Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > [Q] Text vs Binary Files

Reply
Thread Tools

[Q] Text vs Binary Files

 
 
Richard Tobin
Guest
Posts: n/a
 
      06-09-2004
In article <(E-Mail Removed)>,
Corey Murtagh <(E-Mail Removed)> wrote:
>Don't want to be seen to be supporting XML here


???

> but doesn't the UTF-16 standard define byte ordering?


No. There are names for the encodings corresponding to
big-endian-UTF-16 and little-endian-UTF-16, but UTF-16 itself can be
stored in either order.

XML processors can distinguish between them easily because any XML
document not in UTF-8 must begin with a less-than or a byte-order mark
(unless some external indication of encoding is given).

-- Richard
 
Reply With Quote
 
 
 
 
Richard Tobin
Guest
Posts: n/a
 
      06-09-2004
In article <(E-Mail Removed)>,
Malcolm Dew-Jones <(E-Mail Removed)> wrote:

>You can only have byte order issues when you store the UTF-16 as 8 bit
>bytes.


Which is to say, always in practice.

-- Richard
 
Reply With Quote
 
 
 
 
Jeff Brooks
Guest
Posts: n/a
 
      06-10-2004
Corey Murtagh wrote:

> Jeff Brooks wrote:
>
>> Rolf Magnus wrote:

>
> <snip>
>
>>>
>>> Linefeeds and carriage returns don't matter in XML. The other
>>> differences are ruled out by specifying the encoding. Any XML parser
>>> should understand utf-8.

>>
>>
>> Actually, to be an XML parser it must support UTF-8, and UTF-16.
>> UTF-16 has byte ordering issues. Writing an UTF-16 file on different
>> cpus can result in text files that are different. This can be resolved
>> because of the encoding the the UTF standards use but it means that
>> any true XML parser must deal with high-endian, low-endian issues.

>
> Don't want to be seen to be supporting XML here, but doesn't the UTF-16
> standard define byte ordering? I was under the impression (without
> having done any work with it) that a UTF-16 multi-byte sequence could be
> parsed as a byte stream.


Unicode FAQ
http://www.unicode.org/unicode/faq/utf_bom.html#37

Jeff Brooks
 
Reply With Quote
 
Jeff Brooks
Guest
Posts: n/a
 
      06-10-2004
Malcolm Dew-Jones wrote:

> Jeff Brooks ((E-Mail Removed)) wrote:
> : Rolf Magnus wrote:
> : > Arthur J. O'Dwyer wrote:
> : >
> : >>On Thu, 27 May 2004, Eric wrote:
> : >>
> : >>>Assume that disk space is not an issue [...]
> : >>>Assume that transportation to another OS may never occur.
> : >>>Are there any solid reasons to prefer text files over binary files?
> : >>>
> : >>>Some of the reasons I can think of are:
> : >>>
> : >>>-- should transportation to another OS become useful or needed,
> : >>> the text files would be far easier to work with
> : >>
> : >> I would guess this is wrong, in general. Think of the difference
> : >>between a DOS/Win32 text file, a MacOS text file, and a *nix text
> : >>file (hint: linefeeds and carriage returns).
> : >
> : > Linefeeds and carriage returns don't matter in XML. The other
> : > differences are ruled out by specifying the encoding. Any XML parser
> : > should understand utf-8.
>
> : Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16
> : has byte ordering issues.
>
> You can only have byte order issues when you store the UTF-16 as 8 bit
> bytes. But a stream of 8 bit bytes is _not_ UTF-16, which by definition
> is a stream of 16 bit entities, so it is not the UTF-16 that has byte
> order issues.


http://www.unicode.org/unicode/faq/utf_bom.html#37

Jeff Brooks
 
Reply With Quote
 
Ben Measures
Guest
Posts: n/a
 
      06-10-2004
Jeff Brooks wrote:
> Rolf Magnus wrote:
>
>> Linefeeds and carriage returns don't matter in XML. The other
>> differences are ruled out by specifying the encoding. Any XML parser
>> should understand utf-8.

>
> Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16
> has byte ordering issues. Writing an UTF-16 file on different cpus can
> result in text files that are different. This can be resolved because of
> the encoding the the UTF standards use but it means that any true XML
> parser must deal with high-endian, low-endian issues.
>
> "All XML processors MUST accept the UTF-8 and UTF-16 encodings of
> Unicode 3.1"
> - http://www.w3.org/TR/REC-xml/#charsets


"Entities encoded in UTF-16 MUST [snip] begin with the Byte Order Mark
described by section 2.7 of [Unicode3]"
http://www.w3.org/TR/REC-xml/#charencoding

This makes it trivial to overcome any endian issues, and since endian
issues are so fundamental I don't see it as making XML any less portable.

--
Ben M.
 
Reply With Quote
 
Michael Wojcik
Guest
Posts: n/a
 
      06-10-2004

[Followups restricted to comp.programming.]

In article <P7Kxc.681713$Pk3.125780@pd7tw1no>, Jeff Brooks <(E-Mail Removed)> writes:
>
> "All XML processors MUST accept the UTF-8 and UTF-16 encodings of
> Unicode 3.1"
> - http://www.w3.org/TR/REC-xml/#charsets
>
> "The primary feature of Unicode 3.1 is the addition of 44,946 new
> encoded characters. ...
>
> For the first time, characters are encoded beyond the original 16-bit
> codespace or Basic Multilingual Plane (BMP or Plane 0). These new
> characters, encoded at code positions of U+10000 or higher, are
> synchronized with the forthcoming standard ISO/IEC 10646-2."
> - http://www.unicode.org/reports/tr27/
>
> The majority of XML parsers only use 16-bit characters. This means that
> the majority of XML parsers can't actually read XML.


I don't believe this is correct. UTF-16 encodes characters in U+10000
- U+10FFFF as surrogate pairs. None of the surrogate code points match
any of the scalar code points, so there's no ambiguity - all surrogate
pairs are composed of 16-bit values that can't be mistaken for scalar
UTF-16 characters.

As long as the parser processes the surrogate pair without altering
it and recognizes it unambiguously, the parser would seem to be
complying with the XML specification. None of those characters (in
their surrogate-pair UTF-16 representation or any other) has any
special meaning in XML, so a parser that treated the surrogate pair
as a pair of 16-bit characters should do just fine.

In other words, the parser doesn't have to recognize that characters
from U+10000 and up (in their surrogate-pair encoding) are special,
because to it they aren't special.

The only case that immediately comes to mind where the distinction
would matter is if the parser had an API that returned data character-
by-character, which should have special provisions for surrogate
pairs (or be documented as returning them in halves). However, I've
not seen such a parser, AFAIK, and I don't know why one would provide
such an API.

Or, I suppose, if the parser offered to transform the document data
among various supported encodings. In that case, not handling UTF-16
surrogate pairs would indeed be a bug. On the other hand, I'm not
sure such transformations are necessarily the job of an XML parser;
that could be considered a bug in a set of additional utilities
provided alongside the parser.

--
Michael Wojcik http://www.velocityreviews.com/forums/(E-Mail Removed)

Even though there may be some misguided critics of what we're trying
to do, I think we're on the wrong path. -- Reagan
 
Reply With Quote
 
Donald Roby
Guest
Posts: n/a
 
      08-28-2004
On Fri, 28 May 2004 10:05:35 -0400, Arthur J. O'Dwyer wrote:

>
> *Again* I urge the consultation of the RFCs defining any standard
> binary file format, and the notice of the complete lack of regard
> for big-endian/little-endian/19-bit-int/37-bit-int issues. At the
> byte level, these things simply never come up.
>
>


Try (for example) RFC 1314.

These things certainly do come up, and they're handled by encoding the
rules in a header of the format.


 
Reply With Quote
 
Arthur J. O'Dwyer
Guest
Posts: n/a
 
      08-28-2004

On Sat, 28 Aug 2004, Donald Roby wrote:
>
> On Fri, 28 May 2004 10:05:35 -0400, Arthur J. O'Dwyer wrote:
>> *Again* I urge the consultation of the RFCs defining any standard
>> binary file format, and the notice of the complete lack of regard
>> for big-endian/little-endian/19-bit-int/37-bit-int issues. At the
>> byte level, these things simply never come up.

>
> Try (for example) RFC 1314.


[RFC defining among other things a subset(?) of the TIFF image
file format]

> These things certainly do come up, and they're handled by
> encoding the rules in a header of the format.


Not really. TIFF /is/ weird in that it explicitly provides
both a "big-endian" format and a "little-endian" format, and TIFF
readers have to provide routines to read both formats. But the
endianness/word size of the machine never comes up. If it did,
we wouldn't be able to write TIFF writers or readers that worked
on platforms with different endiannesses. (IIRC, this whole thread
was started way back in the mists of time with the idea that

fputs("42000\n", fp);

produces different results on different machines (because of the
embedded newline, which produces different bytes on different
systems; not to mention the possibility of EBCDIC!), while

unsigned int result = 42000;
unsigned char buffer[8];
buffer[0] = (result>>24)&0xFF;
buffer[1] = (result>>16)&0xFF;
buffer[2] = (result>>&0xFF;
buffer[3] = (result>>0)&0xFF;
fwrite(buffer, 1, 4, fp);

produces the exact same bytes on every platform. Thus "binary
is better than text" if you care about portability more than
human-readability.

But since we already had that discussion (several months ago,
IIRC), I'm not going to get back into it.

-Arthur,
signing off
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Newbie: working with binary files/extract png from a binary file Jim Ruby 6 12-24-2013 08:09 AM
Advantages of Binary Files over Text files in Search and read utab C++ 3 11-28-2006 03:09 PM
how i can extract text from the PDF files,power point files,Ms word files? crazyprakash Java 4 10-30-2005 10:17 AM
Text files read multiple files into single file, and then recreate the multiple files googlinggoogler@hotmail.com Python 4 02-13-2005 05:44 PM
distinguish between binary text and regular text zvika Perl Misc 2 12-12-2004 04:20 PM



Advertisments