Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > UTF16, BOM, and Windows Line endings

Reply
Thread Tools

UTF16, BOM, and Windows Line endings

 
 
Fuzzyman
Guest
Posts: n/a
 
      02-06-2006
Hello all,

I'm handling some text files where I don't (necessarily) know the
encoding beforehand. Because I use regular expressions to parse the
text I *must* decode UTF16 encoded text (otherwise the regexes split on
byte boundaries).

I can recognise UTF8 and BOM and remove (but not necessarily decode).
For UTF16 it seems that the Python codec will automatically remove the
BOM. Having detected it (to trigger a decode) is it considered
*invalid* to remove it ? The codec certainly handles the text without a
BOM - I just don't want this part of the code to break later.

Because I don't know the encoding until I've checked for the BOM I have
to read in binary mode. Similarly I have to write in binary mode.

How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ? When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

 
Reply With Quote
 
 
 
 
Neil Hodgson
Guest
Posts: n/a
 
      02-06-2006
Fuzzyman:

> How should I handle line-endings for UTF16 ? Is it possible that other
> programs (on windows) will have line endings as u'\r\n' ?


Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines

>>> contents = open("C:\\fuzzy.txt", "rb").read()
>>> contents

'\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x0 0n\x00d\x00
\x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00'
>>>


The '\r\x00\n\x00' is a u'\r\n'.

> When saving
> files for that platform should I make the line endings u'\r\n' ? (This
> sequence obviously encodes to four bytes in UTF16). I would only do
> this to ensure compatibility with other programs the user may use to
> create the text files.


Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.

Neil
 
Reply With Quote
 
 
 
 
Fuzzyman
Guest
Posts: n/a
 
      02-06-2006

Neil Hodgson wrote:
> Fuzzyman:
>
> > How should I handle line-endings for UTF16 ? Is it possible that other
> > programs (on windows) will have line endings as u'\r\n' ?

>
> Yes, try Notepad and save as Unicode. For the text
>
> Fuzzy
> End of lines
>
> >>> contents = open("C:\\fuzzy.txt", "rb").read()
> >>> contents

> '\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x0 0n\x00d\x00
> \x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00'
> >>>

>
> The '\r\x00\n\x00' is a u'\r\n'.
>
> > When saving
> > files for that platform should I make the line endings u'\r\n' ? (This
> > sequence obviously encodes to four bytes in UTF16). I would only do
> > this to ensure compatibility with other programs the user may use to
> > create the text files.

>
> Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
> applications are OK with other line ends by '\r\n' and u'\r\n' are
> safest on Windows.
>


Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.

All the best,


Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

> Neil


 
Reply With Quote
 
Neil Hodgson
Guest
Posts: n/a
 
      02-07-2006
Fuzzyman:

> Thanks - so I need to decode to unicode and *then* split on line
> endings. Problem is, that means I can't use Python to handle line
> endings where I don't know the encoding in advance.
>
> In another thread I've posted a small function that *guesses* line
> endings in use.


You can normalise line endings:

>>> x = "a\r\nb\rc\nd\n\re"
>>> y = x.replace("\r\n", "\n").replace("\r","\n")
>>> y

'a\nb\nc\nd\n\ne'
>>> print y

a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.

Neil
 
Reply With Quote
 
Fuzzyman
Guest
Posts: n/a
 
      02-07-2006

Neil Hodgson wrote:
> Fuzzyman:
>
> > Thanks - so I need to decode to unicode and *then* split on line
> > endings. Problem is, that means I can't use Python to handle line
> > endings where I don't know the encoding in advance.
> >
> > In another thread I've posted a small function that *guesses* line
> > endings in use.

>
> You can normalise line endings:
>
> >>> x = "a\r\nb\rc\nd\n\re"
> >>> y = x.replace("\r\n", "\n").replace("\r","\n")
> >>> y

> 'a\nb\nc\nd\n\ne'
> >>> print y

> a
> b
> c
> d
>
> e
>
> The empty line is because "\n\r" is 2 line ends.
>


Thanks - that works, but replaces *all* instances of '\r' to '\n' -
even if they aren't used as line terminators. (Unlikely perhaps). It
also doesn't tell me what line ending was used.

Apparently files opened in universal mode - 'rU' - have a newline
attribute. That makes it a bit easier.

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml


> Neil


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
doctest.testfile fails on text files with Windows line endings Steven D'Aprano Python 1 04-11-2010 04:01 AM
module: zipfile.writestr - line endings issue towers Python 7 08-17-2007 04:40 PM
ascii to unicode line endings fidtz@clara.co.uk Python 5 05-03-2007 02:36 PM
Printing unix Line endings from Windows. Ant Python 6 12-05-2006 03:58 PM
Detecting line endings Fuzzyman Python 18 02-08-2006 12:10 PM



Advertisments