Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Detecting line endings

Reply
Thread Tools

Detecting line endings

 
 
Fuzzyman
Guest
Posts: n/a
 
      02-06-2006
Hello all,

I'm trying to detect line endings used in text files. I *might* be
decoding the files into unicode first (which may be encoded using
multi-byte encodings) - which is why I'm not letting Python handle the
line endings.

Is the following safe and sane :

text = open('test.txt', 'rb').read()
if encoding:
text = text.decode(encoding)
ending = '\n' # default
if '\r\n' in text:
text = text.replace('\r\n', '\n')
ending = '\r\n'
elif '\n' in text:
ending = '\n'
elif '\r' in text:
text = text.replace('\r', '\n')
ending = '\r'


My worry is that if '\n' *doesn't* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
'\n'`` prematurely ?

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

 
Reply With Quote
 
 
 
 
Sybren Stuvel
Guest
Posts: n/a
 
      02-06-2006
Fuzzyman enlightened us with:
> My worry is that if '\n' *doesn't* signify a line break on the Mac,
> then it may exist in the body of the text - and trigger ``ending =
> '\n'`` prematurely ?


I'd count the number of occurences of '\r\n', '\n' without a preceding
'\r' and '\r' without following '\n', and let the majority decide.

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
 
Reply With Quote
 
 
 
 
Fuzzyman
Guest
Posts: n/a
 
      02-06-2006

Sybren Stuvel wrote:
> Fuzzyman enlightened us with:
> > My worry is that if '\n' *doesn't* signify a line break on the Mac,
> > then it may exist in the body of the text - and trigger ``ending =
> > '\n'`` prematurely ?

>
> I'd count the number of occurences of '\r\n', '\n' without a preceding
> '\r' and '\r' without following '\n', and let the majority decide.
>


Sounds reasonable, edge cases for small files be damned.

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

> Sybren
> --
> The problem with the world is stupidity. Not saying there should be a
> capital punishment for stupidity, but why don't we just take the
> safety labels off of everything and let the problem solve itself?
> Frank Zappa


 
Reply With Quote
 
Fuzzyman
Guest
Posts: n/a
 
      02-06-2006

Sybren Stuvel wrote:
> Fuzzyman enlightened us with:
> > My worry is that if '\n' *doesn't* signify a line break on the Mac,
> > then it may exist in the body of the text - and trigger ``ending =
> > '\n'`` prematurely ?

>
> I'd count the number of occurences of '\r\n', '\n' without a preceding
> '\r' and '\r' without following '\n', and let the majority decide.
>


This is what I came up with. As you can see from the docstring, it
attempts to sensible(-ish) things in the event of a tie, or no line
endings at all.

Comments/corrections welcomed. I know the tests aren't very useful
(because they make no *assertions* they won't tell you if it breaks),
but you can see what's going on :

import re
import os

rn = re.compile('\r\n')
r = re.compile('\r(?!\n)')
n = re.compile('(?<!\r)\n')

# Sequence of (regex, literal, priority) for each line ending
line_ending = [(n, '\n', 3), (rn, '\r\n', 2), (r, '\r', 1)]


def find_ending(text, default=os.linesep):
"""
Given a piece of text, use a simple heuristic to determine the line
ending in use.

Returns the value assigned to default if no line endings are found.
This defaults to ``os.linesep``, the native line ending for the
machine.

If there is a tie between two endings, the priority chain is
``'\n', '\r\n', '\r'``.
"""
results = [(len(exp.findall(text)), priority, literal) for
exp, literal, priority in line_ending]
results.sort()
print results
if not sum([m[0] for m in results]):
return default
else:
return results[-1][-1]

if __name__ == '__main__':
tests = [
'hello\ngoodbye\nmy fish\n',
'hello\r\ngoodbye\r\nmy fish\r\n',
'hello\rgoodbye\rmy fish\r',
'hello\rgoodbye\n',
'',
'\r\r\r \n\n',
'\n\n \r\n\r\n',
'\n\n\r \r\r\n',
'\n\r \n\r \n\r',
]
for entry in tests:
print repr(entry)
print repr(find_ending(entry))
print

All the best,


Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
> Sybren
> --
> The problem with the world is stupidity. Not saying there should be a
> capital punishment for stupidity, but why don't we just take the
> safety labels off of everything and let the problem solve itself?
> Frank Zappa


 
Reply With Quote
 
Alex Martelli
Guest
Posts: n/a
 
      02-07-2006
Fuzzyman <(E-Mail Removed)> wrote:

> Hello all,
>
> I'm trying to detect line endings used in text files. I *might* be
> decoding the files into unicode first (which may be encoded using


Open the file with 'rU' mode, and check the file object's newline
attribute.

> My worry is that if '\n' *doesn't* signify a line break on the Mac,


It does, since a few years, since MacOSX is version of Unix to all
practical intents and purposes.


Alex
 
Reply With Quote
 
Sybren Stuvel
Guest
Posts: n/a
 
      02-07-2006
Fuzzyman enlightened us with:
> This is what I came up with. [...] Comments/corrections welcomed.


You could use a little more comments in the code, but apart from that
it looks nice.

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
 
Reply With Quote
 
Fuzzyman
Guest
Posts: n/a
 
      02-07-2006

Alex Martelli wrote:
> Fuzzyman <(E-Mail Removed)> wrote:
>
> > Hello all,
> >
> > I'm trying to detect line endings used in text files. I *might* be
> > decoding the files into unicode first (which may be encoded using

>
> Open the file with 'rU' mode, and check the file object's newline
> attribute.
>


Ha, so long as it works with Python 2.2, that makes things a bit
easier.

Rats, I liked that snippet of code (I'm a great fan of list
comprehensions).

> > My worry is that if '\n' *doesn't* signify a line break on the Mac,

>
> It does, since a few years, since MacOSX is version of Unix to all
> practical intents and purposes.
>


I wondered if that might be the case. I think I've worried about this
more than enough now.

Thanks

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

>
> Alex


 
Reply With Quote
 
Fuzzyman
Guest
Posts: n/a
 
      02-07-2006

Alex Martelli wrote:
> Fuzzyman <(E-Mail Removed)> wrote:
>
> > Hello all,
> >
> > I'm trying to detect line endings used in text files. I *might* be
> > decoding the files into unicode first (which may be encoded using

>
> Open the file with 'rU' mode, and check the file object's newline
> attribute.
>


Do you know if this works for multi-byte encodings ? Do files have
metadata associated with them showing the line-ending in use ?

I suppose I could test this...

All the best,


Fuzzy

> > My worry is that if '\n' *doesn't* signify a line break on the Mac,

>
> It does, since a few years, since MacOSX is version of Unix to all
> practical intents and purposes.
>
>
> Alex


 
Reply With Quote
 
Arthur
Guest
Posts: n/a
 
      02-07-2006
Alex Martelli wrote:
> Fuzzyman <(E-Mail Removed)> wrote:
>
>
>>Hello all,
>>
>>I'm trying to detect line endings used in text files. I *might* be
>>decoding the files into unicode first (which may be encoded using

>
>
> Open the file with 'rU' mode, and check the file object's newline
> attribute.


Do you think it would be sensible to have file.readline in universal
newline support by default?

I just got flummoxed by this issue, working with a (pre-alpha) package
by very experienced Python programmers who sent file.readline to
tokenizer.py without universal newline support. Went on a long (and
educational) journey trying to figure out why my file was not being
processed as expected.

Are there circumstances that it would be sensible to have tokenizer
process files without universal newline support?

The result here was having tokenizer detect indentation inconstancies
that did not exist - in the sense that the files were compiled and ran
fine by Python.exe.

Art
 
Reply With Quote
 
Arthur
Guest
Posts: n/a
 
      02-07-2006
Arthur wrote:
> Alex Martelli wrote:
>
> I just got flummoxed by this issue, working with a (pre-alpha) package
> by very experienced Python programmers who sent file.readline to
> tokenizer.py without universal newline support. Went on a long (and
> educational) journey trying to figure out why my file was not being
> processed as expected.


For example, the widely used MoinMoin source code colorizer sends files
to tokenizer without universal newline support:

http://aspn.activestate.com/ASPN/Coo...n/Recipe/52298

Is my premise that tokenizer needs universal newline support to be
reliable correct?

What else could put it out of sync with the complier?

Art
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
doctest.testfile fails on text files with Windows line endings Steven D'Aprano Python 1 04-11-2010 04:01 AM
module: zipfile.writestr - line endings issue towers Python 7 08-17-2007 04:40 PM
ascii to unicode line endings fidtz@clara.co.uk Python 5 05-03-2007 02:36 PM
Printing unix Line endings from Windows. Ant Python 6 12-05-2006 03:58 PM
UTF16, BOM, and Windows Line endings Fuzzyman Python 4 02-07-2006 09:23 AM



Advertisments