Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Does Python mess with CRLFs?

Reply
Thread Tools

Does Python mess with CRLFs?

 
 
Gilles Ganault
Guest
Posts: n/a
 
      11-12-2008
Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

I wonder if Python rewrites CRLFs when reading a text file with
open/read?

Here's the code:
==========
f = open("content.html", "r")
content = f.read()
f.close()

#BAD
friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
| re.MULTILINE | re.DOTALL)

#GOOD
friends = re.compile('</td></tr></table>',re.IGNORECASE | re.MULTILINE
| re.DOTALL)

m = friends.search(content)
if m:
print "Found"
else:
print "List not found"
==========

Thank you for any tip.
 
Reply With Quote
 
 
 
 
Gilles Ganault
Guest
Posts: n/a
 
      11-12-2008
On Wed, 12 Nov 2008 12:04:07 +0100, Gilles Ganault <(E-Mail Removed)>
wrote:
>I wonder if Python rewrites CRLFs when reading a text file with
>open/read?


For those seeing the same thing, the answer is yes: On Windows, the
code above turns CRLF into LF. I tried "rb" instead of "r", with no
difference.
 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      11-12-2008
On Nov 12, 10:04*pm, Gilles Ganault <(E-Mail Removed)> wrote:
> Hello
>
> I'm stuck at understanding why Python can't extract some bit from an
> HTML file using regexes, although I can find it just fine with
> UltraEdit.
>
> I wonder if Python rewrites CRLFs when reading a text file with
> open/read?


Don't wonder; do some very elementary debugging and find out for
yourself.

> Here's the code:
> ==========
> f = open("content.html", "r")
> content = f.read()
> f.close()


Consider inserting
print repr(content)
here.

 
Reply With Quote
 
Irmen de Jong
Guest
Posts: n/a
 
      11-12-2008

Gilles Ganault wrote:
> On Wed, 12 Nov 2008 12:04:07 +0100, Gilles Ganault <(E-Mail Removed)>
> wrote:
>> I wonder if Python rewrites CRLFs when reading a text file with
>> open/read?

>
> For those seeing the same thing, the answer is yes: On Windows, the
> code above turns CRLF into LF. I tried "rb" instead of "r", with no
> difference.


Sorry but that is not what's happening. Your problem is not in reading the
file, it's in the regular expression you're using.

Using open with the "rb" flag leaves the file content intact and does not munge newlines
in any way. A read() will return the exact bytes that are in the file.

--irmen
 
Reply With Quote
 
Irmen de Jong
Guest
Posts: n/a
 
      11-12-2008

Gilles Ganault wrote:
> Hello
>
> I'm stuck at understanding why Python can't extract some bit from an
> HTML file using regexes, although I can find it just fine with
> UltraEdit.
>
> #BAD
> friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
> | re.MULTILINE | re.DOTALL)


If you keep running into trouble and you're sure it's related to the newlines,
maybe it helps using the 'whitespace' symbol instead of \r\n in your expression:
re.compile('</td></tr></table>\\s*</div>\\s*', .... )

Other than that, hard to say what's not working as expected without knowing
the exact contents of the "content.html" file you're searching in....

--irmen
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Does Python mess with the (unicode) code page? Roy Smith Python 1 12-08-2009 08:12 PM
first time use of swig, python and c++ .. it's a mess ... pleaseadvice Eric von Horst Python 3 02-28-2008 07:42 PM
Wireless Networking a Mess =?Utf-8?B?Q2VsZWJyaXR5WDIz?= Wireless Networking 3 05-17-2005 03:35 PM
Does FrontPage mess up JSP code? Mickey Segal Java 8 08-03-2004 06:55 PM
PIX 515, DMZ, VPN, what a mess. Need Help. Eddie Cisco 9 06-20-2004 08:22 PM



Advertisments