Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > regex over files

Reply
Thread Tools

regex over files

 
 
Robin Becker
Guest
Posts: n/a
 
      04-25-2005
Is there any way to get regexes to work on non-string/unicode objects. I would
like to split large files by regex and it seems relatively hard to do so without
having the whole file in memory. Even with buffers it seems hard to get regexes
to indicate that they failed because of buffer termination and getting a partial
match to be resumable seems out of the question.

What interface does re actually need for its src objects?
--
Robin Becker

 
Reply With Quote
 
 
 
 
Bengt Richter
Guest
Posts: n/a
 
      04-27-2005
On Mon, 25 Apr 2005 16:01:45 +0100, Robin Becker <(E-Mail Removed)> wrote:

>Is there any way to get regexes to work on non-string/unicode objects. I would
>like to split large files by regex and it seems relatively hard to do so without
>having the whole file in memory. Even with buffers it seems hard to get regexes
>to indicate that they failed because of buffer termination and getting a partial
>match to be resumable seems out of the question.
>
>What interface does re actually need for its src objects?


ISTM splitting is a special situation where you can easily
chunk through a file and split as you go, since if splitting
the current chunk succeeds, you can be sure that all but the
tail piece is valid[1]. So you can make an iterator that yields
all but the last and then sets the buffer to last+newchunk
and goes on until there are no more chunks, and the tail part
will be a valid split piece. E.g., (not tested beyond what you see

>>> def frxsplit(path, rxo, chunksize=8192):

... buffer = ''
... for chunk in iter((lambda f=open(path): f.read(chunksize)),''):
... buffer += chunk
... pieces = rxo.split(buffer)
... for piece in pieces[:-1]: yield piece
... buffer = pieces[-1]
... yield buffer
...
>>> import re
>>> rxo = re.compile('XXXXX')


The test file:

>>> print '----\n%s----'%open('tsplit.txt').read()

----
This is going to be split on five X's
like XXXXX but we will use a buffer of
XXXXX length 2 to force buffer appending.
We'll try a splitter at the end: XXXXX
----

>>> for piece in frxsplit('tsplit.txt', rxo, 2): print repr(piece)

...
"This is going to be split on five X's\nlike "
' but we will use a buffer of\n'
" length 2 to force buffer appending.\nWe'll try a splitter at the end: "
'\n'

>>> rxo = re.compile('(XXXXX)')
>>> for piece in frxsplit('tsplit.txt', rxo, 2): print repr(piece)

...
"This is going to be split on five X's\nlike "
'XXXXX'
' but we will use a buffer of\n'
'XXXXX'
" length 2 to force buffer appending.\nWe'll try a splitter at the end: "
'XXXXX'
'\n'

[1] In some cases of regexes with lookahead context, you might
have to check that the last piece not only exists but exceeds
max lookahead length, in case there is a <withlookahead>|<plain>
kind of thing in the regex where <lookahead> would have succeeded
with another chunk appended to buffer, but <plain> did the split.

Regards,
Bengt Richter
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
xslt 2.0 regex iterate over captured substrings ==> regex-group(n) RolfK XML 1 06-07-2009 12:04 PM
Files Visible Over WiFi and Hard-Wired, But Won't Open Over WiFi? (PeteCresswell) Wireless Networking 2 12-29-2008 05:21 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
VOIP over VPN over TCP over WAP over 3G Theo Markettos UK VOIP 2 02-14-2008 03:27 PM
Re: regex over files Robin Becker Python 26 04-29-2005 09:33 AM



Advertisments