Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > split large file by string/regex

Reply
Thread Tools

split large file by string/regex

 
 
Martin Dieringer
Guest
Posts: n/a
 
      11-22-2004

I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.
 
Reply With Quote
 
 
 
 
Steve Holden
Guest
Posts: n/a
 
      11-22-2004
Martin Dieringer wrote:

> I am trying to split a file by a fixed string.
> The file is too large to just read it into a string and split this.
> I could probably use a lexer but there maybe anything more simple?
> thanks
> m.


Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.

regards
Steve
--
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119
 
Reply With Quote
 
 
 
 
Jason Rennie
Guest
Posts: n/a
 
      11-22-2004
On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
> I am trying to split a file by a fixed string.
> The file is too large to just read it into a string and split this.
> I could probably use a lexer but there maybe anything more simple?


If the pattern is contained within a single line, do something like this:

import re
myre = re.compile(r'foo')
fh = open(f)
fh1 = open(f1,'w')
s = fh.readline()
while not myre.search(s):
fh1.write(s)
s = fh.readline()
fh1.close()
fh2.open(f1,'w')
while fh
fh2.write(s)
s = fh.readline()
fh2.close()
fh.close()

I'm doing this off the top of my head, so this code almost certainly
has bugs. Hopefully its enough to get you started... Note that only
one line is held in memory at any point in time. Oh, if there's a
chance that the pattern does not appear in the file, you'll need to
check for eof in the first while loop.

Jason
 
Reply With Quote
 
Diez B. Roggisch
Guest
Posts: n/a
 
      11-22-2004
> Depends on your definition of "simple", I suppose. The problem with
> *not* using a lexer is that you'd have to examine the file in a sequence
> of overlapping chunks to make sure that a regex could pick up all
> matches. For me that would be more complex than using a lexer, given the
> excellent range of modules such as SPARK and PLY, to mention but two.


At least spark operates on whole strings if used as lexer/tokenizer - you
can of course feed it a lazy sequence of tokens by using a generator - but
that's up to you.

--
Regards,

Diez B. Roggisch
 
Reply With Quote
 
Martin Dieringer
Guest
Posts: n/a
 
      11-22-2004
Steve Holden <(E-Mail Removed)> writes:

> Martin Dieringer wrote:
>
>> I am trying to split a file by a fixed string.
>> The file is too large to just read it into a string and split this.
>> I could probably use a lexer but there maybe anything more simple?
>> thanks
>> m.

>
> Depends on your definition of "simple", I suppose. The problem with
> *not* using a lexer is that you'd have to examine the file in a
> sequence of overlapping chunks to make sure that a regex could pick up
> all matches. For me that would be more complex than using a lexer,
> given the excellent range of modules such as SPARK and PLY, to mention
> but two.
>


yes lexing would be the simplest, but PLY also can't read from streams
and it looks to me (from the examples) as if it's the same with SPARK.
I wonder why something like this is not in any lib.
Is there any known lexer that can do this?
I don't have to parse, just write the junks to separate files.
I really hate doing that sequence thing...

m.
 
Reply With Quote
 
Martin Dieringer
Guest
Posts: n/a
 
      11-22-2004
Jason Rennie <(E-Mail Removed)> writes:

> On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
>> I am trying to split a file by a fixed string.
>> The file is too large to just read it into a string and split this.
>> I could probably use a lexer but there maybe anything more simple?

>
> If the pattern is contained within a single line, do something like this:


Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...

m.
 
Reply With Quote
 
Bengt Richter
Guest
Posts: n/a
 
      11-22-2004
On Mon, 22 Nov 2004 15:28:54 +0100, Martin Dieringer <(E-Mail Removed)-berlin.de> wrote:

>Jason Rennie <(E-Mail Removed)> writes:
>
>> On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
>>> I am trying to split a file by a fixed string.
>>> The file is too large to just read it into a string and split this.
>>> I could probably use a lexer but there maybe anything more simple?

>>
>> If the pattern is contained within a single line, do something like this:

>
>Hmm it's binary data, I can't tell how long lines would be. OTOH a
>line would certainly contain the pattern as it has no \n in it... and
>the lines probably wouldn't be too large for memory...
>
>m.

Do you want to keep the splitting string? I.e., if you split with xxx
from '1231xxx45646xxx45646xxx78' do you want the long-file equivalent of

>>> '1231xxx45646xxx45646xxx78'.split('xxx')

['1231', '45646', '45646', '78']

or (I chose this for below)
['1231', 'xxx', '45646', 'xxx', '45646', 'xxx', '78']

or maybe

['1231xxx', '45646xxx', '45646xxx', '78']

??

Anyway, I'd use a generator to iterate through the file and look for the delimiter.
This is case-sensitive, BTW (practically untested :

--< splitfile.py >----------------------------------------------
def splitfile(path, splitstr, chunksize=1024*64): # try a megabyte?
splen = len(splitstr)
chunks = iter(lambda f=open(path,'rb'):f.read(chunksize), '')
buf = ''
for chunk in chunks:
buf += chunk
start = end = 0
while end>=0 and len(buf)>=splen:
start, end = end, buf.find(splitstr, end)
if end>=0:
yield buf[start:end] #not including splitstr
yield splitstr # == buf[end:end+splen] # splitstr
end += splen
else:
buf = buf[start:]
break

yield buf

def test(*args):
for chunk in splitfile(*args):
print repr(chunk)

if __name__ == '__main__':
import sys
args = sys.argv[1:]
try:
if len(args)==3: args[2]=int(args[2])
except Exception:
raise SystemExit, 'Usage: python splitfile.py path splitstr [chunksize=64k]'
test(*args)
----------------------------------------------------------------

Extent of testing follows

>>> print '%s\n%s%s'%('-'*40, open('splitfile.txt','rb').read(),'-'*40)

----------------------------------------
01234abc5678abc901234
567ab890abc
----------------------------------------
>>> import ut.splitfile
>>> ut.splitfile.test('splitfile.txt', 'abc')

'01234'
'abc'
'5678'
'abc'
'901234\r\n567ab890'
'abc'
'\r\n'
>>> ut.splitfile.test('splitfile.txt', '012')

''
'012'
'34abc5678abc9'
'012'
'34\r\n567ab890abc\r\n'
>>> it = ut.splitfile.splitfile('splitfile.txt','ab89',4)
>>> it.next

<method-wrapper object at 0x02EF1C6C>
>>> it.next()

'01234abc5678abc901234\r\n567'
>>> it.next()

'ab89'
>>> it.next()

'0abc\r\n'
>>> it.next()

Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration

(I put it in my ut package directory but you can put splitfile.py anywhere handy
and mod it to do what you need).

Regards,
Bengt Richter
 
Reply With Quote
 
Denis S. Otkidach
Guest
Posts: n/a
 
      11-22-2004
On Mon, 22 Nov 2004 08:53:02 -0500
Steve Holden <(E-Mail Removed)> wrote:

> > I am trying to split a file by a fixed string.
> > The file is too large to just read it into a string and split this.
> > I could probably use a lexer but there maybe anything more simple?
> > thanks
> > m.

>
> Depends on your definition of "simple", I suppose. The problem with
> *not* using a lexer is that you'd have to examine the file in a sequence
> of overlapping chunks to make sure that a regex could pick up all


re module works fine with mmap-ed file, so no need to read it into memory.

> matches. For me that would be more complex than using a lexer, given the
> excellent range of modules such as SPARK and PLY, to mention but two.


--
Denis S. Otkidach
http://www.python.ru/ [ru]
 
Reply With Quote
 
Martin Dieringer
Guest
Posts: n/a
 
      11-22-2004
"Denis S. Otkidach" <(E-Mail Removed)> writes:

> On Mon, 22 Nov 2004 08:53:02 -0500
> Steve Holden <(E-Mail Removed)> wrote:
>
>> > I am trying to split a file by a fixed string.
>> > The file is too large to just read it into a string and split this.
>> > I could probably use a lexer but there maybe anything more simple?
>> > thanks
>> > m.

>>
>> Depends on your definition of "simple", I suppose. The problem with
>> *not* using a lexer is that you'd have to examine the file in a sequence
>> of overlapping chunks to make sure that a regex could pick up all

>
> re module works fine with mmap-ed file, so no need to read it into memory.
>


thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read

m.
 
Reply With Quote
 
William Park
Guest
Posts: n/a
 
      11-22-2004
Martin Dieringer <(E-Mail Removed)-berlin.de> wrote:
> Jason Rennie <(E-Mail Removed)> writes:
>
> > On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
> >> I am trying to split a file by a fixed string.
> >> The file is too large to just read it into a string and split this.
> >> I could probably use a lexer but there maybe anything more simple?

> >
> > If the pattern is contained within a single line, do something like this:

>
> Hmm it's binary data, I can't tell how long lines would be. OTOH a
> line would certainly contain the pattern as it has no \n in it... and
> the lines probably wouldn't be too large for memory...


man strings (-o option)

--
William Park <(E-Mail Removed)>
Linux solution for data management and processing.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Split a very large log file Terry.Riegel@gmail.com Perl Misc 2 02-03-2007 12:41 AM
Split large files without file size limitation alinwo@gmail.com Computer Support 0 09-11-2006 06:43 AM
How to split a large avi file into smaller avi by size Avner DVD Video 1 10-31-2005 06:34 PM
How do I: Split a large file on record and data (file = 3GB) seansan Perl Misc 6 01-05-2004 02:51 PM
how can I split one large file? jamfreak Computer Support 4 12-23-2003 06:21 AM



Advertisments