Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > speed of string chunks file parsing

Reply
Thread Tools

speed of string chunks file parsing

 
 
Hyunchul Kim
Guest
Posts: n/a
 
      04-06-2009
Hi, all

I have a simple script.
Can you improve algorithm of following 10 line script, with a view point
of speed ?
Following script do exactly what I want but I want to improve the speed.

This parse a file and accumulate lines till a line match a given regular
expression.
Then, when a line match a given regular expression, this function yield
lines before the matched lines.

****************
import re
resultlist = []
cp_regularexpression = re.compile('^a complex regular expression here$)
for line in file(inputfile):
if cp_regularexpression.match(line):
if resultlist != []:
yield resultlist
resultlist = []
resultlist.append(line)
yield resultlist
****************

Thank you in advance,

Hyunchul


 
Reply With Quote
 
 
 
 
bearophileHUGS@lycos.com
Guest
Posts: n/a
 
      04-06-2009
Hyunchul Kim:

> Following script do exactly what I want but I want to improve the speed.


This may be a bit faster, especially if sequences are long (code
untested):

import re
from collections import deque

def scanner1(deque=deque):
result_seq = deque()
cp_regular_expression = re.compile("^a complex regular expression
here$")
for line in file(inputfile):
if cp_regular_expression.match(line) and result_seq:
yield result_list
result_seq = deque()
result_seq.append(line)
yield result_seq

If the sequences are processed on the fly then you don't need to
create new ones and you can clear old ones, this may be a bit faster:

def scanner2(deque=deque):
result_seq = deque()
cp_regular_expression = re.compile("^a complex regular expression
here$")
for line in file(inputfile):
if cp_regular_expression.match(line) and result_seq:
yield result_list
result_seq.clear()
result_seq.append(line)
yield result_seq

Note that most of the time may be used by the regular expression,
often there are ways to speed it up using string methods, even as a
first faster approximate match too.

Bye,
bearophile
 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      04-06-2009
On Apr 6, 11:48*pm, Hyunchul Kim <(E-Mail Removed)> wrote:
> Hi, all
>
> I have a simple script.
> Can you improve algorithm of following 10 line script, with a view point
> of speed ?
> Following script do exactly what I want but I want to improve the speed.


So the first thing to do is to try to work out where it is spending
the time.

Step 1: read your code:

uh-oh: re.compile('^a complex regular expression here$)

By the way, you are using match() so the "^" is redundant.

Step 2: Run some timing tests:

(a) script as written
(b) script modified to do nothing else but count the lines in the
file. Is the time difference worth any further work? If so:
(c) script modified to ignore the regex and pretend the match fails on
each line
(d) script modified to ignore the regex and pretend the match succeeds
on each line
(e) script modified to ignore the appending and yielding; just count
how many matches

So where is most of the time(a) - time(b) difference being taken:
matching or appending/yielding?

>
> This parse a file and accumulate lines till a line match a given regular
> expression.
> Then, when a line match a given regular expression, this function yield
> lines before the matched lines.


I think your code documents the requirement slightly better. Before
you start worrying about speed, it's a good idea to ensure that the
requirements are correct. It would be useful to include explicit
statements to this effect:

Given N matching lines, there will be (N+1) yields. The last yield
will contain the lines (if any) after the last matching line (if any).
No other yield will be empty. Consequently: an empty file will yield
[], and a file with no matching lines will yield [all the lines in the
file].

Presuming of course that is what you want; consider this:
first yield: [line1, line2]
second and final yield: [line3, line4]
The caller has no way of knowing (without using the regex!) if the
last line in the file matches the regex or not. Does this matter?

>
> ****************
> import re
> resultlist = []
> cp_regularexpression = re.compile('^a complex regular expression here$)
> for line in file(inputfile):
> * * * * if cp_regularexpression.match(line):
> * * * * * * * * if resultlist != []:


if resultlist: # slightly faster and much less ugly

> * * * * * * * * * * * * yield resultlist
> * * * * * * * * * * * * resultlist = []
> * * * * resultlist.append(line)
> yield resultlist


If it turns out that you don't want lines after the last matching
line, you need to remove the above yield statement.

HTH,
John
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
reading file objects in chunks Martin Marcher Python 1 11-12-2007 05:07 PM
split string into chunks mdefoor C Programming 1 12-08-2005 07:01 AM
modifying small chunks from long string MackS Python 6 11-15-2005 03:02 AM
How to Upload Large file in Chunks of Bytes ? Nawaz Ijaz ASP .Net 1 06-08-2005 08:23 AM
speed speed speed a.metselaar Computer Support 14 12-30-2003 03:34 AM



Advertisments