Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Bottleneck? More efficient regular expression?

Reply
Thread Tools

Bottleneck? More efficient regular expression?

 
 
Andrew Dalke
Guest
Posts: n/a
 
      09-24-2003
Me:
> I get confused with the DOM API every time I try


/F:
> the DOM API is designed for consultants, not for humans or
> computers.


Here I thought I was making a living the last few years as
a consultant. Now I find I was making a living being a human.
I'm feeling much better now.

Andrew
http://www.velocityreviews.com/forums/(E-Mail Removed)


 
Reply With Quote
 
 
 
 
Andrew Dalke
Guest
Posts: n/a
 
      09-24-2003
Tina Li:
> Yes, I've shortened the XML snippet considerably as
> the data was obscurely long ...


But since there wasn't a problem with the code you posted
it's hard to figure out what's wrong. (And since it wasn't
in XML it made it even harder for people to help.)

If it's too long (and a threaded alignment which includes the
full PDB structure is too long), put it on a web site somewhere
and give a URL to it.

> Andrew: Thanks for the XML code. I've written up something
> similar using xml.parsers.expat, but it's conceivably slower
> than regexp.


Conceivable, yes. But 1) did you test it, and 2) would it make a
difference?

> And thanks for suggesting biopython.org. But I need to have it up
> quick so maybe cobbling something
> together myself is faster than reading through their documentation.


Not only that, but I don't know of anything in biopython for
handling that code. It's a generic XML parsing question, so a
generic tool is the best, like Fredrick's ElementTree.

Here's a question for you. When is it easier to read the
documentation for existing code (which has been tested)
then it is to write and debug your own code?



You can write the regex in a way that's easier to read

pattern = re.compile (
r'<pdbcode>(?P<pdbcode>.*?)</pdbcode>.*?'
r'<targetSeq name="(?P<targetName>.*?)">.*?'
r'<target>(?P<target>.*?)</target>.*?'
r'<align>(?P<align>.*?)</align>.*?'
r'<template>(?P<template> .*?)</template>.*?'
r'<pdbBackbone>(?P<pdbBackbone>.*?)</pdbBackbone>.*?'
r'<queryGaps>(?P<queryGaps>.*?)</queryGaps>.*?'
r'<templateGaps>(?P<templateGaps>.*?)</templateGaps>',
re.S)

You can also make the pattern a bit less ambiguous,
eg, use [^>]* instead of .*? when you are inside an element,
which turns

r'<pdbcode>(?P<pdbcode>.*?)</pdbcode>.*?'

into

r'<pdbcode>(?P<pdbcode>[^<]*)</pdbcode>.*?'

(and use [^"]* instead of .*? for getting the text of an
attribute.)

You can get rid of the other ambiguity (skipping characters
until the start of the next tag) by using something like

([^<]|<(?!queryGaps))*<queryGaps

instead of

.*?<queryGaps

And you can get rid of all ambiguities by having your
pattern match the text completely, that is, capturing all
fields even if ignore a few. In that way you don't have to
write code which skips over tags. That's easily done
with code which generated your pattern, as in

def match_tag(tagname, groupname):
return "<%s>(?P<%s>[^<]*)</%s>" % (tagname, groupname)

pattern = "[^<]*".join(match_tag("pdbcode", "pdbcode"), ...)

You'll need to extend it to support matching fields with
attributes.

In any case, what you have won't support XML like

<pdbcode >2PLV</pdbcode>

(I put extra spaces in the open tag.)

Instead, you are writing parsing code for the specific XML
subset your threading code produces, which may change in
the future. That's why you should use an XML parser.

And if you want to support only this specific format, it's
still easier to write a traditional line-oriented parser.

def readcheck(infile, start):
line = infile.readline()
assert line.startswith(start)
return line

def simpletag(line, convert = str):
i = line.find(">")
j = line.find("<", i)
return convert(line[i:j])

infile = open("threading.xml")

line = readcheck(infile, "<threading")
_, name, _, source, _, template, _ = line.split('"')

line = readcheck(infile, "<pdbcode")
pdbcode = simpletag(line)
pdbchain = simpletag(readcheck(infile, "<pdbchain"))
templateName = simpletag(readcheck(infile, "<templateName"))

# don't save the settings
while 1:
line = infile.readline()
assert line
if line.startswith("</settings>"):
break

....

This may be clumsier or more tedious, but it's easy to understand
and the ways it fails are much easier to diagnose than regexps.

That said, I like parsing with regexps. But this isn't the right
place to use them.

Andrew
(E-Mail Removed)


 
Reply With Quote
 
 
 
 
Tina Li
Guest
Posts: n/a
 
      09-25-2003
Hi Andrew,

| > Thanks for the XML code. I've written up something
| > similar using xml.parsers.expat, but it's conceivably slower
| > than regexp.
|
| Conceivable, yes. But 1) did you test it, and 2) would it make a
| difference?

I tested it in terms of correctness. I didn't do any serious performance bench-marking as it's not the most important.
The lag is *perceivable* (this is what I meant; sorry) by a human user so it's slower.

| Here's a question for you. When is it easier to read the
| documentation for existing code (which has been tested)
| then it is to write and debug your own code?

Is it a test question? =) I think it depends on the nature of the task. If it's intrinsically complex and error-prone,
and that I have little exposure to it, I wouldn't risk re-inventing a wheel that doesn't turn. As well, what would the
opportunity cost be between the time it takes to understand the documentation and that to code and test? Are the docs
long, drab and obscure? Sometimes existing code isn't easy to customize for a particular need, then we'd have to redo
it.

| You can also make the pattern a bit less ambiguous,
| eg, use [^>]* instead of .*? when you are inside an element,
| which turns
|
| r'<pdbcode>(?P<pdbcode>.*?)</pdbcode>.*?'
|
| into
|
| r'<pdbcode>(?P<pdbcode>[^<]*)</pdbcode>.*?'
|
| (and use [^"]* instead of .*? for getting the text of an
| attribute.)
|
| You can get rid of the other ambiguity (skipping characters
| until the start of the next tag) by using something like
|
| ([^<]|<(?!queryGaps))*<queryGaps
|
| instead of
|
| .*?<queryGaps

I in fact tried that before but the over-limit error still happened. So it's not just the non-greedy .*? that's causing
the problem. Hmm.

But again, thanks for the tips. I'm sure they'll come in handy someday. I'm doing this for a quick "hack" rather than a
generic and robust parser module. It only handles tags without space because all tags are guaranteed to be generated
without space.

Thank!

Tina



-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 100,000 Newsgroups - 19 Different Servers! =-----
 
Reply With Quote
 
Andrew Dalke
Guest
Posts: n/a
 
      09-26-2003
Tina Li:
> The lag is *perceivable* (this is what I meant; sorry) by a human user so

it's slower.

Yup, that's what I meant. Too many people make theoretical
arguments for why to choose one (complicated) approach
over a simpler one on the basis of performance, when it turns
out performance isn't the issue. My appreciation goes out to you
for doing it the right way.

You may also want to look at pyRXP from ReportLab.
However, there seems to be some drastic problems on their
site -- links on reportlab.com fail and reportlab.org goes
to pair.com's site placeholder page.

It's a very fast XML parser for Python.

> I in fact tried that before but the over-limit error still happened. So

it's
> not just the non-greedy .*? that's causing the problem. Hmm.


No, I don't think it is. The stack space increases by one for
each ambiguity and the .*? should only produce one ambiguity.
Usually there's a stack problem only if you have an ambiguity
or empty match inside a repeat, and I didn't see that in your
pattern.

If you get really interested in tracking this down, you might look
around for some of the GUI regexp debugging tools. There's
one in ActiveState's product, as I recall. Err, but it's based on
Perl's regexp parser and won't handle (?P<>)

(I do have an experimental pure-Python regexp engine that
I would offer for debugging, but it doesn't handle .*? yet and
needs a rewrite before it does.)

> It only handles tags without space because all tags are
> guaranteed to be generated without space.


Sure. All I was saying was that if you're going to code for
a specific layout then you don't need to be as general.

You might even consider using "[^\n]*\n{5}" if you just
want to skip 5 lines.

Andrew
(E-Mail Removed)
P.S.
If you are doing anything open-sourceish, or using
open source in bioinformatics, structural biology, and
related fields, and will be at ISMB in Edinborough next
year, you might consider attending the Bioinformatics
Open Source Conference.


 
Reply With Quote
 
Robin Becker
Guest
Posts: n/a
 
      09-26-2003
In article <%nNcb.3635$(E-Mail Removed) t>, Andrew
Dalke <(E-Mail Removed)> writes
>Tina Li:
>> The lag is *perceivable* (this is what I meant; sorry) by a human user so

>it's slower.
>
>Yup, that's what I meant. Too many people make theoretical
>arguments for why to choose one (complicated) approach
>over a simpler one on the basis of performance, when it turns
>out performance isn't the issue. My appreciation goes out to you
>for doing it the right way.
>
>You may also want to look at pyRXP from ReportLab.
>However, there seems to be some drastic problems on their
>site -- links on reportlab.com fail and reportlab.org goes
>to pair.com's site placeholder page.



yup we're reassembling everything again .... sigh

New more dynamic confusion ......

>
>It's a very fast XML parser for Python.
>
>> I in fact tried that before but the over-limit error still happened. So

>it's
>> not just the non-greedy .*? that's causing the problem. Hmm.

>
>No, I don't think it is. The stack space increases by one for
>each ambiguity and the .*? should only produce one ambiguity.
>Usually there's a stack problem only if you have an ambiguity
>or empty match inside a repeat, and I didn't see that in your
>pattern.
>
>If you get really interested in tracking this down, you might look
>around for some of the GUI regexp debugging tools. There's
>one in ActiveState's product, as I recall. Err, but it's based on
>Perl's regexp parser and won't handle (?P<>)
>
>(I do have an experimental pure-Python regexp engine that
>I would offer for debugging, but it doesn't handle .*? yet and
>needs a rewrite before it does.)
>
>> It only handles tags without space because all tags are
>> guaranteed to be generated without space.

>
>Sure. All I was saying was that if you're going to code for
>a specific layout then you don't need to be as general.
>
>You might even consider using "[^\n]*\n{5}" if you just
>want to skip 5 lines.
>
> Andrew
> (E-Mail Removed)
>P.S.
> If you are doing anything open-sourceish, or using
>open source in bioinformatics, structural biology, and
>related fields, and will be at ISMB in Edinborough next
>year, you might consider attending the Bioinformatics
>Open Source Conference.
>
>


--
Robin Becker
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help: Efficient regular expression Divya Badrinath Ruby 24 07-11-2007 09:11 AM
Can I make this Stored Proc more efficient? Roy ASP .Net 6 02-16-2005 03:39 AM
Is this efficient way to validate phrase against collection of regular expressions Ossie Java 2 02-14-2004 01:49 PM
Re: More efficient RFC 1867 uploads? Richard K Bethell ASP .Net 0 07-14-2003 05:32 PM
Re: More efficient RFC 1867 uploads? John Timney \(Microsoft MVP\) ASP .Net 1 07-11-2003 08:16 PM



Advertisments