Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > any idea how to optimize this regex?

Reply
Thread Tools

any idea how to optimize this regex?

 
 
drejcicaREMOVE@volja.net
Guest
Posts: n/a
 
      12-04-2003
Hello. I've discovered that this regex is a bottleneck:

/(?:<!\-.*?>.*?){5}/sig

It tries to locate as many html comments in chunks of five which can
make for quite some possibilities in longer files. Is there a way to
optimize this or do you consider it to be simply poor practice?

Thanks,

andrej

--
echo ${girl_name} > /etc/dumpdates
 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      12-04-2003
http://www.velocityreviews.com/forums/(E-Mail Removed) <(E-Mail Removed)> wrote:
> Hello. I've discovered that this regex is a bottleneck:
>
> /(?:<!\-.*?>.*?){5}/sig

^
^

That "i" doesn't do anything. So why is it there?


> It tries to locate as many html comments



What will it do when it comes across a comment like this:

<!-- if A > B then -->

??


> in chunks of five which can
> make for quite some possibilities in longer files. Is there a way to
> optimize this or do you consider it to be simply poor practice?



Attempting to use regexes to parse HTML is the poor practice.

Use a module that understands HTML data for processing HTML data.


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
 
Malcolm Dew-Jones
Guest
Posts: n/a
 
      12-05-2003
(E-Mail Removed) wrote:
: Hello. I've discovered that this regex is a bottleneck:

: /(?:<!\-.*?>.*?){5}/sig

: It tries to locate as many html comments in chunks of five which can
: make for quite some possibilities in longer files. Is there a way to
: optimize this or do you consider it to be simply poor practice?


First, there are html parses that may help do what ever you want to do,
but ignoring that for the moment...


First off, a comment does not end with >, it ends with --> (and starts
with <!-- so why not test for that correctly also)?

<!--.*?-->

If you know the comments can't have > in them, then a character class
would be quicker than .*?

<!--[^>]*>

Next, I wonder why would you need to find comments in blocks of 5?

Even if you really wish to look for blocks of 5 comments at a time, the /g
says to do this globally, so it looks thru the entire file for all
possible combinations of 5 blocks (I didn't say that correctly) and I
suspect that is the biggest bottle neck.

I suspect you don't really want /g at all.

Also, the .*? is a potential bug, because it does not _prevent_ the re
from matching two (or more) comments at the place you intend to match a
single comment, it simply says "match no more than is necessary to get a
match", so the regex engine could be trying combinations of multiple
comments in an attempt to get a {5} /g match to work.

I'm not sure if the above _is_ a bug, but I can't say it isn't. The
character class I mentioned is not prone to this issue as it simply can't
match past the > , but that assumes (as I mentioned) that the comments
never use > .

Finally, /i is to ignore case, but nothing you look for uses case, so why
specify it (though I doubt that makes a difference here).


 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      12-05-2003

(E-Mail Removed) (Malcolm Dew-Jones) wrote:
> First off, a comment does not end with >, it ends with --> (and starts
> with <!-- so why not test for that correctly also)?
>
> <!--.*?-->
>
> If you know the comments can't have > in them, then a character class
> would be quicker than .*?
>
> <!--[^>]*>
>

<snip>
>
> Also, the .*? is a potential bug, because it does not _prevent_ the re
> from matching two (or more) comments at the place you intend to match a
> single comment, it simply says "match no more than is necessary to get a
> match", so the regex engine could be trying combinations of multiple
> comments in an attempt to get a {5} /g match to work.


I'm somewhat thinking aloud here, but would

/ (?: <!-- (?: [^-] (?!->) )* --> ){5} /x

perform the correct match here? The generalisation of [^>]: ie. 'match
anything up to this multi-character string' is something one quite
often wants.

Ben

--
Every twenty-four hours about 34k children die from the effects of poverty.
Meanwhile, the latest estimate is that 2800 people died on 9/11, so it's like
that image, that ghastly, grey-billowing, double-barrelled fall, repeated
twelve times every day. Full of children. [Iain Banks] (E-Mail Removed)
 
Reply With Quote
 
Matt Garrish
Guest
Posts: n/a
 
      12-05-2003

"Malcolm Dew-Jones" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
>
> First off, a comment does not end with >, it ends with --> (and starts
> with <!-- so why not test for that correctly also)?
>
> <!--.*?-->
>


Html comments allow whitespace between the -- and > when you close a
comment, so you'd have to write that as:

<!--.*?--\s*>

Matt


 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      12-05-2003

"Matt Garrish" <(E-Mail Removed)> wrote:
>
> "Malcolm Dew-Jones" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
> >
> > First off, a comment does not end with >, it ends with --> (and starts
> > with <!-- so why not test for that correctly also)?
> >
> > <!--.*?-->
> >

>
> Html comments allow whitespace between the -- and > when you close a
> comment, so you'd have to write that as:
>
> <!--.*?--\s*>


HTML (SGML) comments also allow whitespace after the '!', and anything
matching /--\s*--/ to appear within the body of the comment. What
browsers will accept is another matter...

Ben

--
"If a book is worth reading when you are six, * (E-Mail Removed)
it is worth reading when you are sixty." - C.S.Lewis
 
Reply With Quote
 
James Willmore
Guest
Posts: n/a
 
      12-05-2003
On Thu, 04 Dec 2003 22:53:02 GMT
(E-Mail Removed) wrote:

> Hello. I've discovered that this regex is a bottleneck:
>
> /(?:<!\-.*?>.*?){5}/sig
>
> It tries to locate as many html comments in chunks of five which can
> make for quite some possibilities in longer files. Is there a way to
> optimize this or do you consider it to be simply poor practice?


Poor practice

Use one of the *many* HTML parsing modules that are available.
http://search.cpan.org/

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
Never hit a man with glasses. Hit him with a baseball bat.

 
Reply With Quote
 
Malcolm Dew-Jones
Guest
Posts: n/a
 
      12-05-2003
Matt Garrish ((E-Mail Removed)) wrote:

: "Malcolm Dew-Jones" <(E-Mail Removed)> wrote in message
: news:(E-Mail Removed)...
: >
: > First off, a comment does not end with >, it ends with --> (and starts
: > with <!-- so why not test for that correctly also)?
: >
: > <!--.*?-->
: >

: Html comments allow whitespace between the -- and > when you close a
: comment, so you'd have to write that as:

: <!--.*?--\s*>

Ah yes, and exactly why one should use the html parsing modules if
at all possible,

(I was looking at my xml book. Xml comments have a more rigid comment
format, if I understand it correctly.)
 
Reply With Quote
 
Matt Garrish
Guest
Posts: n/a
 
      12-05-2003

"Ben Morrow" <(E-Mail Removed)> wrote in message
news:bqoq0c$gms$(E-Mail Removed)...
>
> "Matt Garrish" <(E-Mail Removed)> wrote:
> >
> > "Malcolm Dew-Jones" <(E-Mail Removed)> wrote in message
> > news:(E-Mail Removed)...
> > >
> > > First off, a comment does not end with >, it ends with --> (and starts
> > > with <!-- so why not test for that correctly also)?
> > >
> > > <!--.*?-->
> > >

> >
> > Html comments allow whitespace between the -- and > when you close a
> > comment, so you'd have to write that as:
> >
> > <!--.*?--\s*>

>
> HTML (SGML) comments also allow whitespace after the '!', and anything
> matching /--\s*--/ to appear within the body of the comment. What
> browsers will accept is another matter...
>


I thought no whitespace at the start of a comment was one of the few things
that html did enforce? It obviously would never fly in sgml if that was the
only way to comment out text (or would make for interesting dtds). Then
again, what standards do any browsers adhere to? : )

Matt


 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      12-05-2003
Matt Garrish <(E-Mail Removed)> wrote:
> "Ben Morrow" <(E-Mail Removed)> wrote in message
> news:bqoq0c$gms$(E-Mail Removed)...
>> "Matt Garrish" <(E-Mail Removed)> wrote:



>> > Html comments allow whitespace between the -- and > when you close a
>> > comment, so you'd have to write that as:
>> >
>> > <!--.*?--\s*>

>>
>> HTML (SGML) comments also allow whitespace after the '!', and anything



I believe that you are mistaken with that part.


>> matching /--\s*--/ to appear within the body of the comment. What
>> browsers will accept is another matter...



But that part is true enough.


> I thought no whitespace at the start of a comment was one of the few things
> that html did enforce?



You thought correctly. The grammar[1], reformatted, is:


comment declaration =
MDO,
( comment,
( s |
comment
)*
)?
MDC

comment =
COM
SGML character*
COM

Where:

MDO (<!) Markup Declaration Open
MDC (>) Markup Declaration Close
COM (--) Comment Delimiter
s Separator ( roughly /\s/ )


So, if you have any "comment"s in the "comment declaration",
then there must be no spaces before that first one.

Note also that <!> is a "comment declaration" as well.


This is but one of the "strange corners" of SGML syntax. There
are several dozen of these. Your choices are:

1. Research the bazillion syntax oddities and code for
*all of them* in your program.

or

2. Use a module.




[1] "The SGML Handbook" Charles Goldfarb, p391

--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Re: App idea, Any idea on implementation? Matthew_WARREN@bnpparibas.com Python 0 02-05-2008 05:50 PM
App idea, Any idea on implementation? Dr Mephesto Python 3 02-05-2008 06:55 AM
501 PIX "deny any any" "allow any any" Any Anybody? Networking Student Cisco 4 11-16-2006 10:40 PM
Want to optimize this procedure, any advice? Michael B. C Programming 11 11-14-2003 10:34 PM



Advertisments