Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > unexpected behaviour for python regexp: caret symbol almost useless?

Reply
Thread Tools

unexpected behaviour for python regexp: caret symbol almost useless?

 
 
conan
Guest
Posts: n/a
 
      05-28-2006
This regexp
'<widget class=".*" id=".*">'

works well with 'grep' for matching lines of the kind
<widget class="GtkWindow" id="window1">

on a XML .glade file

However that's not true for the re module in python, since this one
takes the regexp as if were specified this way: '^<widget class=".*"
id=".*">'

For some reason regexp on python decide to match from the start of the
line, no matter if you used or not the caret symbol '^'.

I have a hard time to note why this regexp wasn't working:
regexp = re.compile(r'<widget class=".*" id="(.*)">')

The solution was to consider spaces:
regexp = re.compile(r'\s*<widget class=".*" id="(.*)">\s*')

To reproduce behaviour just take a .glade file and this python script:
<code>
import re

glade_file_name = 'some.glade'

bad_regexp = re.compile(r'<widget class=".*" id="(.*)">')
good_regexp = re.compile(r'\s*<widget class=".*" id="(.*)">\s*')

for line in open(glade_file_name):
if bad_regexp.match(line):
print 'bad:', line.strip()
if good_regexp.match(line):
print 'good:', line.strip()
</code>

The thing is i should expected to have to put caret explicitly to tell
the regexp to match at the start of the line, something like:
r'^<widget class=".*" id="(.*)">'
however python regexp is taking care of that for me. This is not a
desired behaviour for what i know about regexp, but maybe i'm missing
something.

 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      05-28-2006
conan wrote:

> The thing is i should expected to have to put caret explicitly to tell
> the regexp to match at the start of the line, something like:
> r'^<widget class=".*" id="(.*)">'
> however python regexp is taking care of that for me. This is not a
> desired behaviour for what i know about regexp, but maybe i'm missing
> something.


You want search(), not match().

http://docs.python.org/lib/matching-searching.html

Peter
 
Reply With Quote
 
 
 
 
Paul McGuire
Guest
Posts: n/a
 
      05-28-2006
"conan" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
> This regexp
> '<widget class=".*" id=".*">'
>
> works well with 'grep' for matching lines of the kind
> <widget class="GtkWindow" id="window1">
>
> on a XML .glade file
>


As Peter Otten has already mentioned, this is the difference between the re
"match" and "search" methods.

As purely a lateral exercise, here is a pyparsing rendition of your program:

------------------------------------
from pyparsing import makeXMLTags, line

# define pyparsing patterns for begin and end XML tags
widgetStart,widgetEnd = makeXMLTags("widget")

# read the file contents
glade_file_name = 'some.glade'
gladeContents = open(glade_file_name).read()

# scan the input string for matching tags
for widget,start,end in widgetStart.scanString(gladeContents):
print "good:", line(start, gladeContents).strip()
print widget["class"], widget["id"]
print "Class: %(class)s; Id: %(id)s" % widget
------------------------------------
Not quite an exact match, only the good lines get listed. But also check
out some of the other capabilities. To do this with re's, you have to
clutter up the re expression with field names, as in:

(r'<widget class=(?P<class>".*") id="(?P<id>.*)">')

The parsing patterns generated by makeXMLTags give dict-like and
attribute-like access to any attributes included with the tag. If not for
the unfortunate attribute name "class" (which is a Python keyword), you
could also reference these values as widget.class and widget.id.

If you are parsing HTML, there is also a makeHTMLTags method, which creates
patterns that are less rigid about upper/lower case and other XML
strictnesses.

-- Paul


 
Reply With Quote
 
conan
Guest
Posts: n/a
 
      05-29-2006
Thank you, i have read this but somehow a missed it when the issue
arose.

 
Reply With Quote
 
conan
Guest
Posts: n/a
 
      05-29-2006
Thank you Paul.

Since the only thing i'm doing is extracting this fields, and have no
plans to include other stuff, a regexp is fine. However i will take
into account 'pyparsing' when i need to do more complex parsing.

As you can see in the example i send, i was trying to get info from a
glade file, in particular i was tired of doing this everytime i need to
access a widget:

some_var = xml.get_widget('some_id')

(doing this is tiresome when you have more than 10 widgets)

So i do a little module to have all widgets instanciated as attributes
of the object, for anyone interested it is on:

http://www.lugmen.org.ar/~p10n/sourc.../GetWidgets.py

However is still pretty unmature, since it lacks some checks.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Your true traveler finds boredom rather agreeable than painful. It isthe symbol of his liberty -- his excessive freedom. He accepts his boredom,when it comes, not merely philosophically, but almost with pleasure. senthilind@gmail.com Computer Support 0 03-02-2008 08:23 AM
what's differnece between #ifdef symbol and #if defined(symbol) baumann@pan C Programming 1 04-15-2005 08:25 AM
unexpected token / cannot resolve symbol error vivienne wykes Java 4 07-28-2004 06:48 PM
Unexpected mod-python behaviour. Simon Wittber Python 1 02-06-2004 06:45 PM
Unexpected python behaviour Richard Philips Python 2 11-28-2003 03:14 PM



Advertisments