Velocity Reviews > matching exactly a 4 digit number in python

# matching exactly a 4 digit number in python

harijay
Guest
Posts: n/a

 11-21-2008
Hi
I am a few months new into python. I have used regexps before in perl
and java but am a little confused with this problem.

I want to parse a number of strings and extract only those that
contain a 4 digit number anywhere inside a string

However the regexp
p = re.compile(r'\d{4}')

Matches even sentences that have longer than 4 numbers inside
strings ..for example it matches "I have 3324234 and more"

I am very confused. Shouldnt the \d{4,} match exactly four digit
numbers so a 5 digit number sentence should not be matched .

Here is my test program output and the test given below
Harijay

PyMate r8111 running Python 2.5.1 (/usr/bin/python)
>>> testdigit.py

Matched I have 2004 rupees
Matched I have 3324234 and more
Matched As 3233
Matched 2323423414 is good
Matched 4444 dc sav 2412441 asdf
SKIPPED random1341also and also
SKIPPED
SKIPPED 13
Matched a 1331 saves
SKIPPED A has 13123123
SKIPPED A 13123
Matched 1312 times I have told you
DONE

#!/usr/bin/python
import re
x = [" I have 2004 rupees "," I have 3324234 and more" , " As 3233 " ,
"2323423414 is good","4444 dc sav 2412441 asdf " , "random1341also and
also" ,"","13"," a 1331 saves" ," and and as dad"," A has 13123123","
A 13123","123 adn","1312 times I have told you"]

p = re.compile(r'\d{4} ')

for elem in x:
if re.search(p,elem):
print "Matched " + elem
else:
print "SKIPPED " + elem

print "DONE"

Mr.SpOOn
Guest
Posts: n/a

 11-21-2008
2008/11/21 harijay <>:
> Hi
> I am a few months new into python. I have used regexps before in perl
> and java but am a little confused with this problem.
>
> I want to parse a number of strings and extract only those that
> contain a 4 digit number anywhere inside a string
>
> However the regexp
> p = re.compile(r'\d{4}')
>
> Matches even sentences that have longer than 4 numbers inside
> strings ..for example it matches "I have 3324234 and more"

Try with this:

p = re.compile(r'\d{4}\$')

The \$ character matches the end of the string. It should work.

John Machin
Guest
Posts: n/a

 11-21-2008
On Nov 22, 8:46*am, harijay <hari...@gmail.com> wrote:
> Hi
> I am a few months new into python. I have used regexps before in perl
> and java but am a little confused with this problem.
>
> I want to parse a number of strings and extract only those that
> contain a 4 digit number anywhere inside a string
>
> However the regexp
> p = re.compile(r'\d{4}')
>
> Matches even sentences that have longer than 4 numbers inside
> strings ..for example it matches "I have 3324234 and more"

No it doesn't. When used with re.search on that string it matches
3324, it doesn't "match" the whole sentence.

>
> I am very confused. Shouldnt the \d{4,} match exactly four digit
> numbers so a 5 digit number sentence should not be matched .

{4} does NOT mean the same as {4,}.
{4} is the same as {4,4}
{4,} means {4,INFINITY}

Ignoring {4,}:

You need to specify a regex that says "4 digits followed by (non-digit
or end-of-string)". Have a try at that and come back here if you have
any more problems.

some test data:
xxx1234
xxx12345
xxx1234xxx
xxx12345xxx
xxx1234xxx1235xxx
xxx12345xxx1234xxx

skip@pobox.com
Guest
Posts: n/a

 11-21-2008

>> I am a few months new into python. I have used regexps before in perl
>> and java but am a little confused with this problem.

>> I want to parse a number of strings and extract only those that
>> contain a 4 digit number anywhere inside a string

>> However the regexp
>> p = re.compile(r'\d{4}')

>> Matches even sentences that have longer than 4 numbers inside strings
>> ..for example it matches "I have 3324234 and more"

>>> pat = re.compile(r"(?<!\d)(\d{4})(?!\d)")>>> for s in x:

... m = pat.search(s)
... print repr(s),
... print (m is not None) and "matches" or "does not match"
...
' I have 2004 rupees ' matches
' I have 3324234 and more' does not match
' As 3233 ' matches
'2323423414 is good' does not match
'4444 dc sav 2412441 asdf ' matches
'random1341also and also' matches
'' does not match
'13' does not match
' a 1331 saves' matches
' and and as dad' does not match
' A has 13123123' does not match
'A 13123' does not match
'1312 times I have told you' matches

--
Skip Montanaro - - http://smontanaro.dyndns.org/

George Sakkis
Guest
Posts: n/a

 11-21-2008
On Nov 21, 4:46*pm, harijay <hari...@gmail.com> wrote:

> Hi
> I am a few months new into python. I have used regexps before in perl
> and java but am a little confused with this problem.
>
> I want to parse a number of strings and extract only those that
> contain a 4 digit number anywhere inside a string
>
> However the regexp
> p = re.compile(r'\d{4}')
>
> Matches even sentences that have longer than 4 numbers inside
> strings ..for example it matches "I have 3324234 and more"
>
> I am very confused. Shouldnt the \d{4,} match exactly four digit
> numbers so a 5 digit number sentence should not be matched .

No, why should it ? What you're saying is "give me 4 consecutive
digits", without specifying what should precede or follow these
digits. A correct expression is a bit more hairy:

p = re.compile(r'''
(?:\D|\b) # find a non-digit or word boundary..
(\d{4}) # .. followed by the 4 digits to be matched as group
#1..
(?:\D|\b) # .. which are followed by non-digit or word boundary
''', re.VERBOSE)

HTH,
George

MRAB
Guest
Posts: n/a

 11-21-2008
George Sakkis wrote:
> On Nov 21, 4:46 pm, harijay <hari...@gmail.com> wrote:
>
>> Hi
>> I am a few months new into python. I have used regexps before in perl
>> and java but am a little confused with this problem.
>>
>> I want to parse a number of strings and extract only those that
>> contain a 4 digit number anywhere inside a string
>>
>> However the regexp
>> p = re.compile(r'\d{4}')
>>
>> Matches even sentences that have longer than 4 numbers inside
>> strings ..for example it matches "I have 3324234 and more"
>>
>> I am very confused. Shouldnt the \d{4,} match exactly four digit
>> numbers so a 5 digit number sentence should not be matched .

>
> No, why should it ? What you're saying is "give me 4 consecutive
> digits", without specifying what should precede or follow these
> digits. A correct expression is a bit more hairy:
>
> p = re.compile(r'''
> (?:\D|\b) # find a non-digit or word boundary..
> (\d{4}) # .. followed by the 4 digits to be matched as group
> #1..
> (?:\D|\b) # .. which are followed by non-digit or word boundary
> ''', re.VERBOSE)
>

You want to match a sequence of 4 digits: \d{4}
not preceded by a digit: (?<!\d)
not followed by a digit: (?!\d)

which is: re.compile(r'(?<!\d)\d{4}(?!\d)')

harijay
Guest
Posts: n/a

 11-21-2008
Thanks John Machin and Mark Tolonen ..
SO I guess the correct one is to use the word boundary meta character
"\b"

so r'\b\d{4}\b' is what I need since it reads

a 4 digit number in between word boundaries

Thanks a tonne, and this being my second post to comp.lang.python. I
am always amazed at how helpful everyone on this group is

Hari

On Nov 21, 5:12*pm, John Machin <sjmac...@lexicon.net> wrote:
> On Nov 22, 8:46*am, harijay <hari...@gmail.com> wrote:
>
> > Hi
> > I am a few months new into python. I have used regexps before in perl
> > and java but am a little confused with this problem.

>
> > I want to parse a number of strings and extract only those that
> > contain a 4 digit number anywhere inside a string

>
> > However the regexp
> > p = re.compile(r'\d{4}')

>
> > Matches even sentences that have longer than 4 numbers inside
> > strings ..for example it matches "I have 3324234 and more"

>
> No it doesn't. When used with re.search on that string it matches
> 3324, it doesn't "match" the whole sentence.
>
>
>
> > I am very confused. Shouldnt the \d{4,} match exactly four digit
> > numbers so a 5 digit number sentence should not be matched .

>
> {4} does NOT mean the same as {4,}.
> {4} is the same as {4,4}
> {4,} means {4,INFINITY}
>
> Ignoring {4,}:
>
> You need to specify a regex that says "4 digits followed by (non-digit
> or end-of-string)". Have a try at that and come back here if you have
> any more problems.
>
> some test data:
> xxx1234
> xxx12345
> xxx1234xxx
> xxx12345xxx
> xxx1234xxx1235xxx
> xxx12345xxx1234xxx