Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Capturing a Repeated Group

Reply
Thread Tools

Capturing a Repeated Group

 
 
Paul Lalli
Guest
Posts: n/a
 
      07-12-2007
On Jul 12, 8:59 am, "per...@gmail.com" <per...@gmail.com> wrote:
> The short answer is that I'm "inchworming" my way through the string.
> The text may contain senteces with commas, and is not a single number
> string. And after the number is matches, I continue with other
> matches.


Regexp::Common is your friend.

> Correct me if I'm wrong, but for my scenario I think substitutions
> requires two matches, first a hit, then a substitution, like so:
> $_ = "1,234,456,789";
> /\d{1,3}(?:,\d\d\d)*/g && do {
> my $number= $&;
> $number =~ s/,//g;
> print "$number\n";
>
> }
>
> But if the number parts could be eaten up in one regexp, it is
> unnecessarily to use two.


Unnecessary, maybe, but a heck of a lot more readable.

#!/opt2/perl/bin/perl
use strict;
use warnings;
use Regexp::Common qw/number/;

my @numbers;
while (<DATA>) {
push @numbers, /$RE{num}{int}{-sep=>','}/g;
}
tr/,//d for @numbers;
print join(' - ', @numbers), "\n";

__DATA__
Lorem ipsum dolor sit amet, 1,234,567,890 consectetuer 1,000
lacinia risus. 56,650,231 Duis 432 porta vehicula 8,103 ligula.

$ ./nums.pl
1234567890 - 1000 - 56650231 - 432 - 8103

Paul Lalli


 
Reply With Quote
 
 
 
 
anno4000@radom.zrz.tu-berlin.de
Guest
Posts: n/a
 
      07-12-2007
<> wrote in comp.lang.perl.misc:
>
> Thanks Xicheng, Hjalmarsson, Steven and Anno for your inputs!
>
> I'm really "inchworming" my way through the string, scanning tokens
> like a lexical analyzer. If it fails to scan numbers like
> "1,234,567,890", it continue to scan identifier token (e.g. /\w+(\d|
> \w)*/)


That sounds like you want a real parser, where number recognition
would be part of the general parsing process.

> On 12 Juli, 00:55, Gunnar Hjalmarsson <nore...@gunnar.cc> wrote:
> > Xicheng Jia wrote:
> > > you probably want this:

> >
> > > @parts = /(\d{1,3})(?=,(?:\d{3}))*/g .....

> >
> > Let's see:
> >
> > C:\home>type test.pl
> > use warnings;
> > $_ = '1,234,567,890';
> > @parts = /(\d{1,3})(?=,(?:\d{3}))*/g;
> > print @parts, "\n";
> >
> > C:\home>test.pl
> > (?=,(?:\d{3}))* matches null string many times in regex; marked by
> > <-- HERE in m/(\d{1,3})(?=,(?:\d{3}))* <-- HERE / at C:\home\test.pl
> > line 3.
> > 1234567890
> >

>
> Even if perl reports matches null string many times, isn't this what I
> want?
> I want to match 123 or 1,234 or 1,234,567 or similar patterns.
> Since the regexp starts with /\d{1,3}/ it never matches the null
> string.
> Can't see for what patterns this fails...?


The pattern you're repeating is a *zero width* lookahead. Whatever
the regex engine does internally to determine if it matches, the
width of the match will be zero. That's what it's complaining about.
The asterisk does nothing, you can remove it.

Anno
 
Reply With Quote
 
 
 
 
perrog@gmail.com
Guest
Posts: n/a
 
      07-12-2007
On 12 Juli, 04:18, Xicheng Jia <xich...@gmail.com> wrote:
> On Jul 11, 11:09 pm, "attn.steven....@gmail.com"
>
> > The latter won't work for a target string like:
> > $_ = '123,456,789';
> > (i.e., anything with an odd number of comma delimited substrings).
> > You can try global match (//g) in scalar context:
> > $_ = "1,234,567,890";

>
> > my $n = 0;
> > while (/\G(\d{1,3})(?:,|$)/g)

>
> this should be the same as:
>
> while (/\G(\d{1,3}),?/g)
>


Ohh, now I'm beginning to see the logic... The /(\d{1,3})(?:,
(\d{3}))*/g rexexp captured repeated productions, not repeated groups.

So, to sum up. I can't use /(\d{1,3})(?:,(\d\d\d))*/ because the RE
engine only save captured repeated groups for the last iteration. The
fix is to use g-modifier to capture repeated productions... the
subject of this thread should really have been "capturing repeated
productions", right?

Ideally, /(\d{1,3})|(?<=\d{1,3}),(\d\d\d)/g would work, but (?<=
\d{1,3}) is not implemented yet, so I ended up writing:

@parts = ();
(@parts = grep { defined $_ }
m((\d{1,3})
# (?<=\d{1,3}) not implemented, use three cases
| (?<=\d),(\d\d\d)
| (?<=\d\d),(\d\d\d)
| (?<=\d\d\d),(\d\d\d)
)xg) && do {
my $number = 0;
$number = $number * 1000 + $_ foreach (@parts);
print "$number\n";
};

It uses a "Schwartzian transformation" to filter out undef captures,
which I suppose comes from alternation cases.

 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      07-12-2007
schreef:

> I want to match 123 or 1,234 or 1,234,567 or similar patterns.


perldoc -f reverse

--
Affijn, Ruud

"Gewoon is een tijger."
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regexp : repeated group identification candide Python 3 12-14-2011 01:38 PM
Re: how to get all repeated group with regular expression MRAB Python 0 11-22-2008 03:22 PM
how to get all repeated group with regular expression scsoce Python 1 11-21-2008 02:56 PM
Regular expression fun. Repeated matching of a group Q matteosartori@gmail.com Python 7 02-24-2006 11:49 PM
Capturing repeating group matches in regular expressions James Collier Python 4 08-12-2004 10:57 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57