Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regex to extract row data from text

Reply
Thread Tools

Regex to extract row data from text

 
 
TimBenz
Guest
Posts: n/a
 
      10-22-2003
I need a RegEx that I can use to scroll through textual data to extract
lines in a semi-regular format. The original data is a form something like
this:

AAA AAAAA BBBB BB CCCCC DDDDD EEEEEE FFFFFFF

Note, there are zero or more spaces in the "A" entity and the "B" entity,
and the rest of the entities have no spaces. Second, there is no fixed
length for any of the entities. They can be any non-zero length. About the
only point of consistency is that the "B" entity has a finite number of
forms, about fifteen. So far my attempt has been like this:

(.*)(COM|COMMON SHARES|Domestic Common)\s{1,}(.*?)\s{1,}(.*?)\s{1,}(.*?)\s

From which I extract $1, $3, and $5.

How do I spool through the whole text file and extract every line for which
the above holds? Are there better ways of doing this without the arduous
part where I have to detail all the variants of the B entity?

Thanks.
 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      10-22-2003
TimBenz <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> I need a RegEx that I can use to scroll through textual data to extract
> lines in a semi-regular format. The original data is a form something like
> this:
>
> AAA AAAAA BBBB BB CCCCC DDDDD EEEEEE FFFFFFF
>
> Note, there are zero or more spaces in the "A" entity and the "B" entity,
> and the rest of the entities have no spaces. Second, there is no fixed
> length for any of the entities. They can be any non-zero length. About the
> only point of consistency is that the "B" entity has a finite number of
> forms, about fifteen. So far my attempt has been like this:
>
> (.*)(COM|COMMON SHARES|Domestic Common)\s{1,}(.*?)\s{1,}(.*?)\s{1,}(.*?)\s


Which is the part that is supposed to catch the "B" entry? The one
starting "(COM..." has only three alternatives.

> From which I extract $1, $3, and $5.


What about $2?

> How do I spool through the whole text file and extract every line for which
> the above holds?


my @extract;
while ( <FILE> ){
push @extract, $_ if /.../;
}

> Are there better ways of doing this without the arduous
> part where I have to detail all the variants of the B entity?


No. From what you say, it is only possible to delimit the "A" record
after having identified the "B" record.

Anno
 
Reply With Quote
 
 
 
 
David Oswald
Guest
Posts: n/a
 
      10-22-2003

"TimBenz" <(E-Mail Removed)> wrote in message

> I need a RegEx that I can use to scroll through textual data to extract
> lines in a semi-regular format. The original data is a form something like
> this:
>
> AAA AAAAA BBBB BB CCCCC DDDDD EEEEEE FFFFFFF
>
> Note, there are zero or more spaces in the "A" entity and the "B" entity,
> and the rest of the entities have no spaces. Second, there is no fixed
> length for any of the entities. They can be any non-zero length. About the
> only point of consistency is that the "B" entity has a finite number of
> forms, about fifteen. So far my attempt has been like this:
>
> (.*)(COM|COMMON SHARES|Domestic Common)\s{1,}(.*?)\s{1,}(.*?)\s{1,}(.*?)\s
>
> From which I extract $1, $3, and $5.


The biggest problem is, how are you planning on delimiting the A segment
from the B segment, if the A segment itself can contain any one-or-more
number of characters that include the space, and yet it's a space that
separates
A from B? The only way to solve that problem IS to enumerate through
alternation
all the forms that B can take, so that you can use B as an anchor-point.

Fortunately, you don't have to do it in quite so ugly a way.

Try something like this:

while ( my $line = <DATA> );
my $re_alternates = join "|", @alternates_list;
if ( my ($first, $third, $fifth) = $line =~
m/^(.+?)(?:$re_alternates)\s+(\w+)\s+\w+\s+(\w+)\s+$/ ) {
#do your stuff...
}
}

....to explain...
You said you only want to capture the first, third and fifth groupings. So
I only used
capturing parenthesis on those portions of the match. I used non-capturing
parens
to confine the alternation. And all of the alternates are built up into
$re_alternates.

Finally, instead of using $1, $2, $3, I just used the regexp in list context
so that the
scalars $first, $third, and $fifth would be populated in case of a match.

Good luck...





 
Reply With Quote
 
Tore Aursand
Guest
Posts: n/a
 
      10-22-2003
On Wed, 22 Oct 2003 07:25:26 +0000, TimBenz wrote:
> The original data is a form something like this:
> [...]


Why don't you post a bit of the _excact_ data you're trying to parse, thus
making it a lot easier for us?

Chance is that you'll get a few answers to your original post, and then
you goes "yeah, but the data could also include...blah...blah...".


--
Tore Aursand <(E-Mail Removed)>
 
Reply With Quote
 
Tassilo v. Parseval
Guest
Posts: n/a
 
      10-22-2003
Also sprach Tore Aursand:

> On Wed, 22 Oct 2003 07:25:26 +0000, TimBenz wrote:
>> The original data is a form something like this:
>> [...]

>
> Why don't you post a bit of the _excact_ data you're trying to parse, thus
> making it a lot easier for us?
>
> Chance is that you'll get a few answers to your original post, and then
> you goes "yeah, but the data could also include...blah...blah...".


This chance is even higher when he posts a sample of exact data.

Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus}) !JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexi ixesixeseg;y~\n~~dddd;eval
 
Reply With Quote
 
Chris Mattern
Guest
Posts: n/a
 
      10-22-2003
Tassilo v. Parseval wrote:
> Also sprach Tore Aursand:
>
>
>>On Wed, 22 Oct 2003 07:25:26 +0000, TimBenz wrote:
>>
>>>The original data is a form something like this:
>>>[...]

>>
>>Why don't you post a bit of the _excact_ data you're trying to parse, thus
>>making it a lot easier for us?
>>
>>Chance is that you'll get a few answers to your original post, and then
>>you goes "yeah, but the data could also include...blah...blah...".

>
>
> This chance is even higher when he posts a sample of exact data.
>

When you're parsing input data, what is necessary is a true understanding
of its syntax, not samples which will almost invariably fail to cover
certain cases. "The data looks like such-and-so" or "The data is in
a form like this" is usually a red flag that the speaker doesn't understand
his input data well enough to parse it properly.

Chris Mattern

 
Reply With Quote
 
TimBenz
Guest
Posts: n/a
 
      10-22-2003
Thanks for all the replies. Sorry for having been remiss in not posting the
exact data, but it's proprietary trading data for our money management
firm, so I didn't know what I could post. Here is a representative piece,
however, that I don't think should worry anyone:

NAME OF ISSUER TITLE OF CUSIP MARKET AMOUNT SH/PRINV
DISC OTHER VOTING AUTHORITY

21ST CENTURY INS GRP COMMON 90130N103 974 70700 SH SOLE
70700 0 0
3COM CORP COMMON 885535104 5156 873949 SH SOLE
873949 0 0
3M COMPANY COMMON 88579Y101 36846 533460 SH SOLE
527760 0 5700
3M COMPANY COMMON 88579Y101 2735 39596 SH
OTHER 39596 0 0
IBM CORP COMMON 88179Y101 735 35110 SH SOLE
35110 0 0



As you can see, the structure is fairly open, and even the tab/space
structure changes depending on the size of entry in the first column.
 
Reply With Quote
 
Glenn Jackman
Guest
Posts: n/a
 
      10-22-2003
TimBenz <(E-Mail Removed)> wrote:
> Here is a representative piece,
>
> NAME OF ISSUER TITLE OF CUSIP MARKET AMOUNT SH/PRINV
> DISC OTHER VOTING AUTHORITY
>
> 21ST CENTURY INS GRP COMMON 90130N103 974 70700 SH SOLE
> 70700 0 0
> 3COM CORP COMMON 885535104 5156 873949 SH SOLE
> 873949 0 0
> 3M COMPANY COMMON 88579Y101 36846 533460 SH SOLE
> 527760 0 5700
> 3M COMPANY COMMON 88579Y101 2735 39596 SH OTHER
> 39596 0 0
> IBM CORP COMMON 88179Y101 735 35110 SH SOLE
> 35110 0 0


Looks like fixed width fields, as opposed to delimited.
Does the "COMMON" always start at the 31st character?
If so, use substr() to extract the data.


--
Glenn Jackman
NCF Sysadmin
http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
TimBenz
Guest
Posts: n/a
 
      10-22-2003
Glenn Jackman <(E-Mail Removed)> wrote in
news:(E-Mail Removed):

>
> Looks like fixed width fields, as opposed to delimited.
> Does the "COMMON" always start at the 31st character?
> If so, use substr() to extract the data.


Sadly, the field widths aren't fixed. It really depends on who filed the
trading report how wide the fields are -- they vary all over the map. So
the substr() method doesn't work. Following advice here, I have written a
regex that keys on the 10 or so variants of the second column and hinges
around that. Irritating, but that seems to be the only thing that works for
me.



 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      10-22-2003
Glenn Jackman <(E-Mail Removed)> wrote:

> Looks like fixed width fields, as opposed to delimited.


> If so, use substr() to extract the data.



unpack() is the Right Tool for fixed width fields.


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Please help me how is easiest way to extract text between some variable text Mladen Perl Misc 5 02-22-2011 10:57 AM
extract substring by regex from a text file Alessio Python 5 04-17-2010 04:14 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
How do i extract vidios when winrar wont extract them??? help plzzzzzzzz smuttdog@sc.rr.com Computer Support 2 12-23-2007 07:03 AM
ok I can do a totals row but how about a percentage row after each data row D ASP .Net Datagrid Control 0 05-23-2005 04:10 PM



Advertisments