Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   Parsing multi-line text (http://www.velocityreviews.com/forums/t906452-parsing-multi-line-text.html)

keith@bytebrothers.co.uk 02-18-2008 10:42 AM

Parsing multi-line text
 

Hi all,

I have a data file structured something like this:

------------------8<-----------------------
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"
Age: 28
Chunk 03
FIRST: "Carol"
Description: "Some other string"
Age: 32
Chunk 04
FIRST: "Dave"
Description: "Some other string"
Age: 22
------------------8<-----------------------

and I want to extract from it to produce output something like this:

------------------8<-----------------------
01 NAME: Alice -> 37
02 NAME: Bob -> 28
03 NAME: Carol -> 32
04 NAME: Dave -> 22
------------------8<-----------------------

I've read and re-read the section in perlfaq6 (no, really, I have!)
about milt-line matching, but I can't see how to adapt what is there
to this.

Can someone please point me in the right direction?
Thx!

Gunnar Hjalmarsson 02-18-2008 11:18 AM

Re: Parsing multi-line text
 
keith@bytebrothers.co.uk wrote:
> I have a data file structured something like this:
>
> ------------------8<-----------------------
> Chunk 01
> NAME: "Alice"
> Description: "Some other string"
> Age: 37
> Chunk 02
> NAME: "Bob"
> Description: "Some other string"
> Age: 28
> Chunk 03
> FIRST: "Carol"
> Description: "Some other string"
> Age: 32
> Chunk 04
> FIRST: "Dave"
> Description: "Some other string"
> Age: 22
> ------------------8<-----------------------
>
> and I want to extract from it to produce output something like this:
>
> ------------------8<-----------------------
> 01 NAME: Alice -> 37
> 02 NAME: Bob -> 28
> 03 NAME: Carol -> 32
> 04 NAME: Dave -> 22
> ------------------8<-----------------------


local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

keith@bytebrothers.co.uk 02-18-2008 11:41 AM

Re: Parsing multi-line text
 
On 18 Feb, 11:18, Gunnar Hjalmarsson <nore...@gunnar.cc> wrote:
> ke...@bytebrothers.co.uk wrote:
> > I have a data file structured something like this:

>
> > ------------------8<-----------------------
> > Chunk 01
> > NAME: "Alice"
> > Description: "Some other string"
> > Age: 37
> > ------------------8<-----------------------

>
> > and I want to extract from it to produce output something like this:

>
> > ------------------8<-----------------------
> > 01 NAME: Alice -> 37
> > ------------------8<-----------------------

>
> local $/ = 'Chunk';
> while (<>) {
> if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
> printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
> }
> }


Gotta love this place - thanks!

Now, let's see if I can decipher (no point in asking if I don't learn
from the answer)...

You make the text 'Chunk' the record delimiter. Then inside each
record you look for digits (store in $1). Skip anything followed by
uppercase text followed by colon followed by space followed by double-
quote. Now grab everything up to next double quote (store in $2).
Skip double-quote, then anything then the text 'Age:' then spaces,
then grab digits (store in $3), and we're done.

Is that close?!

Gunnar Hjalmarsson 02-18-2008 11:56 AM

Re: Parsing multi-line text
 
keith@bytebrothers.co.uk wrote:
> On 18 Feb, 11:18, Gunnar Hjalmarsson <nore...@gunnar.cc> wrote:
>>
>> local $/ = 'Chunk';
>> while (<>) {
>> if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
>> printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
>> }
>> }

>
> Gotta love this place - thanks!
>
> Now, let's see if I can decipher (no point in asking if I don't learn
> from the answer)...
>
> You make the text 'Chunk' the record delimiter. Then inside each
> record you look for digits (store in $1). Skip anything followed by
> uppercase text followed by colon followed by space followed by double-
> quote. Now grab everything up to next double quote (store in $2).
> Skip double-quote, then anything then the text 'Age:' then spaces,
> then grab digits (store in $3), and we're done.
>
> Is that close?!


Yep, that's about it.

Since each chunk spans over multiple lines, the /s modifier is important
(makes . match also newlines).

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Peter J. Holzer 03-01-2008 03:42 PM

Re: Parsing multi-line text
 
On 2008-02-18 10:42, keith@bytebrothers.co.uk <keith@bytebrothers.co.uk> wrote:
> I have a data file structured something like this:
>
> ------------------8<-----------------------
> Chunk 01
> NAME: "Alice"
> Description: "Some other string"
> Age: 37
> Chunk 02
> NAME: "Bob"
> Description: "Some other string"
> Age: 28
> Chunk 03
> FIRST: "Carol"
> Description: "Some other string"
> Age: 32
> Chunk 04
> FIRST: "Dave"
> Description: "Some other string"
> Age: 22
> ------------------8<-----------------------
>
> and I want to extract from it to produce output something like this:
>
> ------------------8<-----------------------
> 01 NAME: Alice -> 37
> 02 NAME: Bob -> 28
> 03 NAME: Carol -> 32
> 04 NAME: Dave -> 22
> ------------------8<-----------------------


Sure about that? In the input you have sometimes "FIRST" and sometimes
"NAME", but in the output it is always NAME. Assuming this is
intentional:


#!/usr/bin/perl
use strict;
use warnings;

my $s = <<EOS;
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"
Age: 28
Chunk 03
FIRST: "Carol"
Description: "Some other string"
Age: 32
Chunk 04
FIRST: "Dave"
Description: "Some other string"
Age: 22
EOS

while ($s =~ m{
^Chunk \s (\d+) \n
\s+(NAME|FIRST): \s "(.*?)" \n
\s+Description: \s "(.*?)" \n
\s+Age: \s (\d+) \n
}xmg
) {
print "$1 NAME: $3 -> $5\n";
}

hp

Peter J. Holzer 03-01-2008 03:45 PM

Re: Parsing multi-line text
 
On 2008-02-18 11:18, Gunnar Hjalmarsson <noreply@gunnar.cc> wrote:
> keith@bytebrothers.co.uk wrote:
>> I have a data file structured something like this:
>>
>> ------------------8<-----------------------
>> Chunk 01
>> NAME: "Alice"
>> Description: "Some other string"
>> Age: 37
>> Chunk 02
>> NAME: "Bob"
>> Description: "Some other string"


change this line to

Description: "Some Chunky string"

>> Age: 28

....
>> ------------------8<-----------------------

>
> local $/ = 'Chunk';
> while (<>) {
> if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
> printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
> }
> }


and then run this script again.

hp

Gunnar Hjalmarsson 03-02-2008 09:50 AM

Re: Parsing multi-line text
 
Peter J. Holzer wrote:
> On 2008-02-18 11:18, Gunnar Hjalmarsson <noreply@gunnar.cc> wrote:
>> keith@bytebrothers.co.uk wrote:
>>> I have a data file structured something like this:
>>>
>>> ------------------8<-----------------------
>>> Chunk 01
>>> NAME: "Alice"
>>> Description: "Some other string"
>>> Age: 37
>>> Chunk 02
>>> NAME: "Bob"
>>> Description: "Some other string"

>
> change this line to
>
> Description: "Some Chunky string"
>
>>> Age: 28

> ...
>>> ------------------8<-----------------------

>> local $/ = 'Chunk';
>> while (<>) {
>> if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
>> printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
>> }
>> }

>
> and then run this script again.


Well, what's the likelihood that that would happen? At least the OP
didn't object to the idea with 'Chunk' as record separator.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Peter J. Holzer 03-02-2008 01:43 PM

Re: Parsing multi-line text
 
On 2008-03-02 09:50, Gunnar Hjalmarsson <noreply@gunnar.cc> wrote:
> Peter J. Holzer wrote:
>> On 2008-02-18 11:18, Gunnar Hjalmarsson <noreply@gunnar.cc> wrote:
>>> keith@bytebrothers.co.uk wrote:
>>>> I have a data file structured something like this:
>>>>
>>>> ------------------8<-----------------------
>>>> Chunk 01
>>>> NAME: "Alice"
>>>> Description: "Some other string"
>>>> Age: 37
>>>> Chunk 02
>>>> NAME: "Bob"
>>>> Description: "Some other string"

>>
>> change this line to
>>
>> Description: "Some Chunky string"
>>
>>>> Age: 28

>> ...
>>>> ------------------8<-----------------------
>>> local $/ = 'Chunk';
>>> while (<>) {
>>> if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
>>> printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
>>> }
>>> }

>>
>> and then run this script again.

>
> Well, what's the likelihood that that would happen?


How would I know? The OP didn't say much about the contents of the
fields. But I'd say it is non-zero. "Chunk" is an English word which
might occur in a description, and it might even be the first 5
characters of a name. Finally, we don't know where data comes from -
somebody might deliberately try to sabotage the script.

> At least the OP didn't object to the idea with 'Chunk' as record
> separator.


I was under the impression that he was glad to understand your solution
at all and wasn't trying to find flaws in it. Far too few people think
about the edge-cases of possible input.

A word of warning about the solution I posted in a different message: It
doesn't handle embedded quotes - that would be quite easy to add, but
there are different systems of escaping quotes and one would need to
know which one to use - the OP didn't tell us.

hp



All times are GMT. The time now is 12:45 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.