Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Parsing multi-line text

Reply
Thread Tools

Parsing multi-line text

 
 
keith@bytebrothers.co.uk
Guest
Posts: n/a
 
      02-18-2008

Hi all,

I have a data file structured something like this:

------------------8<-----------------------
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"
Age: 28
Chunk 03
FIRST: "Carol"
Description: "Some other string"
Age: 32
Chunk 04
FIRST: "Dave"
Description: "Some other string"
Age: 22
------------------8<-----------------------

and I want to extract from it to produce output something like this:

------------------8<-----------------------
01 NAME: Alice -> 37
02 NAME: Bob -> 28
03 NAME: Carol -> 32
04 NAME: Dave -> 22
------------------8<-----------------------

I've read and re-read the section in perlfaq6 (no, really, I have!)
about milt-line matching, but I can't see how to adapt what is there
to this.

Can someone please point me in the right direction?
Thx!
 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      02-18-2008
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I have a data file structured something like this:
>
> ------------------8<-----------------------
> Chunk 01
> NAME: "Alice"
> Description: "Some other string"
> Age: 37
> Chunk 02
> NAME: "Bob"
> Description: "Some other string"
> Age: 28
> Chunk 03
> FIRST: "Carol"
> Description: "Some other string"
> Age: 32
> Chunk 04
> FIRST: "Dave"
> Description: "Some other string"
> Age: 22
> ------------------8<-----------------------
>
> and I want to extract from it to produce output something like this:
>
> ------------------8<-----------------------
> 01 NAME: Alice -> 37
> 02 NAME: Bob -> 28
> 03 NAME: Carol -> 32
> 04 NAME: Dave -> 22
> ------------------8<-----------------------


local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
 
 
 
keith@bytebrothers.co.uk
Guest
Posts: n/a
 
      02-18-2008
On 18 Feb, 11:18, Gunnar Hjalmarsson <(E-Mail Removed)> wrote:
> (E-Mail Removed) wrote:
> > I have a data file structured something like this:

>
> > ------------------8<-----------------------
> > Chunk 01
> > NAME: "Alice"
> > Description: "Some other string"
> > Age: 37
> > ------------------8<-----------------------

>
> > and I want to extract from it to produce output something like this:

>
> > ------------------8<-----------------------
> > 01 NAME: Alice -> 37
> > ------------------8<-----------------------

>
> local $/ = 'Chunk';
> while (<>) {
> if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
> printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
> }
> }


Gotta love this place - thanks!

Now, let's see if I can decipher (no point in asking if I don't learn
from the answer)...

You make the text 'Chunk' the record delimiter. Then inside each
record you look for digits (store in $1). Skip anything followed by
uppercase text followed by colon followed by space followed by double-
quote. Now grab everything up to next double quote (store in $2).
Skip double-quote, then anything then the text 'Age:' then spaces,
then grab digits (store in $3), and we're done.

Is that close?!
 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      02-18-2008
(E-Mail Removed) wrote:
> On 18 Feb, 11:18, Gunnar Hjalmarsson <(E-Mail Removed)> wrote:
>>
>> local $/ = 'Chunk';
>> while (<>) {
>> if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
>> printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
>> }
>> }

>
> Gotta love this place - thanks!
>
> Now, let's see if I can decipher (no point in asking if I don't learn
> from the answer)...
>
> You make the text 'Chunk' the record delimiter. Then inside each
> record you look for digits (store in $1). Skip anything followed by
> uppercase text followed by colon followed by space followed by double-
> quote. Now grab everything up to next double quote (store in $2).
> Skip double-quote, then anything then the text 'Age:' then spaces,
> then grab digits (store in $3), and we're done.
>
> Is that close?!


Yep, that's about it.

Since each chunk spans over multiple lines, the /s modifier is important
(makes . match also newlines).

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      03-01-2008
On 2008-02-18 10:42, (E-Mail Removed) <(E-Mail Removed)> wrote:
> I have a data file structured something like this:
>
> ------------------8<-----------------------
> Chunk 01
> NAME: "Alice"
> Description: "Some other string"
> Age: 37
> Chunk 02
> NAME: "Bob"
> Description: "Some other string"
> Age: 28
> Chunk 03
> FIRST: "Carol"
> Description: "Some other string"
> Age: 32
> Chunk 04
> FIRST: "Dave"
> Description: "Some other string"
> Age: 22
> ------------------8<-----------------------
>
> and I want to extract from it to produce output something like this:
>
> ------------------8<-----------------------
> 01 NAME: Alice -> 37
> 02 NAME: Bob -> 28
> 03 NAME: Carol -> 32
> 04 NAME: Dave -> 22
> ------------------8<-----------------------


Sure about that? In the input you have sometimes "FIRST" and sometimes
"NAME", but in the output it is always NAME. Assuming this is
intentional:


#!/usr/bin/perl
use strict;
use warnings;

my $s = <<EOS;
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"
Age: 28
Chunk 03
FIRST: "Carol"
Description: "Some other string"
Age: 32
Chunk 04
FIRST: "Dave"
Description: "Some other string"
Age: 22
EOS

while ($s =~ m{
^Chunk \s (\d+) \n
\s+(NAME|FIRST): \s "(.*?)" \n
\s+Description: \s "(.*?)" \n
\s+Age: \s (\d+) \n
}xmg
) {
print "$1 NAME: $3 -> $5\n";
}

hp
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      03-01-2008
On 2008-02-18 11:18, Gunnar Hjalmarsson <(E-Mail Removed)> wrote:
> (E-Mail Removed) wrote:
>> I have a data file structured something like this:
>>
>> ------------------8<-----------------------
>> Chunk 01
>> NAME: "Alice"
>> Description: "Some other string"
>> Age: 37
>> Chunk 02
>> NAME: "Bob"
>> Description: "Some other string"


change this line to

Description: "Some Chunky string"

>> Age: 28

....
>> ------------------8<-----------------------

>
> local $/ = 'Chunk';
> while (<>) {
> if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
> printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
> }
> }


and then run this script again.

hp
 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      03-02-2008
Peter J. Holzer wrote:
> On 2008-02-18 11:18, Gunnar Hjalmarsson <(E-Mail Removed)> wrote:
>> (E-Mail Removed) wrote:
>>> I have a data file structured something like this:
>>>
>>> ------------------8<-----------------------
>>> Chunk 01
>>> NAME: "Alice"
>>> Description: "Some other string"
>>> Age: 37
>>> Chunk 02
>>> NAME: "Bob"
>>> Description: "Some other string"

>
> change this line to
>
> Description: "Some Chunky string"
>
>>> Age: 28

> ...
>>> ------------------8<-----------------------

>> local $/ = 'Chunk';
>> while (<>) {
>> if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
>> printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
>> }
>> }

>
> and then run this script again.


Well, what's the likelihood that that would happen? At least the OP
didn't object to the idea with 'Chunk' as record separator.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      03-02-2008
On 2008-03-02 09:50, Gunnar Hjalmarsson <(E-Mail Removed)> wrote:
> Peter J. Holzer wrote:
>> On 2008-02-18 11:18, Gunnar Hjalmarsson <(E-Mail Removed)> wrote:
>>> (E-Mail Removed) wrote:
>>>> I have a data file structured something like this:
>>>>
>>>> ------------------8<-----------------------
>>>> Chunk 01
>>>> NAME: "Alice"
>>>> Description: "Some other string"
>>>> Age: 37
>>>> Chunk 02
>>>> NAME: "Bob"
>>>> Description: "Some other string"

>>
>> change this line to
>>
>> Description: "Some Chunky string"
>>
>>>> Age: 28

>> ...
>>>> ------------------8<-----------------------
>>> local $/ = 'Chunk';
>>> while (<>) {
>>> if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
>>> printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
>>> }
>>> }

>>
>> and then run this script again.

>
> Well, what's the likelihood that that would happen?


How would I know? The OP didn't say much about the contents of the
fields. But I'd say it is non-zero. "Chunk" is an English word which
might occur in a description, and it might even be the first 5
characters of a name. Finally, we don't know where data comes from -
somebody might deliberately try to sabotage the script.

> At least the OP didn't object to the idea with 'Chunk' as record
> separator.


I was under the impression that he was glad to understand your solution
at all and wasn't trying to find flaws in it. Far too few people think
about the edge-cases of possible input.

A word of warning about the solution I posted in a different message: It
doesn't handle embedded quotes - that would be quite easy to add, but
there are different systems of escaping quotes and one would need to
know which one to use - the OP didn't tell us.

hp

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
SAX parsing problem, when element contains text like "[text]" Kai Schlamp Java 1 03-27-2008 08:36 PM
[ANN] Parsing Tutorial and YARD 1.0: A C++ Parsing Framework Christopher Diggins C++ 0 07-09-2007 09:01 PM
Assistance parsing text file using Text::CSV_XS Domenico Discepola Perl Misc 6 09-02-2004 03:55 PM
SAX Parsing - Weird results when parsing content between tags. Naren XML 0 05-11-2004 07:25 PM
Perl expression for parsing CSV (ignoring parsing commas when in double quotes) GIMME Perl 2 02-11-2004 05:40 PM



Advertisments