Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Reading poorly structured data

Reply
Thread Tools

Reading poorly structured data

 
 
Alan Mead
Guest
Posts: n/a
 
      12-08-2004
I have five files of contact info (one for each year of a conference).
All five have slightly different fairly unstructured formats. One looks
like this:

Bush, George, President, 1 White House Way, Washington,
DC 00000; http://www.velocityreviews.com/forums/(E-Mail Removed)
Kerry, John, 1 Main, Detroit, MI 00000; (E-Mail Removed)
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; (E-Mail Removed)
Blair, Tony, 1 Downing Street, London, UK 0000000
.... etc..

So the fields are comma-separated, except for email which may be absent,
and the record may be split over two or three lines.

In a later file dozens of records appear on the same line.

I'd like to output

lname=Bush
fname=George
address=President, 1 White House Way, Washington, DC 00000
email=(E-Mail Removed)

Any ideas how to parse this using Perl? So far I can parse about 60% of
the records with the below hack. It gets tripped up when the number
of commas in a record is large (some people have five lines of
address with embedded commas) in which cases it will parse the
first half of the record fairly well and then try to parse the
next half as a new record.

-Alan

my $i=0;
while($i<=$count) {
$i++;
my($lname,$fname,$address,$email)=('','','','');
my $line = $lines{$i};
if ($line =~ /[,;]$/) { # clearly more on next line
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
if ( (scalar split/,/,$line) > 4) { # a proper name and address will
# have at least 5 parts
if ($line =~ /@/) {
my @bits = split(/;/,$line); # email is last element when split
# on semicolons, so save it
$email = pop(@bits);
$line = join(';',@bits); # put line back together (just
# in case there's more than one
# semi-colon in the record)
}
my @bits = split(/,/,$line); # now split on commas
$lname = shift @bits; # lname is first bit
$fname = shift @bits; # folllowed by fname
$address = join(',',@bits); # the rest is the address
} else {
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
....
}


 
Reply With Quote
 
 
 
 
A. Sinan Unur
Guest
Posts: n/a
 
      12-08-2004
Alan Mead <(E-Mail Removed)> wrote in
news(E-Mail Removed):

> I have five files of contact info (one for each year of a conference).
> All five have slightly different fairly unstructured formats. One looks
> like this:
>
> Bush, George, President, 1 White House Way, Washington,
> DC 00000; (E-Mail Removed)
> Kerry, John, 1 Main, Detroit, MI 00000; (E-Mail Removed)
> Williams, Robin, 2 Main, Burbank, CA 00000
> Newman, Paul, President and Principal Spokesperson,
> Paul Newmans's Own Brand Foods, 123 Main Street,
> Olympia Fields, WY 00000; (E-Mail Removed)
> Blair, Tony, 1 Downing Street, London, UK 0000000
> ... etc..


Here is somewhat of a kludge that "works" for the snippet you posted. Hope
this helps.

#! perl

use strict;
use warnings;

use File::Slurp;

my $input = read_file(\*DATA);
$input =~ tr/\n/ /;

my @records;

while(length $input) {
my %record;
$record{lname} = grab_name($input);
$record{fname} = grab_name($input);
$input =~ /[A-Z]{2} \d+/g;
$record{address} = substr $input, 0, pos($input);
$input = substr $input, pos($input);
if($input =~ /^;\s*(\w+\@\w+\.\w+)\s*/g) {
$record{email} = $1;
$input = substr $input, pos $input;
}
push @records, \%record;
}

use Data:umper;
print Dumper \@records;

sub grab_name {
my $off = index $_[0], ',';
my $name = substr $_[0], 0, $off;
$_[0] = substr $_[0], $off + 2;
return $name;
}

__DATA__
Bush, George, President, 1 White House Way, Washington,
DC 00000; (E-Mail Removed)
Kerry, John, 1 Main, Detroit, MI 00000; (E-Mail Removed)
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; (E-Mail Removed)
Blair, Tony, 1 Downing Street, London, UK 0000000


 
Reply With Quote
 
 
 
 
Alan Mead
Guest
Posts: n/a
 
      12-08-2004
On Wed, 08 Dec 2004 04:04:53 +0000, A. Sinan Unur wrote:

> Here is somewhat of a kludge that "works" for the snippet you posted. Hope
> this helps.
>
> #! perl
> use strict;
> use warnings;
> use File::Slurp;
> my $input = read_file(\*DATA);
> $input =~ tr/\n/ /;
> my @records;
> while(length $input) {
> my %record;
> $record{lname} = grab_name($input);
> $record{fname} = grab_name($input);
> $input =~ /[A-Z]{2} \d+/g;
> $record{address} = substr $input, 0, pos($input);
> $input = substr $input, pos($input);
> if($input =~ /^;\s*(\w+\@\w+\.\w+)\s*/g) {
> $record{email} = $1;
> $input = substr $input, pos $input;
> }
> push @records, \%record;
> }

[...]

And so it does very nicely. I think you are making use of the fact that
these all had a pair of capital letters near the end (including the
convenient UK) but there is a 'D.C.' in my data and some other
addresses outside the US (that lack this feature). I should have included
a better sample. But this may get me to 95% ... The way you've slurped the
file makes this perfectly applicable to the rest of the files which is a
REALLY BIG help.

Thanks!

-Alan
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      12-08-2004
Alan Mead <(E-Mail Removed)> wrote in
news(E-Mail Removed):

> On Wed, 08 Dec 2004 04:04:53 +0000, A. Sinan Unur wrote:
>
>> $input =~ /[A-Z]{2} \d+/g;

....

> And so it does very nicely. I think you are making use of the fact
> that these all had a pair of capital letters near the end (including
> the convenient UK) but there is a 'D.C.' in my data and some other
> addresses outside the US (that lack this feature).


Actually, that is a standing for some kind of Country/State Code with
numeric postal code match because all your addresses seemed to end with
that.

The "two capital letters followed by some digits as end of mailing address
indicator" was one of the things that made the code kludgy.

I am sure others will provide better ways once the sun comes up. Good luck.

Sinan.
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      12-08-2004
"A. Sinan Unur" <(E-Mail Removed)> wrote in
news:Xns95B8F3ED5DCB9asu1cornelledu@132.236.56.8:

> Actually, that is a standing for some kind of Country/State Code with

^^^^^^^^
I meant 'stand-in'. Sorry.

Sinan
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
transmitting raw data vs. tree-structured data Max You Ruby 2 06-19-2010 09:06 PM
Canon S400 poorly made? Louise Digital Photography 9 07-19-2004 10:40 PM
Palm Pilot Pages Poorly Performing... d w a c o n Computer Support 2 01-28-2004 03:28 AM
Enterprise Localization Framework/Toolkit - AutoPopulate : Does it work poorly or not at all? Eric ASP .Net 3 08-26-2003 05:37 PM
Navigation renders poorly in Navigator Bryce HTML 1 06-28-2003 09:27 PM



Advertisments