Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Help sought Perl with a bit of REGEX

Reply
Thread Tools

Help sought Perl with a bit of REGEX

 
 
Chris Newman
Guest
Posts: n/a
 
      07-22-2006
I am working on a script to process a large number of old electoral records.
There are about 100,000 records in all but here is a representative sample

BTW hd =household duties


ALLISON, Winifred hd
BRACKENREG, Helen & James hd & lands officer
MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver

Note that the first names are in the same sequence as the occupations. An
occupation may consist of one or two words eg 'hd' or 'tractor driver'. The
last of these sample records has 3 'Marshalls' Margaret, Charles and Herbert

though other records include up to six family members. In all cases there is

a pattern:

1 person . . . occupation is immediately followed by a line return
(naturally)
2 people . . . first occupation is followed by an '&', last occupation by
line return
3 or more people . . . the first and up to the second last occupation are
followed by commas and the remainder of the line follows the aforementioned
patterns


My initial thoughts
Use a global REGEX that would step though and match the next occupation but
it has not proved that easy. Need a way to move the 'matching point forward
to a ampersand, comma or line return depending on context. If anyone could
provide some insights into whether RE can provide this level of control or
point me to a more appropriate solution.


Here the relevant code snippet:


#preceding code to do with last name, addresses etc This part works well

@matches = (m/\s([A-Z][a-z]+\s)/g); # holds all first names for a record

foreach $FirstName (@matches ) {

(m/(\s[a-z]+(.*?)(&|,|$))/g); # fails to match but the first occupation

$Occupation =$1; # stores the next matching occupation with each successive
loop

print ("\"$FirstName\",\"$Occupation\");

}

 
Reply With Quote
 
 
 
 
Mumia W.
Guest
Posts: n/a
 
      07-22-2006
On 07/22/2006 02:56 AM, Chris Newman wrote:
> I am working on a script to process a large number of old electoral records.
> There are about 100,000 records in all but here is a representative sample
>
> BTW hd =household duties
>
>
> ALLISON, Winifred hd
> BRACKENREG, Helen & James hd & lands officer
> MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver
>
> Note that the first names are in the same sequence as the occupations. An
> occupation may consist of one or two words eg 'hd' or 'tractor driver'. The
> last of these sample records has 3 'Marshalls' Margaret, Charles and Herbert
>
> though other records include up to six family members. In all cases there is
>
> a pattern:
>
> 1 person . . . occupation is immediately followed by a line return
> (naturally)
> 2 people . . . first occupation is followed by an '&', last occupation by
> line return
> 3 or more people . . . the first and up to the second last occupation are
> followed by commas and the remainder of the line follows the aforementioned
> patterns
>
>
> My initial thoughts
> Use a global REGEX that would step though and match the next occupation but
> it has not proved that easy. Need a way to move the 'matching point forward
> to a ampersand, comma or line return depending on context. If anyone could
> provide some insights into whether RE can provide this level of control or
> point me to a more appropriate solution.
>
>
> Here the relevant code snippet:
>
>
> #preceding code to do with last name, addresses etc This part works well
>
> @matches = (m/\s([A-Z][a-z]+\s)/g); # holds all first names for a record
>
> foreach $FirstName (@matches ) {
>
> (m/(\s[a-z]+(.*?)(&|,|$))/g); # fails to match but the first occupation
>
> $Occupation =$1; # stores the next matching occupation with each successive
> loop
>
> print ("\"$FirstName\",\"$Occupation\");
>
> }
>


The newsgroup comp.lang.perl is defunct. Comp.lang.perl.misc
is where the action is.

I like to break problems into pieces and eat away at them
piece-by-piece. For this problem, I'd use the s/// operator to
match and remove parts of the string that I'm looking for.

Your strings are organized like so: <family-name>
<first-names> <occupations>. So I'd suggest stripping off
(while matching) the family-names first, followed by the
first-names, followed by the occupations. And since '&' seems
to have a function that's the same as the comma, I'd convert
all &'s to commas before doing the real work, e.g.

use Data:umper;

my $data = q{
ALLISON, Winifred hd
BRACKENREG, Helen & James hd & lands officer
MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver
};

open (FH, "<", \$data) or die("Couldn't open in-memory file.\n");

while (my $line = <FH>) {
$_ = $line;
s/^\s+//;
s/\s+$//;
next if m/^$/;

my ($fam,@names,@occup);
s/\&/,/g;
if (s/^([A-Z]+),\s*//) { $fam = $1 }
while (s/^([A-Z][a-z]+)(\s*,\s*)?//) { push @names, $1 }
while (s/^([a-z ]+)(\s*,\s*)?//) { push @occup, $1 }

print Data:umper->Dump([$fam,\@names,\@occup],
[qw(family names occupations)]);
}

close FH;


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
simple regex pattern sought Roedy Green Java 18 05-27-2012 06:39 PM
What is the point of having 16 bit colour if a computer monitor can only display 8 bit colour? How do you edit 16 bit colour when you can only see 8 bit? Scotius Digital Photography 6 07-13-2010 03:33 AM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
64 bit - Windows Liberty 64bit, Windows Limited Edition 64 Bit, Microsoft SQL Server 2000 Developer Edition 64 Bit, IBM DB2 64 bit - new ! vvcd Computer Support 0 09-17-2004 08:15 PM
64 bit - Windows Liberty 64bit, Windows Limited Edition 64 Bit,Microsoft SQL Server 2000 Developer Edition 64 Bit, IBM DB2 64 bit - new! Ionizer Computer Support 1 01-01-2004 07:27 PM



Advertisments