On 07/22/2006 02:56 AM, Chris Newman wrote:
> I am working on a script to process a large number of old electoral records.
> There are about 100,000 records in all but here is a representative sample
>
> BTW hd =household duties
>
>
> ALLISON, Winifred hd
> BRACKENREG, Helen & James hd & lands officer
> MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver
>
> Note that the first names are in the same sequence as the occupations. An
> occupation may consist of one or two words eg 'hd' or 'tractor driver'. The
> last of these sample records has 3 'Marshalls' Margaret, Charles and Herbert
>
> though other records include up to six family members. In all cases there is
>
> a pattern:
>
> 1 person . . . occupation is immediately followed by a line return
> (naturally)
> 2 people . . . first occupation is followed by an '&', last occupation by
> line return
> 3 or more people . . . the first and up to the second last occupation are
> followed by commas and the remainder of the line follows the aforementioned
> patterns
>
>
> My initial thoughts
> Use a global REGEX that would step though and match the next occupation but
> it has not proved that easy. Need a way to move the 'matching point forward
> to a ampersand, comma or line return depending on context. If anyone could
> provide some insights into whether RE can provide this level of control or
> point me to a more appropriate solution.
>
>
> Here the relevant code snippet:
>
>
> #preceding code to do with last name, addresses etc This part works well
>
> @matches = (m/\s([A-Z][a-z]+\s)/g); # holds all first names for a record
>
> foreach $FirstName (@matches ) {
>
> (m/(\s[a-z]+(.*?)(&|,|$))/g); # fails to match but the first occupation
>
> $Occupation =$1; # stores the next matching occupation with each successive
> loop
>
> print ("\"$FirstName\",\"$Occupation\");
>
> }
>
The newsgroup comp.lang.perl is defunct. Comp.lang.perl.misc
is where the action is.
I like to break problems into pieces and eat away at them
piece-by-piece. For this problem, I'd use the s/// operator to
match and remove parts of the string that I'm looking for.
Your strings are organized like so: <family-name>
<first-names> <occupations>. So I'd suggest stripping off
(while matching) the family-names first, followed by the
first-names, followed by the occupations. And since '&' seems
to have a function that's the same as the comma, I'd convert
all &'s to commas before doing the real work, e.g.
use Data:

umper;
my $data = q{
ALLISON, Winifred hd
BRACKENREG, Helen & James hd & lands officer
MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver
};
open (FH, "<", \$data) or die("Couldn't open in-memory file.\n");
while (my $line = <FH>) {
$_ = $line;
s/^\s+//;
s/\s+$//;
next if m/^$/;
my ($fam,@names,@occup);
s/\&/,/g;
if (s/^([A-Z]+),\s*//) { $fam = $1 }
while (s/^([A-Z][a-z]+)(\s*,\s*)?//) { push @names, $1 }
while (s/^([a-z ]+)(\s*,\s*)?//) { push @occup, $1 }
print Data:

umper->Dump([$fam,\@names,\@occup],
[qw(family names occupations)]);
}
close FH;