Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > regex search - suggestions?

Reply
Thread Tools

regex search - suggestions?

 
 
Sara
Guest
Posts: n/a
 
      07-24-2004
Hi All,
I have a string (a paragraph) without newlines, with organization
names and their abbreviations in brackets like...

$tmp = "... was proposed by World Health Organisation (WHO) in ...";

I have the following code segment:

$tmp =~ s/\)/\)\n<brk>/g; # because we have . in regex and
# there is no \n in $tmp
my ($abbr,$org) = "";
my (%orgs) = ();
foreach my $line (split (/\n/, $tmp)) {
if ($line =~ /\b([A-Z])(\w+[ forand]*) ([A-Z])(.*?)
\((\1\3[A-Z]*)\)/) {
$abbr = $5; $org = "$1$2 $3$4";
$orgs{$abbr} = $org;
}
}
I added [ forand]* in regex to include 'for', 'of', 'and' that might
appear after the first word.
Can anyone help me to improve the accuracy of this search, especially
the [ forand]* part.
Thanks in advance.
 
Reply With Quote
 
 
 
 
Ilmari Karonen
Guest
Posts: n/a
 
      07-24-2004
On 2004-07-24, Sara <(E-Mail Removed)> wrote:
> Hi All,
> I have a string (a paragraph) without newlines, with organization
> names and their abbreviations in brackets like...
>
> $tmp = "... was proposed by World Health Organisation (WHO) in ...";


....and you want to extract the organization names and abbreviations?

my @tmp = split /\s*\(([A-Z]+)\)/, $tmp;
pop @tmp;

my %orgs;
while (my ($str, $abbr) = splice(@tmp, 0, 2)) {
(my $re = $abbr) =~ s/(.)/$1[a-z\\W]*/g;
$str =~ /.*($re)$/s or warn "Can't expand $abbr!\n" and next;
$orgs{$abbr} = $1;
}


> Can anyone help me to improve the accuracy of this search, especially


If you could provide more sample data, I could do some more thorough
testing. My code works for your example case, and probably quite many
others. Some cases where it fails for various reasons include:

World Wide Web Consortium (W3C)
PlayStation 2 (PS2)
Church of Scientology (CoS)
Skip if Equal (SEQ)
Decrement and Jump if Not Zero (DJN)
Deutscher Jugendbund für Naturbeobachtung (DJN)
GNU's Not Unix (GNU)

Most of those can be fixed, although idiosyncratic abbreviations like
W3C are probably not worth the effort.

--
Ilmari Karonen
If replying by e-mail, please replace ".invalid" with ".net" in address.
 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      07-24-2004
Sara <(E-Mail Removed)> wrote:

> I added [ forand]* in regex to include 'for', 'of', 'and' that might
> appear after the first word.



That will match exactly the same strings as:

[adfnor ]*

It would match:

aaaaaa
afafafaf

etc.

A character class matches a _character_, not a string.


> Can anyone help me to improve the accuracy of this search, especially
> the [ forand]* part.



(for|of|and)


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Sara
Guest
Posts: n/a
 
      07-26-2004
Ilmari Karonen wrote in message
>...and you want to extract the organization names and abbreviations?

Yes, forgot to mention that

>If you could provide more sample data, I could do some more thorough
>testing. My code works for your example case, and probably quite

many

I have got organization names like ...
European Process Safety Centre (EPSC)
Association of British Chemical Manufacturers (ABCM)
Safety and Reliability Directorate (SRD)
# The next one was not found by your code
Health and Safety at Work etc. Act 1974 (HSWA)
Advisory Committee on Major Hazards (ACMH)
Center for Chemical Process Safety (CCPS)

>Most of those can be fixed, although idiosyncratic abbreviations like
>W3C are probably not worth the effort.

I agree, I don't want to work for it either


Tad McClellan wrote in message
> That will match exactly the same strings as:
> [adfnor ]*
>
> > Can anyone help me to improve the accuracy of this search, especially
> > the [ forand]* part.

>
> (for|of|and)


That was almost exactly what I tried first:
$line =~ /\b([A-Z])(\w+)( for| of| and)? ([A-Z])(.*?)
\((\1\4[A-Z]*)\)/;
$abbr = $6; $org = "$1$2$3 $4$5";
$orgs{$abbr} = $org;

since 'for','of','and' don't get included in abbreviations, but won't
it produce 'Use of uninitialized value in ...' for those which don't
have 'for','of','and'? Is that ignorable?
Thanks,
Sara
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine? =?Utf-8?B?SmViQnVzaGVsbA==?= ASP .Net 2 10-22-2005 02:43 PM
search within a search within a search - looking for better way...my script times out Abby Lee ASP General 5 08-02-2004 04:01 PM
perl regex to java regex Rick Venter Java 5 11-06-2003 10:55 AM



Advertisments