Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > regex help!

Reply
Thread Tools

regex help!

 
 
Geoff Cox
Guest
Posts: n/a
 
      09-13-2003
Hello,

I am trying to extract email addresses from about 1000 htm files.

So far am trying

if ($line =~ /Mailto.*)"/ {
print OUT ("$1 \n");

where the line is

<a href="private.php?do=newpm&u="

problem is with the " after the email address and the "greedy" regex
characteristic which finds other " further along the line ...

can I stop at the first " mark?

Cheers

Geoff
 
Reply With Quote
 
 
 
 
Andreas Kahari
Guest
Posts: n/a
 
      09-13-2003
In article <>, Geoff Cox wrote:
> Hello,
>
> I am trying to extract email addresses from about 1000 htm files.


E-mail address harvesting on your spare time, are you?

> if ($line =~ /Mailto.*)"/ {
> print OUT ("$1 \n");

[cut]
> problem is with the " after the email address and the "greedy" regex
> characteristic which finds other " further along the line ...


Read the perlre manual about changing the "greediness" of a
quantifier with "?".


--
Andreas Kähäri
 
Reply With Quote
 
 
 
 
Michael Budash
Guest
Posts: n/a
 
      09-13-2003
In article <>,
Geoff Cox <> wrote:

> Hello,
>
> I am trying to extract email addresses from about 1000 htm files.
>
> So far am trying
>
> if ($line =~ /Mailto.*)"/ {
> print OUT ("$1 \n");
>
> where the line is
>
> <a href="private.php?do=newpm&u="
>
> problem is with the " after the email address and the "greedy" regex
> characteristic which finds other " further along the line ...
>
> can I stop at the first " mark?


/Mailto.*?)"/

you know that won't match your example don't you? unless you add the 'i'
flag (for 'i'gnore case):


/Mailto.*?)"/i

hth-

--
Michael Budash
 
Reply With Quote
 
Geoff Cox
Guest
Posts: n/a
 
      09-13-2003
On Sat, 13 Sep 2003 07:33:31 GMT, Michael Budash <>
wrote:

>/Mailto.*?)"/
>
>you know that won't match your example don't you? unless you add the 'i'
>flag (for 'i'gnore case):


Michael,

Thanks for the help - following code works now but I get the error
message "uninitialized value in string ne at ... the line with a **
below - do you knwo why?

Cheers

Geoff

use warnings;
use strict;

use File::Find;

open (OUT, ">>out");

my $dir = 'c:/atemp1/directory';

find ( sub {

open (IN, "$_");
my $line = <IN>;
** while ($line ne "") {
if ($line =~ /Mailto.*?)"/i) {
print OUT ("$1 \n");
}
$line = <IN>;
}

}, $dir);

close (OUT);


>
>/Mailto.*?)"/i
>
>hth-


 
Reply With Quote
 
Andreas Kahari
Guest
Posts: n/a
 
      09-13-2003
In article <>, Geoff Cox wrote:
[cut]
> Thanks for the help - following code works now but I get the error
> message "uninitialized value in string ne at ... the line with a **
> below - do you knwo why?

[cut]
> open (IN, "$_");
> my $line = <IN>;
> ** while ($line ne "") {
> if ($line =~ /Mailto.*?)"/i) {
> print OUT ("$1 \n");

[cut]


What happens at the end of a file? Well, <IN> will give you an
undefined value. This will also happen if the open() call failed.


--
Andreas Kähäri
 
Reply With Quote
 
Geoff Cox
Guest
Posts: n/a
 
      09-13-2003
On Sat, 13 Sep 2003 08:21:39 +0000 (UTC), Andreas Kahari
<ak+> wrote:

>In article <>, Geoff Cox wrote:
>[cut]
>> Thanks for the help - following code works now but I get the error
>> message "uninitialized value in string ne at ... the line with a **
>> below - do you knwo why?

>[cut]
>> open (IN, "$_");
>> my $line = <IN>;
>> ** while ($line ne "") {
>> if ($line =~ /Mailto.*?)"/i) {
>> print OUT ("$1 \n");

>[cut]
>
>
>What happens at the end of a file? Well, <IN> will give you an
>undefined value. This will also happen if the open() call failed.


Andreas,

ah! well the open call works so must be the end of file part - is
there a better way than using while ($line ne "" ) ? eof?

Geoff

 
Reply With Quote
 
Andreas Kahari
Guest
Posts: n/a
 
      09-13-2003
In article <>, Geoff Cox wrote:
> On Sat, 13 Sep 2003 08:21:39 +0000 (UTC), Andreas Kahari
><ak+> wrote:
>>In article <>, Geoff Cox wrote:

[cut]
>>> open (IN, "$_");
>>> my $line = <IN>;
>>> ** while ($line ne "") {
>>> if ($line =~ /Mailto.*?)"/i) {
>>> print OUT ("$1 \n");

>>[cut]
>>
>>
>>What happens at the end of a file? Well, <IN> will give you an
>>undefined value. This will also happen if the open() call failed.

>
> Andreas,
>
> ah! well the open call works so must be the end of file part - is
> there a better way than using while ($line ne "" ) ? eof?


Yes, a much much better way:

while(defined($line = <IN>)) {
... code ...
}

And personally I would say

open(IN, $_) or die "Failed in open(): $!";


Cheers,
Andreas

--
Andreas Kähäri
 
Reply With Quote
 
Geoff Cox
Guest
Posts: n/a
 
      09-13-2003
On Sat, 13 Sep 2003 08:39:03 +0000 (UTC), Andreas Kahari
<ak+> wrote:

>Yes, a much much better way:
>
> while(defined($line = <IN>)) {
> ... code ...
> }
>
>And personally I would say
>
> open(IN, $_) or die "Failed in open(): $!";


will use both - thanks!

Geoff

>
>
>Cheers,
>Andreas


 
Reply With Quote
 
Eric J. Roode
Guest
Posts: n/a
 
      09-13-2003
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Geoff Cox <> wrote in
news::

> I am trying to extract email addresses from about 1000 htm files.
>
> So far am trying
>
> if ($line =~ /Mailto.*)"/ {
> print OUT ("$1 \n");
>
> where the line is
>
> <a href="private.php?do=newpm&u="
>
> problem is with the " after the email address and the "greedy" regex
> characteristic which finds other " further along the line ...
>
> can I stop at the first " mark?


Change your thinking a bit. Instead of matching "Mailto:" followed by as
many characters as possible followed by a quote, match "Mailto:" followed
by as many non-quote characters as possible followed by a quote:

if ($line =~ /Mailto[^"]*)"/)

Also consider making it case-insensitive with the i modifier.

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP2MoO2PeouIeTNHoEQIdtACgxV2WliWoH07gZaS39JHGdb 1q+wAAn1f6
oXom0J4O85KppYwOysICYuZs
=yU+G
-----END PGP SIGNATURE-----
 
Reply With Quote
 
Geoff Cox
Guest
Posts: n/a
 
      09-13-2003
On Sat, 13 Sep 2003 09:22:06 -0500, "Eric J. Roode"
<> wrote:


>Change your thinking a bit. Instead of matching "Mailto:" followed by as
>many characters as possible followed by a quote, match "Mailto:" followed
>by as many non-quote characters as possible followed by a quote:
>
> if ($line =~ /Mailto[^"]*)"/)


Thanks Eric - will give it a try...

Cheers

Geoff

>
>Also consider making it case-insensitive with the i modifier.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
String Pattern Matching: regex and Python regex documentation Xah Lee Java 1 09-22-2006 07:11 PM
Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine? =?Utf-8?B?SmViQnVzaGVsbA==?= ASP .Net 2 10-22-2005 02:43 PM
Java regex imposture re: Perl regex compatibility a_c_Attlee@yahoo.com Java 2 05-06-2005 12:16 AM
perl regex to java regex Rick Venter Java 5 11-06-2003 10:55 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57