Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > regex for URL in a log file

Reply
Thread Tools

regex for URL in a log file

 
 
Jaga
Guest
Posts: n/a
 
      10-02-2003
hail all,
I am trying to write a regular expression to match a url in a text file.
the test file looks like below under the *********
I would like to match all the urls a print them out...
I think this is easy for most but a pain in the neck for me

thanks!


************
;V8q|`<F- L/& ?Q ` h  
6/$h :2003091520030922:
tfred@http://quintillium.com/mslegal/tssi986
URL  ssq|`<F- L/ ?Q ` h  
6/$h :2003091520030922:
tfred@http://ninet/Lists/Announcements/DispForm.h


 
Reply With Quote
 
 
 
 
Jaga
Guest
Posts: n/a
 
      10-02-2003
Hail again,
here is some code I 'lifted' from different places to do pretty much
what I want... unforutnately, it doesn't work and I am working on trying to
fix it...
##########################
open IFILE,"<log.txt" or die "Can't Open file:: $!";

@lines=<IFILE>;

$text = join "\n", @lines;

@hrefs=($text=~ m{ \"(?-)|http\:\/\/(.*?))\"\s+ }x);

print "list of href values\n";
$count = 1;
foreach $href (@hrefs) {
print "$href\n";
$count++;
}
print $count;

close IFILE;
##########################
thanks,
Jaga

"Jaga" <(E-Mail Removed)> wrote in message
news:blhq2d$4v4$(E-Mail Removed)...
> hail all,
> I am trying to write a regular expression to match a url in a text

file.
> the test file looks like below under the *********
> I would like to match all the urls a print them out...
> I think this is easy for most but a pain in the neck for me
>
> thanks!
>
>
> ************
> ;V8q|`<F- L/& ?Q ` h  
> 6/$h :2003091520030922:
> tfred@http://quintillium.com/mslegal/tssi986
> URL  ssq|`<F- L/ ?Q ` h  
> 6/$h :2003091520030922:
> tfred@http://ninet/Lists/Announcements/DispForm.h
>
>



 
Reply With Quote
 
 
 
 
Jaga
Guest
Posts: n/a
 
      10-02-2003
I change the regex to look like this:
@hrefs=($text=~ m{http\:\/\/(.*?)\s+ }x);
unfortunately, it only returns:
quintillium.com/mslegal/tssi986

and doesn't return the other url
how can I do it recursivly through out the whole $text string?
or how can I do this more efficiently...

"Jaga" <(E-Mail Removed)> wrote in message
news:bli00h$9l0$(E-Mail Removed)...
> Hail again,
> here is some code I 'lifted' from different places to do pretty much
> what I want... unforutnately, it doesn't work and I am working on trying

to
> fix it...
> ##########################
> open IFILE,"<log.txt" or die "Can't Open file:: $!";
>
> @lines=<IFILE>;
>
> $text = join "\n", @lines;
>
> @hrefs=($text=~ m{ \"(?-)|http\:\/\/(.*?))\"\s+ }x);
>
> print "list of href values\n";
> $count = 1;
> foreach $href (@hrefs) {
> print "$href\n";
> $count++;
> }
> print $count;
>
> close IFILE;
> ##########################
> thanks,
> Jaga
>
> "Jaga" <(E-Mail Removed)> wrote in message
> news:blhq2d$4v4$(E-Mail Removed)...
> > hail all,
> > I am trying to write a regular expression to match a url in a text

> file.
> > the test file looks like below under the *********
> > I would like to match all the urls a print them out...
> > I think this is easy for most but a pain in the neck for me
> >
> > thanks!
> >
> >
> > ************
> > ;V8q|`<F- L/& ?Q ` h  
> > 6/$h :2003091520030922:
> > tfred@http://quintillium.com/mslegal/tssi986
> > URL  ssq|`<F- L/ ?Q ` h  
> > 6/$h :2003091520030922:
> > tfred@http://ninet/Lists/Announcements/DispForm.h
> >
> >

>
>



 
Reply With Quote
 
Glenn Jackman
Guest
Posts: n/a
 
      10-02-2003
Jaga <(E-Mail Removed)> wrote:
> I am trying to write a regular expression to match a url in a text file.


Don't reinvent the wheel:

use Regexp::Common qw(URI);
my @urls;
while (<>) {
push @urls, /$RE{URI}{HTTP}/g;
}

--
Glenn Jackman
NCF Sysadmin
http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
Florian von Savigny
Guest
Posts: n/a
 
      10-02-2003


One way to do it:

$text = "blabla soiu apoj match poi aigjpo match poua ier";

while ($text =~ /[^a-z](match)[^a-z]/g) {
print $1, "\n";
}

this outputs:

match
match

The crucial thing is the /g (global) modifier, which causes the
matching to go on after the first match, until there's no more.

> @hrefs=($text=~ m{http\:\/\/(.*?)\s+ }x);
> unfortunately, it only returns:
> quintillium.com/mslegal/tssi986


This seems obvious, since you've excluded the "http://" from the
parentheses. I've never formulated such a thing the way you have done
here, but you might try to exchange your x modifier for g (x is
misled: it means "extended regular expressions", which means that you
can use comments and whitespace inside your regex to make it more
readable); it might work similar to my while () loop. However, as this
seems to return the contents of the first pair of parentheses (all $1,
so to speak), I wouldn't want to guess what it returns if you use more
than one pair.

Some more hints:

- if you use delimiters other than //, as you have done, you need not
escape the "/" in the regex; and you never need to escape ":"

- it is often a good idea to define matches by what they must NOT be:
e.g., formulate the body of the URL as "[^\s]+" (assuming it is
indeed delimited by some whitespace character). This has the side
effect of being helpful with tools such as grep, which don't support
minimal matching quantifiers (*?).

- if you do not want to exclude protocols other than HTTP, you might
want to say sth like "(http|ftp|news|mailto)" instead of just
"http" (but see above). You'd have to adjust the slashes, of course.

--


Florian v. Savigny

If you are going to reply in private, please be patient, as I only
check for mail something like once a week. - Si vous allez rpondre
personellement, patientez s.v.p., car je ne lis les courriels
qu'environ une fois par semaine.
 
Reply With Quote
 
Florian von Savigny
Guest
Posts: n/a
 
      10-02-2003

Florian von Savigny <(E-Mail Removed)> writes:

> However, as this
> seems to return the contents of the first pair of parentheses (all $1,
> so to speak), I wouldn't want to guess what it returns if you use more
> than one pair.


Sorry, got it: it returns What You Would Expect: if you have two pairs
of parentheses, it will return $1, $2, for the first match, then $1,
$2 for the second, and so on. So using more than one pair of
parentheses probably makes your approach unwieldy, as you'd probably
have to post-process your list.

--


Florian v. Savigny

If you are going to reply in private, please be patient, as I only
check for mail something like once a week. - Si vous allez rpondre
personellement, patientez s.v.p., car je ne lis les courriels
qu'environ une fois par semaine.
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      10-03-2003
Florian von Savigny <(E-Mail Removed)> wrote:

> e.g., formulate the body of the URL as "[^\s]+"



or as \S+ which matches exactly the same characters.


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Ted Zlatanov
Guest
Posts: n/a
 
      10-03-2003
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
How to log independently of other webapps using log4j? Also, reading from log file from within servlet? unomystEz Java 0 11-19-2006 10:42 AM
redirect URL's, return URL's, and URL Parameters Jon paugh ASP .Net 1 07-10-2004 05:29 AM
Urgent Pls: Facing problem in reading Log information from Log file, created by IIS Amratash ASP .Net 0 04-13-2004 09:08 AM
regex for URL in a log file Jaga Perl Misc 0 10-02-2003 07:28 PM



Advertisments