Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regexp help.

Reply
Thread Tools

Regexp help.

 
 
Cab
Guest
Posts: n/a
 
      06-02-2006
Hi all.

I'm trying to set up a script to strip out URL's from the body of a
Usenet post.

Any clues please? I have some expressions that I'm using, but they're
very long winded and inefficient, as seen below. At the moment, I've
done this in bash, but want to eventually set up a perl script to do
this.

So far I've got this small script that will remove URLs that start at
the beginning of a line, into a file. This is the easy part (Note, I
know this is messy, but this is still a dev script, at the moment).

---
echo remove spaces from the start of lines
sed 's/^ *//g' sorted_file > 1

echo Remove all '>' from a file.
sed '/>/d' 1 > 2

echo uniq the file
uniq 2 > 3


echo Move all lines beginning with http or www into another file
sed -n '/^http/p' 3 > 4
sed -n '/^www/p' 3 >> 4

echo Remove all junk on lines from "space" to EOL
sed '/ .*$/d' 4 > 4.1

echo uniq the file
uniq 4.1 > 4.2

echo So far, I've got a file with all www and http only.
mv 4.2 http_and_www_only
---

Once I've stripped these lines (easy enough), I have a file that
remains like this:

----
And the URL is:
Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
Anyone still got the url of the pages about the woman who keeps going
Are available on: http://www.spete.net/ukrm/sedan06/index.html
are July 6-8. The reason being "Power Big Meet",
http://www.bigmeet.com/ ,
Are you sure? http://www.usgpru.net/
a scout around www.nslu2-linux.org - and perhaps there isn't any easier
asked where the sinks were and if you could plug curling tongs into the
----

The result I want is a list like the following:

http://ukrm.net/faq/UKRMsCBT.html
http://www.girlsbike2.com/
http://www.spete.net/ukrm/sedan06/index.html
http://www.bigmeet.com/
http://www.usgpru.net/
www.nslu2-linux.org

Can anyone give me some clues or pointers to websites where I can go
into this in more detail please?
--
Cab
 
Reply With Quote
 
 
 
 
Mirco Wahab
Guest
Posts: n/a
 
      06-02-2006
Thus spoke Cab (on 2006-06-02 15:57):

> I'm trying to set up a script to strip out URL's from the body of a
> Usenet post.
> The result I want is a list like the following:
>
> http://ukrm.net/faq/UKRMsCBT.html
> http://www.girlsbike2.com/
> http://www.spete.net/ukrm/sedan06/index.html
> http://www.bigmeet.com/
> http://www.usgpru.net/
> www.nslu2-linux.org


The following prints all links
(starting w/http or www) from $text

use:
$> perl dumplinks.pl < text.txt

#!/usr/bin/perl
use strict;
use warnings;

my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

# or:
# while (<>) {
# print "$1\n" while /(\b(http|www)\S+)/g;
# }


Of course, this can be done by an one-liner

Regards

Mirco
 
Reply With Quote
 
 
 
 
Dr.Ruud
Guest
Posts: n/a
 
      06-02-2006
Cab schreef:

> Subject: Regexp help.


Please go and read the Posting Guidelines.

--
Affijn, Ruud

"Gewoon is een tijger."


 
Reply With Quote
 
Paul Lalli
Guest
Posts: n/a
 
      06-02-2006
Cab wrote:
> I'm trying to set up a script to strip out URL's from the body of a
> Usenet post.


<snip bash script>

> Can anyone give me some clues or pointers to websites where I can go
> into this in more detail please?


open the original file for reading
open two files for writing - one for the modified file, one for the
list of URLs
loop through each line of the original file
Search for a URI, using Regexp::Common::URI. Replace it with nothing,
and be sure to capture the URI.
print the modified line to the modified file
print the captured URI to the URI file.

Documentation to help you in this goal:
open a file: perldoc -f open
Looping: perldoc perlsyn
Reading a line from a file: perldoc -f readline
Using search-and-replace: perldoc perlop, perldoc perlretut
Regexp::Common::URI:
http://search.cpan.org/~abigail/Rege.../Common/URI.pm
printing to a file: perldoc -f print

Once you have made your *perl* attempt, if it doesn't work the way you
want, feel free to post it here to seek assistance. In the mean time,
be sure to read the posting guidelines for this group. They are posted
here twice a week.

Paul Lalli

 
Reply With Quote
 
Xicheng Jia
Guest
Posts: n/a
 
      06-02-2006
Cab wrote:
> Hi all.
>
> I'm trying to set up a script to strip out URL's from the body of a
> Usenet post.
>
> Any clues please? I have some expressions that I'm using, but they're
> very long winded and inefficient, as seen below. At the moment, I've
> done this in bash, but want to eventually set up a perl script to do
> this.
>
> So far I've got this small script that will remove URLs that start at
> the beginning of a line, into a file. This is the easy part (Note, I
> know this is messy, but this is still a dev script, at the moment).
>
> ---
> echo remove spaces from the start of lines
> sed 's/^ *//g' sorted_file > 1
>
> echo Remove all '>' from a file.
> sed '/>/d' 1 > 2
>
> echo uniq the file
> uniq 2 > 3
>
>
> echo Move all lines beginning with http or www into another file
> sed -n '/^http/p' 3 > 4
> sed -n '/^www/p' 3 >> 4
>
> echo Remove all junk on lines from "space" to EOL
> sed '/ .*$/d' 4 > 4.1
>
> echo uniq the file
> uniq 4.1 > 4.2
>
> echo So far, I've got a file with all www and http only.
> mv 4.2 http_and_www_only
> ---
>
> Once I've stripped these lines (easy enough), I have a file that
> remains like this:
>
> ----
> And the URL is:
> Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
> Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
> Anyone still got the url of the pages about the woman who keeps going
> Are available on: http://www.spete.net/ukrm/sedan06/index.html
> are July 6-8. The reason being "Power Big Meet",
> http://www.bigmeet.com/ ,
> Are you sure? http://www.usgpru.net/
> a scout around www.nslu2-linux.org - and perhaps there isn't any easier
> asked where the sinks were and if you could plug curling tongs into the
> ----
>
> The result I want is a list like the following:
>
> http://ukrm.net/faq/UKRMsCBT.html
> http://www.girlsbike2.com/
> http://www.spete.net/ukrm/sedan06/index.html
> http://www.bigmeet.com/
> http://www.usgpru.net/
> www.nslu2-linux.org


you can start from here:

lynx -dump http://your_url | grep -o '\(http\|www\)://.*'

then filter out any unwanted links.

HTH,
Xicheng

 
Reply With Quote
 
Cab
Guest
Posts: n/a
 
      06-02-2006
Mirco Wahab wrote:

> Thus spoke Cab (on 2006-06-02 15:57):
>
> > I'm trying to set up a script to strip out URL's from the body of a
> > Usenet post.
> > The result I want is a list like the following:
> >
> > http://ukrm.net/faq/UKRMsCBT.html
> > http://www.girlsbike2.com/
> > http://www.spete.net/ukrm/sedan06/index.html
> > http://www.bigmeet.com/
> > http://www.usgpru.net/
> > www.nslu2-linux.org

>
> The following prints all links
> (starting w/http or www) from $text
>
> use:
> $> perl dumplinks.pl < text.txt
>
> #!/usr/bin/perl
> use strict;
> use warnings;
>
> my $data = do {local $/; <> };
> print "$1\n" while $data =~ /(\b(http|www)\S+)/g;
>
> # or:
> # while (<>) {
> # print "$1\n" while /(\b(http|www)\S+)/g;
> # }
>
>
> Of course, this can be done by an one-liner
>
> Regards
>
> Mirco


Ta very much for that. Very helpful.

--
Cab
 
Reply With Quote
 
Cab
Guest
Posts: n/a
 
      06-02-2006
Paul Lalli wrote:

> Documentation to help you in this goal:
> open a file: perldoc -f open
> Looping: perldoc perlsyn
> Reading a line from a file: perldoc -f readline
> Using search-and-replace: perldoc perlop, perldoc perlretut
> Regexp::Common::URI:
> http://search.cpan.org/~abigail/Rege...Regexp/Common/
> URI.pm printing to a file: perldoc -f print


^^^^^^^^^^^^^^^^^^^

Ah, that's handy. Thanks.

--
Cab
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      06-02-2006
Mirco Wahab schreef:

> my $data = do {local $/; <> };
> print "$1\n" while $data =~ /(\b(http|www)\S+)/g;


{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:httpwww\.)\S+)/g ]}"
}

But read `perldoc -q URL`.

--
Affijn, Ruud

"Gewoon is een tijger."


 
Reply With Quote
 
John W. Krahn
Guest
Posts: n/a
 
      06-02-2006
Dr.Ruud wrote:
> Mirco Wahab schreef:
>
>> my $data = do {local $/; <> };
>> print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

>
> { local ($", $\, $/) = ("\n", "\n", undef) ;
> print "@{[ <> =~ /(\b(?:httpwww\.)\S+)/g ]}"
> }



{ local ( $,, $\, $/ ) = ( "\n", "\n" );
print <> =~ /\b(?:httpwww\.)\S+/g
}



John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      06-03-2006
John W. Krahn schreef:
> Dr.Ruud:
>> Mirco Wahab:


>>> my $data = do {local $/; <> };
>>> print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

>>
>> { local ($", $\, $/) = ("\n", "\n", undef) ;
>> print "@{[ <> =~ /(\b(?:httpwww\.)\S+)/g ]}"
>> }

>
> { local ( $,, $\, $/ ) = ( "\n", "\n" );
> print <> =~ /\b(?:httpwww\.)\S+/g
> }


Yes, that certainly is a cleaner variant. I did hesitate to put the
C<undef> at the end of the rightside list, but decided it would be more
educational. But then I was already trapped in using C<$"> where C<$,>
is cleaner.

--
Affijn, Ruud

"Gewoon is een tijger."


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
new RegExp().test() or just RegExp().test() Matěj Cepl Javascript 3 11-24-2009 02:41 PM
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Ruby 1.9 - ArgumentError: incompatible encoding regexp match(US-ASCII regexp with ISO-2022-JP string) Mikel Lindsaar Ruby 0 03-31-2008 10:27 AM
Programmatically turning a Regexp into an anchored Regexp Greg Hurrell Ruby 4 02-14-2007 06:56 PM
RegExp.exec() returns null when there is a match - a JavaScript RegExp bug? Uldis Bojars Javascript 2 12-17-2006 09:59 PM



Advertisments