Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Walking a tree and extracting info... Problems

Reply
Thread Tools

Walking a tree and extracting info... Problems

 
 
jim.goodman@gmail.com
Guest
Posts: n/a
 
      04-09-2006
I am new to the perl thing and i am trying to extract some date from
some web pages and am having problems.... can someone please tell me
what i am doing wrong... i think i have become a charter member of the
"idiots 'r' us" club... )!

this is my script... pretty simple so far, i am just trying to get one
piece of info working to start. i can traverse the directory and print
the filenames, but it only seems to get the data and do the pattern
matching from the first file in the directory....

any hints would be appreciated!

#!/usr/bin/perl
$dir="/Users/test/";
opendir(DIRECTORY, $dir) || die("Cannot open directory");
@thefiles= readdir(DIRECTORY);
closedir(DIRECTORY);

foreach $file (@thefiles) {
unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
) {
open FILE, "$dir/$file" or die "Can't open $file : $!";
while( <FILE> ) {
s/\t//; # ignore tabs by erasing them
next if /^(\s)*$/; # skip blank lines
chomp; # remove trailing newline characters
push @lines, $_; # push the data line onto the array
}
close FILE;
$string = "@lines";
$n++;
print "$n:$file:";
$string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
print "$1\n"; # print html page title
}
}

 
Reply With Quote
 
 
 
 
Henry Law
Guest
Posts: n/a
 
      04-09-2006
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I am new to the perl thing and i am trying to extract some date from
> some web pages and am having problems.... can someone please tell me
> what i am doing wrong... i think i have become a charter member of the
> "idiots 'r' us" club... )!


Nope; Perl is IMO harder to learn than some other languages. You're not
helping yourself enough though. I'll get to your problem in a moment,
but first some things you should do (a) to help you find your problems
before posting here, and (b) to get better and quicker help here.

1. Always code "use strict;" and "use warnings"; had you done so you
might have picked up the logic problem in your code, but it will
certainly ensure that you pick up many others.
2. Code not only a test program (well done for doing that) but also
some suitable data. I had to make some in order to do the testing.
3. Learn to use the Perl debugger (perl -d yourprog.pl) and to use the
breakpoint and examine commands. Doing that I found your problem in
one pass through the program.

> this is my script... pretty simple so far, i am just trying to get one
> piece of info working to start. i can traverse the directory and print
> the filenames, but it only seems to get the data and do the pattern
> matching from the first file in the directory....


What you mean is that once it has found a file with a match it then
finds that match in all subsequent files even if they themselves don't
have it. I recommend you try to be very precise about your problem.
Actually, showing your incorrect output is very precise and saves extra
thought on your part!

> #!/usr/bin/perl
> $dir="/Users/test/";


If you code "use strict" you'll need to put "my $dir", and the same
elsewhere in the file.

> opendir(DIRECTORY, $dir) || die("Cannot open directory");
> @thefiles= readdir(DIRECTORY);


This is OK as far as it goes but assumes you have enough memory to read
in the whole directory. Better practice is to read the directory line
by line, as you've (partly) done with the file.

> closedir(DIRECTORY);
>
> foreach $file (@thefiles) {
> unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
> ) {


A regex could do this (untested)

unless ( $file =~ /^\.{1,2}$|^\.DS_Store$/ ) {

.... but if you could just reject all "dot" files it would be even easier

unless ( $file =~ /^\./ )

> open FILE, "$dir/$file" or die "Can't open $file : $!";


Well done for checking the file open result. Lots of beginners don't.

> while( <FILE> ) {
> s/\t//; # ignore tabs by erasing them
> next if /^(\s)*$/; # skip blank lines
> chomp; # remove trailing newline characters
> push @lines, $_; # push the data line onto the array


Again, you're assuming that you always have enough memory for the whole
file.

Your problem is here. Because you didn't code "use strict" you aren't
forcing yourself to take control of the scope of your variables. Perl
has allocated "@lines" once for the whole program; when you process the
next file in the directory you push the lines on the bottom; the match
for the HTML title then fires every time. If you'd coded "my @lines"
just before the "while (<FILE)" line then you'd have got a new "@lines"
each time and your program would have worked as you wanted it to.

> }
> close FILE;
> $string = "@lines";


This is ugly, and produces a slap on the wrist from Perl when you code
"use strict; use warnings". Not that it doesn't give you what you want,
though ... it's up to you as to whether you want to write with good style.

> $n++;


When "strict" forces you to code "my $n" then you'll have to put it
outside the directory-read loop.

> print "$n:$file:";
> $string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
> print "$1\n"; # print html page title


Always check the extracted text. When I fixed your program so it only
examined the text of the current file I got errors from this statement
every time it failed to find a match.

Here's a minimally-fixed version of your program which "works", in the
sense that it finds the HTML titles. It still needs quite a lot of
cleaning up and more Perlish idiom.

#!/usr/bin/perl
# Jim Goodman's problem April 9

use strict; use warnings; # I added this

#$dir="/Users/test/";
my $dir="F:/scratch"; # My directory instead of his

opendir(DIRECTORY, $dir) || die("Cannot open directory");
my @thefiles= readdir(DIRECTORY);
closedir(DIRECTORY);

my $n;
foreach my $file (@thefiles) {
unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
) {
open FILE, "$dir/$file" or die "Can't open $file : $!";
my @lines = ();
while( <FILE> ) {
s/\t//; # ignore tabs by erasing them
next if /^(\s)*$/; # skip blank lines
chomp; # remove trailing newline characters
push @lines, $_; # push the data line onto the array
}
close FILE;
my $string = "@lines";
$n++;
print "$n:$file:";
$string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
print "$1\n" if $1; # print html page title
}
}

But I think I'd feel inclined use "grep" to find the files that had the
relevant string in them, and pipe the output into a much smaller Perl
program to find the HTML titles and print them out. You'd lose the
incrementing count of the files, though.

--

Henry Law <>< Manchester, England
 
Reply With Quote
 
 
 
 
jim.goodman@gmail.com
Guest
Posts: n/a
 
      04-09-2006
thanks a million.... i want you to know that although the wanted result
was a bit different that what you suggested, your suggestions still
solved my problem. You should also know that i have taken your
suggestions into account and have cleaned up my code, and next time i
will include a sample input file and the output... i wanted to attach
it all and had prepared a nice little archive but... ).

again, thanks a million on resolving what was such a simple issue, i
just not catching it ).... and if you think i should be learning
something other than perl, please speak up....!

 
Reply With Quote
 
Henry Law
Guest
Posts: n/a
 
      04-09-2006
(E-Mail Removed) wrote:

> again, thanks a million on resolving what was such a simple issue, i
> just not catching it ).... and if you think i should be learning
> something other than perl, please speak up....!


Absolutely not; once you're familiar with it Perl is easy and powerful.
I'd just that for some reason (that I can't explain) it seems to me to
be harder to move from "writing random Perl code" to "writing good,
neat, compact-yet-understandable Perl" than it is to make the same
transition for other languages. Keep posting here - in a way that
helps you and helps us - and you'll get the hang of it.

--

Henry Law <>< Manchester, England
 
Reply With Quote
 
Henry Law
Guest
Posts: n/a
 
      04-09-2006
(E-Mail Removed) wrote:
> suggestions into account and have cleaned up my code, and next time i


By the way, walking a directory tree is _exactly_ what the File::Find
module does, and for many applications it's better. Have a look at it.

--

Henry Law <>< Manchester, England
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      04-09-2006
(E-Mail Removed) <(E-Mail Removed)> wrote:

> I am new to the perl thing



You should have a look at the Posting Guidelines that are
posted here frequently (even though you have composed a
very good first post).


> and i am trying to extract some date from
> some web pages and am having problems.... can someone please tell me
> what i am doing wrong...



Putting the lines from the 1st file into @lines, then tacking
on the lines from the 2nd file, then the 3rd ...


> i can traverse the directory and print
> the filenames, but it only seems to get the data and do the pattern
> matching from the first file in the directory....
>
> any hints would be appreciated!
>
> #!/usr/bin/perl


#!/usr/bin/perl
use warnings;
use strict;


> $dir="/Users/test/";


my $dir = '/Users/test/';

> foreach $file (@thefiles) {



Since you want a new @lines array for every iteration of this loop, and
since you will now be using "strict" forevermore <g>, put a declaration
here so that you will get a new @lines array each time through
the foreach loop:

my @lines;


> unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")



If you do it this way instead:

next if ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store");

then you can save a level of indent.


> while( <FILE> ) {
> s/\t//; # ignore tabs by erasing them
> next if /^(\s)*$/; # skip blank lines
> chomp; # remove trailing newline characters
> push @lines, $_; # push the data line onto the array
> }



You eventually push() all of the lines from all of the files into @lines.

(the matching line from file 1 is in there every time.)


> $string = "@lines";



This adds space characters between each line. Is that want you wanted?


> $string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
> print "$1\n"; # print html page title



You should never use the dollar-digit variables unless you have
first tested to ensure that the match _succeeded_


if ( $string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is ){
print "$1\n"; # print html page title
}

> $string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;

^
^
^
Is there really a space there in the string you are matching against?


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      04-10-2006
Henry Law <(E-Mail Removed)> wrote:
> (E-Mail Removed) wrote:
>> I am new to the perl thing and i am trying to extract some date from
>> some web pages and am having problems.... can someone please tell me
>> what i am doing wrong... i think i have become a charter member of the
>> "idiots 'r' us" club... )!


> to get better and quicker help here.
>
> 1. Always code "use strict;" and "use warnings"; had you done so you
> might have picked up the logic problem in your code, but it will
> certainly ensure that you pick up many others.
> 2. Code not only a test program (well done for doing that) but also
> some suitable data. I had to make some in order to do the testing.



Amen brother!


> 3. Learn to use the Perl debugger (perl -d yourprog.pl) and to use the
> breakpoint and examine commands. Doing that I found your problem in
> one pass through the program.



I've needed to use the Perl debugger about a dozen times in
over 10 years of daily Perl coding.

Carefully placed print() statements usually do it for me (warn()
statements actually, because STDERR is not buffered).

I'd not spend a lot of my limited time on the debugger for a while.


>> foreach $file (@thefiles) {
>> unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
>> ) {

>
> A regex could do this (untested)
>
> unless ( $file =~ /^\.{1,2}$|^\.DS_Store$/ ) {



The code went from being easy to figure out, to requiring a bit
of analysis.

I would never use your regex alternative in a case like this.


>> open FILE, "$dir/$file" or die "Can't open $file : $!";

>
> Well done for checking the file open result. Lots of beginners don't.



And even more well done for remembering to glue the directory
part back onto the filename from readdir().


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Already here... and walking amongst us... ____sairenns@6url.com DVD Video 25 08-16-2005 12:14 AM
DVD Verdict reviews: WALKING TALL (2004), CLUB DREAD: UNRATED EDITION, and more! DVD Verdict DVD Video 0 10-22-2004 09:09 AM
walking a binary tree pembed2003 C Programming 4 04-20-2004 05:55 AM
tree walking -- saved recursion state Mikito Harakiri Java 13 01-05-2004 11:14 PM
B tree, B+ tree and B* tree Stub C Programming 3 11-12-2003 01:51 PM



Advertisments