Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Find length of files

Reply
Thread Tools

Find length of files

 
 
Michael Preminger
Guest
Posts: n/a
 
      06-23-2004
Hello!

I have a huge directory, for which I need the word-count of all files
(like wc -w * , and then put all length into a database)

Is there a smart way to do it in perl? (apart from wc * > file and then
open the file..)

Thanks

Michael
 
Reply With Quote
 
 
 
 
John Bokma
Guest
Posts: n/a
 
      06-23-2004
Michael Preminger wrote:

> Hello!
>
> I have a huge directory, for which I need the word-count of all files
> (like wc -w * , and then put all length into a database)
>
> Is there a smart way to do it in perl? (apart from wc * > file and then
> open the file..)


foreach file
open file,
put each word in a hash, like $words{$word}++;
this way you get the count of each word.

--
John MexIT: http://johnbokma.com/mexit/
personal page: http://johnbokma.com/
Experienced Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html
 
Reply With Quote
 
 
 
 
Rusty Phillips
Guest
Posts: n/a
 
      06-23-2004
The wordcount is not the number of times a word occurs, John.
It's the number of words in a file.

Using wc will probably be faster than the perl way (unless there's a
perl command for doing wordcounts I'm not aware of), but you don't
need to open a file.
Just use a pipe.

http://www.devdaily.com/perl/edu/art...pl010004.shtml
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      06-23-2004
Rusty Phillips <(E-Mail Removed)> wrote in
news(E-Mail Removed):

> The wordcount is not the number of times a word occurs, John.
> It's the number of words in a file.


Just a minor point: You _can_ obtain the number of words in a file by
adding the number of occurances of each word in the file.

--
A. Sinan Unur
http://www.velocityreviews.com/forums/(E-Mail Removed) (reverse each component for email address)
 
Reply With Quote
 
John Bokma
Guest
Posts: n/a
 
      06-23-2004
Rusty Phillips wrote:

> The wordcount is not the number of times a word occurs, John.
> It's the number of words in a file.


Ah, ok, then do something like:

$words_in_file{$filename}++;

for every word you find .

> Using wc will probably be faster than the perl way (unless there's a
> perl command for doing wordcounts I'm not aware of),


Uhm, I guess for every file you have to fork, so I doubt it.

Hard to say, without benchmarking.

--
John MexIT: http://johnbokma.com/mexit/
personal page: http://johnbokma.com/
Experienced Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html
 
Reply With Quote
 
John Bokma
Guest
Posts: n/a
 
      06-23-2004
Purl Gurl wrote:

> count and number of characters. Some say number of
> characters is file size but I am not sure if this
> includes file headers or not. I have never compared
> character count to file size. Perhaps they are the
> same measure.


Guess so: " -c, --bytes, --chars
print the byte counts"

(man wc)

File headers are of course a thing that's hard to guess for a program
like wc, and depends on the file contents specification.

--
John MexIT: http://johnbokma.com/mexit/
personal page: http://johnbokma.com/
Experienced Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      06-24-2004
Michael Preminger <(E-Mail Removed)> wrote:

> I have a huge directory, for which I need the word-count of all files
> (like wc -w * , and then put all length into a database)
>
> Is there a smart way to do it in perl?



Yes and no, depending on the definition of "smart".


> (apart from wc * > file and then
> open the file..)



I don't like shelling-out for things that are easily done
in native Perl.

Perhaps you can adapt this "wc -w" workalike one-liner to your purposes?

perl -ane '$c+=@F; print("$c $ARGV\n"), $c=0 if eof(ARGV)' *


or suitable for a Real Program:

my $cnt = 0;
while ( <> ) {
my @words = split;
$cnt += @words;
if ( eof(ARGV) ) {
printf "$cnt $ARGV\n";
$cnt = 0;
}
}


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Joe Smith
Guest
Posts: n/a
 
      06-24-2004
Purl Gurl wrote:

>>(apart from wc * > file and then open the file..)

>
> That syntax produces three outputs, line count, word
> count and number of characters. Some say number of
> characters is file size but I am not sure if this
> includes file headers or not. I have never compared
> character count to file size. Perhaps they are the
> same measure.


What do you mean by file headers?
On the systems I work with, files do not have any headers.
The character count is exactly equal to the file size.
-Joe
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      06-24-2004
Tad McClellan <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> Michael Preminger <(E-Mail Removed)> wrote:
>
> > I have a huge directory, for which I need the word-count of all files
> > (like wc -w * , and then put all length into a database)
> >
> > Is there a smart way to do it in perl?

>
>
> Yes and no, depending on the definition of "smart".
>
>
> > (apart from wc * > file and then
> > open the file..)

>
>
> I don't like shelling-out for things that are easily done
> in native Perl.
>
> Perhaps you can adapt this "wc -w" workalike one-liner to your purposes?
>
> perl -ane '$c+=@F; print("$c $ARGV\n"), $c=0 if eof(ARGV)' *
>
>
> or suitable for a Real Program:
>
> my $cnt = 0;
> while ( <> ) {
> my @words = split;
> $cnt += @words;
> if ( eof(ARGV) ) {
> printf "$cnt $ARGV\n";
> $cnt = 0;
> }
> }


Alternatively, tr/// can be used if speed is an issue but space isn't.

my $cnt;
for ( do { local $/; <> } ) {
tr/tr/\n\t / /s; # replace sequences of white space with single blanks
$cnt = tr/ //; # count blanks
}

Because split() ignores trailing white space but tr/// doesn't, the
tr/// count may be one higher than the split() count, but that's
small stuff

Anno
 
Reply With Quote
 
Rusty Phillips
Guest
Posts: n/a
 
      06-24-2004
> Uhm, I guess for every file you have to fork, so I doubt it.
>

You only have to run wc once ("wc -w *"), so there should only be one
fork. Because wc is a compiled program designed especially for this
purpose, it is hopefully faster than perl at fetching and reading all
of the files quickly - enough so to overcome the penalty lost in
forking once (probably - need a benchmark to be sure).

In addition, it makes the perl coding simpler. You don't have to
bother with globbing and opening and closing the multiple files it needs,
or with scanning through the files. All you have to do is parse the
output from the wc command.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Smart or stupid? Tying textbox length to database column length Dan Manes ASP .Net 1 04-23-2006 10:57 PM
911...Need Help! : Length cannot be less than zero. Parameter name : length manmit.walia@gmail.com ASP .Net 2 01-10-2006 03:29 PM
left(string, length) or right(string, length)? Sam ASP .Net 3 02-17-2005 12:01 PM
System.ArgumentOutOfRangeException: Length cannot be less than zero. Parameter name: length =?Utf-8?B?SG96aQ==?= ASP .Net 1 06-01-2004 11:06 PM
How to get length of string? length() problems Mitchua Perl 5 07-17-2003 12:08 AM



Advertisments