Quoth "tak" <>:
>
> wrote:
> > "tak" <> wrote:
> > > Hi,
> > >
> > > I have a script, that loads a txt file, with 240k lines in it to a hash
> > > currently. And when it loads the data to the hash - it becomes slower
> > > and slower when it reaches may be around 150k
> >
> > How much memory do you have? How much are you using at this point?
>
> >From looking at the PF Usage - it is about 1.9 GB. on a 1gb machine -
> the available physical memory are down to about 10 MB when loading...
> But the CPU usage remains about 5% only...
So, you are thrashing. You've run out of memory: I would suggest using
one of the DBM modules, probably DB_File. This stores the contents of
the hash in a (structured, binary, fast-to-index) file on disk, which
will probably make things faster.
> > > (probably due to
> > > collision, since perl's hash uses linear chaininig...).
> >
> > That is a rather unlikely bit of speculation, especially on a modern Perl.
> > How many buckets does your hash have and use? (print scalar %hash).
>
> I have 1 main_hash, which stores 27 hashes in it. And out of each 27
> hashes, it averages about 9k unique strings. print scalar %hash
> reports, 23/32. What does this number mean?
From perldoc perldata:
| If you evaluate a hash in scalar context, it returns false if the hash
| is empty. If there are any key/value pairs, it returns true; more
| precisely, the value returned is a string consisting of the number of
| used buckets and the number of allocated buckets, separated by a slash.
| This is pretty much useful only to find out whether Perl's internal
| hashing algorithm is performing poorly on your data set. For example,
| you stick 10,000 things in a hash, but evaluating %HASH in scalar
| context reveals "1/16", which means only one out of sixteen buckets has
| been touched, and presumably contains all 10,000 of your items. This
| isn't supposed to happen.
(Note that this is not meant as a rebuke: noone can be expected to have
all the arcana in Perl's std docs memorized. It is meant so that you may
remember where to find it next time

. )
So, your main hash is using 23 buckets to store your 27 subhashes... not
such a useful thing to know

. The real question is, how many buckets
does your original hash (with all the data in it) use? For instance, on
my perl
my %h;
for (1..240_000) {
$h{$_} = 1;
}
print scalar %h;
prints '157199/262144', so the hash is using 157199 buckets, and each
bucket has on average 240000/157199 ~~ 1.5 entries in it, which should
not be a problem.
> > When the facts don't fit your theory, re-examing your theory. You probably
> > have a swapping problem, not a hash collision problem. And if you do have
> > a collision problem, the better way to fix it would be to start out with a
> > higher number of buckets, by assigning to the keys function.
> >
>
> Can you elaborate on what you mean by a swapping problem?
Your system has started thrashing: the working set (the pages in current
use) has exceeded the size of physical memory, and the system is
spending all its time swapping things in and out.
> And I thought
> about assigning higher number of bucket to the hash itself , but i
> cannot find the related function to set that... I am a Java programmer,
> and this is my first perl script.. I tried looking into the constructor
> for the hash itself, but it doesnt seem like it accepts argument...?
The next para after my previous quote:
| You can preallocate space for a hash by assigning to the keys()
| function. This rounds up the allocated buckets to the next power of two:
|
| keys(%users) = 1000; # allocate 1024 buckets
> Last question,
>
> How Do you delete an element within a hoh? Say i have a hash of hash,
> like the following.
>
> my %hoh();
Did you even try this? Perl Is Not Java: this is a syntax error. You
don't need the parens.
> loop() { # say this is the loop of each line of my txtFile
What is this 'loop()'? Have you been reading about Perl6? Or did you
mean
sub loop {
?
> my $value = "TheRecordFromMyTxtFile";
(You really want to sort out your indentation. Makes life easier for
both you and us.)
> my $letter = substr $value, 0, 1; # say, i am using the first letter
> as the key for subhash.
> my $myKey = substr $value, 5, 9; # Say position 5 - 9 is the key for
> the element.
> $hoh{$letter}{$myKey} = $value
> }
>
>
> Now, I want to delete a particular value from one of the subhash...
>
> I tried doing this,
>
> delete $hoh{$letter}{$value};
That's correct (assuming $value corresponds to $myKey in the above, not
to $value there: that is, you delete an element by specifying its key).
> But it doesnt seem like it is deleting... B/c if I try to get the
> length of the $hoh{$letter}, it still reports the same number...
You really need to learn some basic Perl. I'd recommend a book:
'Learning Perl' published by O'Reilly is universally recommended as a
good place to start. An alternative would be to read through the
perldocs, but that's not an easy way to learn.
length (see perldoc -f length) treats its argument as a string and
returns the length of that string. $hoh{$letter} contains a hash
*reference*: see perldoc perldsc and perldoc perlreftut for how
multi-level data structures are implemented in Perl. Or, again, a decent
book will cover it. Now, when you stringify a hash ref, you get
something that looks like 'HASH(0x80142180)', which is basically
useless, and is always the same length.
To find the number of keys in a hash, you do as it says in perldoc -f
length: 'scalar keys %hash'. This is somewhat complicated by the fact
that what you have is not a hash but a hash ref, so we apply 'Use Rule
1' from perlreftut:
# an ordinary hash
print scalar keys %hash;
# replace the var name with { }
print scalar keys %{ }
# put the hashref inside the braces
print scalar keys %{ $hoh{$letter} };
Yes, I agree this is a little icky, but that's what you get when you
graft complex data structures onto a language (Perl4) that doesn't
really support them

.
A useful tool for examining data structures is the module Data:

umper
(obviously, you want to run a test on a smaller dataset rather than
dumping a hash of 240k entries).
Ben
--
All persons, living or dead, are entirely coincidental.
Kurt Vonnegut