Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > hashing function

Reply
Thread Tools

hashing function

 
 
Dan Jones
Guest
Posts: n/a
 
      11-28-2004
I'm working with some large (several hundred megs) flat database files. I
need to examine the records for duplicates. Obviously, I don't want to
store several hundred megs of data in a hash. What I'd like to do is to
read each record, generate a hash value for the record, store that hash
value and an index key rather than storing the entire record, and look for
collisions in the hash value.

Perl obviously uses an internal hashing function to create it's hash
variables. Is it possible to access this function or to get the actual
hash value it produces? If not, any pointers to a module or information on
writing a hashing function in Perl would be appreciated. Hashing functions
usually involve low level bit twiddling. While it's probably possible to
do this directly in Perl (what isn't?), I don't know enough Perl to do it.
Right now, I'm looking at using a C function, then having to integrate that
with Perl. I'd really prefer to keep this a pure Perl script if I can.

I've been through the Camel, the Panther, and the Ram without finding
anything relevant. The Cookbook does mention that using hashes to search
for dupes is memory intensive if you have large records but doesn't provide
any alternatives that I could find. Searching for information on hashing
functions and Perl on the web has proven to be an exercise in futility due
to the naming collision with the variable type.

Thanks in advance for any assistance.


 
Reply With Quote
 
 
 
 
Juha Laiho
Guest
Posts: n/a
 
      11-28-2004
Dan Jones <> said:
>I'm working with some large (several hundred megs) flat database files. I
>need to examine the records for duplicates. Obviously, I don't want to
>store several hundred megs of data in a hash.

....
>If not, any pointers to a module or information on writing a hashing
>function in Perl would be appreciated.


MD5 ans SHA1 are two commonly used hashing functions, and implementations
for both are available as perl modules.
--
Wolf a.k.a. Juha Laiho Espoo, Finland
(GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
"...cancel my subscription to the resurrection!" (Jim Morrison)
 
Reply With Quote
 
 
 
 
Xavier Noria
Guest
Posts: n/a
 
      11-28-2004
Storing the hash code and an index is what the $hash{$record}++ idiom
does, the post seem to assume that the record itself is being stored in
the hash. Just to avoid a misconception, that's not the case.

Did you already try that simple solution and it wasn't suitable?

-- fxn

 
Reply With Quote
 
Xavier Noria
Guest
Posts: n/a
 
      11-28-2004
I don't know what was I thinking about, that's completely wrong. I am
very sorry.

-- fxn

 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      11-28-2004
Dan Jones <> wrote:
> I'm working with some large (several hundred megs) flat database files.
> I need to examine the records for duplicates. Obviously, I don't want to
> store several hundred megs of data in a hash.


I don't consider that to be at all obvious. I often want to and often do
store several hundred meg of data in a hash.

Is there no subset of the record which would be suitable for identifying
duplicates?

> What I'd like to do is to
> read each record, generate a hash value for the record, store that hash
> value and an index key


You would potentially need a list of index keys, not just one index key.
Otherwise what would you do when different records collide on the
same hash values?

> rather than storing the entire record, and look
> for collisions in the hash value.


And then use the index values to go back and look up the whole records
and compare them properly? Wouldn't it be easier to use system sort
routines or a proper database server?

> Perl obviously uses an internal hashing function to create it's hash
> variables. Is it possible to access this function or to get the actual
> hash value it produces?


This seems to work. No gaurantees:

#!/usr/bin/perl -wl

use strict;
use Inline 'C' ;

foreach (1..1_000_000) {
print calc($_);
};

__DATA__
__C__

size_t calc(SV* sv) {
char * c;
size_t size;
size_t hash;

c=SvPV(sv,size);
PERL_HASH(hash,c,size);
return hash;
};



> If not, any pointers to a module or information
> on writing a hashing function in Perl would be appreciated. Hashing
> functions usually involve low level bit twiddling. While it's probably
> possible to do this directly in Perl (what isn't?), I don't know enough
> Perl to do it. Right now, I'm looking at using a C function, then having
> to integrate that with Perl. I'd really prefer to keep this a pure Perl
> script if I can.


The C code behind PERL_HASH is given in PERLDOC PERLGUTS. You could easily
translate that into Perl.

Others have mentioned modules for hashing functions which are intended
more for security than efficiency.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
Mike Deskevich
Guest
Posts: n/a
 
      11-28-2004
rather than trying to store the data in the hash (which you said you
didn't want to do), why not just use the hash to keep track of how many
records there are of each key. something like this

%hash={} #or however you make an empty hash, i don't use them often so
i don't know off the top of my head how to do it.
while (<DATAFILE>)
{
$key=get_key_from_record ($_); #or however you get the key from the
record just read in
$num=(%hash{$key}++);
if ($num>1)
{
#record is a dupe, do what you want with it
}
}

 
Reply With Quote
 
John W. Krahn
Guest
Posts: n/a
 
      11-29-2004
Dan Jones wrote:
> I'm working with some large (several hundred megs) flat database files. I
> need to examine the records for duplicates. Obviously, I don't want to
> store several hundred megs of data in a hash. What I'd like to do is to
> read each record, generate a hash value for the record, store that hash
> value and an index key rather than storing the entire record, and look for
> collisions in the hash value.
>
> Perl obviously uses an internal hashing function to create it's hash
> variables. Is it possible to access this function or to get the actual
> hash value it produces? If not, any pointers to a module or information on
> writing a hashing function in Perl would be appreciated. Hashing functions
> usually involve low level bit twiddling. While it's probably possible to
> do this directly in Perl (what isn't?), I don't know enough Perl to do it.
> Right now, I'm looking at using a C function, then having to integrate that
> with Perl. I'd really prefer to keep this a pure Perl script if I can.
>
> I've been through the Camel, the Panther, and the Ram without finding
> anything relevant. The Cookbook does mention that using hashes to search
> for dupes is memory intensive if you have large records but doesn't provide
> any alternatives that I could find. Searching for information on hashing
> functions and Perl on the web has proven to be an exercise in futility due
> to the naming collision with the variable type.


Here is an example using Digest::MD5 that may suit your needs:

#!/usr/bin/perl
use warnings;
use strict;

use Digest::MD5 'md5';


my $file = 'database.file';

my ( $loc, %data ) = 0;

@ARGV = $file;

while ( <> ) {
push @{ $data{ md5 $_ } }, $loc;
$loc = tell ARGV;
}

close ARGV;

# remove all duplicates except the first one
my @dups = sort { $a <=> $b } map splice( @$_, 1 ), values %data;

# or alternately remove all except the last one
# my @dups = sort { $a <=> $b } map splice( @$_, 0, $#$_ ), values %data;


( $loc, $^I, @ARGV ) = ( 0, '.bck', $file );

while ( <> ) {
if ( @dups and $loc == $dups[ 0 ] ) {
shift @dups;
next;
}
print;
$loc = tell ARGV;
}

__END__



John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Hashing function different values on different OS ? Lawrence Java 16 02-18-2007 09:57 PM
Hashing Function Dave Python 0 11-17-2005 07:45 PM
Hashing function for integer coordinates Owen Jacobson Java 3 05-26-2005 12:29 PM
Hashing function Pat C++ 2 05-18-2004 05:39 PM
Hashing function (possibly OT) Matt Bull C Programming 2 07-07-2003 08:46 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57