Dan Jones <> wrote:
> I'm working with some large (several hundred megs) flat database files.
> I need to examine the records for duplicates. Obviously, I don't want to
> store several hundred megs of data in a hash.
I don't consider that to be at all obvious. I often want to and often do
store several hundred meg of data in a hash.
Is there no subset of the record which would be suitable for identifying
duplicates?
> What I'd like to do is to
> read each record, generate a hash value for the record, store that hash
> value and an index key
You would potentially need a list of index keys, not just one index key.
Otherwise what would you do when different records collide on the
same hash values?
> rather than storing the entire record, and look
> for collisions in the hash value.
And then use the index values to go back and look up the whole records
and compare them properly? Wouldn't it be easier to use system sort
routines or a proper database server?
> Perl obviously uses an internal hashing function to create it's hash
> variables. Is it possible to access this function or to get the actual
> hash value it produces?
This seems to work. No gaurantees:
#!/usr/bin/perl -wl
use strict;
use Inline 'C' ;
foreach (1..1_000_000) {
print calc($_);
};
__DATA__
__C__
size_t calc(SV* sv) {
char * c;
size_t size;
size_t hash;
c=SvPV(sv,size);
PERL_HASH(hash,c,size);
return hash;
};
> If not, any pointers to a module or information
> on writing a hashing function in Perl would be appreciated. Hashing
> functions usually involve low level bit twiddling. While it's probably
> possible to do this directly in Perl (what isn't?), I don't know enough
> Perl to do it. Right now, I'm looking at using a C function, then having
> to integrate that with Perl. I'd really prefer to keep this a pure Perl
> script if I can.
The C code behind PERL_HASH is given in PERLDOC PERLGUTS. You could easily
translate that into Perl.
Others have mentioned modules for hashing functions which are intended
more for security than efficiency.
Xho
--
--------------------
http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB