Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Mailbox-style directory hashing

Reply
Thread Tools

Mailbox-style directory hashing

 
 
s1037989@gmail.com
Guest
Posts: n/a
 
      10-31-2006
I whipped up this quick and ugly script and I wanted to post it for
code review and others' benefit.

With an array such as:
qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)

The program returns:
# perl list2fs.pl 2
/a/aa/aaa/aaaa/aaaa
/a/aa/aaa/aaab/aaab
/a/aa/aaa/aaac/aaac
/b/bb/bbbb
/b/bc/bcc/bccc
/b/bc/bcd/bcdd
/b/bc/bce/bcee
/b/bc/bcf/bcff
/c/cccc
/d/dddd

Now as you can see, what this program does is take a list of filenames
and "hashifies" it like mailbox storing allowing no more than 2 (or
whatever $ARGV[0] is) filenames to be in a single directory. The
point, obviously, is if you have 100000 filenames and ext3 won't store
100000 files in a single directory, you can use this technique to break
them down.

I had to make it recursive because I'm dealing with unknown data. I
don't know that 3 levels or even 6 levels will suffice. Therefore, the
recursion figures it out.

Now, that said, this is NOT intended for "hashifying" mail storage
dirs. It IS intended to "hashify" a HUGE list of filenames.
Unfortunately this code is VERY inefficient. So, I post it here so
people can see my idea if it helps, and so that people can maybe direct
me to an existing CPAN module that would accomplish the same thing?
Or, perhaps someone likes what I've started and wants to help improve
the code?

---------------------
#!/usr/bin/perl -w

use strict;
$ARGV[0] ||= 1;

#my @list = map { chomp; $_ } <STDIN>;
my @list = @_ = qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd);

my %hash = a(1, \@list);
b('', %hash);

sub a {
my %list = map { lc(substr($_, 0, $_[0])) => 1 } @{$_[1]};
my %hash = ();
foreach my $hash ( keys %list ) {
my @hash = grep { lc(substr($hash, 0)) eq lc(substr($_,
0, length($hash))) } @{$_[1]};
if ( $#hash >= $ARGV[0] ) {
%{$hash{$hash}} = a($_[0]+1, \@hash);
} elsif ( $#hash >= 0 ) {
%{$hash{$hash}} = map { lc($_) => 1 } @hash;
} elsif ( $#hash == -1 ) {
# %{$hash{$hash}} = ();
}
}
return %hash;
}

sub b {
my ($root, %hash) = @_;
foreach ( sort keys %hash ) {
if ( ref $hash{$_} eq 'HASH' ) {
b($root.'/'.$_, %{$hash{$_}});
} else {
print "$root/$_\n";
}
}
}

 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      11-01-2006
http://www.velocityreviews.com/forums/(E-Mail Removed) <(E-Mail Removed)> wrote:

> Or, perhaps someone likes what I've started and wants to help improve
> the code?
>
> ---------------------
> #!/usr/bin/perl -w



use warnings;

Lexical warnings are a big improvement over global warnings.


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
 
s1037989@gmail.com
Guest
Posts: n/a
 
      11-01-2006

Tad McClellan wrote:
> (E-Mail Removed) <(E-Mail Removed)> wrote:
>
> > Or, perhaps someone likes what I've started and wants to help improve
> > the code?
> >
> > ---------------------
> > #!/usr/bin/perl -w

>
>
> use warnings;
>
> Lexical warnings are a big improvement over global warnings.


Interesting...! Why is that?

> --
> Tad McClellan SGML consulting
> (E-Mail Removed) Perl programming
> Fort Worth, Texas


 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      11-01-2006
s1037989 schreef:
> Tad McClellan:
>> s1037989:


>>> #!/usr/bin/perl -w

>>
>> use warnings;
>>
>> Lexical warnings are a big improvement over global warnings.

>
> Interesting...! Why is that?


Read perllexwarn: `perldoc perllexwarn`.

--
Affijn, Ruud

"Gewoon is een tijger."
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      11-01-2006
(E-Mail Removed) <(E-Mail Removed)> wrote:
>
> Tad McClellan wrote:
>> (E-Mail Removed) <(E-Mail Removed)> wrote:
>>
>> > Or, perhaps someone likes what I've started and wants to help improve
>> > the code?
>> >
>> > ---------------------
>> > #!/usr/bin/perl -w

>>
>>
>> use warnings;
>>
>> Lexical warnings are a big improvement over global warnings.

>
> Interesting...! Why is that?



For the same reason that lexical variables are a big improvement
over global variables.

You can control the scope.


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-01-2006
On 2006-10-31 23:40, (E-Mail Removed) <(E-Mail Removed)> wrote:
> I whipped up this quick and ugly script and I wanted to post it for
> code review and others' benefit.
>
> With an array such as:
> qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)
>
> The program returns:
> # perl list2fs.pl 2
> /a/aa/aaa/aaaa/aaaa
> /a/aa/aaa/aaab/aaab
> /a/aa/aaa/aaac/aaac
> /b/bb/bbbb
> /b/bc/bcc/bccc
> /b/bc/bcd/bcdd
> /b/bc/bce/bcee
> /b/bc/bcf/bcff
> /c/cccc
> /d/dddd
>
> Now as you can see, what this program does is take a list of filenames
> and "hashifies" it like mailbox storing allowing no more than 2 (or
> whatever $ARGV[0] is) filenames to be in a single directory. The
> point, obviously, is if you have 100000 filenames and ext3 won't store
> 100000 files in a single directory, you can use this technique to break
> them down.


Ext3 will happily store 100000 filenames in a directory - it just won't
be very quick in retrieving them (Even that isn't true for Linux 2.6 any
more - ext3 directories now use a structure called an "htree" to quickly
access files in huge directories). But assuming you need to use ext3
with older kernel versions or other filesystems with linear directories:


Are all the filenames guaranteed to be the same length? Otherwise, what
happens if you have these file names?

aaab aaac aaad aaae aaa

You would need a directory /a/aa/aaa and a file /a/aa/aaa. But you can't
have both.


Do you have all the filenames in advance or is it possible to create new
files after the structure has been created? If it is the latter, you
proabably will need a way to split an existing directory if the number
of files in it becomes too large - what happens when somebody accesses
the directory in the middle of the split operation? How do you determine
when you have to split? Count files time you create a new file?


Finally, I assume you used the value of 2 for demonstration purposes
only: Such a small value is not practical: It makes little sense to
restrict directory sizes to much less than a disk block. A structure as
you showed with lots of directories with a single file in it would be
slower than one which is one level less deep.

> Now, that said, this is NOT intended for "hashifying" mail storage
> dirs. It IS intended to "hashify" a HUGE list of filenames.
> Unfortunately this code is VERY inefficient.


How did you determine that is very inefficient?

> So, I post it here so people can see my idea if it helps, and so that
> people can maybe direct me to an existing CPAN module that would
> accomplish the same thing?


When I needed to do similar things I used a hash function (e.g. MD5 or
SHA-1) on the key (the filename in your case) to get nice, uniformly
distributed constant length filenames and then computed the number of
levels from the maximum number of files I had to store. With 256 files
per directory, 2 levels would be enough for 16 million files and 3
levels would be enough for 4 billion files. I never needed more .

> Or, perhaps someone likes what I've started and wants to help improve
> the code?


If you avoid copying around huge lists it might be faster. But I'm not
sure the code even does what you want - see my questions above.

hp

--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | (E-Mail Removed) | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
 
Reply With Quote
 
s1037989@gmail.com
Guest
Posts: n/a
 
      11-02-2006
Peter, thanks for your thoughtful reply! My response below...

Peter J. Holzer wrote:
> On 2006-10-31 23:40, (E-Mail Removed) <(E-Mail Removed)> wrote:
> > I whipped up this quick and ugly script and I wanted to post it for
> > code review and others' benefit.
> >
> > With an array such as:
> > qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)
> >
> > The program returns:
> > # perl list2fs.pl 2
> > /a/aa/aaa/aaaa/aaaa
> > /a/aa/aaa/aaab/aaab
> > /a/aa/aaa/aaac/aaac
> > /b/bb/bbbb
> > /b/bc/bcc/bccc
> > /b/bc/bcd/bcdd
> > /b/bc/bce/bcee
> > /b/bc/bcf/bcff
> > /c/cccc
> > /d/dddd
> >
> > Now as you can see, what this program does is take a list of filenames
> > and "hashifies" it like mailbox storing allowing no more than 2 (or
> > whatever $ARGV[0] is) filenames to be in a single directory. The
> > point, obviously, is if you have 100000 filenames and ext3 won't store
> > 100000 files in a single directory, you can use this technique to break
> > them down.

>
> Ext3 will happily store 100000 filenames in a directory - it just won't
> be very quick in retrieving them (Even that isn't true for Linux 2.6 any
> more - ext3 directories now use a structure called an "htree" to quickly
> access files in huge directories). But assuming you need to use ext3
> with older kernel versions or other filesystems with linear directories:


Ok, I didn't know that. I thought I had read somewhere that the limit
was 32,000 (and thus the reason for using something like reiserfs), but
that was apparently an ancient article... Oops!

> Are all the filenames guaranteed to be the same length? Otherwise, what
> happens if you have these file names?


No. So otherwise, that's a good question. To be more specific with
the filenames, they are actually URLs.

> aaab aaac aaad aaae aaa
>
> You would need a directory /a/aa/aaa and a file /a/aa/aaa. But you can't
> have both.


Actually, in this case, aaa the file would go in aaa the directory.

>
> Do you have all the filenames in advance or is it possible to create new
> files after the structure has been created? If it is the latter, you
> proabably will need a way to split an existing directory if the number
> of files in it becomes too large - what happens when somebody accesses
> the directory in the middle of the split operation? How do you determine
> when you have to split? Count files time you create a new file?


I have the majority of the filenames in advance, but unfortunately the
list will change. To be more specific about what I'm doing: the URLs
are blacklisted domains to be used with squid. I'm writing my own
redirect program in Perl (because I'm not satisified with squidGuard
and do not know C well enough to modify squidGuard). I have been
pondering the idea of giving each URL it's own filename on the FS (disk
space is not an issue) because that would be -- I think -- far more
efficient to lookup a domain name than to parse a file and/or store the
whole list in memory. It would be as simple as: does this file exist?

Oh, the other reason for using a FS versus a MySQL DB is that I hate to
write Web apps and it would be very simple for someone to just create a
new blacklist using Samba from a Windows box.

So as far as handling a changing list, it could just be done nightly.
Blow away the list, and rebuild.

I know this is not the best approach, but I want to make something
simple for a prototype. I hate to invest too much time into it only to
decide it won't be used.

Lastly, just thinking about the problem got me interested in solving
the problem generically. As such, I wrote the quick script and posted
it for review and perhaps for others' benefit.

> Finally, I assume you used the value of 2 for demonstration purposes
> only: Such a small value is not practical: It makes little sense to
> restrict directory sizes to much less than a disk block. A structure as
> you showed with lots of directories with a single file in it would be
> slower than one which is one level less deep.


Correct, 2 was used merely for demonstration purposes. I intend to set
the number to around 1000. I don't want the directory listing to be
too overwhelming.

> > Now, that said, this is NOT intended for "hashifying" mail storage
> > dirs. It IS intended to "hashify" a HUGE list of filenames.
> > Unfortunately this code is VERY inefficient.

>
> How did you determine that is very inefficient?


By running it... It was AWFUL! I ran it against a file containing
nearly 600,000 entries. I never waited long enough for it to finish.

> > So, I post it here so people can see my idea if it helps, and so that
> > people can maybe direct me to an existing CPAN module that would
> > accomplish the same thing?

>
> When I needed to do similar things I used a hash function (e.g. MD5 or
> SHA-1) on the key (the filename in your case) to get nice, uniformly
> distributed constant length filenames and then computed the number of
> levels from the maximum number of files I had to store. With 256 files
> per directory, 2 levels would be enough for 16 million files and 3
> levels would be enough for 4 billion files. I never needed more .


This is good information, thanks. I'll consider it's practical
purposes, but would defeat the goal of simple blacklist updates via
Samba.

> > Or, perhaps someone likes what I've started and wants to help improve
> > the code?

>
> If you avoid copying around huge lists it might be faster. But I'm not
> sure the code even does what you want - see my questions above.


How could I avoid that?

> hp
>
> --
> _ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
> |_|_) | Sysadmin WSR | > ist?
> | | | (E-Mail Removed) | Was sonst wre der Sinn des Erfindens?
> __/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd


 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-03-2006
On 2006-11-02 18:41, (E-Mail Removed) <(E-Mail Removed)> wrote:
> Peter, thanks for your thoughtful reply! My response below...
>
> Peter J. Holzer wrote:
>> On 2006-10-31 23:40, (E-Mail Removed) <(E-Mail Removed)> wrote:
>> > I whipped up this quick and ugly script and I wanted to post it for
>> > code review and others' benefit.
>> >
>> > With an array such as:
>> > qw(aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd)
>> >
>> > The program returns:
>> > # perl list2fs.pl 2
>> > /a/aa/aaa/aaaa/aaaa
>> > /a/aa/aaa/aaab/aaab

[...]
>> > /c/cccc
>> > /d/dddd
>> >
>> > Now as you can see, what this program does is take a list of filenames
>> > and "hashifies" it like mailbox storing allowing no more than 2 (or
>> > whatever $ARGV[0] is) filenames to be in a single directory. The
>> > point, obviously, is if you have 100000 filenames and ext3 won't store
>> > 100000 files in a single directory, you can use this technique to break
>> > them down.

>>
>> Ext3 will happily store 100000 filenames in a directory - it just won't
>> be very quick in retrieving them (Even that isn't true for Linux 2.6 any
>> more - ext3 directories now use a structure called an "htree" to quickly
>> access files in huge directories). But assuming you need to use ext3
>> with older kernel versions or other filesystems with linear directories:

>
> Ok, I didn't know that. I thought I had read somewhere that the limit
> was 32,000 (and thus the reason for using something like reiserfs), but
> that was apparently an ancient article... Oops!


AFAIK, there was never a limit on the number of files in an ext2/ext3
directory. 32 k may be the maximum number of subdirectories (because the
link count is 16 bit, and each ".." entry is a link). But you wouldn't
want more than a few 1000 files in a single directory because it was slow.


>> aaab aaac aaad aaae aaa
>>
>> You would need a directory /a/aa/aaa and a file /a/aa/aaa. But you can't
>> have both.

>
> Actually, in this case, aaa the file would go in aaa the directory.


Right. But my gut feeling that you don't handle that case correctly was
right. See below.


>> Do you have all the filenames in advance or is it possible to create new
>> files after the structure has been created? If it is the latter, you
>> proabably will need a way to split an existing directory if the number
>> of files in it becomes too large - what happens when somebody accesses
>> the directory in the middle of the split operation? How do you determine
>> when you have to split? Count files time you create a new file?

>
> I have the majority of the filenames in advance, but unfortunately the
> list will change. To be more specific about what I'm doing: the URLs
> are blacklisted domains to be used with squid.


Are they URLs or domains? With URLs you may need to be careful because
of '/' and possibly other characters (especially if you access the
filesystem via Windows/Samba).


>> > Now, that said, this is NOT intended for "hashifying" mail storage
>> > dirs. It IS intended to "hashify" a HUGE list of filenames.
>> > Unfortunately this code is VERY inefficient.

>>
>> How did you determine that is very inefficient?

>
> By running it... It was AWFUL! I ran it against a file containing
> nearly 600,000 entries. I never waited long enough for it to finish.


That's probably because it doesn't finish at all if one of the strings
is a prefix of more than $ARGV[0] other strings. Change the
initialization to

my @list = @_ = qw(aa aaaa aaab aaac bbbb bccc bcdd bcee bcff cccc dddd);

and try it. (It will probably crash pretty soon when it runs out of
memory, but if you had an infinite amount of memory it would run
forever)


>> > So, I post it here so people can see my idea if it helps, and so that
>> > people can maybe direct me to an existing CPAN module that would
>> > accomplish the same thing?

>>
>> When I needed to do similar things I used a hash function (e.g. MD5 or
>> SHA-1) on the key (the filename in your case) to get nice, uniformly
>> distributed constant length filenames and then computed the number of
>> levels from the maximum number of files I had to store. With 256 files
>> per directory, 2 levels would be enough for 16 million files and 3
>> levels would be enough for 4 billion files. I never needed more .

>
> This is good information, thanks. I'll consider it's practical
> purposes, but would defeat the goal of simple blacklist updates via
> Samba.


You want the directory structure to be created with the script anyway,
don't you? Or do you expect Windows users to create and delete
individual files to edit the blacklist?


>> > Or, perhaps someone likes what I've started and wants to help improve
>> > the code?

>>
>> If you avoid copying around huge lists it might be faster. But I'm not
>> sure the code even does what you want - see my questions above.

>
> How could I avoid that?


I was wrong. You are passing the lists by reference, and the hashes
which you copy needlessly aren't that big (at most 70 elements, assuming
printable ASCII). The problem is that you are calling grep way too often
which also causes the endless recursion (and would be slow even without
that).

Here is my version of your function "a" (you have some talent for
choosing horrible names, btw. I could only figure out what the variables
were supposed to be by reading the code - the names were at best
useless and at worst misleading):

sub a {
my ($prefix_len, $list) = @_;
#print " " x $prefix_len, (scalar @$list), " elements\n";
my %hash = ();
push @{ $hash{lc(substr($_, 0, $prefix_len))} }, $_ for @$list;
foreach my $sublist ( values %hash ) {
if ( @$sublist > $ARGV[0] ) {
$sublist = a($prefix_len+1, $sublist);
} else {
$sublist = { map { lc($_) => 1 } @$sublist };
}
}
return \%hash;
}

The biggest change is that I didn't try to figure out the prefixes first
and then find all strings with the same prefix (which caused duplication
and the endless recursion) but do that in one loop - that way each
string can only ever be assigned to one prefix and the lists must get
shorter. The switching between array and hash references is mostly so
that a can be called with an array reference as before. I could have
uses hash references throughout. It also now returns a hash reference
instead of a list-which-is-a-hash, which I consider cleaner.

I didn't have a list of 600000 domain names handy, so I used two lists
with 35000 and 96000 names: The run times (with $ARGV[0] == 1000) were
about 3 and 9 seconds, respectively, on a PIII 850.

hp

--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | (E-Mail Removed) | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
System.IO.Directory.GetDirectories() and System.IO.Directory.GetFiles() are not returning the specified directory Nathan Sokalski ASP .Net 2 09-06-2007 03:58 PM
Hashing Passwords Showjumper ASP .Net 2 12-22-2005 12:20 AM
Password Hashing and User Authentication =?Utf-8?B?QnJpYW4=?= ASP .Net 0 06-06-2005 01:37 PM
Password Hashing and Salting - Recommended Reading Guadala Harry ASP .Net 4 09-12-2004 08:43 PM
Hashing TT \(Tom Tempelaere\) Java 6 02-08-2004 11:38 AM



Advertisments