Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > lest talk a litle more about directories

Reply
Thread Tools

lest talk a litle more about directories

 
 
George Mpouras
Guest
Posts: n/a
 
      07-26-2013
Ok let’s talk about a little more about creating directories (or any other
trie structure entries).
There are three basic methods :

1) The simple, just start create them from top to bottom. This is give us
N disk accesses always. Silly but do not underestimate it because it does
not consume cpu.

2) The smarter. Search from bottom to top for the first existing node and
start creating from this point and bellow; this is give us N only at the
worst case.

3) The most wise approach is to define an array from all the upper nodes.
Then perform a binary tree search at this array against a node existence
code. This is give as log(N) accesses at worst case.

So what is the best solution; Unfortunately there is not ….

If your directories “usually” exist, the 2nd method will be the faster
because it will finish at the very first iterations.
If your directories “usually” do not exist, the 3rd method will be the
faster because it will dig faster at the bottomless abyss.

Because the 2nd method is almost trivila I will write the code only for the
3rd binary tree method



mkdir_btree( 'd1/d2/d3/d4/d5' ) or die "$^E\n";

sub mkdir_btree
{
my @array = split /[\/\\]/, $_[0];
return 1 if -1 == $#array;
$array[0] eq '' && splice @array, 0, 2, "/$array[1]";
my ($begin, $end) = (0, scalar @array);

while ($begin < $end)
{
my $cut = int(($begin + $end)/2);
-d join('/', @array[0..$cut]) ? ($begin = $cut + 1) : ($end = $cut)
}

return 1 unless $begin < @array;
mkdir(join '/', @array[0 .. $_]) || return 0 for $begin .. $#array;1
}







Also here is small benchmark. If you want to test it , (its better to run it
multiple times)





#!/usr/bin/perl
# Dir creation benchmark
use strict;
use warnings;
use Time::HiRes;
use File:ath;

my $how_many = 1000;
my $how_deep = 40;
my $root_test_dir = '/tmp/test_this';

# create the dirs array;
print "Creating the test dirs array\n";
my @Test_dirs;
my $start_time=[];
for (0 .. $how_many) { my @R = $root_test_dir;
for (0 .. $how_deep) {
push @R, sprintf "%s%02d", ('a'..'z')[int(rand 24)], int(rand 100) }
push @Test_dirs, join('/',@R) }



foreach my $code (qw/


Mkdir_recursive
mkdir_btree
File_Path_module


/){
system("/bin/rm -rf $root_test_dir") if -d $root_test_dir;
$start_time = [ Time::HiRes::gettimeofday ];
print "Testing $code ... ";
foreach (@Test_dirs) { &{\&$code}($_) or die("$!") }
print "finished at ", Time::HiRes::tv_interval($start_time) ," sec\n"
}




sub File_Path_module
{
my $err;
File:ath::mkpath( $_[0], {'error' => \$err} );
@{$err} ? 0 : 1
}


sub Mkdir_recursive
{
return 1 if $_[0] eq '' || -d $_[0];
Mkdir_recursive( $_[0] =~/^(.*?)\/[^\/]+$/ ) || return undef;
mkdir $_[0] || return undef
}


sub mkdir_btree
{
my @array = split /\//, $_[0];
return 1 if -1 == $#array;
$array[0] eq '' && splice @array, 0, 2, "/$array[1]";
splice @array, 0, 2, "/$array[1]" if '' eq $array[0];
my ($begin, $end) = (0, scalar @array);

while ($begin < $end)
{
my $cut = int(($begin + $end)/2);
-d join('/', @array[0..$cut]) ? ($begin = $cut + 1) : ($end = $cut)
}

return 1 unless $begin < @array;
mkdir(join '/', @array[0 .. $_]) || return 0 for $begin .. $#array;1
}


END {
print "Clearing test dir: $root_test_dir\n";
system("/bin/rm -rf $root_test_dir") if -d $root_test_dir }

 
Reply With Quote
 
 
 
 
Rainer Weikusat
Guest
Posts: n/a
 
      07-26-2013
"George Mpouras"
<(E-Mail Removed) m.com.nospam>
writes:
> 1) The simple, just start create them from top to bottom. This is
> give us N disk accesses always. Silly but do not underestimate it
> because it does not consume cpu.
>
> 2) The smarter. Search from bottom to top for the first existing
> node and start creating from this point and bellow; this is give us N
> only at the worst case.


As I already tried to tell you last time: Method 1 tuned for the case
where most directories have to be created and method 2 for the case
where most directories already exist. Both stat (the system call
behind the -X) and mkdir need to determine information about a
particular directory entry, stat because that's what it is supposed to
do and mkdir because it must not do anything if this directory entry
already exists. This means creating all directories in the path
/a/b/c/d from scratch will perform first perform for stat calls,

stat("/a/b/c/d")
stat("/a/b/c")
stat("a/b");
stat("/a");

which will fail with ENOENT, followed by four mkdir calls,

mkdir("/a");
mkdir("/a/b");
mkdir("/a/b/c");
mkdir("/a/b/c/d");

for the 'backward' method while the other will just do the 'mkdirs'.

>> 3) The most wise approach is to define an array from all the upper

> nodes. Then perform a binary tree search at this array against a node
> existence code. This is give as log(N) accesses at worst case.
>
> So what is the best solution; Unfortunately there is not ….
>
> If your directories “usually” exist, the 2nd method will be the faster
> because it will finish at the very first iterations.
> If your directories “usually” do not exist, the 3rd method will be the
> faster because it will dig faster at the bottomless abyss.
>
> Because the 2nd method is almost trivila I will write the code only
> for the 3rd binary tree method
>
>
> mkdir_btree( 'd1/d2/d3/d4/d5' ) or die "$^E\n";
>
> sub mkdir_btree
> {
> my @array = split /[\/\\]/, $_[0];


I don't know this is for Windows, but a valid UNIX(*) pathname may use
any number of slashes to separate two directories,

[rw@sapphire]~ $perl -e "chdir('//////////////'); system('pwd');"
/
 
Reply With Quote
 
 
 
 
Rainer Weikusat
Guest
Posts: n/a
 
      07-27-2013
"George Mpouras"
<(E-Mail Removed) m.com.nospam>
writes:

[...]

> 3) The most wise approach is to define an array from all the upper
> nodes. Then perform a binary tree search at this array against a node
> existence code. This is give as log(N) accesses at worst case.
>
> So what is the best solution; Unfortunately there is not ….
>
> If your directories “usually” exist, the 2nd method will be the faster
> because it will finish at the very first iterations.
> If your directories “usually” do not exist, the 3rd method will be the
> faster because it will dig faster at the bottomless abyss.


[...]

> mkdir_btree( 'd1/d2/d3/d4/d5' ) or die "$^E\n";
>
> sub mkdir_btree
> {
> my @array = split /[\/\\]/, $_[0];
> return 1 if -1 == $#array;
> $array[0] eq '' && splice @array, 0, 2, "/$array[1]";
> my ($begin, $end) = (0, scalar @array);
>
> while ($begin < $end)
> {
> my $cut = int(($begin + $end)/2);
> -d join('/', @array[0..$cut]) ? ($begin = $cut + 1) : ($end = $cut)
> }
>
> return 1 unless $begin < @array;
> mkdir(join '/', @array[0 .. $_]) || return 0 for $begin .. $#array;1
> }


The 'log(N) worst case' is not true: This algortihm has a constant
overhead of log(N) stat calls for every case (this could be improved
somewhat by using mkdir in the loop). This means it is worse than
'create from the top' when all directories have to be created and will
become worse than 'create from the bottom' if the number of
subdirectories which have to be created at depth n is large enough
that (m - 1) * log(n) stat calls amount to more work than what was avoided
when creating the first subdirectory at this depth. The work saved
when creating the first directory is (based on mkdir)

2n - 1 - n - log(n)

This must be larger than (m - 1) * log(n) for the binary search to
beneficial. Putting this together,

2n - 1 - n - log(n) > (m - 1) * log(n)

this ends up as (Rechenfehler vorbehalten)

(n - 1)/log(n) > m

The /usr tree on the machine I used as example has a maximum depth of
12. Using this program,

--------------
sub log2
{
return log($_[0])/log(2);
}

sub threshold
{
return ($_[0] - 1) / log2($_[0]);
}

print($_, "\t", threshold($_), "\n") for 2 .. 12;
--------------

yields the following threshold values:

2 1
3 1.26185950714291
4 1.5
5 1.72270623229357
6 1.93426403617271
7 2.13724312264813
8 2.33333333333333
9 2.52371901428583
10 2.70926996097583
11 2.89064826317888
12 3.06837240216243

(as soon as a directory at depth [first column] has more
subdirectories than [second column], the algorithm sucks).

I've used find /usr -type d to generate a list of directories, edited
the first line to remove a trailing slash and then used the following
program

-------------
use List::Util qw(sum);

@ds = map {chomp; $_} <STDIN>;

for (@ds) {
/(.*?\/)[^\/]*$/ and $path = $1;
# print($_, "\t", $path, "\n");

++$dirs{$path};
}

for (keys(%dirs)) {
$depth = tr/\///;
push(@{$counts[$depth]}, $dirs{$_});
}

for (1 .. $#counts) {
print($_, "\t", scalar(@{$counts[$_]}), "\t", sum(@{$counts[$_]}) / @{$counts[$_]}, "\n") if @{$counts[$_]};
}
---------------

to calculate the average number of subdirectories at each depth.
The result is (depth, number of times encountered, avg)

1 1 1
2 1 10
3 6 82.3333333333333
4 228 10.0833333333333
5 673 2.93610698365527
6 521 8.13435700575816
7 556 2.42446043165468
8 289 1.96193771626298
9 136 2.63970588235294
10 85 3.23529411764706
11 56 3.17857142857143
12 5 5

In other words (if the results are at least mostly correct , this
idea sucks dead hippopotamusses through nanotubes in the real world.
 
Reply With Quote
 
George Mpouras
Guest
Posts: n/a
 
      07-27-2013
Στις 27/7/2013 15:25, ο/η Rainer Weikusat *γραψε:
> "George Mpouras"
> <(E-Mail Removed) m.com.nospam>
> writes:
>
> [...]
>
>> 3) The most wise approach is to define an array from all the upper
>> nodes. Then perform a binary tree search at this array against a node
>> existence code. This is give as log(N) accesses at worst case.
>>
>> So what is the best solution; Unfortunately there is not ….
>>
>> If your directories “usually” exist, the 2nd method will be the faster
>> because it will finish at the very first iterations.
>> If your directories “usually” do not exist, the 3rd method will be the
>> faster because it will dig faster at the bottomless abyss.

>
> [...]
>
>> mkdir_btree( 'd1/d2/d3/d4/d5' ) or die "$^E\n";
>>
>> sub mkdir_btree
>> {
>> my @array = split /[\/\\]/, $_[0];
>> return 1 if -1 == $#array;
>> $array[0] eq '' && splice @array, 0, 2, "/$array[1]";
>> my ($begin, $end) = (0, scalar @array);
>>
>> while ($begin < $end)
>> {
>> my $cut = int(($begin + $end)/2);
>> -d join('/', @array[0..$cut]) ? ($begin = $cut + 1) : ($end = $cut)
>> }
>>
>> return 1 unless $begin < @array;
>> mkdir(join '/', @array[0 .. $_]) || return 0 for $begin .. $#array;1
>> }

>
> The 'log(N) worst case' is not true: This algortihm has a constant
> overhead of log(N) stat calls for every case (this could be improved
> somewhat by using mkdir in the loop). This means it is worse than
> 'create from the top' when all directories have to be created and will



Do not forget that computers do not if the directories have to be
created. So it finds faster if they have to be.


> become worse than 'create from the bottom' if the number of
> subdirectories which have to be created at depth n is large enough
> that (m - 1) * log(n) stat calls amount to more work than what was avoided
> when creating the first subdirectory at this depth. The work saved
> when creating the first directory is (based on mkdir)
>
> 2n - 1 - n - log(n)
>
> This must be larger than (m - 1) * log(n) for the binary search to
> beneficial. Putting this together,
>
> 2n - 1 - n - log(n) > (m - 1) * log(n)
>
> this ends up as (Rechenfehler vorbehalten)
>
> (n - 1)/log(n) > m
>
> The /usr tree on the machine I used as example has a maximum depth of
> 12. Using this program,
>
> --------------
> sub log2
> {
> return log($_[0])/log(2);
> }
>
> sub threshold
> {
> return ($_[0] - 1) / log2($_[0]);
> }
>
> print($_, "\t", threshold($_), "\n") for 2 .. 12;
> --------------
>
> yields the following threshold values:
>
> 2 1
> 3 1.26185950714291
> 4 1.5
> 5 1.72270623229357
> 6 1.93426403617271
> 7 2.13724312264813
> 8 2.33333333333333
> 9 2.52371901428583
> 10 2.70926996097583
> 11 2.89064826317888
> 12 3.06837240216243
>
> (as soon as a directory at depth [first column] has more
> subdirectories than [second column], the algorithm sucks).
>
> I've used find /usr -type d to generate a list of directories, edited
> the first line to remove a trailing slash and then used the following
> program
>
> -------------
> use List::Util qw(sum);
>
> @ds = map {chomp; $_} <STDIN>;
>
> for (@ds) {
> /(.*?\/)[^\/]*$/ and $path = $1;
> # print($_, "\t", $path, "\n");
>
> ++$dirs{$path};
> }
>
> for (keys(%dirs)) {
> $depth = tr/\///;
> push(@{$counts[$depth]}, $dirs{$_});
> }
>
> for (1 .. $#counts) {
> print($_, "\t", scalar(@{$counts[$_]}), "\t", sum(@{$counts[$_]}) / @{$counts[$_]}, "\n") if @{$counts[$_]};
> }
> ---------------
>
> to calculate the average number of subdirectories at each depth.
> The result is (depth, number of times encountered, avg)
>
> 1 1 1
> 2 1 10
> 3 6 82.3333333333333
> 4 228 10.0833333333333
> 5 673 2.93610698365527
> 6 521 8.13435700575816
> 7 556 2.42446043165468
> 8 289 1.96193771626298
> 9 136 2.63970588235294
> 10 85 3.23529411764706
> 11 56 3.17857142857143
> 12 5 5
>
> In other words (if the results are at least mostly correct , this
> idea sucks dead hippopotamusses through nanotubes in the real world.
>




well good comments.
This idea do not sucks. When deep directories usually do not exist it is
the faster one.

 
Reply With Quote
 
George Mpouras
Guest
Posts: n/a
 
      07-27-2013
> In other words (if the results are at least mostly correct , this
> idea sucks dead hippopotamusses through nanotubes in the real world.
>




the binary search is the fastest method to access a specific record on a
unknown data structure, that is why is often used from databases, but as
I said there is not one absolute answer. the fallowing factors must
considered before decide what to do on a specific machine:

* what is the timecost of a cpu cycle
* what is the timecost to ask for a node existense
* what is the timecost to create a node
* which is the statistical behaviour of the application
* what is the current status of your data

So the problem is not so determenistic after all
 
Reply With Quote
 
Willem
Guest
Posts: n/a
 
      07-27-2013
George Mpouras wrote:
) Ok let???s talk about a little more about creating directories (or any other
) trie structure entries).
) There are three basic methods :
)
) 1) The simple, just start create them from top to bottom. This is give us
) N disk accesses always. Silly but do not underestimate it because it does
) not consume cpu.
)
) 2) The smarter. Search from bottom to top for the first existing node and
) start creating from this point and bellow; this is give us N only at the
) worst case.
)
) 3) The most wise approach is to define an array from all the upper nodes.
) Then perform a binary tree search at this array against a node existence
) code. This is give as log(N) accesses at worst case.

Have you considered the possibility that it takes O(N) disk accesses
to check if a directory (a1/a2/.../aN) exists ?
The OS will have to open a1, find a2, open a2, find a3, etc.
However, if you already have a handle to directory a2,
then checking or creating a3 is a single step.

If that is the case, method 1 will always be fastest.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
Reply With Quote
 
gamo@telecable.es
Guest
Posts: n/a
 
      07-27-2013
> my @array = split /[\/\\]/, $_[0];
-----------------------

I'm afraid that's not compatible with VMS


 
Reply With Quote
 
George Mpouras
Guest
Posts: n/a
 
      07-27-2013
Στις 27/7/2013 23:26, ο/η http://www.velocityreviews.com/forums/(E-Mail Removed) *γραψε:
>> my @array = split /[\/\\]/, $_[0];

> -----------------------
>
> I'm afraid that's not compatible with VMS
>
>

this one maybe ? split /\/+/, $_[0];
what is the VMS ?
 
Reply With Quote
 
George Mpouras
Guest
Posts: n/a
 
      07-27-2013
Στις 27/7/2013 23:26, ο/η (E-Mail Removed) *γραψε:
>> my @array = split /[\/\\]/, $_[0];

> -----------------------
>
> I'm afraid that's not compatible with VMS
>
>


are you working in one of these ?
http://en.wikipedia.org/wiki/VAX
unbelievable cool !

 
Reply With Quote
 
Keith Keller
Guest
Posts: n/a
 
      07-27-2013
On 2013-07-26, George Mpouras <(E-Mail Removed) m.com.nospam> wrote:
> Ok let???s talk about a little more about creating directories (or any other
> trie structure entries).
> There are three basic methods :


The best method is to use File:ath unless you have thoroughly
documented that you need to optimize directory creation because
File:ath isn't fast enough. Using File:ath optimizes for
programmer time and portability.

If you really believe that your methods are superior, you should write
them up as a patch to File:ath, so that others can thoroughly test
your code and include them in the module if they work properly.

--keith

--
http://www.velocityreviews.com/forums/(E-Mail Removed)-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
see X- headers for PGP signature information

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Skype and Talk Talk Sacred UK VOIP 9 08-01-2006 08:57 PM
Lest We Forget - ANZAC Day Wayne MCSE 1 04-24-2006 07:33 PM
chicken litle newsleecher@spam.com Computer Support 0 12-07-2005 06:10 PM
[a litle OT] network programming Boris Glawe C++ 13 09-16-2004 07:28 PM
Simple litle style question Kirk Haines Ruby 2 02-14-2004 06:08 PM



Advertisments