Velocity Reviews > Perl > domain name ordering

# domain name ordering

Ivan Shmakov
Guest
Posts: n/a

 02-28-2013
One more "domain name" problem: write an "inequality" predicate
to compare domain names "right to left". IOW:

my @ordered
= qw (qux.example.net bar.example.org foo.bar.example.org foo.example.org);

A trivial solution is to split /\./ each of the names, reverse
the resulting lists, and then compare them elementwise, until
either one of the lists ends (which is then the lesser one), or
an inequality is found. Hence:

sub list_cmp (&\$\$) {
## BTW, how do I imitate sort's own signature here?
## (i. e., make \$cmp truly optional.)
my (\$cmp, \$a, \$b) = @_;
\$cmp
//= sub { \$_[0] cmp \$_[1]; };
for (my \$i = 0; \$i <= \$#\$a; \$i++) {
## .
return 1
if (\$i > \$#\$b);
my \$v
= &\$cmp (\$a->[\$i], \$b->[\$i]);
## .
return \$v
if (\$v != 0);
}

## .
return (\$#\$a < \$#\$b ? -1 : 0);
}

sub dns_name_cmp (\$\$) {
my (\$a, \$b) = @_;

## .
list_cmp (undef,
[reverse (split (/\./, \$a, -1))],
[reverse (split (/\./, \$b, -1))]);
}

## print in reverse
print (join ("\n", sort { - dns_name_cmp (\$a, \$b); } (@ordered)), "\n");
# foo.example.org
# foo.bar.example.org
# bar.example.org
# qux.example.net

However, doing a split on every sort iteration doesn't seem all
that sensible (or does it?) Let's try with rindex () instead:

sub dns_name_cmp (\$\$) {
my (\$a, \$b) = @_;

my (\$ta, \$tb)
= (-1 + length (\$a), -1 + length (\$b));
while (\$ta >= 0 && \$tb >= 0) {
## NB: \$[ is deprecated, thus rindex () >= -1
my (\$i, \$j)
= (rindex (\$a, ".", \$ta),
rindex (\$b, ".", \$tb));
## .
my \$v
= (substr (\$a, 1 + \$i, \$ta - \$i)
cmp substr (\$b, 1 + \$j, \$tb - \$j));
return \$v
if (\$v != 0);
(\$ta, \$tb)
= (-1 + \$i, -1 + \$j);
}

## .
return (\$ta > 0 ? +1
: \$tb > 0 ? -1
: 0);
}

## print in reverse
print (join ("\n", sort { - dns_name_cmp (\$a, \$b); } (@ordered)), "\n");
# foo.example.org
# foo.bar.example.org
# bar.example.org
# qux.example.net

Hopefully I didn't miss any corner case with these.

Now, it makes me wonder if there's an easier way to write this
function...

--
FSF associate member #7257

Dr.Ruud
Guest
Posts: n/a

 02-28-2013
On 2013-02-28 09:36, Ivan Shmakov wrote:
> One more "domain name" problem: write an "inequality" predicate
> to compare domain names "right to left". IOW:
>
> my @ordered
> = qw (qux.example.net bar.example.org foo.bar.example.org foo.example.org);

You can build a hash from that, for example:

my %dname = (
net => {
example => { qux => { '.' => -1 } },
},
org => {
example => {
bar => {
foo => { '.' => -1 },
'.' => -1,
},
foo => { '.' => -1 },
},
},
);

Or do
my @dnames= sort { length(\$b) <=> length(\$a) or \$a cmp \$b } @ordered;
to make it:
qw(foo.bar.x.org bar.x.org foo.x.org qux.x.net);
and use a 'for my \$dname (@dnames) { ... } loop
that stops at the first match on
/(?:^|\.)\Q\$dname\$/

--
Ruud

Bo Lindbergh
Guest
Posts: n/a

 02-28-2013
Transform each FQDN into something that will sort correctly
using the default string comparison; sort; reverse the transformation.

@ordered = map {
join ".", reverse split / /;
} sort map {
join " ", reverse split /\./;
} @unordered;

/Bo Lindbergh

Rainer Weikusat
Guest
Posts: n/a

 02-28-2013
Ivan Shmakov <(E-Mail Removed)> writes:
> One more "domain name" problem: write an "inequality" predicate
> to compare domain names "right to left". IOW:
>
> my @ordered
> = qw (qux.example.net bar.example.org foo.bar.example.org foo.example.org);
>
> A trivial solution is to split /\./ each of the names, reverse
> the resulting lists, and then compare them elementwise, until
> either one of the lists ends (which is then the lesser one), or
> an inequality is found. Hence:
>
> sub list_cmp (&\$\$) {

[...]

> However, doing a split on every sort iteration doesn't seem all
> that sensible (or does it?) Let's try with rindex () instead:
>
>
> sub dns_name_cmp (\$\$) {
> my (\$a, \$b) = @_;
>
> my (\$ta, \$tb)
> = (-1 + length (\$a), -1 + length (\$b));

-1 + length(\$a) == length(\$a) - 1

> while (\$ta >= 0 && \$tb >= 0) {
> ## NB: \$[ is deprecated, thus rindex () >= -1
> my (\$i, \$j)
> = (rindex (\$a, ".", \$ta),
> rindex (\$b, ".", \$tb));
> ## .
> my \$v
> = (substr (\$a, 1 + \$i, \$ta - \$i)
> cmp substr (\$b, 1 + \$j, \$tb - \$j));

Have you benchmarked that? You're still doing the structure analysis
and component extraction for ever comparison and out of my head, I
wouldn't be absolutely sure that doing it this way instead of with
split is actually faster.

The obvious other idea would be to split the domain names once and
leave them this way until the list is sorted. If you could afford to
lose some speed at the expense of code clarity, this could be put in a
non-reversed' representation 'on demand' (=> overloaded "") or keeps
both around or constructs the 'exploded' one on demand.

Instead of reversing the splitted lists, one could also compare from
the end, possibly using negative indices (\$a[\$#a] is the same as \$a[-1]).

Rainer Weikusat
Guest
Posts: n/a

 02-28-2013
Bo Lindbergh <(E-Mail Removed)> writes:
> Transform each FQDN into something that will sort correctly
> using the default string comparison; sort; reverse the transformation.
>
> @ordered = map {
> join ".", reverse split / /;
> } sort map {
> join " ", reverse split /\./;
> } @unordered;

This is a neat idea. It could be made slightly more general by using
\000 instead of the space as alternate label separator.

Rainer Weikusat
Guest
Posts: n/a

 02-28-2013
Rainer Weikusat <(E-Mail Removed)> writes:
> Bo Lindbergh <(E-Mail Removed)> writes:
>> Transform each FQDN into something that will sort correctly
>> using the default string comparison; sort; reverse the transformation.
>>
>> @ordered = map {
>> join ".", reverse split / /;
>> } sort map {
>> join " ", reverse split /\./;
>> } @unordered;

>
> This is a neat idea. It could be made slightly more general by using
> \000 instead of the space as alternate label separator.

A 'better' ('faster') idea which needs even less Perl code would be to
build a hash of sort keys and sort the domain name array based on
comparing these instead of destroying and recreating the dataset which
is supposed to be sorted:

------------------
use Benchmark;

my @in = reverse(qw(qux.example.net bar.example.org foo.bar.example.org foo.example.org));

timethese(-5,
{
sort_x => sub {
return map {
join '.', reverse split /\000/
} sort map {
join "\000", reverse split /\./
} @in;
},

sort_y => sub {
my %keys;

\$keys{\$_} = join("\000", reverse(split(/\./, \$_))) for @in;
return sort { \$keys{\$a} cmp \$keys{\$b} } @in;
}});

Ivan Shmakov
Guest
Posts: n/a

 02-28-2013
>>>>> Rainer Weikusat <(E-Mail Removed)> writes:

[...]

> A 'better' ('faster') idea which needs even less Perl code would be
> to build a hash of sort keys and sort the domain name array based on
> comparing these instead of destroying and recreating the dataset
> which is supposed to be sorted:

[...]

Unless I be mistaken, sort { } () with the second variant of
dns_name_cmp I've originally posted is still faster.

\$ cat < 3zytr64ehn1aqckj5mk93da6zg.perl
use common::sense;

require Benchmark;

## Courtesy of /usr/share/dict/words
my @n1 = qw {
aces.mumps.colts pink.bebop.prune coeds.troop.sees tour.hulks.rungs
vent.dial.whelk logos.loped.nor mated.bran.cited pies.loaf.rune
sperm.yelp.harms how.scaly.ante dogma.lawns.plods dell.yaw.piano
caged.daub.gores weird.sofas.arias rave.gouty.tamp sics.avows.today
crass.execs.smuts dale.real.stack boots.dew.pool demur.fancy.stoop
gruff.aspen.memo gaze.homey.sorts new.slot.queue wiry.share.gulf
};
my @n2 = qw {
bums.example.com fools.example.com sic.example.org feel.example.com
wound.example.net stop.example.com diced.example.org
purse.example.com sip.example.net poll.example.com whole.example.net
kinky.example.net homer.example.org ulnas.example.com
won.example.org chase.example.com prong.example.net
bogs.example.net crest.example.org rays.example.com
solar.example.com click.example.net mixer.example.com
rib.example.org pinup.example.com gees.example.net fizzy.example.com
aisle.example.org
};

## sub dns_name_cmp { ... as per news:(E-Mail Removed) ... }

sub rw_sort (@) {
my %keys;

\$keys{\$_}
= join ("\000", reverse (split (/\./, \$_)))
for @_;
## .
return
sort { \$keys{\$a} cmp \$keys{\$b} } @_;
}

Benchmark::timethese (1024 * 1024, {
"is-1" => sub { sort dns_name_cmp (@n1); },
"is-2" => sub { sort dns_name_cmp (@n2); },
"rw-1" => sub { rw_sort (@n1); },
"rw-2" => sub { rw_sort (@n1); }
});
\$ perl 3zytr64ehn1aqckj5mk93da6zg.perl
Benchmark: timing 1048576 iterations of is-1, is-2, rw-1, rw-2...
is-1: -1 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 34952533.33/s (n=1048576)
(warning: too few iterations for a reliable count)
is-2: -1 wallclock secs ( 0.16 usr + 0.00 sys = 0.16 CPU) @ 6553600.00/s (n=1048576)
(warning: too few iterations for a reliable count)
rw-1: 69 wallclock secs (69.14 usr + 0.03 sys = 69.17 CPU) @ 15159.40/s (n=1048576)
rw-2: 69 wallclock secs (68.61 usr + 0.02 sys = 68.63 CPU) @ 15278.68/s (n=1048576)
\$

--
FSF associate member #7257 np. Lord of the Rings -- Blind Guardian

Rainer Weikusat
Guest
Posts: n/a

 02-28-2013
Ben Morrow <(E-Mail Removed)> writes:
> Quoth Rainer Weikusat <(E-Mail Removed)>:
>> Rainer Weikusat <(E-Mail Removed)> writes:
>> > Bo Lindbergh <(E-Mail Removed)> writes:
>> >> Transform each FQDN into something that will sort correctly
>> >> using the default string comparison; sort; reverse the transformation.
>> >>
>> >> @ordered = map {
>> >> join ".", reverse split / /;
>> >> } sort map {
>> >> join " ", reverse split /\./;
>> >> } @unordered;
>> >
>> > This is a neat idea. It could be made slightly more general by using
>> > \000 instead of the space as alternate label separator.

>>
>> A 'better' ('faster') idea which needs even less Perl code would be to
>> build a hash of sort keys and sort the domain name array based on
>> comparing these instead of destroying and recreating the dataset which
>> is supposed to be sorted:

> [...]
>> my %keys;
>>
>> \$keys{\$_} = join("\000", reverse(split(/\./, \$_))) for @in;
>> return sort { \$keys{\$a} cmp \$keys{\$b} } @in;

>
> Or just use a standard ST
>
> map \$\$_[0],
> sort { \$\$a[1] cmp \$\$b[1] }
> map [ \$_, join "\0", reverse split /\./ ],
> @in;

I tried that. It is/ was slower for the data I tested with than the
algorithm originally posted by Bo Lindbergh.

Rainer Weikusat
Guest
Posts: n/a

 02-28-2013
Ivan Shmakov <(E-Mail Removed)> writes:
>>>>>> Rainer Weikusat <(E-Mail Removed)> writes:

>
> [...]
>
> > A 'better' ('faster') idea which needs even less Perl code would be
> > to build a hash of sort keys and sort the domain name array based on
> > comparing these instead of destroying and recreating the dataset
> > which is supposed to be sorted:

>
> [...]
>
> Unless I be mistaken, sort { } () with the second variant of
> dns_name_cmp I've originally posted is still faster.

[...]

> sub rw_sort (@) {
> my %keys;
>
> \$keys{\$_}
> = join ("\000", reverse (split (/\./, \$_)))
> for @_;
> ## .
> return
> sort { \$keys{\$a} cmp \$keys{\$b} } @_;
> }
>
> Benchmark::timethese (1024 * 1024, {
> "is-1" => sub { sort dns_name_cmp (@n1); },
> "is-2" => sub { sort dns_name_cmp (@n2); },
> "rw-1" => sub { rw_sort (@n1); },
> "rw-2" => sub { rw_sort (@n1); }
> });

The obvious issue with this would be that you're adding a gratuitious
subroutine call per iteration for the 3rd and 4th test. But that
doesn't matter that much: Comparing the labels via substring and
stopping at the first mismatch is several orders of magnitude faster
then processing the complete data once per sort and sorting based on
the 'preprocessed' data.

Jürgen Exner
Guest
Posts: n/a

 02-28-2013
Ivan Shmakov <(E-Mail Removed)> wrote:
[...]
> However, doing a split on every sort iteration doesn't seem all
> that sensible (or does it?)

[...]
> Now, it makes me wonder if there's an easier way to write this
> function...

Standard answer to this whole class of problem is to use a Schwartzian
Transformation: http://en.wikipedia.org/wiki/Schwartzian_transform

jue