Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > suggestions for comparing two large data sets requested

Reply
Thread Tools

suggestions for comparing two large data sets requested

 
 
Terry L. Ridder
Guest
Posts: n/a
 
      10-14-2003
hello;

background:
using the daily statistic files from the four regional internet
registries (rir), apnic, arin, lacnic, and ripe, i create a 'holes' data
set. 'holes' are network address blocks that are not reserved by iana
nor allocated or assigned by any of the four rirs. the current 'holes'
file contains 149190 entries in cidr notation.

a third party file contains ip addresses which in theory should be
blocked are various reasons. this flat ascii file has 16420 entries in
cidr notation. a review of the third party file shows ip addresses
listed which are really 'holes', i.e. they are neither reserved,
allocated, nor assigned by iana, apnic, arin, lacnic, or ripe.
however, that does not rule out that someone may actually be attempting
to use them.

the 'holes' data and the third party data need to be compared.
there are several possibilities:

for each 'holes' entry:
$holes_lo == begin ip address of network address block.
$holes_hi == end ip address of network address block.

for each 'third party' entry:
$block_lo == begin ip address of network address block.
$block_hi == end ip address of network address block.

$holes_lo < $block_lo &&
$holes_lo < $block_hi &&
$holes_hi > $block_lo &&
$holes_hi < $block_hi
partial overlap;
flag block entry;

$holes_lo > $block_lo &&
$holes_lo < $block_hi &&
$holes_hi > $block_lo &&
$holes_hi > $block_hi
partial overlap;
flag block entry;

$holes_lo > $block_lo &&
$holes_lo > $block_hi &&
$holes_hi > $block_lo &&
$holes_hi > $block_hi
no overlap;
ok;

$holes_lo < $block_lo &&
$holes_lo < $block_hi &&
$holes_hi < $block_lo &&
$holes_hi < $block_hi
no overlap;
ok;

$holes_lo < $block_lo &&
$holes_lo < $block_hi &&
$holes_hi > $block_lo &&
$holes_hi > $block_hi
total overlap;
flag block entry;

the flagged block entries will be check against the bgp routing tables
by querying the router for announced routes just to make sure someone
is not attempting to use it.

using foreach loops would be braindead given the number of entries.
149190 x 16420. ( which change daily. )

please note:
all ip addresses are stored as numbers and *not* as dotted quads.

the reason for doing this is to provide feedback to the third party
concerning their listings and to request clarification as to why they
are listing network address blocks which are neither reserved,
allocated, assigned, nor routed.

i would be the first to agree that the third party should be checking
their listings, but for whatever reason they are not. i have pointed
out several 'errors' to the third party but it falls of deaf ears or
blind eyes depending on your perspective.

--
terry l. ridder ><>
postmaster at blauedonau.com

 
Reply With Quote
 
 
 
 
James Willmore
Guest
Posts: n/a
 
      10-14-2003
On Tue, 14 Oct 2003 00:14:38 -0500
"Terry L. Ridder" <(E-Mail Removed)> wrote:
<snip>

There are various Perl modules you could use to aid in this task.
Check out:
http://search.cpan.org

Since you don't have a specific Perl question, this is the best answer
I can give you. Others may have somethign else to offer.

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
For some reason, this fortune reminds everyone of Marvin
Zelkowitz.
 
Reply With Quote
 
 
 
 
John W. Krahn
Guest
Posts: n/a
 
      10-14-2003
"Terry L. Ridder" wrote:
>
> [snip]
>
> the flagged block entries will be check against the bgp routing tables
> by querying the router for announced routes just to make sure someone
> is not attempting to use it.
>
> using foreach loops would be braindead given the number of entries.
> 149190 x 16420. ( which change daily. )
>
> please note:
> all ip addresses are stored as numbers and *not* as dotted quads.


It sounds like you could use a bit vector to store the 'holes' data
which would require a 536,870,912 byte string and have look-ups of O(1).


John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      10-14-2003
John W. Krahn <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> "Terry L. Ridder" wrote:
> >
> > [snip]
> >
> > the flagged block entries will be check against the bgp routing tables
> > by querying the router for announced routes just to make sure someone
> > is not attempting to use it.
> >
> > using foreach loops would be braindead given the number of entries.
> > 149190 x 16420. ( which change daily. )
> >
> > please note:
> > all ip addresses are stored as numbers and *not* as dotted quads.

>
> It sounds like you could use a bit vector to store the 'holes' data
> which would require a 536,870,912 byte string and have look-ups of O(1).


Ah, the sophistication of brute force

It may cost some time to set up the table in the first place. A (probably
substantial) subset of 8*536,870,912 bits must be set, more or less
individually.

Alternatively, one could use binary search to find the starting and
ending points of a possible enclosing interval.

Anno
 
Reply With Quote
 
Quantum Mechanic
Guest
Posts: n/a
 
      10-14-2003
http://www.velocityreviews.com/forums/(E-Mail Removed)-berlin.de (Anno Siegel) wrote in message news:<bmgils$anq$(E-Mail Removed)-Berlin.DE>...
> Alternatively, one could use binary search to find the starting and
> ending points of a possible enclosing interval.


Assuming the begin/end pairs in each list are sorted, use the merge
sort algorithm, with some stream state thrown in, and it should take
linear time.

-QM
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
suggestions on intelligent processing of data sets in a file alt.testing@{g}mail.com Perl Misc 2 05-14-2007 07:47 AM
two axis, two data sets GD::Graphic java Perl Misc 7 12-08-2006 12:55 PM
comparing values in two sets John Salerno Python 11 05-15-2006 11:25 AM
Why different sums for these two functions for large sets of data? Eric Lilja C++ 9 05-26-2005 12:42 AM
xpath: comparing two node sets Andy Fish XML 3 03-10-2005 09:16 AM



Advertisments