Mathematisch
Guest
Posts: n/a

 10-18-2010
Hi,

The problem: I would like to create an iterator to iterate through a
csv file with the following structure:

field_1,field_2,...field_14
field_1,field_2,...field_14
(...)

Note that this is a csv file with 14 fields and it is already sorted
by field_1 and then by field_2. There are usually only 5-10 lines
having the same field_1 and field_2 value.

There could be up to hundreds of millions of lines in the file. The
desired iterator should work like this: At each "next_entry" call, the
iterator should return a reference to an array of the lines having the
identical field_1 and field_2 values.

Because of my lack of understanding the iterator concept, I could not
come up with a solution yet. The file is too big to use the field_1
and field_2 as a hash key to achieve the same goal of grouping the
entries.

Thank you very much for any help on this. I hope I can learn from the
eventual proposed solutions.

Kind regards.
F.

J. Gleixner
Guest
Posts: n/a

 10-18-2010
Mathematisch wrote:
> Hi,
>
> The problem: I would like to create an iterator to iterate through a
> csv file with the following structure:
>
>
> field_1,field_2,...field_14
> field_1,field_2,...field_14
> (...)
>
>
>
> Note that this is a csv file with 14 fields and it is already sorted
> by field_1 and then by field_2. There are usually only 5-10 lines
> having the same field_1 and field_2 value.
>
> There could be up to hundreds of millions of lines in the file. The
> desired iterator should work like this: At each "next_entry" call, the
> iterator should return a reference to an array of the lines having the
> identical field_1 and field_2 values.
>
> Because of my lack of understanding the iterator concept, I could not
> come up with a solution yet. The file is too big to use the field_1
> and field_2 as a hash key to achieve the same goal of grouping the
> entries.

You don't say what you want to do with the data, however
you could store everything into a database, then using
group by, order by, you could process your data easily.

However, since you say that everything is already sorted
by those keys, you could process things as you read the
file, keeping track of when those fields change. Throwing a
next_entry around this and having it return the data
of calling process_data, would be simple enough, I rarely
bother with creating an 'iterator'.. but that's just me..

Hopefully you're using Text::CSV or some other module to
parse the CSV file.

my ( \$prev_f1, \$prev_f2, @data );
while( my \$line = <> )
{
chomp( \$line );
my ( \$f1, \$f2, @fields ) = parse-line-somehow();

if( \$f1 eq \$prev_f1 && \$f2 eq \$prev_f2 )
{
push( @data, \@fields );
}
else
{
process_data( \$prev_f1, \$prev_f2, \@data );
\$prev_f1 = \$f1;
\$prev_f2 = \$f2;
undef @data;
push( @data, \@fields );
}
}
process_data( \$prev_f1, \$prev_f2, \@data ) if @data;

sub process_data
{
my \$f1 = shift;
my \$f2 = shift;
my \$data_aref = shift;

# do whatever you want...
}

J. Gleixner
Guest
Posts: n/a

 10-18-2010
J. Gleixner wrote:
[...]
> my ( \$prev_f1, \$prev_f2, @data );
> while( my \$line = <> )
> {
> chomp( \$line );
> my ( \$f1, \$f2, @fields ) = parse-line-somehow();
>

FYI: Just looking at this again..

Dumb bug here, which will cause it to always go to the
else, on the first line read.. so you'll have to modify
this if() accordingly.

> if( \$f1 eq \$prev_f1 && \$f2 eq \$prev_f2 )
> {
> push( @data, \@fields );
> }
> else
> {

sln@netherlands.com
Guest
Posts: n/a

 10-19-2010
On Mon, 18 Oct 2010 11:09:11 -0500, "J. Gleixner" <(E-Mail Removed)> wrote:

>Mathematisch wrote:
>> Hi,
>>
>> The problem: I would like to create an iterator to iterate through a
>> csv file with the following structure:
>>
>>
>> field_1,field_2,...field_14
>> field_1,field_2,...field_14
>> (...)
>>
>>
>>
>> Note that this is a csv file with 14 fields and it is already sorted
>> by field_1 and then by field_2. There are usually only 5-10 lines
>> having the same field_1 and field_2 value.
>>
>> There could be up to hundreds of millions of lines in the file. The
>> desired iterator should work like this: At each "next_entry" call, the
>> iterator should return a reference to an array of the lines having the
>> identical field_1 and field_2 values.
>>
>> Because of my lack of understanding the iterator concept, I could not
>> come up with a solution yet. The file is too big to use the field_1
>> and field_2 as a hash key to achieve the same goal of grouping the
>> entries.

>
>You don't say what you want to do with the data, however
>you could store everything into a database, then using
>group by, order by, you could process your data easily.
>
>However, since you say that everything is already sorted
>by those keys, you could process things as you read the
>file, keeping track of when those fields change. Throwing a
>next_entry around this and having it return the data
>of calling process_data, would be simple enough, I rarely
>bother with creating an 'iterator'.. but that's just me..
>
>Hopefully you're using Text::CSV or some other module to
>parse the CSV file.
>
>my ( \$prev_f1, \$prev_f2, @data );
>while( my \$line = <> )
>{
> chomp( \$line );
> my ( \$f1, \$f2, @fields ) = parse-line-somehow();
>
> if( \$f1 eq \$prev_f1 && \$f2 eq \$prev_f2 )
> {
> push( @data, \@fields );
> }
> else
> {
> process_data( \$prev_f1, \$prev_f2, \@data );
> \$prev_f1 = \$f1;
> \$prev_f2 = \$f2;
> undef @data;
> push( @data, \@fields );
> }
>}
>process_data( \$prev_f1, \$prev_f2, \@data ) if @data;
>
>sub process_data
>{
> my \$f1 = shift;
> my \$f2 = shift;
> my \$data_aref = shift;
>
> # do whatever you want...
>}

Since it's set up to process_data() on every
non-match (including the first line), the check could
be in the function.

----------

my ( \$prev_f1, \$prev_f2, @data );
while( my \$line = <> )
{
...
}
process_data( \$prev_f1, \$prev_f2, \@data ); # if @data;

sub process_data
{
my ( \$f1, \$f2, \$data_aref ) = @_;
return unless @{\$data_ref};

if ( @{\$data_ref} > 1 ) {
# process multiple records (all with same f1 f2 val's)
}
else {
# process single record (or not)
}
}

-sln

Xho Jingleheimerschmidt
Guest
Posts: n/a

 10-19-2010
Mathematisch wrote:
> Hi,
>
> The problem: I would like to create an iterator to iterate through a
> csv file with the following structure:
>
>
> field_1,field_2,...field_14
> field_1,field_2,...field_14
> (...)
>
> Note that this is a csv file with 14 fields and it is already sorted
> by field_1 and then by field_2. There are usually only 5-10 lines
> having the same field_1 and field_2 value.

What is usually the case is of precious little value. If the unusual
case causes ICBMs to be erroneously launched, where is the comfort in
the fact that this is unusual? What is the *maximum plausible* number
of lines with the same field_1 and field_2?

> There could be up to hundreds of millions of lines in the file. The
> desired iterator should work like this: At each "next_entry" call, the
> iterator should return a reference to an array of the lines having the
> identical field_1 and field_2 values.
>
> Because of my lack of understanding the iterator concept, I could not
> come up with a solution yet. The file is too big to use the field_1
> and field_2 as a hash key to achieve the same goal of grouping the
> entries.

package whatever;
sub new {
shift; # not meant for subclassing
open my \$fh, (shift) or die \$!;
my \$x=<\$fh>; chomp \$x;
return bless [\$fh,\$x];
};

sub next_entry {
my \$this=shift;
my \$fh=\$this->[0];
return unless defined \$this->[1];
my @return=\$this->[1];
my @line=split /,/, \$this->[1];
while(1) {
\$this->[1]=<\$fh>;
return [@return] unless defined \$this->[1];
chomp \$this->[1];
my @line2=split /,/, \$this->[1];
return [@return] unless \$line2[0]eq\$line[0] and \$line2[1]eq\$line[1];
push @return, \$this->[1];
};
};

>
> Thank you very much for any help on this. I hope I can learn from the
> eventual proposed solutions.
>
> Kind regards.
> F.
>
>
>