Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > please help with creating a special iterator

Reply
Thread Tools

please help with creating a special iterator

 
 
Mathematisch
Guest
Posts: n/a
 
      10-18-2010
Hi,

The problem: I would like to create an iterator to iterate through a
csv file with the following structure:


field_1,field_2,...field_14
field_1,field_2,...field_14
(...)



Note that this is a csv file with 14 fields and it is already sorted
by field_1 and then by field_2. There are usually only 5-10 lines
having the same field_1 and field_2 value.

There could be up to hundreds of millions of lines in the file. The
desired iterator should work like this: At each "next_entry" call, the
iterator should return a reference to an array of the lines having the
identical field_1 and field_2 values.

Because of my lack of understanding the iterator concept, I could not
come up with a solution yet. The file is too big to use the field_1
and field_2 as a hash key to achieve the same goal of grouping the
entries.

Thank you very much for any help on this. I hope I can learn from the
eventual proposed solutions.

Kind regards.
F.



 
Reply With Quote
 
 
 
 
J. Gleixner
Guest
Posts: n/a
 
      10-18-2010
Mathematisch wrote:
> Hi,
>
> The problem: I would like to create an iterator to iterate through a
> csv file with the following structure:
>
>
> field_1,field_2,...field_14
> field_1,field_2,...field_14
> (...)
>
>
>
> Note that this is a csv file with 14 fields and it is already sorted
> by field_1 and then by field_2. There are usually only 5-10 lines
> having the same field_1 and field_2 value.
>
> There could be up to hundreds of millions of lines in the file. The
> desired iterator should work like this: At each "next_entry" call, the
> iterator should return a reference to an array of the lines having the
> identical field_1 and field_2 values.
>
> Because of my lack of understanding the iterator concept, I could not
> come up with a solution yet. The file is too big to use the field_1
> and field_2 as a hash key to achieve the same goal of grouping the
> entries.


You don't say what you want to do with the data, however
you could store everything into a database, then using
group by, order by, you could process your data easily.

However, since you say that everything is already sorted
by those keys, you could process things as you read the
file, keeping track of when those fields change. Throwing a
next_entry around this and having it return the data
of calling process_data, would be simple enough, I rarely
bother with creating an 'iterator'.. but that's just me..

Hopefully you're using Text::CSV or some other module to
parse the CSV file.

my ( $prev_f1, $prev_f2, @data );
while( my $line = <> )
{
chomp( $line );
my ( $f1, $f2, @fields ) = parse-line-somehow();

if( $f1 eq $prev_f1 && $f2 eq $prev_f2 )
{
push( @data, \@fields );
}
else
{
process_data( $prev_f1, $prev_f2, \@data );
$prev_f1 = $f1;
$prev_f2 = $f2;
undef @data;
push( @data, \@fields );
}
}
process_data( $prev_f1, $prev_f2, \@data ) if @data;

sub process_data
{
my $f1 = shift;
my $f2 = shift;
my $data_aref = shift;

# do whatever you want...
}
 
Reply With Quote
 
 
 
 
J. Gleixner
Guest
Posts: n/a
 
      10-18-2010
J. Gleixner wrote:
[...]
> my ( $prev_f1, $prev_f2, @data );
> while( my $line = <> )
> {
> chomp( $line );
> my ( $f1, $f2, @fields ) = parse-line-somehow();
>


FYI: Just looking at this again..

Dumb bug here, which will cause it to always go to the
else, on the first line read.. so you'll have to modify
this if() accordingly.

> if( $f1 eq $prev_f1 && $f2 eq $prev_f2 )
> {
> push( @data, \@fields );
> }
> else
> {

 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      10-19-2010
On Mon, 18 Oct 2010 11:09:11 -0500, "J. Gleixner" <(E-Mail Removed)> wrote:

>Mathematisch wrote:
>> Hi,
>>
>> The problem: I would like to create an iterator to iterate through a
>> csv file with the following structure:
>>
>>
>> field_1,field_2,...field_14
>> field_1,field_2,...field_14
>> (...)
>>
>>
>>
>> Note that this is a csv file with 14 fields and it is already sorted
>> by field_1 and then by field_2. There are usually only 5-10 lines
>> having the same field_1 and field_2 value.
>>
>> There could be up to hundreds of millions of lines in the file. The
>> desired iterator should work like this: At each "next_entry" call, the
>> iterator should return a reference to an array of the lines having the
>> identical field_1 and field_2 values.
>>
>> Because of my lack of understanding the iterator concept, I could not
>> come up with a solution yet. The file is too big to use the field_1
>> and field_2 as a hash key to achieve the same goal of grouping the
>> entries.

>
>You don't say what you want to do with the data, however
>you could store everything into a database, then using
>group by, order by, you could process your data easily.
>
>However, since you say that everything is already sorted
>by those keys, you could process things as you read the
>file, keeping track of when those fields change. Throwing a
>next_entry around this and having it return the data
>of calling process_data, would be simple enough, I rarely
>bother with creating an 'iterator'.. but that's just me..
>
>Hopefully you're using Text::CSV or some other module to
>parse the CSV file.
>
>my ( $prev_f1, $prev_f2, @data );
>while( my $line = <> )
>{
> chomp( $line );
> my ( $f1, $f2, @fields ) = parse-line-somehow();
>
> if( $f1 eq $prev_f1 && $f2 eq $prev_f2 )
> {
> push( @data, \@fields );
> }
> else
> {
> process_data( $prev_f1, $prev_f2, \@data );
> $prev_f1 = $f1;
> $prev_f2 = $f2;
> undef @data;
> push( @data, \@fields );
> }
>}
>process_data( $prev_f1, $prev_f2, \@data ) if @data;
>
>sub process_data
>{
> my $f1 = shift;
> my $f2 = shift;
> my $data_aref = shift;
>
> # do whatever you want...
>}


Since it's set up to process_data() on every
non-match (including the first line), the check could
be in the function.

----------

my ( $prev_f1, $prev_f2, @data );
while( my $line = <> )
{
...
}
process_data( $prev_f1, $prev_f2, \@data ); # if @data;

sub process_data
{
my ( $f1, $f2, $data_aref ) = @_;
return unless @{$data_ref};

if ( @{$data_ref} > 1 ) {
# process multiple records (all with same f1 f2 val's)
}
else {
# process single record (or not)
}
}

-sln
 
Reply With Quote
 
Xho Jingleheimerschmidt
Guest
Posts: n/a
 
      10-19-2010
Mathematisch wrote:
> Hi,
>
> The problem: I would like to create an iterator to iterate through a
> csv file with the following structure:
>
>
> field_1,field_2,...field_14
> field_1,field_2,...field_14
> (...)
>
> Note that this is a csv file with 14 fields and it is already sorted
> by field_1 and then by field_2. There are usually only 5-10 lines
> having the same field_1 and field_2 value.


What is usually the case is of precious little value. If the unusual
case causes ICBMs to be erroneously launched, where is the comfort in
the fact that this is unusual? What is the *maximum plausible* number
of lines with the same field_1 and field_2?

> There could be up to hundreds of millions of lines in the file. The
> desired iterator should work like this: At each "next_entry" call, the
> iterator should return a reference to an array of the lines having the
> identical field_1 and field_2 values.
>
> Because of my lack of understanding the iterator concept, I could not
> come up with a solution yet. The file is too big to use the field_1
> and field_2 as a hash key to achieve the same goal of grouping the
> entries.


package whatever;
sub new {
shift; # not meant for subclassing
open my $fh, (shift) or die $!;
my $x=<$fh>; chomp $x;
return bless [$fh,$x];
};

sub next_entry {
my $this=shift;
my $fh=$this->[0];
return unless defined $this->[1];
my @return=$this->[1];
my @line=split /,/, $this->[1];
while(1) {
$this->[1]=<$fh>;
return [@return] unless defined $this->[1];
chomp $this->[1];
my @line2=split /,/, $this->[1];
return [@return] unless $line2[0]eq$line[0] and $line2[1]eq$line[1];
push @return, $this->[1];
};
};



>
> Thank you very much for any help on this. I hope I can learn from the
> eventual proposed solutions.
>
> Kind regards.
> F.
>
>
>


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
List iterator assignment fails, assert iterator not dereferencable David Bilsby C++ 5 10-09-2007 02:05 PM
What makes an iterator an iterator? Steven D'Aprano Python 28 04-20-2007 03:34 AM
Difference between Java iterator and iterator in Gang of Four Hendrik Maryns Java 18 12-22-2005 05:14 AM
How to convert from std::list<T*>::iterator to std::list<const T*>::iterator? PengYu.UT@gmail.com C++ 6 10-30-2005 03:31 AM
Iterator doubts, Decision on Iterator usage greg C++ 6 07-17-2003 01:26 PM



Advertisments