Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > suggestions on intelligent processing of data sets in a file

Reply
Thread Tools

suggestions on intelligent processing of data sets in a file

 
 
alt.testing@{g}mail.com
Guest
Posts: n/a
 
      05-09-2007
Hi all,
I am writing a script to parse files, and insert data into mysql.
The task is simple enough with files containing "standard" fields.
However; there are many files, and this is not the case.
Some of the files even vary in the number of fields therein.

Example: (fields are email, name, postcode, phone)
http://www.velocityreviews.com/forums/(E-Mail Removed), Firstname Lastname
(E-Mail Removed), Firstname Lastname, 2004, 0412 321 512
(E-Mail Removed), Firstname Lastname, 0412 321 512


Now; other than the obvious and easy solution of breaking up the files
into chunks that are "known" and consistent in themselves, in terms of
data fields, I want to build a mechanism that can:

1. Autodetect the number of fields and "line-by-line" respectively
build the data structure as it goes.
2. Verify (or guess the "type" of field using regex)

I don't mind using modules, but would prefer to use ones shipped as
standard. Else, build my own, as I really want to start a bit of "OO",
and this could be a good start.

I have a felling, that creating a class, and building some methods
that can create objects (each respective to a different set) that
reference/manipulate the actual data structures (or something similar)
might be a good approach. This way operations can actually be built on
the fly? Mind you, I've not yet created a module, so this is my first
time. Best approach, or something else, perhaps?

Could anyone suggest some things, that I might try?

tia


Full Context (some rough ideas as a starting point)
================================================== =============================
#!/usr/bin/perl

use strict;
use warnings;

use DBI;

my $email_index;
my $name_index;
my $location_index;
my $mobile_index;


my $input_file = $ARGV[0];
my @working_data_array;
my $email;
my $mobile;
my $name;
my $location;
my $counter;

my $email_regex = qr/^
*[a-zA-Z0-9_.-]*@[a-zA-Z0-9_.-]*\.[a-zA-Z0-9_.-]*/;
my $mobile_regex = qr/^ *[04][0-9 ]{8,12}/;
my $name_regex = qr/^ *[a-z -]*/;
my $location_regex = qr/^ *[a-zA-Z0-9 ]*/;

&set_indexes;

open ( IN_FILE, "< $input_file" ) or die "$!";

while ( <IN_FILE> ) {
next unless ( /@/ );
chomp;
@working_data_array = split( /,/ );

$email = $working_data_array[$email_index];
$name = $working_data_array[$name_index];
$location = $working_data_array[$location_index];
$mobile = $working_data_array[$mobile_index];

print "$email";
print "$name";
print "$location";
print "$mobile\n";

}

close IN_FILE;

exit;

sub set_indexes() {
for $counter ( 0 .. $#ARGV ){
$email_index = $counter-1 if ( $ARGV[$counter] =~ /email/ );
$name_index = $counter-1 if ( $ARGV[$counter] =~ /name/ );
$location_index = $counter-1 if ( $ARGV[$counter] =~ /location/ );
$mobile_index = $counter-1 if ( $ARGV[$counter] =~ /mobile/ );
}
}
 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      05-09-2007
alt.testing@{g}mail.com <alt.testing@{g}mail.com> wrote:

> Some of the files even vary in the number of fields therein.
>
> Example: (fields are email, name, postcode, phone)
> (E-Mail Removed), Firstname Lastname
> (E-Mail Removed), Firstname Lastname, 2004, 0412 321 512
> (E-Mail Removed), Firstname Lastname, 0412 321 512



> I want to build a mechanism that can:
>
> 1. Autodetect the number of fields and "line-by-line" respectively
> build the data structure as it goes.
> 2. Verify (or guess the "type" of field using regex)



------------------------
#!/usr/bin/perl
use warnings;
use strict;
use Data:umper;

while ( <DATA> ) {
chomp;
my %record;
foreach my $part ( split /,\s*/ ) {
if ( $part =~ /^\d+$/ ) # all digits
{ $record{postcode} = $part }
elsif ( $part =~ /^[\d\s]+$/ ) # digits with spaces
{ $record{phone} = $part }
elsif ( $part =~ /@/ ) # contains at-sign
{ $record{email} = $part }
else
{ $record{name} = $part }
}
print Dumper \%record;
}

__DATA__
(E-Mail Removed), Firstname Lastname
(E-Mail Removed), Firstname Lastname, 2004, 0412 321 512
(E-Mail Removed), Firstname Lastname, 0412 321 512
------------------------


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
 
alt.testing@{g}mail.com
Guest
Posts: n/a
 
      05-14-2007
On Wed, 9 May 2007 06:05:40 -0500, Tad McClellan
<(E-Mail Removed)> wrote:

>alt.testing@{g}mail.com <alt.testing@{g}mail.com> wrote:
>
>> Some of the files even vary in the number of fields therein.
>>
>> Example: (fields are email, name, postcode, phone)
>> (E-Mail Removed), Firstname Lastname
>> (E-Mail Removed), Firstname Lastname, 2004, 0412 321 512
>> (E-Mail Removed), Firstname Lastname, 0412 321 512

>
>
>> I want to build a mechanism that can:
>>
>> 1. Autodetect the number of fields and "line-by-line" respectively
>> build the data structure as it goes.
>> 2. Verify (or guess the "type" of field using regex)

>
>
>------------------------
>#!/usr/bin/perl
>use warnings;
>use strict;
>use Data:umper;
>
>while ( <DATA> ) {
> chomp;
> my %record;
> foreach my $part ( split /,\s*/ ) {
> if ( $part =~ /^\d+$/ ) # all digits
> { $record{postcode} = $part }
> elsif ( $part =~ /^[\d\s]+$/ ) # digits with spaces
> { $record{phone} = $part }
> elsif ( $part =~ /@/ ) # contains at-sign
> { $record{email} = $part }
> else
> { $record{name} = $part }
> }
> print Dumper \%record;
>}
>
>__DATA__
>(E-Mail Removed), Firstname Lastname
>(E-Mail Removed), Firstname Lastname, 2004, 0412 321 512
>(E-Mail Removed), Firstname Lastname, 0412 321 512
>------------------------


thanks Tad

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
difflib and intelligent file differences hayes.tyler@gmail.com Python 9 03-26-2009 06:24 PM
Question: processing HTML, re-write default processing action of many tags Hubert Hung-Hsien Chang Python 2 09-17-2004 03:10 PM
suggestions for comparing two large data sets requested Terry L. Ridder Perl Misc 4 10-14-2003 10:28 PM
Processing file input for large files[100+ MB] - Performance suggestions? Maxim ASP .Net 0 07-07-2003 05:31 AM
Help esk Suggestions - Best Practices, Tool, ideas, suggestions, etc JohnNews Computer Support 3 06-23-2003 11:20 PM



Advertisments