Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Parsing Large Files

Reply
Thread Tools

Parsing Large Files

 
 
Jose Yimpho
Guest
Posts: n/a
 
      11-04-2003
Perl newbie here.. I'm experienced with other languages, but this is
my first grapple with Perl + Regular Expressions, and I could use some
help or a starting point on this problem.

I have a text file that contains lines like what's at the bottom of
this message. I would like to create a new file that contained
comma-separated values that contains the info from the file. Possible
entries are company name, street address, city, state, zip, phone,
fax, email, url, rep, membership type, business type, and major
products.

Thanks for your help,
Joe Laughlin




----------------------------------------
A Street Games
489 Park Ave
Idaho Idaho Falls ID 83402
Phone: 208-542-2824 Fax: 208-542-2824
http://www.velocityreviews.com/forums/(E-Mail Removed)
Business Representative: Mike Antonson
Membership Type: C - Ret
Business type: Accessories, Board games, Collectable card games,
Family
games, Magazines, Miniatures, Retailer, Roleplaying games, Video
games,
Wargames, Comic Books
Major products: Role-Playing Games, Games Workshop Products, CCGs

2 Big Guyz
15901 Indian Head Hwy
Accokeek MD 20607
Phone: 240-210-0302
(E-Mail Removed)
www.2bigguyz.com
Business Representative: Andrew Turlington
Membership Type: C - Ret
Business type: Accessories, Board games, Books, Collectable card
games,
Magazines, Miniatures, Retailer, Wargames, Comic Books

21st Century Comics
1531 S Harbor Blvd
Fullerton CA 92832
Phone: 714-992-6649 Fax: 714-992-6604
(E-Mail Removed)
www.21stcenturycomics.com
Business Representative: Barry Short
Membership Type: C - Ret
Business type: Accessories, Books, Collectable card games, Other card
games,
Miniatures, Retailer, Roleplaying games, Wargames
Major products: Wizards of the Coast Products; Wizkids Products
-------------------------------------
 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      11-04-2003
Jose Yimpho <(E-Mail Removed)> wrote:

> Subject: Parsing Large Files



I see nothing relating to large files in your post, so why
did you say that there would be something relating to large
files in your Subject?


> Perl newbie here.. I'm experienced with other languages, but this is
> my first grapple with Perl + Regular Expressions, and I could use some
> help or a starting point on this problem.



You haven't told us enough to be of much help...


> I have a text file that contains lines like what's at the bottom of
> this message.



To parse a file we need to know the rules that the file will follow.

What rules will the file follow?


> Possible



Which ones are optional?

Which ones are required?


> entries are company name,



Is that always the 1st line?


> street address,



Is that always the 2nd line?


> phone,



Does that one always start with "Phone:" ?


> email,



Is that always the 5th line?


> url,



(you know those aren't really URLs, right?)


> rep, membership type, business type, and major
> products.



Do those ones always have the something-ending-with-colon headings?


> Business type: Accessories, Board games, Collectable card games,
> Family
> games, Magazines, Miniatures, Retailer, Roleplaying games, Video
> games,
> Wargames, Comic Books



Even worse than the sample-with-no-spec approach to getting help
is letting your newsreader break the data for you.

Is that all on one line in your Real Data?


Maybe this will get you started:

---------------------------
#!/usr/bin/perl
use strict;
use warnings;

{ local $/ = ''; # enable paragraph mode
while ( <DATA> ) {
my($name, $street, $addr, $phone, $email) = /(.*)\n/g;
my($city, $state, $zip) = $addr =~ /(.*?)\s+([A-Z][A-Z])\s+(\d+)$/;
my($rep) = /^Business Representative:\s+(.*)/m;

print "$name\n$street\n$city - $state - $zip\n$rep\n";
print "-----\n";
}
}

__DATA__
# your data here
---------------------------


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
 
Jose Yimpho
Guest
Posts: n/a
 
      11-04-2003
Tad McClellan wrote:

> Jose Yimpho <(E-Mail Removed)> wrote:
>
>> Subject: Parsing Large Files

>
>
> I see nothing relating to large files in your post, so why
> did you say that there would be something relating to large
> files in your Subject?
>


There's about 20,000 lines in the file. I thought that was large?

>
>> Perl newbie here.. I'm experienced with other languages, but this is
>> my first grapple with Perl + Regular Expressions, and I could use some
>> help or a starting point on this problem.

>
>
> You haven't told us enough to be of much help...


Sorry...

>
>
>> I have a text file that contains lines like what's at the bottom of
>> this message.

>
>
> To parse a file we need to know the rules that the file will follow.
>
> What rules will the file follow?
>
>
>> Possible

>
>
> Which ones are optional?
>
> Which ones are required?
>
>
>> entries are company name,

>
>
> Is that always the 1st line?


Yes

>
>
>> street address,

>
>
> Is that always the 2nd line?


Yes, the city, state, and zip are always the third line.

>
>
>> phone,

>
>
> Does that one always start with "Phone:" ?


Yes, and the Fax number has Fax: in front of it.

>
>
>> email,

>
>
> Is that always the 5th line?


No, it's sometimes there.

>
>
>> url,

>
>
> (you know those aren't really URLs, right?)


Forgive me.

>
>
>> rep, membership type, business type, and major
>> products.

>
>
> Do those ones always have the something-ending-with-colon headings?


Yes

>
>
>> Business type: Accessories, Board games, Collectable card games,
>> Family
>> games, Magazines, Miniatures, Retailer, Roleplaying games, Video
>> games,
>> Wargames, Comic Books

>
>
> Even worse than the sample-with-no-spec approach to getting help
> is letting your newsreader break the data for you.
>
> Is that all on one line in your Real Data?


No, not all on one line. I don't think the newsreader broke any data (the
data is on multiple lines for each entitity wuth a blank line in between
each entitity).

Also, something like the following is legal (the linebreaks are
intentional):

Business type: Accessories, Board Games, Books,
Other card games, Family
Games, Magazines, Minatures
Major products: Wizkids Products; Wizards of the Coast
Products; Reaper Minatures





>
>
> Maybe this will get you started:
>
> ---------------------------
> #!/usr/bin/perl
> use strict;
> use warnings;
>
> { local $/ = ''; # enable paragraph mode
> while ( <DATA> ) {
> my($name, $street, $addr, $phone, $email) = /(.*)\n/g;
> my($city, $state, $zip) = $addr =~ /(.*?)\s+([A-Z][A-Z])\s+(\d+)$/;
> my($rep) = /^Business Representative:\s+(.*)/m;
>
> print "$name\n$street\n$city - $state - $zip\n$rep\n";
> print "-----\n";
> }
> }
>
> __DATA__
> # your data here
> ---------------------------
>
>


Thanks, that will get me started. Would appreciate any other help you could
give. If there's anything I can answer, let me know.

With regards to the paragraph grouping, I tried something like this last
night:

$/ = '';
while <FILE>
{
print;
$count++;
}
print "\nNumber of paragraphs: $count\n";

It printed the file contents, and then: 'Number of paragraphs: 1', which
didn't seem right to me, as I was trying to count the number of paragraphs
(or blank lines) in the file. Setting the $/ sets the 'splitter' to split
on all blank lines, right? and each iteration of the while loop reads in
one section of the input (split by blank lines), right? Not sure why it
was printing out a 1.

Joe Laughlin
 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      11-04-2003

Jose Yimpho <(E-Mail Removed)> wrote:
> With regards to the paragraph grouping, I tried something like this last
> night:
>
> $/ = '';
> while <FILE>
> {
> print;
> $count++;
> }
> print "\nNumber of paragraphs: $count\n";
>
> It printed the file contents, and then: 'Number of paragraphs: 1', which
> didn't seem right to me, as I was trying to count the number of paragraphs
> (or blank lines) in the file.


Are the lines between your paragraphs truly blank? If they contain any
whitespace (in the case of Win32 files opened in binary mode this
includes the \r at the end of each line), then they will not be
counted a paragraph breaks by Perl.

Try

$/ = $\ = "";
while <FILE> {
print "Line $.: |$_|";
}

to see what Perl considers each paragraph to contain. If your file
does have 'blank' lines with spaces in, and you want to get rid of
them, use

perl -pi~ -e's/^\s+$//' file

..

Ben

--
$.=1;*g=sub{print@_};sub r($$\$){my($w,$x,$y)=@_;for(keys%$x){/main/&&next;*p=$
$x{$_};/(\w)::$/&&(r($w.$1,$x.$_,$y),next);$y eq\$p&&&g("$w$_")}};sub t{for(@_)
{$f&&($_||&g(" "));$f=1;r"","::",$_;$_&&&g(chr(0012))}};t # (E-Mail Removed)
$J::u::t, $a::n:::t::h::e::r, $P::e::r::l, $h::a::c::k::e::r, $.
 
Reply With Quote
 
Jose Yimpho
Guest
Posts: n/a
 
      11-04-2003
Ben Morrow wrote:

>
> Jose Yimpho <(E-Mail Removed)> wrote:
>> With regards to the paragraph grouping, I tried something like this last
>> night:
>>
>> $/ = '';
>> while <FILE>
>> {
>> print;
>> $count++;
>> }
>> print "\nNumber of paragraphs: $count\n";
>>
>> It printed the file contents, and then: 'Number of paragraphs: 1', which
>> didn't seem right to me, as I was trying to count the number of
>> paragraphs (or blank lines) in the file.

>
> Are the lines between your paragraphs truly blank? If they contain any
> whitespace (in the case of Win32 files opened in binary mode this
> includes the \r at the end of each line), then they will not be
> counted a paragraph breaks by Perl.
>
> Try
>
> $/ = $\ = "";
> while <FILE> {
> print "Line $.: |$_|";
> }
>
> to see what Perl considers each paragraph to contain. If your file
> does have 'blank' lines with spaces in, and you want to get rid of
> them, use
>
> perl -pi~ -e's/^\s+$//' file
>
> .
>
> Ben
>


Yeah, I thought that too.

In vi (in Redhat 9), I created a file similiar to:

=============
Hello this

is a

great file

and I am proud of it.
============

But I still got a paragraph count of one.


 
Reply With Quote
 
Glenn Jackman
Guest
Posts: n/a
 
      11-04-2003
Jose Yimpho <(E-Mail Removed)> wrote:
> With regards to the paragraph grouping, I tried something like this last
> night:
>
> $/ = '';
> while <FILE>


syntax error: should be: while (<FILE>)

> {
> print;
> $count++;
> }
> print "\nNumber of paragraphs: $count\n";
>
> It printed the file contents, and then: 'Number of paragraphs: 1', which
> didn't seem right to me, as I was trying to count the number of paragraphs
> (or blank lines) in the file. Setting the $/ sets the 'splitter' to split
> on all blank lines, right? and each iteration of the while loop reads in
> one section of the input (split by blank lines), right? Not sure why it
> was printing out a 1.


Are your blank lines truly empty, or do they have whitespace in them?
For instance, if each line ends with "\r\n", and your processing the
file on a unixy OS where "\n" is the end of line character, you don't
have any empty lines in the file. Test this theory with: $/="\r\n\r\n";

--
Glenn Jackman
NCF Sysadmin
(E-Mail Removed)
 
Reply With Quote
 
Glenn Jackman
Guest
Posts: n/a
 
      11-04-2003
Jose Yimpho <(E-Mail Removed)> wrote:
> In vi (in Redhat 9), I created a file similiar to:

[...]
> But I still got a paragraph count of one.


In vi, is your file format 'dos'?
:set fileformat
If so, set it to 'unix' before you save.
:set ff=unix
:wq

--
Glenn Jackman
NCF Sysadmin
(E-Mail Removed)
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      11-04-2003
Jose Yimpho <(E-Mail Removed)> wrote:

> I tried something like this last

^^^^^^^^^^^^^^
> night:
>
> $/ = '';
> while <FILE>
> {



Please post *real* code.

Have you seen the Posting Guidelines that are posted here frequently?


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
c++ parsing with mix of sax & dom for large files alex masselot XML 2 01-10-2007 02:33 PM
Parsing large files aditya.raghunath@gmail.com C++ 2 09-13-2006 05:39 AM
Parsing large XML files FAST PedroX XML 9 06-27-2005 11:38 PM
string parsing screwing up on large files? Daniel Kramer Python 2 12-20-2003 01:37 PM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM



Advertisments