Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > look up very large table

Reply
Thread Tools

look up very large table

 
 
ela
Guest
Posts: n/a
 
      02-10-2010
I have some large data in pieces, e.g.

asia.gz.tar 300M

or

roads1.gz.tar 100M
roads2.gz.tar 100M
roads3.gz.tar 100M
roads4.gz.tar 100M

I wonder whether I should concatenate them all into a single ultra large
file and then perform parsing them into a large table (I don't know whether
perl can handle that...).

The final table should look like this:

ID1 ID2 INFO
X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
X2.3 H9 Beijing; China; Asia
.....

each row may come from a big file of >100M (as aforementioned):

CITY Beijing
NOTE Capital
RACE Chinese
....

And then I have another much smaller table which contains all the ID's
(either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
this 20M file annotated with the INFO. Hashing seems not to be a solution
for my 32G, 8-core machine...

Any advice? or should i resort to some other languages?







 
Reply With Quote
 
 
 
 
Jim Gibson
Guest
Posts: n/a
 
      02-10-2010
In article <hku3e0$3fs$(E-Mail Removed)>, ela
<(E-Mail Removed)> wrote:

> I have some large data in pieces, e.g.
>
> asia.gz.tar 300M
>
> or
>
> roads1.gz.tar 100M
> roads2.gz.tar 100M
> roads3.gz.tar 100M
> roads4.gz.tar 100M
>
> I wonder whether I should concatenate them all into a single ultra large
> file and then perform parsing them into a large table (I don't know whether
> perl can handle that...).


There is no benefit that I can see to concatenating the files. Use the
File::Find module to find all files with a certain naming convention,
read each one, and process the information in each file. As far as the
amount of information that Perl can handles, that is mostly determined
by the available memory and how smart you are at condensing the data,
keeping only what you need and throwing away stuff you don't need.

>
> The final table should look like this:
>
> ID1 ID2 INFO
> X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
> X2.3 H9 Beijing; China; Asia


Perl does not have tables. It has arrays and hashes. You can nest
arrays and hashes to store complex datasets in memory by using
references.

> ....
>
> each row may come from a big file of >100M (as aforementioned):
>
> CITY Beijing
> NOTE Capital
> RACE Chinese
> ...
>
> And then I have another much smaller table which contains all the ID's
> (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
> this 20M file annotated with the INFO. Hashing seems not to be a solution
> for my 32G, 8-core machine...
>
> Any advice? or should i resort to some other languages?


Try reading all the files and saving the data you want. If you run out
of memory, then think about a different approach. 32GB of memory is
quite a lot.

If you can't fit all of your data into memory at one time, you might
consider using a database that will store your data in files. Perl has
support for many databases. But I would first determine whether or not
you can fit everything in memory.

--
Jim Gibson
 
Reply With Quote
 
 
 
 
ccc31807
Guest
Posts: n/a
 
      02-10-2010
On Feb 10, 5:57*am, "ela" <(E-Mail Removed)> wrote:
> Any advice? or should i resort to some other languages?


Perl is probably your best bet for this task.

> I have some large data in pieces, e.g.
> asia.gz.tar 300M
> or
> roads1.gz.tar 100M


It might be helpful for you to give a sample of your data format. You
don't mention untarring and unzipping your file, so I assume that you
are dealing with ASCII text. If not, then some of the following might
not work well.

> I wonder whether I should concatenate them all into a single ultra large
> file and then perform parsing them into a large table (I don't know whether
> perl can handle that...).


Irrelevant question. Ordinarily you process files one line at a time,
so it doesn't make any difference how large a particular file is, as
long as each line can be manipulated. In cases where I have to deal
with a number of files, I find it easier to glob the files, or open
and read a directory, to automate the process of opening, reading, and
closing a number of files. You might gain something in particular
cases by combining files, but I don't see any general advantage in
doing so.

> each row may come from a big file of >100M (as aforementioned):
> CITY * *Beijing
> NOTE * *Capital
> RACE * *Chinese
> ...


Typical data munging. Depending on whether you have duplicates, I
would probably build a hash and write the hash to your output file.
You then have be ability to sort on different fields, e.g., cities,
notes, races, etc.

> And then I have another much smaller table which contains all the ID's
> (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
> this 20M file annotated with the INFO. Hashing seems not to be a solution
> for my 32G, 8-core machine...


Hashing is ideal, provided you can link the two files by a common
record. The general technique is to open the ID file first, build a
hash of record IDs, then open your data file and populate the hash
records with data according to the common record. Then, open your
output file and print to it.

If you will use the data frequently, you might want to stuff the data
into a database so you can query it conveniently.

If you want help, please be sure to furnish both sample data from each
file and your attempts at writing the script.

CC.


 
Reply With Quote
 
John Bokma
Guest
Posts: n/a
 
      02-10-2010
"ela" <(E-Mail Removed)> writes:

> I have some large data in pieces, e.g.
>
> asia.gz.tar 300M
>
> or
>
> roads1.gz.tar 100M
> roads2.gz.tar 100M
> roads3.gz.tar 100M
> roads4.gz.tar 100M
>
> I wonder whether I should concatenate them all into a single ultra large
> file and then perform parsing them into a large table (I don't know whether
> perl can handle that...).
>
> The final table should look like this:
>
> ID1 ID2 INFO
> X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
> X2.3 H9 Beijing; China; Asia
> ....
>
> each row may come from a big file of >100M (as aforementioned):
>
> CITY Beijing
> NOTE Capital
> RACE Chinese
> ...
>
> And then I have another much smaller table which contains all the ID's
> (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
> this 20M file annotated with the INFO. Hashing seems not to be a solution
> for my 32G, 8-core machine...
>
> Any advice? or should i resort to some other languages?


How about importing all your data into a database, and using SQL to
extract what you want? Depending on the format of your input files some
parsing might be required which can be done with a small Perl
program.

--
John Bokma j3b

Hacking & Hiking in Mexico - http://johnbokma.com/
http://castleamber.com/ - Perl & Python Development
 
Reply With Quote
 
Jürgen Exner
Guest
Posts: n/a
 
      02-10-2010
"ela" <(E-Mail Removed)> wrote:
>I have some large data in pieces, e.g.
>
>asia.gz.tar 300M
>
>or
>
>roads1.gz.tar 100M
>roads2.gz.tar 100M
>roads3.gz.tar 100M
>roads4.gz.tar 100M
>
>I wonder whether I should concatenate them all into a single ultra large
>file


I may be mistaken but isn't that a prerequisite to actually extract any
data from compressed (.gz) file?

>and then perform parsing them into a large table (I don't know whether
>perl can handle that...).


The hardware is the limit.

>The final table should look like this:
>
>ID1 ID2 INFO
>X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
>X2.3 H9 Beijing; China; Asia
>....
>
>each row may come from a big file of >100M (as aforementioned):
>
>CITY Beijing
>NOTE Capital
>RACE Chinese
>...
>
>And then I have another much smaller table which contains all the ID's
>(either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
>this 20M file annotated with the INFO. Hashing seems not to be a solution
>for my 32G, 8-core machine...


Depends. It's easy enough to do so you can just try if it works.

>Any advice? or should i resort to some other languages?


If at all you are hardware limited, Eventually the system will begin
swapping. And that will happen in any language if you try to keep too
much data in RAM.
If that happens you will have to revert to time-proven techniques from
the dark ages: trade HD space and time for RAM by keeping only one set
of data in RAM and annotate that set while processing the second set of
data from the HD line by line.

However the real solution would be to load the whole enchilada into a
database and then do whatever join you want to do. There is a reason why
database system have been created and optimized for exactly such tasks.

jue
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      02-10-2010
On Wed, 10 Feb 2010 18:57:02 +0800, "ela" <(E-Mail Removed)> wrote:

>I have some large data in pieces, e.g.
>
>asia.gz.tar 300M
>
>or
>
>roads1.gz.tar 100M
>roads2.gz.tar 100M
>roads3.gz.tar 100M
>roads4.gz.tar 100M
>
>I wonder whether I should concatenate them all into a single ultra large
>file and then perform parsing them into a large table (I don't know whether
>perl can handle that...).
>
>The final table should look like this:
>

[snip examples that doesen't tell info]

>Any advice? or should i resort to some other languages?
>


Yes, go back to the database that produced these files
and run a different querry to get the info you need.

If your not still with the company who owns this information,
I suggest you contact them for permission to use this information.

-sln
 
Reply With Quote
 
Xho Jingleheimerschmidt
Guest
Posts: n/a
 
      02-11-2010
ela wrote:
> I have some large data in pieces, e.g.
>
> asia.gz.tar 300M
>
> or
>
> roads1.gz.tar 100M
> roads2.gz.tar 100M
> roads3.gz.tar 100M
> roads4.gz.tar 100M


The data is first gzipped and then tarred? That is an odd way of doing
things.

> I wonder whether I should concatenate them all into a single ultra large
> file


I see no reason to do that. Especially as I don't think tar format
supports that cleanly, does it?

> and then perform parsing them into a large table (I don't know whether
> perl can handle that...).


I bet it can.

>
> The final table should look like this:
>
> ID1 ID2 INFO
> X1 Y9 san diego; california; West Coast; America; North Ameria; Earth
> X2.3 H9 Beijing; China; Asia
> .....
>
> each row may come from a big file of >100M (as aforementioned):
>
> CITY Beijing
> NOTE Capital
> RACE Chinese
> ....


What is the ",,,," hiding? 100M is an awful lot of "...."

Each file is turned into only one row? And each file is 100M? So how
many rows do you anticipate having?


> And then I have another much smaller table which contains all the ID's
> (either ID1 or ID2, maybe 100,000 records, <20M). and I just need to make
> this 20M file annotated with the INFO. Hashing seems not to be a solution
> for my 32G, 8-core machine...


Why not?

> Any advice? or should i resort to some other languages?


Your description is too vague to give any reasonable advice.


Xho
 
Reply With Quote
 
Xho Jingleheimerschmidt
Guest
Posts: n/a
 
      02-11-2010
Jürgen Exner wrote:
>
> However the real solution would be to load the whole enchilada into a
> database and then do whatever join you want to do. There is a reason why
> database system have been created and optimized for exactly such tasks.


Database systems are generally created for atomicity, concurrency,
isolation, and durability, which is quite a bit more than this task
seems to consist of. It is my general experience that in this type of
task, a Perl script could be written and have completed its job while
the database system is still tying its shoes.


Xho
 
Reply With Quote
 
Jürgen Exner
Guest
Posts: n/a
 
      02-11-2010
Xho Jingleheimerschmidt <(E-Mail Removed)> wrote:
>Jürgen Exner wrote:
>>
>> However the real solution would be to load the whole enchilada into a
>> database and then do whatever join you want to do. There is a reason why
>> database system have been created and optimized for exactly such tasks.

>
>Database systems are generally created for atomicity, concurrency,
>isolation, and durability, which is quite a bit more than this task
>seems to consist of. It is my general experience that in this type of
>task, a Perl script could be written and have completed its job while
>the database system is still tying its shoes.


Certainly true. But they are also designed to handle vast amounts of
data efficiently. And if the OP indeed runs into space issues then a DB
system may (just may!) provide an easier to use and even faster
alternative to looping through files over and over again.

If it actually is an advantage or not is hard to tell. I agree, the OPs
task seems to be easy enough to be solved in a single pass. But the
description was rather cryptic, too, so there might be more
cross-references going on than either of us is expecting at this time.

jue
 
Reply With Quote
 
ccc31807
Guest
Posts: n/a
 
      02-11-2010
On Feb 11, 1:32*pm, (E-Mail Removed) (Jamie) wrote:
> I'd only add that often databases aren't a very good choice for data
> that doesn't change. Databases are way over used, IMO.
>
> Often, an index structure built out of flat files is superior (doesn't have all
> that extra baggage of servers, code to carefully insert/update and such)
>
> Sometimes, you just can't beat a set of well formed text files on
> top of a carefully planned directory.


Amen, amen, and amen. In my line of work, I see Access almost
exclusively used as a productivity tool (e.g., to produce reports) and
never (amost) as a database. Personally, I have created hundreds if
not a thousand or two of Access databases, and I can't recall using a
Primary Key in Access.

The same thing can also be said for Excel.

I totally agree that many times using some kind of delimited file is
much easier, simpler, and faster than using a database. For one time
processing, building a data structure in memory is also much eaiser,
simpler, and faster.

CC.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Latest Nvidia Drivers WHQL certification but very very Large.. EMB NZ Computing 6 08-15-2008 11:18 AM
SOAP and very very large numbers bmm Ruby 0 04-18-2006 11:14 PM
Quick Book file access very very very slow Thomas Reed Computer Support 7 04-09-2004 08:09 PM
<tr> with a 1x1 image as a filler on a table with padding of 2 look thicker in netscape but they look ok in IE. Serial # 19781010 HTML 1 08-10-2003 09:05 PM
very Very VERY dumb Question About The new Set( ) 's Raymond Arthur St. Marie II of III Python 4 07-27-2003 12:09 AM



Advertisments