Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > processing large files

Reply
Thread Tools

processing large files

 
 
robic0
Guest
Posts: n/a
 
      04-07-2006
On 7 Apr 2006 17:33:26 GMT, Glenn Jackman <(E-Mail Removed)> wrote:

>At 2006-04-06 08:46PM, robic0 <> wrote:
>> To get around this problem create 24 un-indexed data files. 1 for
>> each letter of the alphabet.

>
>No wonder it's so hard to understand you sometimes.


Oh yeah, one for y,z also.
Good catch.
 
Reply With Quote
 
 
 
 
robic0
Guest
Posts: n/a
 
      04-08-2006
On Thu, 06 Apr 2006 17:46:48 -0700, robic0 wrote:

>On 6 Apr 2006 11:42:03 -0700, "Andreja" <(E-Mail Removed)> wrote:
>
>>I'm wondering if anyone in the group has experience with handling large
>>files. If you are a consultant with proven capability to handle a
>>project as (roughly) summarized below please contact me at your
>>convenience.
>>
>>Here's what I have:
>>1. File A: fixed length 830 bytes/row, 193 million rows
>>2. File B: fixed length 760 bytes/row, 122 million rows
>>3. File C: 10 bytes/row, 50 million rows
>>4. File D: fixed length 11 bytes/row, 15 million rows
>>5. File E: 4 bytes/row, 50,000 rows
>>
>>Here's what I need to do:
>>1. merge files A and B, dedupe and create output file - expected to be
>>+/- 210 million rows
>>2. append files C, D and E to this output file
>>3. Collapse this output file into a smaller file (essentially group
>>individual records with certain common characteristics into one record)
>>4. create final output, tab del, ascii
>>
>>Any insight as to feasability is much appreciated, in particular also
>>with regard to hardware requirement considerations.
>>
>>Thank you all.

>
>I find it hard to believe you would have amassed 190+ gigabyte file
>in A for instance. There would never be a need for this unless it were
>from an unintelligent data logging system (recieved back from space
>after 10 years).
>
>I don't know what you mean by A and B need to be deduped after a merge.
>And I don't know what you consider a dupe. If they are records, then a
>SELECT DISTINCT A.field(1..n) etc... should do it.
>
>If say you want to remove dups within A for example, weather its certain
>field or all fields combined as one string, on a per-record basis, then
>you will have to create seperate index per file A and B. Then use those
>index's to merge.
>
>Of course creating an index requires a location (pointer) to the
>actual data so the data is not moved. But the data is not in memory so
>it will have to be the absolute file pointer position of the data that is stored
>in the index.
>
>When the index is being created, as it grows, a binary search is being used so
>that in-file duplicates will not be stored. That takes care of sorting the file
>and removing duplicates.
>
>Of course when creating an index, data is not actually moved around, however
>to acess the data for sort comparisons requires the file position to be moved
>and data read in during the binary search. This results in a constant
>thrashing of the hard drive and would take an incredibly long time since
>the same data (albeit a binary search and not linear) will need to be brought
>in time and time again.
>
>In this case, because the files are so big, even buffering won't help, actually it
>would hurt.
>
>To get around this problem create 24 un-indexed data files. 1 for each letter of
>the alphabet. Each line in a letter file contains a file position into file A that
>starts with that letter. At this point, indexes will be small enough to be created
>in memory, however, the index is now twice indirect to the data in file A. This
>of course is not problem whatsoever as this scheme could go multiple levels of
>indirection if necessary, but this should do it.
>
>Open 24 letter files, open file A. Sequentially read each line in A and write its
>file pointer (imagine it would be a big int) to the appropriate letter file.
>If you have a 'square' file of given line length then just store the line number and
>let the absolute file position for that line be calculated. If the lines differ
>in size, another scheme would be to factor out a common denominator, reducing the
>size of the letter file.
>
>Now create a seperate sort index for each of the 24 letter data files.
>These 'indexed' files now contain the position into the letter files.
>And the 'letter' files contain the postion into file A.
>
>What this does is exponentially reduce redundant acess to file A during
>binary searches.
>
>At this point you could reconstruct file A using the index's to allow it to
>be either viewed in a sorted manner or written out to a new file.
>However it would be better to index file B in the same fashion, then,
>use sequential acess to the index's for file A and B, to construct either
>a merged index per letter, or create a merged file altogether (without dups).
>
>File A sorted data traversal example. Assume letter 'a' has its index and
>file position already read into an into a couple of array's:
>
># file A is open in binary mode
>foreach $ndxto_a (@A_ndx_a)
>{
> $A_fpos = 830 * $A_letter_a[$ndxto_a];
> # seek to $A_fpos, read 830 bytes
>}
>
>The lfcr will have to be taken into account when seeking, so maybe 832 or so.
>
>That about handles sorting and in-file removal of dups.
>To do the other stuff, a further breakdown of data will be necessary,
>similar but probably with a different strategy.
>
>Creating the @A_ndx_xxx index's will need a bit of creativity in Perl since
>it doesn't have memmov's as in C. This would be needed to 'insert' the letter
>array index into the sorted index array. This oly applies if you would be
>doing a binary search for the insertion.
>
>An alternative to the binary search index creation for the letter array,
>would be to use Perls 'sort' function on the letter array directly, making
>it the index. This would do away with the extra step. Its hard to tell which
>would be faster.
>
>To make an index out of the letter array, create the letter file as above.
>Read it into an array and use Perls 'sort' function with a custom compare
>subroutine. The file acess on file A would be the same.
>
>For instance:
>
>@A_letter_a = sort special @letter_a;
>
>sub special {
> ($Afpos1, $Afpos2 = (830*$a, 830*$b);
> # seek to $A_fpos1, read 830 bytes into data1
> # seek to $A_fpos2, read 830 bytes into data2
> return $data1 cmp $data2;
>}
>
>Then as above to acess the sorted data, with a slight change:
>
>foreach $fline (@A_letter_a)
>{
> $A_fpos = 830 * $fline;
> # seek to $A_fpos, read 830 bytes
>}
>
>I would try this last sort method first. Make the letter file into
>a sorted index. The best way is to make a test case on one letter
>and see how it goes. If too slow, a sorted index of the letter file
>may be necessary.


Just a follow-up with some possible strategies to consider that would
speed things up on whatever this is, data-mining or whatever.
Btw, it was pointed out to me that there is 26 letters in the alphabet
and NOT 24 as stated (no I'm not brain dead yet, pretty close).
I'm almost sure the number of letter arrays won't be 26 though (see below).

CREATING THE LETTER FILES
- Try not to read the entire file, use seek then read the first character
(this applies if it is a record as well).
- Store the file position (the calculated line number) in the particular
letter array.
- Create an additional 26 files if being case-sensitive
- This process should consume 193 million x 4 @= 7.7 gigabytes in total letter
files for file A. The goal is to move individual letter files into memory
for processing. Ideally, 7.7 gigabytes divided by 26 @= 8 megabytes or less,
per file. This is managable. The problem comes when an individual letter file
is very big. Depending on the data, the skew would tend toward 's' and the
more used letters.

JOINING SMALL LETTER FILES
In this step, it is asumed that each letter file size is
less than < sum of all the letter file sizes divided by / 26
(see below if a file is bigger). This step will actually slow the process down
but it will give a somewhat equal time differential between each run of a letter group.
- Join/Append alphabetic sequential files into a single file up to the file size limit
imposed, @ 8 megabytes.
- Sort this file as if it were a single letter file.

SPLITTING LARGE LETTER FILES
This presents the biggest problem.
There are solutions to this as well.
Hopefully this step is not necessary.

 
Reply With Quote
 
 
 
 
robic0
Guest
Posts: n/a
 
      04-10-2006
On Fri, 07 Apr 2006 17:04:35 -0700, robic0 wrote:

>On Thu, 06 Apr 2006 17:46:48 -0700, robic0 wrote:
>
>>On 6 Apr 2006 11:42:03 -0700, "Andreja" <(E-Mail Removed)> wrote:
>>
>>>I'm wondering if anyone in the group has experience with handling large
>>>files. If you are a consultant with proven capability to handle a
>>>project as (roughly) summarized below please contact me at your
>>>convenience.
>>>
>>>Here's what I have:
>>>1. File A: fixed length 830 bytes/row, 193 million rows
>>>2. File B: fixed length 760 bytes/row, 122 million rows
>>>3. File C: 10 bytes/row, 50 million rows
>>>4. File D: fixed length 11 bytes/row, 15 million rows
>>>5. File E: 4 bytes/row, 50,000 rows
>>>
>>>Here's what I need to do:
>>>1. merge files A and B, dedupe and create output file - expected to be
>>>+/- 210 million rows
>>>2. append files C, D and E to this output file
>>>3. Collapse this output file into a smaller file (essentially group
>>>individual records with certain common characteristics into one record)
>>>4. create final output, tab del, ascii
>>>
>>>Any insight as to feasability is much appreciated, in particular also
>>>with regard to hardware requirement considerations.
>>>
>>>Thank you all.

>>

[snip]
>Just a follow-up with some possible strategies to consider that would
>speed things up on whatever this is, data-mining or whatever.
>Btw, it was pointed out to me that there is 26 letters in the alphabet
>and NOT 24 as stated (no I'm not brain dead yet, pretty close).
>I'm almost sure the number of letter arrays won't be 26 though (see below).
>
>CREATING THE LETTER FILES
>- Try not to read the entire file, use seek then read the first character
> (this applies if it is a record as well).
>- Store the file position (the calculated line number) in the particular
> letter array.
>- Create an additional 26 files if being case-sensitive
>- This process should consume 193 million x 4 @= 7.7 gigabytes in total letter
> files for file A. The goal is to move individual letter files into memory
> for processing. Ideally, 7.7 gigabytes divided by 26 @= 8 megabytes or less,
> per file. This is managable. The problem comes when an individual letter file
> is very big. Depending on the data, the skew would tend toward 's' and the
> more used letters.
>
>JOINING SMALL LETTER FILES
> In this step, it is asumed that each letter file size is
> less than < sum of all the letter file sizes divided by / 26
> (see below if a file is bigger). This step will actually slow the process down
> but it will give a somewhat equal time differential between each run of a letter group.
>- Join/Append alphabetic sequential files into a single file up to the file size limit
> imposed, @ 8 megabytes.
>- Sort this file as if it were a single letter file.
>
>SPLITTING LARGE LETTER FILES
> This presents the biggest problem.
> There are solutions to this as well.
> Hopefully this step is not necessary.
>

More on 'Splitting large letter files':
I didn't want to get into the statistics for letter combinations (like 'st') but it
can be used analytically in a automated algo.

Given that, statistically, it should arm you with apriori knowledge of the distribution
curve enough to split files that are large. For instance, if you have a letter such as
'S' that might produce a large file, even if you know this after a semi-auto pass,
all thats necessary is to create another 26 sub-letter files based on the next character.
Ei: sa, st, so, ... etc. This could all be done programatically, the entire thing.
Statistics could help with some apriori knowledge necessary for an algol.
There is no doubt however that this is entirely possible.

I guess if the world were geared for such large data sets now, absolutely anything can
be done as far as data reduction. When we had 64k of ram in the old days, large didn't stop
anything.

Don't think you have come up with something that is not possible, this group is read in America!
Given the knowledge displayed here without a response would lead me to believe your a crackpot
lang bullshit/phishing artist who needs sexual relief. So sperm on these words jamocc....
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Processing large files with TextFieldParser Jon Spivey ASP .Net 3 12-01-2009 10:02 PM
Quicken copy time when processing large files Clement Ow Ruby 0 04-04-2008 10:16 AM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM
Script "terminates" when processing large numbers of files Scott Stark Perl Misc 1 08-03-2003 09:15 PM
Processing file input for large files[100+ MB] - Performance suggestions? Maxim ASP .Net 0 07-07-2003 05:31 AM



Advertisments