Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > question concerning pipes and large strings

Reply
Thread Tools

question concerning pipes and large strings

 
 
Rainer Weikusat
Guest
Posts: n/a
 
      06-20-2012
math math <> writes:

[...]

>> An simple implementation of the
>> 'concatenate everything in ID-sorted order' with a sensible I/O
>> stragey:
>>

>
> Hm, I don't understand right away what's really going on below in the script, I will try to decipher it.


That's what I think to be the best idea here (this may be wrong): It
splits the input file into files each containing the data part of a
single line while recording the IDs. Afterwards, it sorts the IDs and
concatenates the temporary files into an output file in that order.
There's some additional logic in order to keep as many of the
temporary files open until the output stage as the OS supports.
 
Reply With Quote
 
 
 
 
Rainer Weikusat
Guest
Posts: n/a
 
      06-20-2012
math math <> writes:
>> On 06/19/12 10:50, math math wrote:
>> > Hi,
>> >
>>> I have a file with two tab delimited fields. First field is an
>>> ID, the second field is a large string (up to hundreds of
>>> millions of characters). The file may have many lines.
>>>
>>> I would like to sort the file on the first (ID) field and after
>>> this sorting, merge the second fields (i.e. remove the new lines),
>>> so that I get a single line with many hundreds of lines that are
>>> in the same order appended to each other as their alphabetically
>>> sorted IDs.


[...]

>> Probably sorting it first would make it much easier:
>>
>> man sort
>>

> Indeed, I tried sort first, it works, it is more of a scalability
> question really.


This is a really bad idea because sort will reorder the complete input
lines, including the data part, possible/ probably multiple times for
each input line, and this means a lot of copying of data which doesn't
need to be copied since only the IDs are supposed to be sorted.
 
Reply With Quote
 
 
 
 
Rainer Weikusat
Guest
Posts: n/a
 
      06-20-2012
Ben Morrow <> writes:
> Quoth Rainer Weikusat <>:
>> Ben Morrow <> writes:
>> > Quoth Rainer Weikusat <>:
>> >> Ben Morrow <> writes:
>> >>> I would do this in two passes. Start by reading through the file a block
>> >>> at a time, and finding all the ID fields. (I am assuming these are small
>> >>> enough that you aren't worried about keeping the whole list in
>> >>> memory).
>> >>
>> >> If the file is large and entries with identical ID are not somehow
>> >> clustered, this will result in a lot of (useless) I/O.
>> >
>> > How so?

>>
>> Because the complete contents of the file won't fit into the kernel
>> page cache and this means whenever a block of data needs to be read
>> which presently isn't in the page cache, one of the existing blocks
>> needs to be evicted and the data read 'from disk'. This can happen
>> numerous times when randomly seeking back and forth within the file.

>
> The part you quoted reads through the entire file exactly once, in
> order. There is no seeking.I agree there will be a lot of seeking
> later,


Could you perhaps enlighthen me what the point of this remark is
supposed to be? Since I wrote about seeking and seeking is a necessary
part of the complete solution, it should be blatantly obivious that I
was referring to that.

[...]

>> The same phenomenon will occur at the perl buffering level except that
>> it will likely be much more severe (in terms of system-call overhead)
>> because the perl-level buffer will likely (this may be wrong) be much
>> smaller than the kernel page cache.

>
> That depends on the filesystem and OS, of course, but yes, in general I
> would assume this would be the case. I'm not sure it matters: in a
> process as IO-bound as this, the system call overhead is likely to be
> irrelevant.


Let's assume for the sake of example that 'all of the I/O' takes 30
days. The kilotons of useless system calls take an hour. This is about
0.14% of the first time span. Would you want to wait an additional
hour for this task to complete just because it is only 0.14%? Or would
you rather want to avoid this hour as well? Can you imagine that
someone who just wants to use the code could want to avoid this hour?
Or that someone could want to do something more useful with this hour
of CPU time than 'burn it with useless system calls'?

>> >> #!/usr/bin/perl
>> >> #
>> >>
>> >> sub get_ids
>> >> {
>> >> my ($in, $ids) = @_;
>> >> my ($line, $the_id, $pos);
>> >>
>> >> $pos = tell($in);
>> >> while ($line = <$in>) {
>> >
>> > You're still reading an entire line into memory.

>>
>> Can you imagine that I know that? Actually, that just about everyone
>> reading your text will know that?
>>
>> If you want an opinion on this: If a single line of input is too large
>> to be kept in memory, Perl is decidedly the wrong choice for solving
>> this problem.

>
> I strongly disagree. IMHO a file with lines that long simply isn't
> meaningfully a 'text file' any more, and so needs to be handled like a
> binary file: read in blocks, and remember byte positions. Perl is
> perfectly capable of handling binary data.


As opposed to something sensible such as memory-mapping the input
file (or that part of it which fits into the process address space),
the overhead is going to be grotesque and when 'low-level buffer
control' has to be done in any case, this overhead is not justified.

[...]

>> ----------------
>> #!/usr/bin/perl
>> #
>>
>> use Errno qw(EMFILE ENFILE);
>>
>> {
>> my ($in, $open, $id, %ids, @open, $input);
>>
>> while ($input = <STDIN>) {
>> ($id) = $input =~ /^([^\t]+)\t/;
>> chop($input);
>>
>> until (defined(open($out, '+>', $id))) {
>> die("open: $in: $!")
>> unless ($! == EMFILE || $! == ENFILE) && @open;

>
> Copying the data out to temporary files and then reading it back in
> again is *bound* to be more IO than seeking in the original file.


No, it is not bound to be more I/O (neither physical nor 'system call
I/O') because everything is done strictly sequential and will thus
interact nicely with any intermediate buffering layers: There's never
a need to 'flush the buffer, read a different block into it, flush the
buffer again, reread the original block'. If this helps with the issue
at hand would be a different question: It certainly will if the buffer
is larger than any indivividual data item. This might not be the case
here and consequently, both approaches should be tested to determine
which one is better.
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      06-20-2012
Rainer Weikusat <> writes:

[...]

>>> If you want an opinion on this: If a single line of input is too large
>>> to be kept in memory, Perl is decidedly the wrong choice for solving
>>> this problem.

>>
>> I strongly disagree. IMHO a file with lines that long simply isn't
>> meaningfully a 'text file' any more, and so needs to be handled like a
>> binary file: read in blocks, and remember byte positions. Perl is
>> perfectly capable of handling binary data.

>
> As opposed to something sensible such as memory-mapping the input
> file (or that part of it which fits into the process address space),
> the overhead is going to be grotesque


For sake of completeness and because I felt like doing it: Here's a
seek-based variant based on reading data in 'blocks' (of an arbitrary
size > 0) which actually works (according to my limited testing, still
without error handling).

----------------------------
#!/usr/bin/perl
#

use constant BLOCK => 4096;

sub read_block
{
my ($block, $rc);

$rc = sysread($_[0], $block, $_[1] // BLOCK);
$rc // die("sysread: $!");
$_[1] && $rc != $_[1] && die("short read");

return $rc ? $block : undef;
}

sub get_ids
{
my ($in, $ids) = @_;
my ($block, $id, $want, $bpos, $sbpos, $fpos);

$want = "\t";
while ($block = read_block($in)) {
$sbpos = $bpos = 0;

{
$bpos = index($block, $want, $sbpos);

if ($want eq "\t") {
if ($bpos != -1) {
$id .= substr($block, $sbpos, $bpos - $sbpos);
push(@$ids, [$id, $fpos + ++$bpos]);
$id = '';

$want = "\n";
$sbpos = $bpos;
redo if $sbpos < length($block);
} else {
$id .= substr($block, $sbpos);
}

last;
}

if ($want eq "\n") {
last if $bpos == -1;

push(@{$ids->[$#$ids]}, $fpos + $bpos);

$want = "\t";
$sbpos = $bpos + 1;
redo if $sbpos < length($block);
}
}

$fpos += length($block);
}
}

sub print_id_data
{
my ($fh, $id) = @_;
my ($blocks, $len);

seek($fh, $id->[1], 0);
$len = $id->[2] - $id->[1];

$blocks = int($len / BLOCK);
print(read_block($fh, BLOCK)) while ($blocks--);

$len %= BLOCK;
print(read_block($fh, $len)) if $len;
}

{
my ($fh, @ids);

open($fh, '<', $ARGV[0]) // die("open: $ARGV[0]: $!");

get_ids($fh, \@ids);
print_id_data($fh, $_) for sort { $a->[0] cmp $b->[0] } @ids;
}
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      06-21-2012
Ben Morrow <> writes:
> Quoth Rainer Weikusat <>:
>> Ben Morrow <> writes:
>> > Quoth Rainer Weikusat <>:

>>
>> >> The same phenomenon will occur at the perl buffering level except that
>> >> it will likely be much more severe (in terms of system-call overhead)
>> >> because the perl-level buffer will likely (this may be wrong) be much
>> >> smaller than the kernel page cache.
>> >
>> > That depends on the filesystem and OS, of course, but yes, in general I
>> > would assume this would be the case. I'm not sure it matters: in a
>> > process as IO-bound as this, the system call overhead is likely to be
>> > irrelevant.

>>
>> Let's assume for the sake of example that 'all of the I/O' takes 30
>> days. The kilotons of useless system calls take an hour.

>
> 'IO-bound' means that for most of its runtime, the process will be
> sitting there doing exactly nothing, waiting for IO to complete.


Really? Who would have guessed that ...
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      06-21-2012
Rainer Weikusat <> writes:

[...]

> seek($fh, $id->[1], 0);


This should be sysseek (works here because no buffered input is ever
done on this filehandle).
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      06-21-2012
Rainer Weikusat <> writes:

[...]

>>> ----------------
>>> #!/usr/bin/perl
>>> #
>>>
>>> use Errno qw(EMFILE ENFILE);
>>>
>>> {
>>> my ($in, $open, $id, %ids, @open, $input);
>>>
>>> while ($input = <STDIN>) {
>>> ($id) = $input =~ /^([^\t]+)\t/;
>>> chop($input);
>>>
>>> until (defined(open($out, '+>', $id))) {
>>> die("open: $in: $!")
>>> unless ($! == EMFILE || $! == ENFILE) && @open;

>>
>> Copying the data out to temporary files and then reading it back in
>> again is *bound* to be more IO than seeking in the original file.


For a small test I made, this approach is indeed hopeless but not
because of the additional data I/O but because of the overhead of
creating, deleting, opening etc 'a really large number of files'. But
my 'data parts' where tiny compared to the 'many millions' of the OP
and things might still be different for this case.
 
Reply With Quote
 
Martijn Lievaart
Guest
Posts: n/a
 
      06-21-2012
On Wed, 20 Jun 2012 16:29:56 +0100, Rainer Weikusat wrote:

> math math <> writes:
>>> On 06/19/12 10:50, math math wrote:
>>> > Hi,
>>> >
>>>> I have a file with two tab delimited fields. First field is an ID,
>>>> the second field is a large string (up to hundreds of millions of
>>>> characters). The file may have many lines.
>>>>
>>>> I would like to sort the file on the first (ID) field and after this
>>>> sorting, merge the second fields (i.e. remove the new lines),
>>>> so that I get a single line with many hundreds of lines that are in
>>>> the same order appended to each other as their alphabetically sorted
>>>> IDs.

>
> [...]
>
>>> Probably sorting it first would make it much easier:
>>>
>>> man sort
>>>

>> Indeed, I tried sort first, it works, it is more of a scalability
>> question really.

>
> This is a really bad idea because sort will reorder the complete input
> lines, including the data part, possible/ probably multiple times for
> each input line, and this means a lot of copying of data which doesn't
> need to be copied since only the IDs are supposed to be sorted.


As GNU sort is rather optimized, I would benchmark this before making
blanket statements like this.

Also, we don't know if efficiency is relevant. If it runs only once a
month, at night, the OP probably does not care if it takes a few hours as
opposed to a few minutes.

That said, the requirements are rather unique, so there is also a good
chance that sort will handle these files abysmally bad, chewing up all
memory and disk-I/O and effectively bringing the machine to it's knees.

So to the OP: Does sort -k1,1 run in acceptable time for you? If so,
there is the first part of your answer, the second part is now rather
trivial (or if you still run out of memory, more trivial than the
original problem).

(And if you do run out of memory, ask yourself, do I need a "works
always" solution, or does just adding more memory solve your immediate
problem)

HTH,
M4
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      06-21-2012
Martijn Lievaart <> writes:
> On Wed, 20 Jun 2012 16:29:56 +0100, Rainer Weikusat wrote:


[...]

>>>> man sort
>>>>
>>> Indeed, I tried sort first, it works, it is more of a scalability
>>> question really.

>>
>> This is a really bad idea because sort will reorder the complete input
>> lines, including the data part, possible/ probably multiple times for
>> each input line, and this means a lot of copying of data which doesn't
>> need to be copied since only the IDs are supposed to be sorted.

>
> As GNU sort is rather optimized, I would benchmark this before making
> blanket statements like this.


'Rather optimmized' usually means the code is seriously convoluted
because it used to run faster on some piece of ancient hardware in
1997 for a single test case because of that. And not matter how
'optimized', a sort program needs to sort its input. Which involves
reordering it. Completely. In case of files which are too large for
the memory of a modern computer, this involves a real lot of copying
data around.

I suggest that you make some benchmarks before making blanket
statements like the one above.

> Also, we don't know if efficiency is relevant. If it runs only once a
> month, at night, the OP probably does not care if it takes a few hours as
> opposed to a few minutes.


Efficiency is always relevant except in a single case: The guy who has
to write the code is so busy with getting it to work at all that the
mere thought of having to try to make it work sensibly scares the ****
out of him and he tries to pass this competence-deficit as 'secret
advantage' when posing for others. Uusually, this will also always
involve a dedicated computer for testing and often, the people who are
going to use the code are not in the position to complain to the
person who wrote it, IOW, run-time efficiency doesn't matter because
it is someone elses problem.

Congratulate yourself to the happy situation you happen to be in. Stop
assuming that it is 'the universal situation'. Things might look
rather different if code is written for in-house use and supposed to
run on a computer which also provides VPN services for customers
coming from fifty different companies.
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      06-21-2012
Rainer Weikusat <> writes:
> Martijn Lievaart <> writes:
>> On Wed, 20 Jun 2012 16:29:56 +0100, Rainer Weikusat wrote:

>
> [...]
>
>>>>> man sort
>>>>>
>>>> Indeed, I tried sort first, it works, it is more of a scalability
>>>> question really.
>>>
>>> This is a really bad idea because sort will reorder the complete input
>>> lines, including the data part, possible/ probably multiple times for
>>> each input line, and this means a lot of copying of data which doesn't
>>> need to be copied since only the IDs are supposed to be sorted.

>>
>> As GNU sort is rather optimized, I would benchmark this before making
>> blanket statements like this.

>
> 'Rather optimmized' usually means the code is seriously convoluted
> because it used to run faster on some piece of ancient hardware in
> 1997 for a single test case because of that. And not matter how
> 'optimized', a sort program needs to sort its input. Which involves
> reordering it. Completely. In case of files which are too large for
> the memory of a modern computer, this involves a real lot of copying
> data around.
>
> I suggest that you make some benchmarks before making blanket
> statements like the one above.


On some random computer I just used for that, sorting a 1080M file
(4000000 lines) with sort using the first column as key and sending
output to /dev/null (average from three runs) comes out at 40.9s
wallclock/ 6.4 user/ 3.2sys. Using one of the Perl scripts I posted
(with unused code removed) to extract the ID from each input line and
sort the list of IDs takes 9.5w/ 7.8u/ 0.7s. I didn't check if sort
used temporary files for this but it doesn't really matter because
sort is guaranteed to lose out if the file is only large enough. In
this case, the volume of data it needed to deal with was 1,132,259,700
bytes while the Perl script only needed to sort 28,000,000 bytes. And
the average line length was only 283 bytes in this case, not the 'many
millions' of the original problem.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
matching strings in a large set of strings Karin Lagesen Python 13 05-03-2010 03:53 PM
External Hashing [was Re: matching strings in a large set of strings] Helmut Jarausch Python 3 04-30-2010 08:44 PM
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM
CSS question concerning id and class Deryck HTML 12 12-22-2004 06:49 PM
newbie question concerning strings Dominic van Berkel C++ 6 11-04-2004 05:40 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57