Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > whether large hash might leak?

Reply
Thread Tools

whether large hash might leak?

 
 
Kimia
Guest
Posts: n/a
 
      07-27-2007
hi, girls and dudes,

.....I doubt whether hash might leak when it comprises of a large
amount of pairs.
Recently I have been asked to do some statitic work over large
files. All I wanted to do is to find the duplicated lines of a file
and I wrote the snippet as below:
code:
mysort.pl
------------------------
#!/usr/bin/perl

use strict;
use warnings;
my %in;
my $cnt = 0;
while(<>){
chomp;
$_ or ++$cnt, next;
++$in{$_};
}
foreach(sort keys %in){
$cnt += $in{$_};
print "$_*$in{$_}\n";
}

------------------------


When input file contains a few lines, it goes perfectly well.

data file:
in1.dat
------------------------
1aa
2bbbbb
3cc
1aa
5dd
------------------------

$ ./mysort.pl in1.dat
then i got:
------------------------
1aa*2
2bbbbb*1
3cc*1
5dd*1
------------------------

However, when I used it for a large file, which contains 10M lines, it
failed.

$ ./mysort <TenLinesInput.dat >out
$ echo $?
0
$ tail out -n 5
------------------------
??????????????*2
????????????????*1
??????????????????*1
?????????????????????????????*2834
?????????????????????????????????????????????????? ?????????????????????????????????????????????????? ?????????????????????????????????????????????????? ?????????????????????????????????????????????????? ????????????????????????????????????????
*1
------------------------
Where '?' is \0xff, when viewed as binary file.
I'm sure that the input contains no char as: \0xff. Most of lines
are tens of char long, few exceeds 100 and none exceeds 1000.
The other output lines, except last 10, all are as expected.

Then I tried it for a input file conprised of one million lines
and it failed with the same error; I tried it for a input file of 100k
lines and it did OK.
I am not sure that it should be a bug. If anyone know the reason,
would you plz tell us?

thank you for your attention.

--
uita uinum est.

 
Reply With Quote
 
 
 
 
Mirco Wahab
Guest
Posts: n/a
 
      07-27-2007
Kimia wrote:
> recently I have been asked to do some statitic work over large
> files. All I wanted to do is to find the duplicated lines of a file
> and I wrote the snippet as below:
> code:
> ...
> However, when I used it for a large file, which contains 10M lines, it
> failed.
>
> $ ./mysort <TenLinesInput.dat >out
> $ echo $?
> 0
> $ tail out -n 5
> ------------------------
> ??????????????*2
> ????????????????*1
> ??????????????????*1
> ?????????????????????????????*2834
> ------------------------
> Where '?' is \0xff, when viewed as binary file.
> I'm sure that the input contains no char as: \0xff. Most of lines
> are tens of char long, few exceeds 100 and none exceeds 1000.


This might depend on the properties of the input file,
which encoding does it use, UTF8/16 or plain ASCII?

What system do you working on, what Perl version is installed?

Regards

M.
 
Reply With Quote
 
 
 
 
xhoster@gmail.com
Guest
Posts: n/a
 
      07-27-2007
Kimia <> wrote:
> hi, girls and dudes,
>
> ....I doubt whether hash might leak when it comprises of a large
> amount of pairs.
> Recently I have been asked to do some statitic work over large
> files. All I wanted to do is to find the duplicated lines of a file
> and I wrote the snippet as below:
> code:
> mysort.pl
> ------------------------
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> my %in;
> my $cnt = 0;
> while(<>){
> chomp;
> $_ or ++$cnt, next;
> ++$in{$_};
> }
> foreach(sort keys %in){
> $cnt += $in{$_};
> print "$_*$in{$_}\n";
> }


What is the stuff with $cnt?

>
> However, when I used it for a large file, which contains 10M lines, it
> failed.


It doesn't fail. I gives you output you didn't expect.

>
> $ ./mysort <TenLinesInput.dat >out
> $ echo $?
> 0
> $ tail out -n 5
> ------------------------
> ??????????????*2
> ????????????????*1
> ??????????????????*1
> ?????????????????????????????*2834
> ?????????????????????????????????????????????????? ???????????????????????
> ?????????????????????????????????????????????????? ???????????????????????
> ?????????????????????????????????????????????????? ???????????????????????
> ????????????????????? *1
> ------------------------
> Where '?' is \0xff, when viewed as binary file.
> I'm sure that the input contains no char as: \0xff.


I am not sure of that. Try this and see what it gives, and if
it consistently gives the same thing:

perl -lne 'print $. unless -1==index $_, chr(0xff)' TenLinesInput.dat


> Most of lines
> are tens of char long, few exceeds 100 and none exceeds 1000.
> The other output lines, except last 10, all are as expected.
>
> Then I tried it for a input file conprised of one million lines
> and it failed with the same error;


It didn't fail with an error. The value of $? shows that. (And I don't
see anything suggestive of a "leak", either.) It seems like what it comes
down to is that you and Perl disagree over what is in your file.


Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
J. Gleixner
Guest
Posts: n/a
 
      07-27-2007
Kimia wrote:
> hi, girls and dudes,
>
> ....I doubt whether hash might leak when it comprises of a large
> amount of pairs.


You could also try using uniq -with the -d -c options: man uniq
 
Reply With Quote
 
Kimia
Guest
Posts: n/a
 
      07-28-2007
On 27 juil, 15:45, Mirco Wahab <wahab-m...@gmx.net> wrote:
> Kimia wrote:


> > ?????????????????????????????*2834
> > ------------------------
> > Where '?' is \0xff, when viewed as binary file.
> > I'm sure that the input contains no char as: \0xff. Most of lines
> > are tens of char long, few exceeds 100 and none exceeds 1000.

>
> This might depend on the properties of the input file,
> which encoding does it use, UTF8/16 or plain ASCII?
>
> What system do you working on, what Perl version is installed?
>
> Regards
>
> M.


the file is encoded with gb2312, which is ASCII-compatibe and that is
used in P.R. China.

 
Reply With Quote
 
Kimia
Guest
Posts: n/a
 
      07-28-2007
>On 28 juil, 03:12, xhos...@gmail.com wrote:
> Kimia <chemies...@gmail.com> wrote:
> > hi, girls and dudes,

>
> > ....I doubt whether hash might leak when it comprises of a large
> > amount of pairs.
> > Recently I have been asked to do some statitic work over large
> > files. All I wanted to do is to find the duplicated lines of a file
> > and I wrote the snippet as below:
> > code:
> > mysort.pl
> > ------------------------
> > #!/usr/bin/perl

>
> > use strict;
> > use warnings;
> > my %in;
> > my $cnt = 0;
> > while(<>){
> > chomp;
> > $_ or ++$cnt, next;
> > ++$in{$_};
> > }
> > foreach(sort keys %in){
> > $cnt += $in{$_};
> > print "$_*$in{$_}\n";
> > }

>
> What is the stuff with $cnt?
>
>
>
> > However, when I used it for a large file, which contains 10M lines, it
> > failed.

>
> It doesn't fail. I gives you output you didn't expect.
>
>
>
>
>
> > $ ./mysort <TenLinesInput.dat >out
> > $ echo $?
> > 0
> > $ tail out -n 5
> > ------------------------
> > ??????????????*2
> > ????????????????*1
> > ??????????????????*1
> > ?????????????????????????????*2834
> > ?????????????????????????????????????????????????? ???????????????????????
> > ?????????????????????????????????????????????????? ???????????????????????
> > ?????????????????????????????????????????????????? ???????????????????????
> > ????????????????????? *1
> > ------------------------
> > Where '?' is \0xff, when viewed as binary file.
> > I'm sure that the input contains no char as: \0xff.

>
> I am not sure of that. Try this and see what it gives, and if
> it consistently gives the same thing:
>
> perl -lne 'print $. unless -1==index $_, chr(0xff)' TenLinesInput.dat
>
> > Most of lines
> > are tens of char long, few exceeds 100 and none exceeds 1000.
> > The other output lines, except last 10, all are as expected.

>
> > Then I tried it for a input file conprised of one million lines
> > and it failed with the same error;

>
> It didn't fail with an error. The value of $? shows that. (And I don't
> see anything suggestive of a "leak", either.) It seems like what it comes
> down to is that you and Perl disagree over what is in your file.
>
> Xho
>
> --
> --------------------http://NewsReader.Com/--------------------
> Usenet Newsgroup Service $9.95/Month 30GB


thanks, xho. I've found the bug, which, of course, I've made.
The output file is perfectly correct. The input file does contains
lines
of ????.
Before debugging, I have tryed with:
$perl -lne 'print if /^\0xff/'
and the output was none. Then I assured myself with the assumption.
However, the regex should be : /^\xff/

It was part of the volumnious log-file processing that I was asked
to do.
\0xff should not exist in normal encoding and should be generated in
some
uncertain situation.
The code that I posted was written for debugging when I found
exceptions in
other processing. However, I did not succeed in it, and it was so
stupid~
Befor debugging would expel error, it does import stupidness
Thanks for all your help.

ps:
> perl -lne 'print $. unless -1==index $_, chr(0xff)' TenLinesInput.dat


I tried this lines and it does help me.

--
fous, c'est un mot qu'on dirait invent'e pour nous.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
hash of hash of hash of hash in c++ rp C++ 1 11-10-2011 04:45 PM
What's a good way to find whether a hash is a subset of another hash (deep)? dblock Ruby 2 10-09-2011 08:37 PM
what's the rules re whether a Hash can use either a Symbol or Stringto reference the value??? Greg Hauptmann Ruby 8 01-13-2009 07:23 AM
Hash#select returns an array but Hash#reject returns a hash... Srijayanth Sridhar Ruby 19 07-02-2008 12:49 PM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57