Velocity Reviews > Re: count

# Re: count

Vilya Harvey
Guest
Posts: n/a

 07-08-2009
2009/7/8 Dhananjay <(E-Mail Removed)>:
> I wanted to sort column 2 in assending order* and I read whole file in array
> "data" and did the following:
>
> data.sort(key = lambda fieldsfields[2]))
>
> I have sorted column 2, however I want to count the numbers in the column 2.
> i.e. I want to know, for example, how many repeates of say '3' (first row,
> 2nd column in above data) are there in column 2.

One thing: indexes in Python start from 0, so the second column has an
index of 1 not 2. In other words, it should be data.sort(key = lambda
fields: fields[1]) instead.

With that out of the way, the following will print out a count of each
unique item in the second column:

from itertools import groupby
for x, g in groupby([fields[1] for fields in data]):
print x, len(tuple(g))

Hope that helps,
Vil.

Bearophile
Guest
Posts: n/a

 07-08-2009
Vilya Harvey:
> from itertools import groupby
> for x, g in groupby([fields[1] for fields in data]):
> * * print x, len(tuple(g))

Avoid that len(tuple(g)), use something like the following, it's lazy
and saves some memory.

def leniter(iterator):
"""leniter(iterator): return the length of a given
iterator, consuming it, without creating a list.
Never use it with infinite iterators.

>>> leniter()

Traceback (most recent call last):
...
TypeError: leniter() takes exactly 1 argument (0 given)
>>> leniter([])

0
>>> leniter([1])

1
>>> leniter(iter([1]))

1
>>> leniter(x for x in xrange(100) if x%2)

50
>>> from itertools import groupby
>>> [(leniter(g), h) for h,g in groupby("aaaabccaadeeee")]

[(4, 'a'), (1, 'b'), (2, 'c'), (2, 'a'), (1, 'd'), (4, 'e')]

>>> def foo0():

... if False: yield 1
>>> leniter(foo0())

0

>>> def foo1(): yield 1
>>> leniter(foo1())

1
"""
# This code is faster than: sum(1 for _ in iterator)
if hasattr(iterator, "__len__"):
return len(iterator)
nelements = 0
for _ in iterator:
nelements += 1
return nelements

Bye,
bearophile

Paul Rubin
Guest
Posts: n/a

 07-08-2009
Bearophile <(E-Mail Removed)> writes:
> > * * print x, len(tuple(g))

>
> Avoid that len(tuple(g)), use something like the following

print x, sum(1 for _ in g)

Aahz
Guest
Posts: n/a

 07-08-2009
In article <(E-Mail Removed)>,
Bearophile <(E-Mail Removed)> wrote:
>Vilya Harvey:
>>
>> from itertools import groupby
>> for x, g in groupby([fields[1] for fields in data]):
>> =A0 =A0 print x, len(tuple(g))

>
>Avoid that len(tuple(g)), use something like the following, it's lazy
>and saves some memory.

The question is whether it saves time, have you tested it?
--
Aahz ((E-Mail Removed)) <*> http://www.pythoncraft.com/

"as long as we like the same operating system, things are cool." --piranha

Paul Rubin
Guest
Posts: n/a

 07-08-2009
http://www.velocityreviews.com/forums/(E-Mail Removed) (Aahz) writes:
> >Avoid that len(tuple(g)), use something like the following, it's lazy
> >and saves some memory.

> The question is whether it saves time, have you tested it?

len(tuple(xrange(100000000))) ... hmm.

Aahz
Guest
Posts: n/a

 07-08-2009
In article <(E-Mail Removed)>,
Paul Rubin <http://(E-Mail Removed)> wrote:
>(E-Mail Removed) (Aahz) writes:
>>Paul Rubin deleted an attribution:
>>>
>>>Avoid that len(tuple(g)), use something like the following, it's lazy
>>>and saves some memory.

>>
>> The question is whether it saves time, have you tested it?

>
>len(tuple(xrange(100000000))) ... hmm.

When dealing with small N, O() can get easily swamped by the constant
factors. How often do you deal with more than a hundred fields?
--
Aahz ((E-Mail Removed)) <*> http://www.pythoncraft.com/

"as long as we like the same operating system, things are cool." --piranha

Paul Rubin
Guest
Posts: n/a

 07-08-2009
(E-Mail Removed) (Aahz) writes:
> When dealing with small N, O() can get easily swamped by the constant
> factors. How often do you deal with more than a hundred fields?

The number of fields in the OP's post was not stated. Expecting it to
be less than 100 seems like an ill-advised presumption. If N is
unknown, speed-tuning the case where N is small at the expense of
consuming monstrous amounts of memory when N is large sounds
somewhere between a premature optimization and a nasty bug.

J. Clifford Dyer
Guest
Posts: n/a

 07-09-2009
On Wed, 2009-07-08 at 14:45 -0700, Paul Rubin wrote:
> (E-Mail Removed) (Aahz) writes:
> > >Avoid that len(tuple(g)), use something like the following, it's lazy
> > >and saves some memory.

> > The question is whether it saves time, have you tested it?

>
> len(tuple(xrange(100000000))) ... hmm.

timer.py
--------
from datetime import datetime

def tupler(n):
return len(tuple(xrange(n)))

def summer(n):
return sum(1 for x in xrange(n))

def test_func(f, n):
print f.__name__,
start = datetime.now()
print f(n)
end = datetime.now()
print "Start: %s" % start
print "End: %s" % end
print "Duration: %s" % (end - start,)

if __name__ == '__main__':
test_func(summer, 10000000)
test_func(tupler, 10000000)
test_func(summer, 100000000)
test_func(tupler, 100000000)

\$ python timer.py
summer 10000000
Start: 2009-07-08 22:02:13.216689
End: 2009-07-08 22:02:15.855931
Duration: 0:00:02.639242
tupler 10000000
Start: 2009-07-08 22:02:15.856122
End: 2009-07-08 22:02:16.743153
Duration: 0:00:00.887031
summer 100000000
Start: 2009-07-08 22:02:16.743863
End: 2009-07-08 22:02:49.372756
Duration: 0:00:32.628893
Killed
\$

Note that "Killed" did not come from anything I did. The tupler just
bombed out when the tuple got too big for it to handle. Tupler was
faster for as large an input as it could handle, as well as for small
inputs (test not shown).

Bearophile
Guest
Posts: n/a

 07-09-2009
Paul Rubin:
> print x, sum(1 for _ in g)

Don't use that, use my function If g has a __len__ you are wasting
time. And sum(1 ...) is (on my PC) slower.

J. Clifford Dyer:
> if __name__ == '__main__':
> * * test_func(summer, 10000000)
> * * test_func(tupler, 10000000)
> * * test_func(summer, 100000000)
> * * test_func(tupler, 100000000)

Have you forgotten my function?

Bye,
bearophile

J. Cliff Dyer
Guest
Posts: n/a

 07-09-2009
Bearophile wins! (This only times the loop itself. It doesn't check
for __len__)

summer:5
0:00:00.000051
bearophile:5
0:00:00.000009
summer:50
0:00:00.000030
bearophile:50
0:00:00.000013
summer:500
0:00:00.000077
bearophile:500
0:00:00.000053
summer:5000
0:00:00.000575
bearophile:5000
0:00:00.000473
summer:50000
0:00:00.005583
bearophile:50000
0:00:00.004625
summer:500000
0:00:00.055834
bearophile:500000
0:00:00.046137
summer:5000000
0:00:00.426734
bearophile:5000000
0:00:00.349573
summer:50000000
0:00:04.180920
bearophile:50000000
0:00:03.652311
summer:500000000
0:00:42.647885
bearophile: 500000000
0:00:35.190550

On Thu, 2009-07-09 at 04:04 -0700, Bearophile wrote:
> Paul Rubin:
> > print x, sum(1 for _ in g)

>
> Don't use that, use my function If g has a __len__ you are wasting
> time. And sum(1 ...) is (on my PC) slower.
>
>
> J. Clifford Dyer:
> > if __name__ == '__main__':
> > test_func(summer, 10000000)
> > test_func(tupler, 10000000)
> > test_func(summer, 100000000)
> > test_func(tupler, 100000000)

>
> Have you forgotten my function?
>
> Bye,
> bearophile

 Thread Tools

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is OffTrackbacks are On Pingbacks are On Refbacks are Off Forum Rules

 Similar Threads Thread Thread Starter Forum Replies Last Post efelnavarro09 VHDL 2 01-26-2011 03:49 AM SkyPilot Firefox 2 07-06-2005 01:35 AM Herb Firefox 4 03-29-2005 02:00 AM Kaimuri MCSD 3 12-29-2004 06:38 PM Praveen Balanagendra via .NET 247 ASP .Net 2 06-06-2004 07:16 AM

Advertisments