Velocity Reviews > number of different lines in a file

# number of different lines in a file

r.e.s.
Guest
Posts: n/a

 05-18-2006
I have a million-line text file with 100 characters per line,
and simply need to determine how many of the lines are distinct.

On my PC, this little program just goes to never-never land:

def number_distinct(fn):
f = file(fn)
L = []
while x<>'':
if x not in L:
L = L + [x]
return len(L)

Would anyone care to point out improvements?
Is there a better algorithm for doing this?

Larry Bates
Guest
Posts: n/a

 05-18-2006
r.e.s. wrote:
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
>
> On my PC, this little program just goes to never-never land:
>
> def number_distinct(fn):
> f = file(fn)
> L = []
> while x<>'':
> if x not in L:
> L = L + [x]
> return len(L)
>
> Would anyone care to point out improvements?
> Is there a better algorithm for doing this?

Sounds like homework, but I'll bite.

def number_distinct(fn):
hash_dict={}
total_lines=0
for line in open(fn, 'r'):
total_lines+=1
key=hash(line.strip())
if hash_dict.has_key(key): continue
hash_dict[key]=1

if __name__=="__main__":
fn='c:\\test.txt'
total_lines, distinct_lines=number_distinct(fn)
print "Total lines=%i, distinct lines=%i" % (total_lines, distinct_lines)

-Larry Bates

Ben Finney
Guest
Posts: n/a

 05-18-2006
"r.e.s." <(E-Mail Removed)> writes:

> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.

I'd generalise it by allowing the caller to pass any iterable set of
items. A file handle can be iterated this way, but so can any
sequence or iterable.

def count_distinct(seq):
""" Count the number of distinct items """
counts = dict()
for item in seq:
if not item in counts:
counts[item] = 0
counts[item] += 1
return len(counts)

>>> infile = file('foo.txt')
>>> for line in file('foo.txt'):

... print line,
...
abc
def
ghi
abc
ghi
def
xyz
abc
abc
def

>>> infile = file('foo.txt')
>>> print count_distinct(infile)

5

--
\ "A man may be a fool and not know it -- but not if he is |
`\ married." -- Henry L. Mencken |
_o__) |
Ben Finney

Fredrik Lundh
Guest
Posts: n/a

 05-18-2006
r.e.s. wrote:

> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
>
> On my PC, this little program just goes to never-never land:
>
> def number_distinct(fn):
> f = file(fn)
> L = []
> while x<>'':
> if x not in L:
> L = L + [x]
> return len(L)

ouch.

> Would anyone care to point out improvements?
> Is there a better algorithm for doing this?

try this:

def number_distinct(fn):
return len(set(s.strip() for s in open(fn)))

</F>

Bill Pursell
Guest
Posts: n/a

 05-18-2006

r.e.s. wrote:
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
>
> On my PC, this little program just goes to never-never land:
>
> def number_distinct(fn):
> f = file(fn)
> L = []
> while x<>'':
> if x not in L:
> L = L + [x]
> return len(L)
>
> Would anyone care to point out improvements?
> Is there a better algorithm for doing this?

Have you tried
cat file | sort | uniq | wc -l ?
sort might choke on the large file, and this isn't python, but it
might work. You might try breaking the file into
smaller peices, maybe based on the first character, and then
process them seperately. The time killer is probably
the "x not in L" line, since L is getting very large. By
subdividing the problem initially, that time constraint
will be better.

Tim Chase
Guest
Posts: n/a

 05-18-2006
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.

A few ideas:

1) the shell way:

bash\$ sort file.in | uniq | wc -l

This doesn't strip whitespace...a little sed magic would
strip off whitespace for you:

bash\$ sed 'regexp' file.in | sort | uniq | wc -l

where 'regexp' is something like this atrocity

's/^[[:space:]]*\(\([[:space:]]*[^[:space:]][^[:space:]]*\)*\)[[:space:]]*\$/\1/'

(If your sed supports "\s" and "\S" for "whitespace" and
"nonwhitespace", it makes the expression a lot less hairy:

's/^\s*\(\(\s*\S\S*\)*\)\s*\$/\1/'

and, IMHO, a little easier to read. There might be a
nice/concise perl one-liner for this too)

2) use a python set:

s = set()
for line in open("file.in"):
return len(s)

3) compact #2:

return len(set([line.strip() for line in file("file.in")]))

or, if stripping the lines isn't a concern, it can just be

return len(set(file("file.in")))

The logic in the set keeps track of ensuring that no
duplicates get entered.

Depending on how many results you *expect*, this could
become cumbersome, as you have to have every unique line in
memory. A stream-oriented solution can be kinder on system
resources, but would require that the input be sorted first.

-tkc

Andrew Robert
Guest
Posts: n/a

 05-18-2006
r.e.s. wrote:
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
>
> On my PC, this little program just goes to never-never land:
>
> def number_distinct(fn):
> f = file(fn)
> L = []
> while x<>'':
> if x not in L:
> L = L + [x]
> return len(L)
>
> Would anyone care to point out improvements?
> Is there a better algorithm for doing this?

Take a look at http://aspn.activestate.com/ASPN/Coo...n/Recipe/52560

It is a python approach to the uniq command on *nix.

r.e.s.
Guest
Posts: n/a

 05-19-2006
"Tim Chase" <(E-Mail Removed)> wrote ...
> 2) use a python set:
>
> s = set()
> for line in open("file.in"):
> return len(s)
>
> 3) compact #2:
>
> return len(set([line.strip() for line in file("file.in")]))
>
> or, if stripping the lines isn't a concern, it can just be
>
> return len(set(file("file.in")))
>
> The logic in the set keeps track of ensuring that no
> duplicates get entered.
>
> Depending on how many results you *expect*, this could
> become cumbersome, as you have to have every unique line in
> memory. A stream-oriented solution can be kinder on system
> resources, but would require that the input be sorted first.

Thank you (and all the others who responded!) -- set() does
the trick, reducing the job to about a minute. I may play
later with the other alternatives people mentionsed (dict(),
the "expected number", which in my case was around 0-10 (as
it turned out, there were no dups).

BTW, the first thing I tried was Fredrik Lundh's program:

def number_distinct(fn):
return len(set(s.strip() for s in open(fn)))

which worked without the square brackets. Interesting that
omitting them doesn't seem to matter.

pac
Guest
Posts: n/a

 05-19-2006
A generator expression can "share" the parenthesis of a function call.
The syntax is explained in PEP 289, which is also in "What's new" in
the Python 2.4 docs.

Nice line of code!

Fredrik Lundh
Guest
Posts: n/a

 05-19-2006
r.e.s. wrote:

> BTW, the first thing I tried was Fredrik Lundh's program:
>
> def number_distinct(fn):
> return len(set(s.strip() for s in open(fn)))
>
> which worked without the square brackets. Interesting that
> omitting them doesn't seem to matter.

a for loop inside square brackets is a "list comprehension", and the
result is a list. if you use a list comprehension inside a function
call, the full list is built *before* the function is called. in this
case, this would mean that the entire file would be read into memory
before the set was constructed.

if you change the square brackets to ordinary parentheses, you get a

http://pyref.infogami.com/generator-expressions

the generator expression results in an iterator object that calculates
the values one by one. if you pass it to a function that expects an
iterator, that function will end up "running the for loop itself", and
no extra storage is needed. (in this case, you still need memory to
hold the set, of course, so the difference between a list comprehension
and a generator expression will only matter if you have lots of duplicates).

finally, a syntax shortcut lets you remove the parentheses if the
generator expression is the only argument in a function call, as in the
above example.

</F>