Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > gather information from various files efficiently

Reply
Thread Tools

gather information from various files efficiently

 
 
Klaus Neuner
Guest
Posts: n/a
 
      12-14-2004
Hello,

I need to gather information that is contained in various files.

Like so:

file1:
=====================
foo : 1 2
bar : 2 4
baz : 3
=====================

file2:
=====================
foo : 5
bar : 6
baz : 7
=====================

file3:
=====================
foo : 4 18
bar : 8
=====================


The straightforward way to solve this problem is to create a
dictionary. Like so:


[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] = [b]


Yet, I have got 43 such files. Together they are 4,1M
large. In the future, they will probably become much larger.
At the moment, the process takes several hours. As it is a process
that I have to run very often, I would like it to be faster.

How could the problem be solved more efficiently?


Klaus
 
Reply With Quote
 
 
 
 
Keith Dart
Guest
Posts: n/a
 
      12-14-2004
Klaus Neuner wrote:
> Hello,
>
> I need to gather information that is contained in various files.
>
> Like so:
>
> file1:
> =====================
> foo : 1 2
> bar : 2 4
> baz : 3
> =====================
>
> file2:
> =====================
> foo : 5
> bar : 6
> baz : 7
> =====================
>
> file3:
> =====================
> foo : 4 18
> bar : 8
> =====================
>
>
> The straightforward way to solve this problem is to create a
> dictionary. Like so:
>
>
> [...]
>
> a, b = get_information(line)
> if a in dict.keys():
> dict[a].append(b)
> else:
> dict[a] = [b]


Aye...

the dict.keys() line creates a temporary list, and then the 'in' does a
linear search of the list. Better would be:

try:
dict[a].append(b)
except KeyError:
dict[a] = [b]

since you expect the key to be there most of the time, this method is
most efficient. You optomistically get the dictionary entry, and on the
exceptional case where it doesn't yet exist you add it.






--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <>
public key: ID: F3D288E4
================================================== ==========================
 
Reply With Quote
 
 
 
 
Kent Johnson
Guest
Posts: n/a
 
      12-14-2004
Keith Dart wrote:
> try:
> dict[a].append(b)
> except KeyError:
> dict[a] = [b]


or my favorite Python shortcut:
dict.setdefault(a, []).append(b)

Kent
 
Reply With Quote
 
Fernando Perez
Guest
Posts: n/a
 
      12-14-2004
Keith Dart wrote:

> Aye...
>
> the dict.keys() line creates a temporary list, and then the 'in' does a
> linear search of the list. Better would be:
>
> try:
> dict[a].append(b)
> except KeyError:
> dict[a] = [b]
>
> since you expect the key to be there most of the time, this method is
> most efficient. You optomistically get the dictionary entry, and on the
> exceptional case where it doesn't yet exist you add it.


I wonder if

dct.setdefault(a,[]).append(b)

wouldn't be even faster. It saves setting up the try/except frame handling in
python (I assume the C implementation of dicts achieves similar results with
much less overhead).

Cheers,

f

ps. I changed dict->dct because it's a generally Bad Idea (TM) to name local
variables as builtin types. This, for the benefit of the OP (I know you were
just following his code conventions).

 
Reply With Quote
 
Keith Dart
Guest
Posts: n/a
 
      12-14-2004
Kent Johnson wrote:
> Keith Dart wrote:
>
>> try:
>> dict[a].append(b)
>> except KeyError:
>> dict[a] = [b]

>
>
> or my favorite Python shortcut:
> dict.setdefault(a, []).append(b)
>
> Kent


Hey, when did THAT get in there? That's nice. However, the
try..except block is a useful pattern for many similiar situations that
the OP might want to keep in mind. It is usually better than the
following, also:

if dct.has_key(a):
dct[a].append(b)
else:
dct[a] = [b]


Which is a pattern I have seen often.




--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <>
vcard: <http://www.kdart.com/~kdart/kdart.vcf>
public key: ID: F3D288E4 URL: <http://www.kdart.com/~kdart/public.key>
================================================== ==========================
 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      12-14-2004
Keith Dart wrote:

>>> try:
>>> dict[a].append(b)
>>> except KeyError:
>>> dict[a] = [b]


the drawback here is that exceptions are relatively expensive; if the
number of collisions are small, you end up throwing and catching lots
of exceptions. in that case, there are better ways to do this.

>> dict.setdefault(a, []).append(b)


the drawback here is that you create a new object for each call, but
if the number of collisions are high, you end up throwing most of them
away. in that case, there are better ways to do this.

(gotta love that method name, btw. a serious candidate for the "most
confusing name in the standard library" contest... or maybe even the
"most confusing name in the history of python" contest...)

> Hey, when did THAT get in there? That's nice. However, the try..except block is a useful
> pattern for many similiar situations that the OP might want to keep in mind. It is usually better
> than the following, also:
>
> if dct.has_key(a):
> dct[a].append(b)
> else:
> dct[a] = [b]


the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.

</F>



 
Reply With Quote
 
Keith Dart
Guest
Posts: n/a
 
      12-14-2004
Fredrik Lundh wrote:
> ...
>>if dct.has_key(a):
>> dct[a].append(b)
>>else:
>> dct[a] = [b]

>
>
> the drawback here is that if the number of collisions are high, you end
> up doing lots of extra dictionary lookups. in that case, there are better
> ways to do this.


Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. Now, perhaps
we could write an adaptive data analyzer-code-generator...







--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <>
public key: ID: F3D288E4
================================================== ==========================
 
Reply With Quote
 
Keith Dart
Guest
Posts: n/a
 
      12-14-2004
Fredrik Lundh wrote:
> ...
>>if dct.has_key(a):
>> dct[a].append(b)
>>else:
>> dct[a] = [b]

>
>
> the drawback here is that if the number of collisions are high, you end
> up doing lots of extra dictionary lookups. in that case, there are better
> ways to do this.


Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. Now, perhaps
we could write an adaptive data analyzer-code-generator...







--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <>
public key: ID: F3D288E4
================================================== ==========================
 
Reply With Quote
 
Richie Hindle
Guest
Posts: n/a
 
      12-14-2004

[Keith]
> Sigh, this reminds me of a discussion I had at my work once... It seems
> to write optimal Python code one must understand various probabilites of
> your data, and code according to the likely scenario.


s/Python //g

--
Richie Hindle


 
Reply With Quote
 
Peter Hansen
Guest
Posts: n/a
 
      12-14-2004
Keith Dart wrote:
> Sigh, this reminds me of a discussion I had at my work once... It seems
> to write optimal Python code one must understand various probabilites of
> your data, and code according to the likely scenario.


And this is different from optimizing in *any* other language
in what way?

-Peter
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Gather autocompletion-information while running tests? Jan Kechel Ruby 1 08-17-2010 12:23 AM
How to efficiently extract information from structured text file Imaginationworks Python 7 02-18-2010 01:39 PM
How to gather information about the system hardware? Florencio Cano Python 1 05-08-2008 08:38 AM
How to gather the caller page information? ABC ASP .Net 1 01-13-2006 10:56 AM
Gather Results From Aspx File Laidbak ASP .Net 0 01-22-2004 11:24 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57