Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Best way to handle large lists?

Reply
Thread Tools

Best way to handle large lists?

 
 
Chaz Ginger
Guest
Posts: n/a
 
      10-03-2006
I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz
 
Reply With Quote
 
 
 
 
Duncan Booth
Guest
Posts: n/a
 
      10-03-2006
Chaz Ginger <(E-Mail Removed)> wrote:

> I have a system that has a few lists that are very large (thousands or
> tens of thousands of entries) and some that are rather small. Many times
> I have to produce the difference between a large list and a small one,
> without destroying the integrity of either list. I was wondering if
> anyone has any recommendations on how to do this and keep performance
> high? Is there a better way than
>
> [ i for i in bigList if i not in smallList ]


How about:

smallSet = set(smallList)
something = [ i for i in bigList if i not in smallSet ]

Use timeit.py on some representative data to see what difference that
makes.
 
Reply With Quote
 
 
 
 
Paul Rubin
Guest
Posts: n/a
 
      10-03-2006
Chaz Ginger <(E-Mail Removed)> writes:
> I have a system that has a few lists that are very large (thousands or
> tens of thousands of entries) and some that are rather small. Many times
> I have to produce the difference between a large list and a small one,
> without destroying the integrity of either list. I was wondering if
> anyone has any recommendations on how to do this and keep performance
> high? Is there a better way than
>
> [ i for i in bigList if i not in smallList ]


diff = list(set(bigList) - set(smallList))

Note that doesn't get you the elements in any particular order.
 
Reply With Quote
 
durumdara
Guest
Posts: n/a
 
      10-03-2006
Chaz Ginger írta:
> I have a system that has a few lists that are very large (thousands or
> tens of thousands of entries) and some that are rather small. Many times
> I have to produce the difference between a large list and a small one,
> without destroying the integrity of either list. I was wondering if
> anyone has any recommendations on how to do this and keep performance
> high? Is there a better way than
>
> [ i for i in bigList if i not in smallList ]
>
> Thanks.
> Chaz
>

Hi !

If you have big list, you can use dbm like databases.
They are very quick. BSDDB, flashdb, etc. See SleepyCat, or see python help.

In is very slow in large datasets, but bsddb is use hash values, so it
is very quick.
The SleepyCat database have many extras, you can set the cache size and
many other parameters.

Or if you don't like dbm style databases, you can use SQLite. Also
quick, you can use SQL commands.
A little slower than bsddb, but it is like SQL server. You can improve
the speed with special parameters.

dd

 
Reply With Quote
 
Bill Williams
Guest
Posts: n/a
 
      10-03-2006
I don't know enough about Python internals, but the suggested solutions
all seem to involve scanning bigList. Can this presumably linear
operation be avoided by using dict or similar to find all occurrences of
smallist items in biglist and then deleting those occurrences?

Bill Williams



In article <prrUg.1662$We.477@trndny08>,
Chaz Ginger <(E-Mail Removed)> wrote:

> I have a system that has a few lists that are very large (thousands or
> tens of thousands of entries) and some that are rather small. Many times
> I have to produce the difference between a large list and a small one,
> without destroying the integrity of either list. I was wondering if
> anyone has any recommendations on how to do this and keep performance
> high? Is there a better way than
>
> [ i for i in bigList if i not in smallList ]
>
> Thanks.
> Chaz

 
Reply With Quote
 
Hari Sekhon
Guest
Posts: n/a
 
      10-03-2006
I don't know much about the python internals either, so this may be the
blind leading the blind, but aren't dicts much slower to work with than
lists and therefore wouldn't your suggestion to use dicts be much
slower? I think it's something to do with the comparative overhead of
using keys in dicts rather than using positional indexes in lists/arrays...

At least that is what I thought.

Can anyone confirm this?

-h

Hari Sekhon



Bill Williams wrote:
> I don't know enough about Python internals, but the suggested solutions
> all seem to involve scanning bigList. Can this presumably linear
> operation be avoided by using dict or similar to find all occurrences of
> smallist items in biglist and then deleting those occurrences?
>
> Bill Williams
>
>
>
> In article <prrUg.1662$We.477@trndny08>,
> Chaz Ginger <(E-Mail Removed)> wrote:
>
>
>> I have a system that has a few lists that are very large (thousands or
>> tens of thousands of entries) and some that are rather small. Many times
>> I have to produce the difference between a large list and a small one,
>> without destroying the integrity of either list. I was wondering if
>> anyone has any recommendations on how to do this and keep performance
>> high? Is there a better way than
>>
>> [ i for i in bigList if i not in smallList ]
>>
>> Thanks.
>> Chaz
>>

 
Reply With Quote
 
Sybren Stuvel
Guest
Posts: n/a
 
      10-03-2006
Bill Williams enlightened us with:
> I don't know enough about Python internals, but the suggested
> solutions all seem to involve scanning bigList. Can this presumably
> linear operation be avoided by using dict or similar to find all
> occurrences of smallist items in biglist and then deleting those
> occurrences?


And how would that beat O(n)? Every element of bigList has to be
scanned at one point, either to compare it to every earlier element in
bigList and eliminate it, or to compare it to every element in
smallList.

Run benchmarks on the suggestions, and see which is fastest for
yourself.

Sybren
--
Sybren Stüvel
Stüvel IT - http://www.stuvel.eu/
 
Reply With Quote
 
Chaz Ginger
Guest
Posts: n/a
 
      10-03-2006
I've done that and decided that Python's 'list comprehension' isn't a
way to go. I was hoping that perhaps someone had some experience with
some C or C++ library that has a Python interface that would make a
difference.

Chaz

Sybren Stuvel wrote:
> Bill Williams enlightened us with:
>> I don't know enough about Python internals, but the suggested
>> solutions all seem to involve scanning bigList. Can this presumably
>> linear operation be avoided by using dict or similar to find all
>> occurrences of smallist items in biglist and then deleting those
>> occurrences?

>
> And how would that beat O(n)? Every element of bigList has to be
> scanned at one point, either to compare it to every earlier element in
> bigList and eliminate it, or to compare it to every element in
> smallList.
>
> Run benchmarks on the suggestions, and see which is fastest for
> yourself.
>
> Sybren

 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      10-03-2006
Sybren Stuvel <(E-Mail Removed)> writes:
> > I don't know enough about Python internals, but the suggested
> > solutions all seem to involve scanning bigList. Can this presumably
> > linear operation be avoided by using dict or similar to find all
> > occurrences of smallist items in biglist and then deleting those
> > occurrences?

>
> And how would that beat O(n)? Every element of bigList has to be
> scanned at one point, either to compare it to every earlier element in
> bigList and eliminate it, or to compare it to every element in
> smallList.


Maybe the application should use sets instead of lists for these
collections.
 
Reply With Quote
 
Chaz Ginger
Guest
Posts: n/a
 
      10-03-2006
Paul Rubin wrote:
> Sybren Stuvel <(E-Mail Removed)> writes:
>>> I don't know enough about Python internals, but the suggested
>>> solutions all seem to involve scanning bigList. Can this presumably
>>> linear operation be avoided by using dict or similar to find all
>>> occurrences of smallist items in biglist and then deleting those
>>> occurrences?

>> And how would that beat O(n)? Every element of bigList has to be
>> scanned at one point, either to compare it to every earlier element in
>> bigList and eliminate it, or to compare it to every element in
>> smallList.

>
> Maybe the application should use sets instead of lists for these
> collections.

What would sets do for me over lists?

Chaz
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
best way to handle sql decimal fields Steve Richter ASP .Net 3 03-31-2005 02:55 PM
What's the best way to handle showing/editing this data? Alan Silver ASP .Net 4 02-16-2005 06:23 PM
Best way to handle documents in ASP.NET Thomas Scheiderich ASP .Net 11 05-20-2004 05:57 PM
Question: Best way to handle DBNULL in datareaders Ravikanth[MVP] ASP .Net 6 07-18-2003 10:51 AM
Re: Best way to handle a mutually exclusive situation gabriel XML 0 06-25-2003 08:08 AM



Advertisments