Velocity Reviews > Comparing lists

Comparing lists

Odd-R.
Guest
Posts: n/a

 10-10-2005
I have to lists, A and B, that may, or may not be equal. If they are not
identical, I want the output to be three new lists, X,Y and Z where X has
all the elements that are in A, but not in B, and Y contains all the
elements that are B but not in A. Z will then have the elements that are
in both A and B.

One way of doing this is of course to iterate throug the lists and compare
each of the element, but is there a more efficient way?

--
Har du et kjøleskap, har du en TV
så har du alt du trenger for å leve

-Jokke & Valentinerne

Laszlo Zsolt Nagy
Guest
Posts: n/a

 10-10-2005
Odd-R. wrote:

>I have to lists, A and B, that may, or may not be equal. If they are not
>identical, I want the output to be three new lists, X,Y and Z where X has
>all the elements that are in A, but not in B, and Y contains all the
>elements that are B but not in A. Z will then have the elements that are
>in both A and B.
>
>

These are set operations.

>One way of doing this is of course to iterate throug the lists and compare
>each of the element, but is there a more efficient way?
>
>

Maybe, using sets?

L1 = [1,2,3,4]
L2=[3,4,5,6]
diff1 = list(set(L1)-set(L2)) # [1,2]
diff2 = list(set(L2)-set(L1)) # [5,6]
symdiff = diff1+diff2 # Symmetric difference [1,2,5,6]
intersect = set(L1+L2) - set(symdiff) # Intersect [3,4]

Best,

Les

ajikoe@gmail.com
Guest
Posts: n/a

 10-10-2005
try to use set.
L1 = [1,1,2,3,4]
L2 = [1,3, 99]
A = set(L1)
B = set(L2)

X = A-B
print X

Y = B-A
print Y

Z = A | B
print Z

Cheers,
pujo

Christian Stapfer
Guest
Posts: n/a

 10-10-2005
<(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
> try to use set.
> L1 = [1,1,2,3,4]
> L2 = [1,3, 99]
> A = set(L1)
> B = set(L2)
>
> X = A-B
> print X
>
> Y = B-A
> print Y
>
> Z = A | B
> print Z

But how "efficient" is this? Could you be a bit
more explicit on that point? What is the order
of complexity of set([...]) or of A-B, B-A,
A | B, A ^ B and A & B? - The Python Documentation
leaves me completely in the dark in this regard.

Sorting the two lists and then extracting
A-B, B-A, A|B, A & B and A ^ B in one single
pass seems to me very likely to be much faster
for large lists.

Regards,
Christian

George Sakkis
Guest
Posts: n/a

 10-10-2005
"Christian Stapfer" <(E-Mail Removed)> wrote:

> <(E-Mail Removed)> wrote:
> > try to use set.

>
> Sorting the two lists and then extracting
> A-B, B-A, A|B, A & B and A ^ B in one single
> pass seems to me very likely to be much faster
> for large lists.

Why don't you implement it, test it and time it to be more convincing about your intuition ?

George

Steven D'Aprano
Guest
Posts: n/a

 10-10-2005
On Mon, 10 Oct 2005 14:34:35 +0200, Christian Stapfer wrote:

> Sorting the two lists and then extracting
> A-B, B-A, A|B, A & B and A ^ B in one single
> pass seems to me very likely to be much faster
> for large lists.

Unless you are running a Python compiler in your head, chances are your
intuition has no connection to the actual performance of Python except by
pure coincidence.

Write some code, and time it in action.

In fact, here is some code I've written for you, complete with timer. Do
yourself a favour, and run it and see for yourself which is faster.

Then, if you think you can write a list version that is more efficient
than my (admittedly very inefficient) version, please do so. Then time the
difference again. Then think about how much time it took you to write,
test and debug your list version, compared to my four lines using sets.

from sets import Set
import time

def compare_and_separate(A, B):
only_A = []
only_B = []
both_AB = []
for item in A:
# ignore items we've already seen before
if item in only_A or item in both_AB:
continue
if item in B:
both_AB.append(item)
else:
only_A.append(item)
for item in B:
# ignore items we've already seen before
if item in only_B or item in both_AB:
continue
only_B.append(item)
return (only_A, only_B, both_AB)

def compare_and_separate_with_sets(A, B):
only_A = list(Set(A)-Set(B))
only_B = list(Set(B)-Set(A))
both_AB = list(Set(A+B) - Set(only_A+only_B))
return (only_A, only_B, both_AB)

A = range(199) + range(44, 300, 3) + range(250, 500) + range(2000, 2100)
B = range(1000) + range(3000, 4000)

def tester():
# confirm that the two functions give the same results
r1 = compare_and_separate(range(5), range(3,)
r2 = compare_and_separate_with_sets(range(5), range(3,)
print r1
print r2
if r1 == r2:
print " Same."
else:
print " Warning: results are different."
loopit = range(20) # repeat the test 20 times for timing purposes
t1 = time.time()
for i in loopit:
results = compare_and_separate(A, B)
t1 = time.time() - t1
t2 = time.time()
for i in loopit:
results = compare_and_separate_with_sets(A, B)
t2 = time.time() - t2
print "Time with loops:", t1
print "Time with sets:", t2

--
Steven.

Christian Stapfer
Guest
Posts: n/a

 10-10-2005
"George Sakkis" <(E-Mail Removed)> wrote in message
news:1128952417.0544cc73190bbbd4e683448c137969c8@t eranews...
> "Christian Stapfer" <(E-Mail Removed)> wrote:
>
>> <(E-Mail Removed)> wrote:
>> > try to use set.

>>
>> Sorting the two lists and then extracting
>> A-B, B-A, A|B, A & B and A ^ B in one single
>> pass seems to me very likely to be much faster
>> for large lists.

>
> Why don't you implement it, test it and time it

The problem is in the generation of the test data.
Even merely generating a set of (suitably "average",
"random", and suitably "worst case") datasets might
turn out to be a major undertaking.
If the documentation stated the order-of-magnitude
behavior of those basic operations up front, then
I (and *anyone* else who ever wanted to use those
operations on large lists / large sets) could do
a quick order-of-magnitude estimation of how
a certain program design will behave, performance
wise.
*Experimenting* is not necessarily as easy to
do as you seem to believe. How do you, for example,
hit upon the worst-case behavior with your test
data? - Without knowing *anything* about the
implementation it might a matter sheer luck.
If you *know* something about the implementation
then, of course, you might be able to figure it
out. (But note that if you know *that* much about
the implementation, you usually have an order-of-
magnitude estimate anyway and don't need to do
*any* experimenting in order to answer my question.)

Regards,
Christian

Steve Holden
Guest
Posts: n/a

 10-10-2005
Christian Stapfer wrote:
> "George Sakkis" <(E-Mail Removed)> wrote in message
> news:1128952417.0544cc73190bbbd4e683448c137969c8@t eranews...
>
>>"Christian Stapfer" <(E-Mail Removed)> wrote:
>>
>>
>>><(E-Mail Removed)> wrote:
>>>
>>>>try to use set.
>>>
>>> Sorting the two lists and then extracting
>>>A-B, B-A, A|B, A & B and A ^ B in one single
>>>pass seems to me very likely to be much faster
>>>for large lists.

>>
>>Why don't you implement it, test it and time it

>
>
> The problem is in the generation of the test data.
> Even merely generating a set of (suitably "average",
> "random", and suitably "worst case") datasets might
> turn out to be a major undertaking.
> If the documentation stated the order-of-magnitude
> behavior of those basic operations up front, then
> I (and *anyone* else who ever wanted to use those
> operations on large lists / large sets) could do
> a quick order-of-magnitude estimation of how
> a certain program design will behave, performance
> wise.
> *Experimenting* is not necessarily as easy to
> do as you seem to believe. How do you, for example,
> hit upon the worst-case behavior with your test
> data? - Without knowing *anything* about the
> implementation it might a matter sheer luck.
> If you *know* something about the implementation
> then, of course, you might be able to figure it
> out. (But note that if you know *that* much about
> the implementation, you usually have an order-of-
> magnitude estimate anyway and don't need to do
> *any* experimenting in order to answer my question.)
>

You are, of course, either assuming that there's a single implementation
of Python, or that all implementations have the same behaviour. Or
alternatively you are asking all implementers to do what you seem to
consider so difficult (i.e. identify worst-case scenarios and then
estimate order-of-magnitude behaviour for them).

Test results with known test data are relatively easy to extrapolate
from, and if your test data are reasonably representative of live data
then so will your performance estimates.

Anyway, aren't you more interested in average behaviour than worst-case?
Most people are.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Christian Stapfer
Guest
Posts: n/a

 10-11-2005
"Steve Holden" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Christian Stapfer wrote:
>> "George Sakkis" <(E-Mail Removed)> wrote in message
>> news:1128952417.0544cc73190bbbd4e683448c137969c8@t eranews...
>>
>>>"Christian Stapfer" <(E-Mail Removed)> wrote:
>>>><(E-Mail Removed)> wrote:
>>>>
>>>>>try to use set.
>>>>
>>>> Sorting the two lists and then extracting
>>>>A-B, B-A, A|B, A & B and A ^ B in one single
>>>>pass seems to me very likely to be much faster
>>>>for large lists.
>>>
>>>Why don't you implement it, test it and time it

>>
>> The problem is in the generation of the test data.
>> Even merely generating a set of (suitably "average",
>> "random", and suitably "worst case") datasets might
>> turn out to be a major undertaking.
>> If the documentation stated the order-of-magnitude
>> behavior of those basic operations up front, then
>> I (and *anyone* else who ever wanted to use those
>> operations on large lists / large sets) could do
>> a quick order-of-magnitude estimation of how
>> a certain program design will behave, performance
>> wise.
>> *Experimenting* is not necessarily as easy to
>> do as you seem to believe. How do you, for example,
>> hit upon the worst-case behavior with your test
>> data? - Without knowing *anything* about the
>> implementation it might a matter sheer luck.
>> If you *know* something about the implementation
>> then, of course, you might be able to figure it
>> out. (But note that if you know *that* much about
>> the implementation, you usually have an order-of-
>> magnitude estimate anyway and don't need to do
>> *any* experimenting in order to answer my question.)
>>

> You are, of course, either assuming that there's a
> single implementation of Python,

Of course not!

> or that all implementations have the same behaviour.

Of course not!

But it is reasonable, anyway, to ask for information
about a specific implementation (that one is *forced*
to use). Performance *will* depend on it, whether we
like it or not, whether that information is given by
the implementer or not.
And if the implementer wants to complain that
giving such information would break his wonderful
abstraction then I can only answer: It is *reality*
that *will* break your abstraction as regards
performance! It is, therefore, absolutely *no*
use to *pretend* you *can* avoid this form of "breaking
the abstraction" by simply avoiding to mention it
in the documentation...

> Or alternatively you are asking all implementers
> to do what you seem to consider so difficult
> (i.e. identify worst-case scenarios and then estimate order-of-magnitude
> behaviour for them).

I consider it the job of the implementer to know
choosing one particular implementation, and to
know what computational complexity therefore
attaches to the various operations exposed in
its interface. Why should *every* user of his
module be forced to either read the source
code of his implementation or try to figure
it out experiment-wise on his own?

> Test results with known test data are relatively
> easy to extrapolate from, and if your test data
> are reasonably representative of live data then so will your performance
> estimates.

How reasonable is it to ask me, or anyone else
for that matter, to extract, experiment-wise
(if it can be done at all with reasonable effort)
information that properly belongs to the implementer
and really should have been exposed in the
documentation in the first place?

> Anyway, aren't you more interested in average
> behaviour than worst-case? Most people are.

It depends. *I* am NOT the OP of this thread.
to do what he needed to do. There are many
measures of efficiency. Even "programmer efficiency"
may be one of them. But I assumed that, since
the OP took the time to ask his question in this
NG, that it really was about "computational
efficiency". Then he was offered solutions -
but they were offered just matter-of-factly.
*Surely* those who had posted those solutions
would be able to answer my question *why*
it was, that they *assumed* conversion into
sets to be more efficient than operating
on the lists in the first place. I was *not*
at all criticizing anyone (except, perhaps,
the Python documentation for its lack of
*insightful* information): I was just asking
for a *small* clarification that I *assumed*
(mistakenly it now appears) others could
*easily* provide.

Regards,
Christian
--
"Those who cannot remember the past are
condemned to repeat it."
- George Santayana

Scott David Daniels
Guest
Posts: n/a

 10-11-2005
Christian Stapfer wrote:
> "Steve Holden" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>
>>Christian Stapfer wrote:
>>
>>>"George Sakkis" <(E-Mail Removed)> wrote in message
>>>news:1128952417.0544cc73190bbbd4e683448c137969c 8@teranews...
>>>
>>>
>>>>"Christian Stapfer" <(E-Mail Removed)> wrote:
>>>>
>>>>><(E-Mail Removed)> wrote:
>>>>>>try to use set.

A reasonable suggestion. A set must have the trade-offs involved.
The abstraction itself brings to mind the issues, and the performance
can, at least in theory, be handled there. If that is true (that
the "set" abstraction "sees" the problem), then you can rely on the
Python implementation of "set" to either now, or eventually, have
a "good" implementation -- one not too far off efficient. The Python
gang is good at this stuff; a "better" set implementation will win if
it can show better performance without related down-sides.

As to the "either now, or eventually;" if you _must_ have performance
now, not in some abstract future, then it behooves you to _test_,
_test_, _test_!

>>> If the documentation stated the order-of-magnitude
>>>behavior of those basic operations up front, then
>>>I (and *anyone* else who ever wanted to use those
>>>operations on large lists / large sets) could do
>>>a quick order-of-magnitude estimation of how
>>>a certain program design will behave, performance
>>>wise.

And, if the proper documentation is in place, and it
says "dictionary lookup is O(N)" (and you avoid using
it for exactly that reason), how surprised will you be
to discover that the O(N) is only reached if the hash
values of the keys are all equal?

Oh, maybe you expect "O(N)" to really mean "\Theta(N)".
Then, if you are a dweeb like me, you will respond that
"This is not possible, a dictionary of size N must take at
least 'O(lg N)' to read the key, never mind processing it."
But, it turns out, that given a bound on the size of a
process, processing an address is "O(1)", not "O(lg N)".
Is that too practical for you, or not enough?

>>> *Experimenting* is not necessarily as easy to
>>>do as you seem to believe.

>>>How do you, for example, hit upon the worst-case

Are you saying the documentation should characterize the
cases that achieve worst-case behavior? This is a stiff
burden indeed, not one I am used to in even the most rigorous
classes I've seen. If there is such a characterization,
how burned will you feel if a case is overlooked and that
case is the one that you sold to your customer? Are you
willing to provide the same guaranteed performance and
documentation of performance to your customer that you
you expect of the Python system? Would you mind if the
quality is proportional to the price you paid?

>>You are, of course, either assuming that there's a
>>single implementation of Python,

> Of course not!
>
>>or that all implementations have the same behaviour.

> Of course not!

> But it is reasonable, anyway, to ask for information
> about a specific implementation (that one is *forced*
> to use).

You are not _forced_ to use any implementation of Python.
You are free to implement your own Python system.

> And if the implementer wants to complain that
> giving such information would break his wonderful
> abstraction then I can only answer: It is *reality*
> that *will* break your abstraction as regards
> performance! It is, therefore, absolutely *no*
> use to *pretend* you *can* avoid this form of "breaking
> the abstraction" by simply avoiding to mention it
> in the documentation...

I simply repeat: I have never seen a paper characterizing
the performance of an algorithm as "O(expr(N))" that described
in details _all_ cases that reached that limit. At most such
papers describe _one_ such cases. Nor have I ever seen papers
describing the performance of an algorithm as "\Theta(expr(N))"
that characterized the cases that broke the "\Theta" performance.
If you provide me the papers, provide me a C compiler with
equivalent docs on all C expressions, and provide me the funding
to update the Python docs, I will be happy to do so for a single
version of Python and a since version of CPython. I expect I
will have an easier time of it than the IronPython people will
have.

> I consider it the job of the implementer to know
> choosing one particular implementation, and to
> know what computational complexity therefore
> attaches to the various operations exposed in
> its interface.

Am I to take it you provide this to all of your customers?

> How reasonable is it to ask me, or anyone else
> for that matter, to extract, experiment-wise
> (if it can be done at all with reasonable effort)
> information that properly belongs to the implementer
> and really should have been exposed in the
> documentation in the first place?

Not at all reasonable. How reasonable is it to ask

--Scott David Daniels
http://www.velocityreviews.com/forums/(E-Mail Removed)