Home  Forums  Reviews  Guides  Newsgroups  Register  Search 
Thread Tools 
Christian Stapfer 


 
jon
Guest
Posts: n/a

To take the heat out of the discussion: sets are blazingly fast. 




jon 


 
Scott David Daniels
Guest
Posts: n/a

Let me begin by apologizing to Christian as I was too snippy in
my reply, and sounded even snippier than I meant to. Christian Stapfer wrote: > "Scott David Daniels" <(EMail Removed)> wrote in message > news:(EMail Removed)... >>a "better" set implementation will win if >>it can show better performance without >>related downsides. > > Is there ever such a thing? I always > thought that, most of the time, there > is no such thing as a free lunch in If you look at the history of Python's sort, it has steadily gotten better. The list implementations has been tweaked to produce better performance appending and popping. There are a number of such cases. In fact, as Python rolls along, code keeps getting improved. Usually the requirement is that the change not degrade current benchmarks and provide a substantial improvement in at least some practical cases. >>As to the "either now, or eventually;" if you _must_ have performance >>now, not in some abstract future, then it behooves you to _test_, >>_test_, _test_! > > Well, I might want to test: but *first* I want > to design, preferably in "armchair style" ... [using] > rough complexity measures to .... >>>>>If the documentation stated the orderofmagnitude >>>>>behavior of those basic operations up front, then >>>>>I (and *anyone* else who ever wanted to use those >>>>>operations on large lists / large sets) could do >>>>>a quick orderofmagnitude estimation of how >>>>>a certain program design will behave, performance >>>>>wise. I think this is where I started overreacting. There are a number of requests here over time by people who state that things would be so much better for every one if only someone (never themselves) did some more work that they might not otherwise want to do. The people who implement the code often do so on their own time. I find the Python docs surprisingly good for even commercial documentation. For work that is done gratis, it is phenomenal. I hesitate to ask any more of it (although I have pointed out bugs in docs as I've found them). >>And, if the proper documentation is in place, and it >>says "dictionary lookup is O(N)" (and you avoid using >>it for exactly that reason), how surprised will you be >>to discover that the O(N) is only reached if the hash >>values of the keys are all equal? > > It's not that difficult to distinguish > *average* case and *worst* case scenarios. > It might even be a good idea to state, > if that can easily be done, what the > *best* case happens do be... > >>Oh, maybe you expect "O(N)" to really mean "\Theta(N)". >>Then, if you are a dweeb like me, you will respond that >>"This is not possible, a dictionary of size N must take at >>least 'O(lg N)' to read the key, never mind processing it." >>But, it turns out, that given a bound on the size of a >>process, processing an address is "O(1)", not "O(lg N)". >>Is that too practical for you, or not enough? > > I do not expect a developer to expend *inordinate* > amounts of work to figure out the computational > complexity of what he has implemented. But, as I > wrote, he *must* have thought about the matter, and > is thus, by definition, in a rather good position > to provide what he can frequently simply pull from > memory when it comes to documenting the interface > of his module. I talked about BigO (worst case) or BigTheta (average case) just to point out that no simple characterization like "O(N)" tells enough of the story to be practically useful. Once you decide that isn't good enough, the burden on creating the documentation is getting substantial, especially given that you've already spent the effort to write the code and tests for it. In fact I'd hesitate to raise the documentation bar any higher  I'd hate to think someone thought, "Oh, forget it, I'll just use this code myself." >>>>> *Experimenting* is not necessarily as easy to >>>>>do as you seem to believe. No, "experimenting" is not always easy (though often it is easy enough). However, "experimenting" puts the cost on the person who derives the benefit, and is thus likely to not be done in a slipshod way. > I am not at all a lawyer type, as you seem to > imagine. I just want to suggest that some > (to the implementer  but not the average user) > *easily* available information about computational > complexity (especially for the most basic data types) > would be a good thing to have. Nothing more. My point is that simple performance characterization is not good enough to be useful, and fully accurate characterization is an onerous documentation burden. Something in between will be fraught with complaints about "surprising" worst cases. Whether you would approach middleground documentation with the spirit of "this is enough to go on" or not, rest assured that a number of people will complain in a "language lawyerlike" way about any perceived imperfections. >>Would you mind if the quality is proportional to >>the price you paid? Here I was too snippy. Sorry. >>>>You are, of course, either assuming that there's a >>>>single implementation of Python, >>>>or that all implementations have the same behaviour. (Each line denied with "Of course not.") But realistically, most users will learn the performance characteristics once, and continue to use the tradeoffs that were characteristics of that version of Python going forward. Now maybe you behave differently, but most programmers I know (and I explicitly include myself) do have that tendency. Lots of python code moves around operating systems and implementations (and the number of implementations is growing). >>>But it is reasonable, anyway, to ask for information >>>about a specific implementation (that one is *forced* >>>to use). >>You are not _forced_ to use any implementation of Python. > What I wanted to say is ... this: *whenever* a Python > program is run, it will run on some specific > implementation of Python. That much should be obvious. If I were still in a snippy mood, I'd point out that you might look at your original statement and see how it might be read by someone who accidentally read what you wrote, rather than what wanted to write. >>You are free to implement your own Python system. > Don't be ridiculous. If Pypy gets going this may not be as unlikely as you think. You will be able to (relatively easily) replace implementations of parts of Python and regenerate a system. > It is you (and only you) here, who is arguing that if > perfection is not attainable (with reasonable effort), > nothing should be done at all. I am trying to say the request you make is substantially larger than you understand (or would appear at first blush). Further it is a request that asks for work from others for your (and other's) benefit, not work that you offer to pitch in and help on. > I am as willing to live with imperfect module inferface > specification as I am willing with imperfect tools more > generally. But that will not stop me from making > *suggestions* for improvement (not arrogant demands, mind you). Try writing something that would fit your standards for all of the basic types, punting where you don't know details. If it is enough to be useful, others will chip in and help fix it where it is broken. > I have seen *many* books on algorithms that specify > computational complexity measures for the algorithms > described. Any really good library of algorithms will > at least try to reach that standard, too. How many "really good libraries of algorithms" do you know of? Could you name a couple that have these characterizations? Practical code is rife with issues of "finite address space," loworder effects swamping the higherorder terms for most practical sizes and such. It is this collection of nasty little details that makes characterizing libraries frustratingly hard. >>>I consider it the job of the implementer to ... >>Am I to take it you provide this to all of your customers? > ...I did not want to pose as the customer who expects nothing > but perfection from *others* (but not himself, of course). > I am not at all like that: you are attacking > a figment of your imagination. Actually, I was not attacking at all. I was trying to suggest that it is a harder task than you imagine. I am assuming you won't take me up on the suggestion of starting a flawed document, but I'd be happy to be proved wrong. If not, try writing a characterization of the performance of a dozen or so of your own programs. Does the result seem useful in proportion to the work it took you to write it? >>>How reasonable is it to ask me, or anyone else >>>for that matter, to extract, experimentwise >>>(if it can be done at all with reasonable effort) >>>information that properly belongs to the implementer >>>and really should have been exposed in the >>>documentation in the first place? >>Not at all reasonable. How reasonable is it to ask >>me to provide you support information for free? OK, here I was _very_ snippy. Sorry again. I will assert that often the experiment results in a single bit of information: "it is fast enough." Experimenting will get you reliable answers like that, and only when runtime is an issue will you need to go into it much further. Scott David Daniels http://www.velocityreviews.com/forums/(EMail Removed) 




Scott David Daniels 
Christian Stapfer
Guest
Posts: n/a

"jon" <(EMail Removed)> wrote in message
news:(EMail Removed) oups.com... > > To take the heat out of the discussion: > > sets are blazingly fast. I'd prefer a (however) rough characterization of computational complexity in terms of BigOh (or Bigwhatever) *anytime* to marketingtype characterizations like this one... Regards, Christian 




Christian Stapfer 
Steven D'Aprano
Guest
Posts: n/a

On Sat, 15 Oct 2005 06:31:53 +0200, Christian Stapfer wrote:
> "jon" <(EMail Removed)> wrote in message > news:(EMail Removed) oups.com... >> >> To take the heat out of the discussion: >> >> sets are blazingly fast. > > I'd prefer a (however) rough characterization > of computational complexity in terms of BigOh > (or Bigwhatever) *anytime* to marketingtype > characterizations like this one... Oh how naive. The marketing department says: "It's O(N), so it is blindingly fast." Translation: the amount of computation it does is linearly proportional to N. The constant of proportionality is 1e10. The marketing department says: "Our competitor's product is O(N**2), so it runs like a threelegged dog." Translation: the amount of computation it does is linearly proportional to N squared. The constant of proportionality is 1e100. You do the maths. Big O notation is practically useless for judging how fast a single algorithm will be, or how one algorithm compares to another. It is only useful for telling you how a single algorithm will scale as the input increases. It is very common for sensible programmers to fall back on a "less efficient" O(N**2) or even O(2**N) algorithm for small amounts of data, if that algorithm runs faster than the "more efficient" O(N) or O(log N) algorithm. In fact, that's exactly what the sort() method does in Python: for small enough lists, say, under 100 elements, it is quicker to run an O(N**2) algorithm (shell sort I believe) than it is to perform the complex set up for the mergesort variant used for larger lists. As for sets, they are based on dicts, which are effectively hash tables. Hash tables are O(1), unless there are collisions, in which case the more common algorithms degenerate to O(N). So, a very rough and ready estimate of the complexity of the algorithm using sets would be somewhere between O(1) and O(N) depending on the details of Python's algorithms. So, let's do an experiment, shall we? from sets import Set import time def compare_and_separate_with_sets(A, B): AB = Set(A+B) A = Set(A) B = Set(B) only_A = list(AB) only_B = list(BA) both_AB = list(AB  Set(only_A+only_B)) return (only_A, only_B, both_AB) def timeit(f, args, n): """Time function f when called with *args. For timing purposes, does n calls of f. Returns the average time used per call in seconds. """ loopit = range(n) t = time.time() for i in loopit: results = f(*args) t = time.time()  t return t/n def single_test(N): print ("N = %8d" % N), A = range(N) B = range(N/2, 3*N/2) return timeit(compare_and_separate_with_sets, (A, B), 20) def test_Order(): # test how compare_and_separate_with_sets scales with size for i in range(7): print single_test(10**i) Now run the test code: py> test_Order() N = 1 0.000219106674194 N = 10 0.000135183334351 N = 100 0.000481128692627 N = 1000 0.0173740386963 N = 10000 0.103679180145 N = 100000 0.655336141586 N = 1000000 8.12827801704 In my humble opinion, that's not bad behaviour. It looks O(log N) to me, and quite fast too: about 8 seconds to compare and separate two lists of one million items each. The craziest thing is, the amount of time it took to write and test two different algorithms was probably 1% of the time it would take to hunt up theoretical discussions of what the big O behaviour of the algorithms would be.  Steven. 




Steven D'Aprano 
Christian Stapfer
Guest
Posts: n/a

"Steven D'Aprano" <(EMail Removed)> wrote in message
newsan.2005.10.15.12.58.03.207379@REMOVETHIScybe r.com.au... > On Sat, 15 Oct 2005 06:31:53 +0200, Christian Stapfer wrote: > >> "jon" <(EMail Removed)> wrote in message >> news:(EMail Removed) oups.com... >>> >>> To take the heat out of the discussion: >>> >>> sets are blazingly fast. >> >> I'd prefer a (however) rough characterization >> of computational complexity in terms of BigOh >> (or Bigwhatever) *anytime* to marketingtype >> characterizations like this one... > > Oh how naive. Why is it that even computer science undergrads are required to learn the basics of BigOh and all that? Are computer scientists really morons, as Xah Lee suggests? I can't believe it, but maybe you have a point... > The marketing department says: "It's O(N), so it is blindingly fast." I might as well interpret "blindingly fast" as meaning O(1).  Why not? Surely marketing might also have reasoned like this: "It's O(1), so its blindingly fast". But I *want*, nay, I *must* know whether it is O(N) or O(1). So forget about marketingspeak, it's a deadly poison for developers. It might be ok to induce grandiose feelings in childish users  but developers are grown ups: they must face reality... > Translation: the amount of computation it does is linearly proportional > to N. The constant of proportionality is 1e10. > > The marketing department says: "Our competitor's product is O(N**2), so it > runs like a threelegged dog." > > Translation: the amount of computation it does is linearly proportional to > N squared. The constant of proportionality is 1e100. > > You do the maths. > > Big O notation is practically useless for judging how fast a single > algorithm will be, or how one algorithm compares to another. That's why Knuth liked it so much? That's why Aho, Hopcroft and Ullman liked it so much? That's why Gonnet and BaezaYates liked it so much? > It is only useful for telling you how a single algorithm > will scale as the input increases. And that's really very useful information indeed. Since, given such information for the basic data types and operations, as implemented by the language and its standard libraries, I stand a real chance of being able to determine the computational complexity of the *particular*combination* of data types and algorithms of my own small utility or of a critical piece of my wonderful and large application, on which the future of my company depends, with some confidence and accuracy. > It is very common for sensible programmers to fall back on a "less > efficient" O(N**2) or even O(2**N) algorithm for small amounts of data, if > that algorithm runs faster than the "more efficient" O(N) or O(log N) > algorithm. In fact, that's exactly what the sort() method does in Python: > for small enough lists, say, under 100 elements, it is quicker to run an > O(N**2) algorithm (shell sort I believe) than it is to perform the > complex set up for the mergesort variant used for larger lists. > > As for sets, they are based on dicts, which are effectively hash tables. > Hash tables are O(1), unless there are collisions, Depending on the "load factor" of the hash tables. So we would want to ask, if we have very large lists indeed, how much space needs to be invested to keep the load factor so low that we can say that the membership test is O(1). Do AB and A&B have to walk the entire hash table (which must be larger than the sets, because of a load factor < 1)? Also: the conversion of lists to sets needs the insertion of N elements into those hash tables. That alone already makes the overall algorithm *at*least* O(N). So forget about O(log N). > in which case the more > common algorithms degenerate to O(N). So here, indeed, we have the kind of reasoning that one ought to be able to deliver, based on what's in the Python documentation. Luckily, you have that kind the knowledge of both, how sets are implemented and what BigOh attaches to the hash table operation of "look up". In order to *enable* SUCH reasoning for *everyone*, starting from the module interface documentation only, one clearly needs something along the lines that I was suggesting... > So, a very rough and ready estimate > of the complexity of the algorithm using sets would be somewhere between > O(1) and O(N) depending on the details of Python's algorithms. > > So, let's do an experiment, shall we? > > from sets import Set > import time > > def compare_and_separate_with_sets(A, B): > AB = Set(A+B) > A = Set(A) > B = Set(B) > only_A = list(AB) > only_B = list(BA) > both_AB = list(AB  Set(only_A+only_B)) > return (only_A, only_B, both_AB) > > def timeit(f, args, n): > """Time function f when called with *args. For timing purposes, > does n calls of f. Returns the average time used per call in seconds. > """ > loopit = range(n) > t = time.time() > for i in loopit: > results = f(*args) > t = time.time()  t > return t/n > > def single_test(N): > print ("N = %8d" % N), > A = range(N) > B = range(N/2, 3*N/2) > return timeit(compare_and_separate_with_sets, (A, B), 20) > > def test_Order(): > # test how compare_and_separate_with_sets scales with size > for i in range(7): > print single_test(10**i) > > > Now run the test code: > > py> test_Order() > N = 1 0.000219106674194 > N = 10 0.000135183334351 Curious: N=10 takes less time than N=1? > N = 100 0.000481128692627 Why do we have such a comparatively large jump here, from N=100 to N=1000? Did hash tables overflow during conversion or something like that? > N = 1000 0.0173740386963 > N = 10000 0.103679180145 > N = 100000 0.655336141586 > N = 1000000 8.12827801704 Doesn't look quite O(n). Not yet... > In my humble opinion, that's not bad behaviour. > It looks O(log N) to me, How could that be? *Every* element of A and B must touched, if only to be copied: that can't make it O(log(N)). Also, the conversion of lists to sets must be at least O(N). And N isn't the right measure anyway. It would probably have to be in terms of A and B. For example, if A is very small, as compared to B, then AB and A & B can be determined rather quickly by only considering elements of A. > and quite fast too: about 8 seconds to compare and separate two lists of > one million items each. > > The craziest thing is, the amount of time it took to write and test two > different algorithms was probably 1% of the time it would take to hunt up > theoretical discussions of what the big O behaviour of the algorithms > would be. You must distinguish questions of principle and questions of muddling through like this testing bit you've done. It would take me some time to even be *sure* how to interpret the result. I would never want to say "it looks O(log N) to me", as you do, and leave it at that. Rather, I might say, as you do, "it looks O(log N) to me", *but* then try to figure out, given my knowledge of the implementation (performance wise, based on information that is sadly missing in the Python documentation), *why* that might be. Then, if my experiments says "it looks like O(log N)" AND if my basic knowledge of the implementation of set and list primitives says "it should be O(log N)" as well, I would venture, with some *confidence*, to claim: "it actually IS O(log N)".... You do not compare the convertittosetsapproach to the single listwalk either. Supposing the OP had actually sorted lists to begin with, then a single, simultaneous walk of the lists would be about as fast as it can get. Very, very likely *faster* than conversion to sets would be... Regards, Christian 




Christian Stapfer 
Steven D'Aprano
Guest
Posts: n/a

On Sat, 15 Oct 2005 18:17:36 +0200, Christian Stapfer wrote:
>>> I'd prefer a (however) rough characterization >>> of computational complexity in terms of BigOh >>> (or Bigwhatever) *anytime* to marketingtype >>> characterizations like this one... >> >> Oh how naive. > > Why is it that even computer science undergrads > are required to learn the basics of BigOh and > all that? So that they know how to correctly interpret what Big O notation means, instead of misinterpreting it. Big O notation doesn't tell you everything you need to know to predict the behaviour of an algorithm. It doesn't even tell you most of what you need to know about its behaviour. Only actual *measurement* will tell you what you need to know. Perhaps you should actually sit down and look at the assumptions, simplifications, shortcuts and tradeoffs that computer scientists make when they estimate an algorithm's Big O behaviour. It might shock you out of your faith that Big O is the be all and end all of algorithm planning. For all but the simplest algorithm, it is impractical to actually count all the operations  and even if you did, the knowledge wouldn't help you, because you don't know how long each operation takes to get executed. That is platform specific. So you simplify. You pretend that paging never happens. That memory allocations take zero time. That set up and removal of data structures take insignificant time. That if there is an N**2 term, it always swamps an N term. You assume that code performance is independent of the CPUs. You assume that some operations (e.g. comparisons) take no time, and others (e.g. moving data) are expensive. Those assumptions sometimes are wildly wrong. I've been seriously bitten following text book algorithms written for C and Pascal: they assume that comparisons are cheap and swapping elements are expensive. But in Python, swapping elements is cheap and comparisons are expensive, because of all the heavy objectoriented machinery used. Your classic text book algorithm is not guaranteed to survive contact with the real world: you have to try it and see. Given all the assumptions, it is a wonder that Big O estimates are ever useful, not that they sometimes are misleading. [snip] >> The marketing department says: "It's O(N), so it is blindingly fast." > > I might as well interpret "blindingly fast" as meaning O(1).  Why not? > Surely marketing might also have reasoned like > this: "It's O(1), so its blindingly fast". But I *want*, nay, I *must* > know whether it is O(N) or O(1). You might _want_, but you don't _need_ to know which it is, not in every case. In general, there are many circumstances where it makes no sense to worry about Big O behaviour. What's your expected data look like? If your data never gets above N=2, then who cares whether it is O(1)=1, O(N)=2, O(N**2)=4 or O(2**N)=2? They are all about as fast. Even bubble sort will sort a three element list fast enough  and probably faster than more complicated sorts. Why spend all the time setting up the machinery for a merge sort for three elements? [snip] >> Big O notation is practically useless for judging how fast a single >> algorithm will be, or how one algorithm compares to another. > > That's why Knuth liked it so much? > That's why Aho, Hopcroft and Ullman liked it so much? That's why Gonnet > and BaezaYates liked it so much? Two reasons: it is useful for telling you how a single algorithm will scale as the input increases, just as I said. And, unlike more accurate ways of calculating the speed of an algorithm from first principles, it is actually possible to do Big O calculations. No doubt the state of the art of algorithm measurements has advanced since I was an undergraduate, but certain fundamental facts remain: in order to calculate that Big O, you have to close your eyes to all the realities of practical code execution, and only consider an idealised calculation. Even when your calculation gives you constants of proportionality and other coefficients, Big O notation demands you throw that information away. But by doing so, you lose valuable information. An O(N**2) algorithm that scales like 1e6 * N**2 will be faster than an O(N) algorithm that scales as 1e6 * N, until N reaches one million million. By tossing away those coefficients, you wrongly expect the first algorithm to be slower than the second, and choose the wrong algorithm. >> It is only useful for telling you how a single algorithm will scale as >> the input increases. > > And that's really very useful information indeed. Yes it is. Did I ever say it wasn't? > Since, given such > information for the basic data types and operations, as implemented by > the language and its standard libraries, I stand a real chance of being > able to determine the computational complexity of the > *particular*combination* of data types and algorithms of my own small > utility or of a critical piece of my wonderful and large application, on > which the future of my company depends, with some confidence and > accuracy. Yes, zero is a real chance. [snip] >> As for sets, they are based on dicts, which are effectively hash >> tables. Hash tables are O(1), unless there are collisions, > > Depending on the "load factor" of the hash tables. So we would want to > ask, if we have very large lists indeed, how much space needs to be > invested to keep the load factor so low that we can say that the > membership test is O(1). And knowing that hash tables are O(1) will not tell you that, will it? There is only one practical way of telling: do the experiment. Keep loading up that hash table until you start getting lots of collisions. > Do AB and A&B have to walk the entire hash > table (which must be larger than the sets, because of a load factor < > 1)? Also: the conversion of lists to sets needs the insertion of N > elements into those hash tables. That alone already makes the overall > algorithm *at*least* O(N). So forget about O(log N). Yes, inserting N items into a hash table takes at least N inserts. But if those inserts are fast enough, you won't even notice the time it takes to do it, compared to the rest of your algorithm. In many algorithms, you don't even care about the time it takes to put items in your hash table, because that isn't part of the problem you are trying to solve. So in real, practical sense, it may be that your algorithm gets dominated by the O(log N) term even though there is technically an O(N) term in there. Are Python dicts like that? I have no idea. But I know how to find out: choose a problem domain I care about ("dicts with less than one million items") and do the experiment. >> in which case the more >> common algorithms degenerate to O(N). > > So here, indeed, we have the kind of reasoning that one ought to be able > to deliver, based on what's in the Python documentation. Luckily, you > have that kind the knowledge of both, how sets are implemented and what > BigOh attaches to the hash table operation of "look up". > In order to *enable* SUCH reasoning for *everyone*, > starting from the module interface documentation only, one clearly needs > something along the lines that I was suggesting... I don't object to having that Big O information available, except insofar as it can be misleading, but I take issue with your position that such information is necessary. [snip] >> Now run the test code: >> >> py> test_Order() >> N = 1 0.000219106674194 >> N = 10 0.000135183334351 > > Curious: N=10 takes less time than N=1? Yes, funny how realworld results aren't as clean and neat as they are in theory. There are those awkward assumptions coming to bite you again. I've done two additional tests, and get: N = 1 0.000085043907166 N = 10 0.000106656551361 N = 1 0.000497949123383 N = 10 0.000124049186707 Remember, these results are averaged over twenty trials. So why it is quicker to do work with sets of size 10 than sets of size 1? Big O notation will never tell you, because it ignores the implementation details that really make a difference. >> N = 100 0.000481128692627 > > Why do we have such a comparatively large jump here, from N=100 to > N=1000? Did hash tables overflow during conversion or something like > that? Who knows? Maybe Python was doing some garbage collection the first time I run it. I've modified my code to print a scale factor, and here is another run: N = 1 0.00113509893417 N = 10 0.000106143951416 (x 0.093511) N = 100 0.00265134572983 (x 24.978774) N = 1000 0.0057701587677 (x 2.176313) N = 10000 0.0551437973976 (x 9.556721) N = 100000 0.668345856667 (x 12.120055) N = 1000000 8.6285964489 (x 12.910376) An increase from N=1 to 1000000 (that's a factor of one million) leads to an increase in execution time of about 7600. You will notice that the individual numbers vary significantly from trial to trial, but the overall pattern is surprisingly consistent. >> N = 1000 0.0173740386963 >> N = 10000 0.103679180145 >> N = 100000 0.655336141586 >> N = 1000000 8.12827801704 > > Doesn't look quite O(n). Not yet... No it doesn't. > >> In my humble opinion, that's not bad behaviour. It looks O(log N) to >> me, That's a mistake  it is nowhere near O(log N). My bad. Closer to O(sqrt N), if anything. > How could that be? *Every* element of A and B must touched, if only to > be copied: that can't make it O(log(N)). And, no doubt, if you had *really enormous* lists, oh, I don't know, maybe a trillion items, you would see that O(N) behaviour. But until then, the overall performance is dominated by the smallerorder terms with larger coefficients. > Also, the conversion of lists > to sets must be at least O(N). And N isn't the right measure anyway. It > would probably have to be in terms of A and B. For example, if A > is very small, as compared to B, then AB and A & B can be determined > rather quickly by only considering elements of A. Both lists have the same number of elements, so double N. [snip] > You must distinguish questions of principle and questions of muddling > through like this testing bit you've done. Your "question of principle" gives you completely misleading answers. Remember, you were the one who predicted that lists would have to be faster than sets. Your prediction failed miserably. > It would take me some time to > even be *sure* how to interpret the result. What's to interpret? I know exactly how fast the function will run, on average, on my hardware. I can even extrapolate to larger sizes of N, although I would be very careful to not extrapolate too far. (I predict less than 10 minutes to work on a pair of 10,000,000 element lists, and less than two hours to work on 100,000,000 element lists.) > I would never want to say > "it looks O(log N) to me", as you do, and leave it at that. Rather, I > might say, as you do, "it looks O(log N) to me", *but* then try to > figure out, given my knowledge of the implementation (performance wise, > based on information that is sadly missing in the Python documentation), > *why* that might be. Fine. You have the source code, knock yourself out. > Then, if my experiments says "it looks like O(log > N)" AND if my basic knowledge of the implementation of set and list > primitives says "it should be O(log N)" as well, I would venture, with > some *confidence*, to claim: "it actually IS O(log N)".... > > You do not compare the convertittosetsapproach > to the single listwalk either. No I did not, because I didn't have a function to do it. You've got my source code. Knock yourself out to use it to test any function you like. > Supposing the OP had actually sorted > lists to begin with, then a single, simultaneous walk of the lists would > be about as fast as it can get. Very, very likely *faster* than > conversion to sets would be... Please let us know how you go with that. It should be really interesting to see how well your prediction copes with the real world. (Hint: another of those awkward little implementation details... how much work is being done in C code, and how much in pure Python? Just something for you to think about. And remember, an O(N) algorithm in Python will be faster than an O(N**2) algorithm in C... or is that slower?)  Steven. 




Steven D'Aprano 
Christian Stapfer
Guest
Posts: n/a

"Steven D'Aprano" <(EMail Removed)> wrote in message
newsan.2005.10.15.19.22.50.864680@REMOVETHIScybe r.com.au... > On Sat, 15 Oct 2005 18:17:36 +0200, Christian Stapfer wrote: > >>>> I'd prefer a (however) rough characterization >>>> of computational complexity in terms of BigOh >>>> (or Bigwhatever) *anytime* to marketingtype >>>> characterizations like this one... >>> >>> Oh how naive. >> >> Why is it that even computer science undergrads >> are required to learn the basics of BigOh and >> all that? > > So that they know how to correctly interpret what Big O notation means, > instead of misinterpreting it. Big O notation doesn't tell you everything > you need to know to predict the behaviour of an algorithm. Well, that's right. I couldn't agree more: it doesn't tell you *everything* but it does tell you *something*. And that *something* is good to have. > It doesn't even tell you most of what you need to know about its > behaviour. > Only actual *measurement* will tell you what you need to know. Well, that's where you err: Testing doesn't tell you everything *either*. You need *both*: a reasonable theory *and* experimental data... If theory and experimental data disagree, we would want to take a closer look on both, the theory (which may be mistaken or inadequate) *and* the experiment (which may be inadequate or just plain botched as well). > Perhaps you should actually sit down and look at the assumptions, > simplifications, shortcuts and tradeoffs that computer scientists make > when they estimate an algorithm's Big O behaviour. It might shock you out > of your faith that Big O is the be all and end all of algorithm planning. > > For all but the simplest algorithm, it is impractical to actually count > all the operations  and even if you did, the knowledge wouldn't help > you, because you don't know how long each operation takes to get executed. > That is platform specific. > > So you simplify. You pretend that paging never happens. That memory > allocations take zero time. That set up and removal of data structures > take insignificant time. That if there is an N**2 term, it always swamps > an N term. You assume that code performance is independent of the CPUs. > You assume that some operations (e.g. comparisons) take no time, and > others (e.g. moving data) are expensive. > > Those assumptions sometimes are wildly wrong. I've been seriously bitten > following text book algorithms written for C and Pascal: they assume that > comparisons are cheap and swapping elements are expensive. But in Python, > swapping elements is cheap and comparisons are expensive, because of all > the heavy objectoriented machinery used. Your classic text book algorithm > is not guaranteed to survive contact with the real world: you have to try > it and see. Still: the expensiveness of those operations (such as swapping elements vs. comparisons) will only affect the constant of proportionality, not the asymptotic behavior of the algorithm. Sooner or later the part of your program that has the worst asymptotic behavior will determine speed (or memory requirements) of your program. > Given all the assumptions, it is a wonder that Big O > estimates are ever useful, not that they sometimes > are misleading. > > [snip] >>> The marketing department says: "It's O(N), so it is blindingly fast." >> >> I might as well interpret "blindingly fast" as meaning O(1).  Why not? >> Surely marketing might also have reasoned like >> this: "It's O(1), so its blindingly fast". But I *want*, nay, I *must* >> know whether it is O(N) or O(1). > > You might _want_, but you don't _need_ to know which it is, not in every > case. In general, there are many circumstances where it makes no > sense to worry about Big O behaviour. What's your expected data look like? > If your data never gets above N=2, then who cares whether it is O(1)=1, > O(N)=2, O(N**2)=4 or O(2**N)=2? They are all about as fast. > > Even bubble sort will sort a three element list fast enough  and > probably faster than more complicated sorts. Why spend all the time > setting up the machinery for a merge sort for three elements? Since you have relapsed into a fit of mere polemics, I assume to have made my point as regards marketing type characterizations of algorithms ("blazingly fast") vs. measures, however rough, of asymptotic complexity measures, like BigOh.  Which really was the root of this subthread that went like this: ...>> To take the heat out of the discussion: ... >> sets are blazingly fast. ... > I'd prefer a (however) rough characterization ... > of computational complexity in terms of BigOh ... > (or Bigwhatever) *anytime* to marketingtype ... > characterizations like this one... > [snip] > >>> Big O notation is practically useless for judging how fast a single ^^^^^^^^^^^^^^^^^^^^^^ >>> algorithm will be, or how one algorithm compares to another. >> >> That's why Knuth liked it so much? >> That's why Aho, Hopcroft and Ullman liked it so much? That's why Gonnet >> and BaezaYates liked it so much? > > Two reasons: it is useful for telling you how a single algorithm will > scale as the input increases, just as I said. Curiously, just a few lines before writing this, you have polemically denied any "practical" use for BigOh notation. > And, unlike more accurate ways of calculating the speed of an algorithm > from first principles, it is actually possible to do Big O calculations. Right. It's a compromise: being somewhat precise  without getting bogged down trying to solve major combinatorial research problems... > No doubt the state of the art of algorithm measurements has advanced since > I was an undergraduate, but certain fundamental facts remain: in order to > calculate that Big O, you have to close your eyes to all the realities > of practical code execution, and only consider an idealised calculation. That's right. Nothing stops you from then opening your eyes and testing some code, of course. *But* always try to relate what you see there with what theoretical grasp of the situation you have. If experimental data and theory *disagree*: try to fix the experiment and/or the theory. > Even when your calculation gives you constants of proportionality and > other coefficients, Big O notation demands you throw that information > away. > > But by doing so, you lose valuable information. An O(N**2) algorithm that > scales like 1e6 * N**2 will be faster than an O(N) algorithm that scales > as 1e6 * N, until N reaches one million million. By tossing away those > coefficients, you wrongly expect the first algorithm to be slower than the > second, and choose the wrong algorithm. > >>> It is only useful for telling you how a single algorithm will scale as >>> the input increases. >> >> And that's really very useful information indeed. > > Yes it is. Did I ever say it wasn't? Well yes, by the way you attacked BigOh notation as "practically useless" (see above) I assumed you did. >> Since, given such >> information for the basic data types and operations, as implemented by >> the language and its standard libraries, I stand a real chance of being >> able to determine the computational complexity of the >> *particular*combination* of data types and algorithms of my own small >> utility or of a critical piece of my wonderful and large application, on >> which the future of my company depends, with some confidence and >> accuracy. > > Yes, zero is a real chance. > > > [snip] > >>> As for sets, they are based on dicts, which are effectively hash >>> tables. Hash tables are O(1), unless there are collisions, >> >> Depending on the "load factor" of the hash tables. So we would want to >> ask, if we have very large lists indeed, how much space needs to be >> invested to keep the load factor so low that we can say that the >> membership test is O(1). > > And knowing that hash tables are O(1) will not tell you that, will it? > > There is only one practical way of telling: do the experiment. Keep > loading up that hash table until you start getting lots of collisions. > >> Do AB and A&B have to walk the entire hash >> table (which must be larger than the sets, because of a load factor < >> 1)? Also: the conversion of lists to sets needs the insertion of N >> elements into those hash tables. That alone already makes the overall >> algorithm *at*least* O(N). So forget about O(log N). > > Yes, inserting N items into a hash table takes at least N inserts. But if > those inserts are fast enough, you won't even notice the time it takes to > do it, compared to the rest of your algorithm. In many algorithms, you > don't even care about the time it takes to put items in your hash table, > because that isn't part of the problem you are trying to solve. > > So in real, practical sense, it may be that your algorithm gets dominated > by the O(log N) term even though there is technically an O(N) term in > there. Are Python dicts like that? I have no idea. But I know how to find > out: choose a problem domain I care about ("dicts with less than one > million items") and do the experiment. > > >>> in which case the more >>> common algorithms degenerate to O(N). >> >> So here, indeed, we have the kind of reasoning that one ought to be able >> to deliver, based on what's in the Python documentation. Luckily, you >> have that kind the knowledge of both, how sets are implemented and what >> BigOh attaches to the hash table operation of "look up". >> In order to *enable* SUCH reasoning for *everyone*, >> starting from the module interface documentation only, one clearly needs >> something along the lines that I was suggesting... > > I don't object to having that Big O information available, except > insofar as it can be misleading, but I take issue with your position that > such information is necessary. *Blindly* testing, that is, testing *without* being able to *relate* the outcomes of those tests (even the *design* of those tests) to some suitably simplified but not at all completely nonsensical theory (via BigOh notation, for example), is *not* really good enough. >>> Now run the test code: >>> >>> py> test_Order() >>> N = 1 0.000219106674194 >>> N = 10 0.000135183334351 >> >> Curious: N=10 takes less time than N=1? > > Yes, funny how realworld results aren't as clean and neat as they are in > theory. There are those awkward assumptions coming to bite you again. I've > done two additional tests, and get: > > N = 1 0.000085043907166 > N = 10 0.000106656551361 > > N = 1 0.000497949123383 > N = 10 0.000124049186707 > > Remember, these results are averaged over twenty trials. So why it is > quicker to do work with sets of size 10 than sets of size 1? Big O > notation will never tell you, because it ignores the implementation > details that really make a difference. > > > >>> N = 100 0.000481128692627 >> >> Why do we have such a comparatively large jump here, from N=100 to >> N=1000? Did hash tables overflow during conversion or something like >> that? > > Who knows? Maybe Python was doing some garbage collection the first time I > run it. I've modified my code to print a scale factor, and here is another > run: > > N = 1 0.00113509893417 > N = 10 0.000106143951416 (x 0.093511) > N = 100 0.00265134572983 (x 24.978774) > N = 1000 0.0057701587677 (x 2.176313) > N = 10000 0.0551437973976 (x 9.556721) > N = 100000 0.668345856667 (x 12.120055) > N = 1000000 8.6285964489 (x 12.910376) > > An increase from N=1 to 1000000 (that's a factor of one million) leads to > an increase in execution time of about 7600. > > You will notice that the individual numbers vary significantly from trial > to trial, but the overall pattern is surprisingly consistent. > > >>> N = 1000 0.0173740386963 >>> N = 10000 0.103679180145 >>> N = 100000 0.655336141586 >>> N = 1000000 8.12827801704 >> >> Doesn't look quite O(n). Not yet... > > No it doesn't. > >> >>> In my humble opinion, that's not bad behaviour. It looks O(log N) to >>> me, > > That's a mistake  it is nowhere near O(log N). My bad. Closer to > O(sqrt N), if anything. > > >> How could that be? *Every* element of A and B must touched, if only to >> be copied: that can't make it O(log(N)). > > And, no doubt, if you had *really enormous* lists, oh, I don't know, maybe > a trillion items, you would see that O(N) behaviour. But until then, the > overall performance is dominated by the smallerorder terms with larger > coefficients. > > >> Also, the conversion of lists >> to sets must be at least O(N). And N isn't the right measure anyway. It >> would probably have to be in terms of A and B. For example, if A >> is very small, as compared to B, then AB and A & B can be determined >> rather quickly by only considering elements of A. > > > Both lists have the same number of elements, so double N. > > > [snip] > >> You must distinguish questions of principle and questions of muddling >> through like this testing bit you've done. > > Your "question of principle" gives you completely misleading answers. > Remember, you were the one who predicted that lists would have to be > faster than sets. I didn't say they would *have* to be faster  I was mainly asking for some *reasoned* argument why (and in what sense) conversion to sets would be an "efficient" solution of the OPs problem. > Your prediction failed miserably. Interestingly, you admit that you did not really compare the two approaches that were under discussion. So your experiment does *not* (yet) prove what you claim it proves. >> It would take me some time to >> even be *sure* how to interpret the result. > > What's to interpret? I know exactly how fast the function will run, on > average, on my hardware. I can even extrapolate to larger sizes of N, > although I would be very careful to not extrapolate too far. (I predict > less than 10 minutes to work on a pair of 10,000,000 element lists, and > less than two hours to work on 100,000,000 element lists.) For a starter: You have chosen a very particular type of element of those lists / sets: integers. So the complexity of comparisons for the OPs application might get *seriously* underestimated. > >> I would never want to say >> "it looks O(log N) to me", as you do, and leave it at that. Rather, I >> might say, as you do, "it looks O(log N) to me", *but* then try to >> figure out, given my knowledge of the implementation (performance wise, >> based on information that is sadly missing in the Python documentation), >> *why* that might be. > > Fine. You have the source code, knock yourself out. That's just what I do *not* think to be a particularly reasonable approach. Instead, I propose stating some (to the implementer *easily* available) information about asymptotic behavior of operations that are exposed by the module interface upfront. >> Then, if my experiments says "it looks like O(log >> N)" AND if my basic knowledge of the implementation of set and list >> primitives says "it should be O(log N)" as well, I would venture, with >> some *confidence*, to claim: "it actually IS O(log N)".... >> >> You do not compare the convertittosetsapproach >> to the single listwalk either. > > No I did not, because I didn't have a function to do it. Here we see one of the problems of a purely experimentalist approach to computational complexity: you need an implementation (of the algorithm and the test harness) *before* you can get your wonderfully decisive experimental data. This is why we would like to have a way of (roughly) estimating the reasonableness of the outlines of a program's design in "armchair fashion"  i.e. without having to write any code and/or test harness. > You've got my > source code. Knock yourself out to use it to test any function you like. > >> Supposing the OP had actually sorted >> lists to begin with, then a single, simultaneous walk of the lists would >> be about as fast as it can get. Very, very likely *faster* than >> conversion to sets would be... > > Please let us know how you go with that. It should be really interesting > to see how well your prediction copes with the real world. > > (Hint: another of those awkward little implementation details... how much > work is being done in C code, and how much in pure Python? Just something > for you to think about. And remember, an O(N) algorithm in Python will be > faster than an O(N**2) algorithm in C... or is that slower?) This discussion begins to sound like the recurring arguments one hears between theoretical and experimental physicists. Experimentalists tend to overrate the importance of experimental data (setting up a useful experiment, how to interpret the experimental data one then gathers, and whether one stands any chance of detecting systematic errors of measurement, all depend on having a good *theory* in the first place). Theoreticians, on the other hand, tend to overrate the importance of the coherence of theories. In truth, *both* are needed: good theories *and* carefully collected experimental data. Regards, Christian  »When asked how he would have reacted if Eddington's *measurements* had come out differently, Einstein replied: "Then I would have been sorry for him  the *theory* is correct."«  Paul B. Armstrong: 'Conflicting Readings' 




Christian Stapfer 
Ron Adam
Guest
Posts: n/a

Christian Stapfer wrote:
> This discussion begins to sound like the recurring > arguments one hears between theoretical and > experimental physicists. Experimentalists tend > to overrate the importance of experimental data > (setting up a useful experiment, how to interpret > the experimental data one then gathers, and whether > one stands any chance of detecting systematic errors > of measurement, all depend on having a good *theory* > in the first place). Theoreticians, on the other hand, > tend to overrate the importance of the coherence of > theories. In truth, *both* are needed: good theories > *and* carefully collected experimental data. > > Regards, > Christian An interesting parallel can be made concerning management of production vs management of creativity. In general, production needs checks and feedback to insure quality, but will often come to a stand still if incomplete resources are available. Where as creativity needs checks to insure production, but in many cases can still be productive even with incomplete or questionable resources. The quality may very quite a bit in both directions, but in creative tasks, that is to be expected. In many ways programmers are a mixture of these two. I think I and Steven use a style that is closer to the creative approach. I get the feeling your background may be closer to the production style. Both are good and needed for different types of tasks. And I think most programmers can switch styles to some degree if they need to. Cheers, Ron 




Ron Adam 
Christian Stapfer
Guest
Posts: n/a

"Ron Adam" <(EMail Removed)> wrote in message
news:cTp4f.16180$(EMail Removed). .. > Christian Stapfer wrote: > >> This discussion begins to sound like the recurring >> arguments one hears between theoretical and >> experimental physicists. Experimentalists tend >> to overrate the importance of experimental data >> (setting up a useful experiment, how to interpret >> the experimental data one then gathers, and whether >> one stands any chance of detecting systematic errors >> of measurement, all depend on having a good *theory* >> in the first place). Theoreticians, on the other hand, >> tend to overrate the importance of the coherence of >> theories. In truth, *both* are needed: good theories >> *and* carefully collected experimental data. >> >> Regards, >> Christian > > An interesting parallel can be made concerning management of production vs > management of creativity. > > In general, production needs checks and feedback to insure quality, but > will often come to a stand still if incomplete resources are available. > > Where as creativity needs checks to insure production, but in many cases > can still be productive even with incomplete or questionable resources. > The quality may very quite a bit in both directions, but in creative > tasks, that is to be expected. > > In many ways programmers are a mixture of these two. I think I and Steven > use a style that is closer to the creative approach. I get the feeling > your background may be closer to the production style. This diagnosis reminds me of C.G. Jung, the psychologist, who, after having introduced the concepts of extra and introversion, came to the conclusion that Freud was an extravert whereas Adler an introvert. The point is that he got it exactly wrong... As to the value of complexity theory for creativity in programming (even though you seem to believe that a theoretical bent of mind can only serve to stifle creativity), the story of the discovery of an efficient string searching algorithm by D.E.Knuth provides an interesting case in point. Knuth based himself on seemingly quite "uncreatively theoretical work" (from *your* point of view) that gave a *better* value for the computuational complexity of string searching than any of the then known algorithms could provide. Regards, Christian  »It is no paradox to say that in our most theoretical moods we may be nearest to our most practical applications.«  Alfred North Whitehead [and those "practical applications" will likely be most "creative" ones..] 




Christian Stapfer 


 
Thread Tools  


Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
comparing two lists, ndiff performance  Zbigniew Braniecki  Python  3  01302008 02:01 PM 
comparing two lists  Ladislav Andel  Python  10  08272007 04:30 PM 
comparing two lists and returning "position"  hiro  Python  12  06252007 02:17 PM 
Comparing lists ...  James Stroud  Python  2  02142007 12:20 AM 
List of lists of lists of lists...  =?UTF8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==?=  Python  5  05152006 11:47 AM 
Powered by vBulletin®. Copyright ©2000  2014, vBulletin Solutions, Inc..
SEO by vBSEO ©2010, Crawlability, Inc. 