Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Performance of int/long in Python 3

Reply
Thread Tools

Performance of int/long in Python 3

 
 
Steven D'Aprano
Guest
Posts: n/a
 
      03-28-2013
On Thu, 28 Mar 2013 23:11:55 +1100, Neil Hodgson wrote:

> Ian Foote:
>
>> Specifically, indexing a variable-length encoding like utf-8 is not as
>> efficient as indexing a fixed-length encoding.

>
> Many common string operations do not require indexing by character
> which reduces the impact of this inefficiency.


Which common string operations do you have in mind?

Specifically in Python's case, the most obvious counter-example is the
length of a string. But that's only because Python strings are immutable
objects, and include a field that records the length. So once the string
is created, checking its length takes constant time.

Some string operations need to inspect every character, e.g. str.upper().
Even for them, the increased complexity of a variable-width encoding
costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or
4 bytes per character. You have to walk the string grabbing 1 byte at a
time, and then decide whether you need another 1, 2 or 3 bytes. Even
though it's still O(N), the added bit-masking and overhead of variable-
width encoding adds to the overall cost.

Any string method that takes a starting offset requires the method to
walk the string byte-by-byte. I've even seen languages put responsibility
for dealing with that onto the programmer: the "start offset" is given in
*bytes*, not characters. I don't remember what language this was... it
might have been Haskell? Whatever it was, it horrified me.


> UTF-8 seems like a
> reasonable choice for an internal representation to me.


It's not. Haskell, for example, uses UTF-8 internally, and it warns that
this makes string operations O(N) instead of O(1) precisely because of
the need to walk the string inspecting every byte.

Remember, when your string primitives are O(N), it is very easy to write
code that becomes O(N**2). Using UTF-8 internally is just begging for
user-written code to be O(N**2).


> One benefit of
> UTF-8 over Python's flexible representation is that it is, on average,
> more compact over a wide set of samples.


Sure. And over a different set of samples, it is less compact. If you
write a lot of Latin-1, Python will use one byte per character, while
UTF-8 will use two bytes per character.


--
Steven
 
Reply With Quote
 
 
 
 
jmfauth
Guest
Posts: n/a
 
      03-28-2013
On 28 mar, 11:30, Chris Angelico <(E-Mail Removed)> wrote:
> On Thu, Mar 28, 2013 at 8:03 PM, jmfauth <(E-Mail Removed)> wrote:


-----

> You really REALLY need to sort out in your head the difference between
> correctness and performance. I still haven't seen one single piece of
> evidence from you that Python 3.3 fails on any point of Unicode
> correctness.


That's because you are not understanding unicode. Unicode takes
you from the character to the unicoded transformed fomat via
the code point, working with a unique set of characters with
a contigoous range of code points.
Then it is up to the "implementors" (languages, compilers, ...)
to implement this utf.

> Covering the whole range of Unicode has never been a
> problem.


.... for all those, who are following the scheme explained above.
And it magically works smoothly. Of course, there are some variations
due to the Character Encoding Form wich is later influenced by the
Character Encoding Scheme (the serialization of the character Encoding
Scheme).

Rough explanation in other words.
I does not matter if you are using utf-8, -16, -32, ucs2 or ucs4.
All the single characters are handled in the same way with the "same
algorithm".

---

The flexible string representation takes the problem from the
other side, it attempts to work with the characters by using
their representations and it (can only) fails...

PS I never propose to use utf-8. I only spoke about utf-8
as an example. If you start to discuss indexing, you are off-topic.

jmf


 
Reply With Quote
 
 
 
 
jmfauth
Guest
Posts: n/a
 
      03-28-2013
On 28 mar, 14:01, Steven D'Aprano <steve
(E-Mail Removed)> wrote:
> On Thu, 28 Mar 2013 23:11:55 +1100, Neil Hodgson wrote:
> > Ian Foote:

>
>
> > One benefit of
> > UTF-8 over Python's flexible representation is that it is, on average,
> > more compact over a wide set of samples.

>
> Sure. And over a different set of samples, it is less compact. If you
> write a lot of Latin-1, Python will use one byte per character, while
> UTF-8 will use two bytes per character.
>


This flexible string representation is so absurd that not only
"it" does not know you can not write Western European Languages
with latin-1, "it" penalizes you by just attempting to optimize
latin-1. Shown in my multiple examples.

(This is a similar case of the long and short int question/dicussion
Chris Angelico opened).


PS1: I received plenty of private mails. I'm suprise, how the dev
do not understand unicode.

PS2: Question I received once from a registrated French Python
Developper (in another context). What are those French characters
you can handle with cp1252 and not with latin-1?

jmf


 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      03-28-2013
On Fri, Mar 29, 2013 at 1:12 AM, jmfauth <(E-Mail Removed)> wrote:
> This flexible string representation is so absurd that not only
> "it" does not know you can not write Western European Languages
> with latin-1, "it" penalizes you by just attempting to optimize
> latin-1. Shown in my multiple examples.


PEP393 strings have two optimizations, or kinda three:

1a) ASCII-only strings
1b) Latin1-only strings
2) BMP-only strings
3) Everything else

Options 1a and 1b are almost identical - I'm not sure what the detail
is, but there's something flagging those strings that fit inside seven
bits. (Something to do with optimizing encodings later?) Both are
optimized down to a single byte per character.

Option 2 is optimized to two bytes per character.

Option 3 is stored in UTF-32.

Once again, jmf, you are forgetting that option 2 is a safe and
bug-free optimization.

ChrisA
 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      03-28-2013
On 28/03/2013 12:11, Neil Hodgson wrote:
> Ian Foote:
>
>> Specifically, indexing a variable-length encoding like utf-8 is not
>> as efficient as indexing a fixed-length encoding.

>
> Many common string operations do not require indexing by character
> which reduces the impact of this inefficiency. UTF-8 seems like a
> reasonable choice for an internal representation to me. One benefit
> of UTF-8 over Python's flexible representation is that it is, on
> average, more compact over a wide set of samples.
>

Implementing the regex module (http://pypi.python.org/pypi/regex) would
have been more difficult if the internal representation had been UTF-8,
because of the need to decode, and the implementation would also have
been slower for that reason.
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      03-28-2013
On Fri, Mar 29, 2013 at 1:51 AM, MRAB <(E-Mail Removed)> wrote:
> On 28/03/2013 12:11, Neil Hodgson wrote:
>>
>> Ian Foote:
>>
>>> Specifically, indexing a variable-length encoding like utf-8 is not
>>> as efficient as indexing a fixed-length encoding.

>>
>>
>> Many common string operations do not require indexing by character
>> which reduces the impact of this inefficiency. UTF-8 seems like a
>> reasonable choice for an internal representation to me. One benefit
>> of UTF-8 over Python's flexible representation is that it is, on
>> average, more compact over a wide set of samples.
>>

> Implementing the regex module (http://pypi.python.org/pypi/regex) would
> have been more difficult if the internal representation had been UTF-8,
> because of the need to decode, and the implementation would also have
> been slower for that reason.


In fact, nearly ALL string parsing operations would need to be done
differently. The only method that I can think of that wouldn't be
impacted is a linear state-machine parser - something that could be
written inside a "for character in string" loop.

text = []

def initial(c):
global state
if c=='<': state=tag
else: text.append(c)

def tag(c):
global state
if c=='>': state=initial

state = initial
for character in string:
state(character)

print(''.join(text))


I'm pretty sure this will run in O(N) time, even with UTF-8 strings.
But it's an *extremely* simple parser.

ChrisA
 
Reply With Quote
 
jmfauth
Guest
Posts: n/a
 
      03-28-2013
On 28 mar, 15:38, Chris Angelico <(E-Mail Removed)> wrote:
> On Fri, Mar 29, 2013 at 1:12 AM, jmfauth <(E-Mail Removed)> wrote:
> > This flexible string representation is so absurd that not only
> > "it" does not know you can not write Western European Languages
> > with latin-1, "it" penalizes you by just attempting to optimize
> > latin-1. Shown in my multiple examples.

>
> PEP393 strings have two optimizations, or kinda three:
>
> 1a) ASCII-only strings
> 1b) Latin1-only strings
> 2) BMP-only strings
> 3) Everything else
>
> Options 1a and 1b are almost identical - I'm not sure what the detail
> is, but there's something flagging those strings that fit inside seven
> bits. (Something to do with optimizing encodings later?) Both are
> optimized down to a single byte per character.
>
> Option 2 is optimized to two bytes per character.
>
> Option 3 is stored in UTF-32.
>
> Once again, jmf, you are forgetting that option 2 is a safe and
> bug-free optimization.
>
> ChrisA


As long as you are attempting to devide a set of characters in
chunks and try to handle them seperately, it will never work.

Read my previous post about the unicode transformation format.
I know what pep393 does.

jmf

 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      03-28-2013
On Fri, Mar 29, 2013 at 2:14 AM, jmfauth <(E-Mail Removed)> wrote:
> As long as you are attempting to devide a set of characters in
> chunks and try to handle them seperately, it will never work.


Okay. Let's look at integers. To properly represent the Python 3 'int'
type (or the Python 2 'long'), we need to be able to encode ANY
integer. And of course, any attempt to divide them up into chunks will
never work. So we need a single representation that will cover ANY
integer, right?

Perfect. We already have one of those, detailed in RFC 2795. (It's
coming up to its thirteenth anniversary in a day or two. Very
appropriate.)

http://tools.ietf.org/html/rfc2795#section-4

Are you saying Python's integers should be stored as I-TAGs?

ChrisA
 
Reply With Quote
 
jmfauth
Guest
Posts: n/a
 
      03-28-2013
On 28 mar, 16:14, jmfauth <(E-Mail Removed)> wrote:
> On 28 mar, 15:38, Chris Angelico <(E-Mail Removed)> wrote:
>
>
>
>
>
>
>
>
>
> > On Fri, Mar 29, 2013 at 1:12 AM, jmfauth <(E-Mail Removed)> wrote:
> > > This flexible string representation is so absurd that not only
> > > "it" does not know you can not write Western European Languages
> > > with latin-1, "it" penalizes you by just attempting to optimize
> > > latin-1. Shown in my multiple examples.

>
> > PEP393 strings have two optimizations, or kinda three:

>
> > 1a) ASCII-only strings
> > 1b) Latin1-only strings
> > 2) BMP-only strings
> > 3) Everything else

>
> > Options 1a and 1b are almost identical - I'm not sure what the detail
> > is, but there's something flagging those strings that fit inside seven
> > bits. (Something to do with optimizing encodings later?) Both are
> > optimized down to a single byte per character.

>
> > Option 2 is optimized to two bytes per character.

>
> > Option 3 is stored in UTF-32.

>
> > Once again, jmf, you are forgetting that option 2 is a safe and
> > bug-free optimization.

>
> > ChrisA

>
> As long as you are attempting to devide a set of characters in
> chunks and try to handle them seperately, it will never work.
>
> Read my previous post about the unicode transformation format.
> I know what pep393 does.
>
> jmf


Addendum.

This was you correctly percieved in one another thread.
You qualified it as a "switch". Now you have to understand
from where this "switch" is coming from.

jmf

by toy with
 
Reply With Quote
 
Ian Kelly
Guest
Posts: n/a
 
      03-28-2013
On Thu, Mar 28, 2013 at 7:01 AM, Steven D'Aprano
<(E-Mail Removed)> wrote:
> Any string method that takes a starting offset requires the method to
> walk the string byte-by-byte. I've even seen languages put responsibility
> for dealing with that onto the programmer: the "start offset" is given in
> *bytes*, not characters. I don't remember what language this was... it
> might have been Haskell? Whatever it was, it horrified me.


Go does this. I remember because it came up in one of these threads,
where jmf (or was it Ranting Rick?) was praising Go for just getting
Unicode "right".
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Performance Tutorials Services - Boosting Performance by DisablingUnnecessary Services on Windows XP Home Edition Software Engineer Javascript 0 06-10-2011 02:18 AM
Re: Performance (pystone) of python 2.4 lower then python 2.3 ??? Andreas Kostyrka Python 0 12-17-2004 02:00 PM
Performance (pystone) of python 2.4 lower then python 2.3 ??? Lucas Hofman Python 13 12-16-2004 03:24 AM
RE: Straw poll on Python performance (was Re: Python is far from atop performer ...) Robert Brewer Python 1 01-10-2004 06:54 AM
Web Form Performance Versus Single File Performance jm ASP .Net 1 12-12-2003 11:14 PM



Advertisments