Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > split on NO-BREAK SPACE

Reply
Thread Tools

split on NO-BREAK SPACE

 
 
Peter Kleiweg
Guest
Posts: n/a
 
      07-22-2007

Is this a bug or a feature?


Python 2.4.4 (#1, Oct 19 2006, 11:55:22)
[GCC 2.95.3 20010315 (SuSE)] on linux2

>>> a = 'a b c\240d e'
>>> a

'a b c\xa0d e'
>>> a.split()

['a', 'b', 'c\xa0d', 'e']
>>> a = a.decode('latin-1')
>>> a

u'a b c\xa0d e'
>>> a.split()

[u'a', u'b', u'c', u'd', u'e']



--
Peter Kleiweg L:NL,af,da,de,en,ia,nds,no,sv,(fr,it) S:NL,de,en,(da,ia)
info: http://www.let.rug.nl/kleiweg/ls.html
 
Reply With Quote
 
 
 
 
Carsten Haese
Guest
Posts: n/a
 
      07-22-2007
On Sun, 2007-07-22 at 17:15 +0200, Peter Kleiweg wrote:
> Is this a bug or a feature?
>
>
> Python 2.4.4 (#1, Oct 19 2006, 11:55:22)
> [GCC 2.95.3 20010315 (SuSE)] on linux2
>
> >>> a = 'a b c\240d e'
> >>> a

> 'a b c\xa0d e'
> >>> a.split()

> ['a', 'b', 'c\xa0d', 'e']
> >>> a = a.decode('latin-1')
> >>> a

> u'a b c\xa0d e'
> >>> a.split()

> [u'a', u'b', u'c', u'd', u'e']


It's a feature. See help(str.split): "If sep is not specified or is
None, any whitespace string is a separator."

--
Carsten Haese
http://informixdb.sourceforge.net


 
Reply With Quote
 
 
 
 
Peter Kleiweg
Guest
Posts: n/a
 
      07-22-2007
Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:

> On Sun, 2007-07-22 at 17:15 +0200, Peter Kleiweg wrote:
> > Is this a bug or a feature?
> >
> >
> > Python 2.4.4 (#1, Oct 19 2006, 11:55:22)
> > [GCC 2.95.3 20010315 (SuSE)] on linux2
> >
> > >>> a = 'a b c\240d e'
> > >>> a

> > 'a b c\xa0d e'
> > >>> a.split()

> > ['a', 'b', 'c\xa0d', 'e']
> > >>> a = a.decode('latin-1')
> > >>> a

> > u'a b c\xa0d e'
> > >>> a.split()

> > [u'a', u'b', u'c', u'd', u'e']

>
> It's a feature. See help(str.split): "If sep is not specified or is
> None, any whitespace string is a separator."


Define "any whitespace".
Why is it different in <type 'str'> and <type 'unicode'>?
Why does split() split when it says NO-BREAK?

--
Peter Kleiweg L:NL,af,da,de,en,ia,nds,no,sv,(fr,it) S:NL,de,en,(da,ia)
info: http://www.let.rug.nl/kleiweg/ls.html
 
Reply With Quote
 
Carsten Haese
Guest
Posts: n/a
 
      07-22-2007
On Sun, 2007-07-22 at 17:44 +0200, Peter Kleiweg wrote:
> > It's a feature. See help(str.split): "If sep is not specified or is
> > None, any whitespace string is a separator."

>
> Define "any whitespace".


Any string for which isspace returns True.

> Why is it different in <type 'str'> and <type 'unicode'>?


>>> '\xa0'.isspace()

False
>>> u'\xa0'.isspace()

True

For byte strings, Python doesn't know whether 0xA0 is a whitespace
because it depends on the encoding whether the number 160 corresponds to
a whitespace character. For unicode strings, code point 160 is
unquestionably a whitespace, because it is a no-break SPACE.

> Why does split() split when it says NO-BREAK?


Precisely. It says NO-BREAK. It doesn't say NO-SPLIT.

--
Carsten Haese
http://informixdb.sourceforge.net


 
Reply With Quote
 
Peter Kleiweg
Guest
Posts: n/a
 
      07-22-2007
Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:

> On Sun, 2007-07-22 at 17:44 +0200, Peter Kleiweg wrote:
> > > It's a feature. See help(str.split): "If sep is not specified or is
> > > None, any whitespace string is a separator."

> >
> > Define "any whitespace".

>
> Any string for which isspace returns True.


Define white space to isspace()

> > Why is it different in <type 'str'> and <type 'unicode'>?

>
> >>> '\xa0'.isspace()

> False
> >>> u'\xa0'.isspace()

> True


Here is another "space":

>>> u'\uFEFF'.isspace()

False

isspace() is inconsistent

> For byte strings, Python doesn't know whether 0xA0 is a whitespace
> because it depends on the encoding whether the number 160 corresponds to
> a whitespace character. For unicode strings, code point 160 is
> unquestionably a whitespace, because it is a no-break SPACE.


I question it. And so does the sre module:

\s Matches any whitespace character; equivalent to [ \t\n\r\f\v]

Where is the NO-BREAK SPACE in there?


> > Why does split() split when it says NO-BREAK?

>
> Precisely. It says NO-BREAK. It doesn't say NO-SPLIT.


That is a stupid answer.


--
Peter Kleiweg L:NL,af,da,de,en,ia,nds,no,sv,(fr,it) S:NL,de,en,(da,ia)
info: http://www.let.rug.nl/kleiweg/ls.html
 
Reply With Quote
 
Wildemar Wildenburger
Guest
Posts: n/a
 
      07-22-2007
Peter Kleiweg wrote:
>
> Define white space to isspace()
>
>

Explain that phrase.

>
> Here is another "space":
>
> >>> u'\uFEFF'.isspace()

> False
>
> isspace() is inconsistent
>

I don't really know much about unicode, but google tells me that \uFEFF
is a byte order mark. I thought we we're implicitly in unison that
"whitespace" (whatever the formal definition) means "the stuff we put
into text to visually separate words".
So what is *your* definition of whitespace?


>>> Why does split() split when it says NO-BREAK?
>>>

>> Precisely. It says NO-BREAK. It doesn't say NO-SPLIT.
>>

>
> That is a stupid answer.
>
>

I fail to see why you deem it a good idea to become insulting at this point.
It is a very valid answer: NO-BREAK means "when wrapping characters into
paragraphs do not break at this space".
split() however does not wrap text, it /splits/ it (at whitespace
characters, as it happens). The NO-BREAK semantic has no meaning here.


/W
 
Reply With Quote
 
Steve Holden
Guest
Posts: n/a
 
      07-22-2007
Jean-Paul Calderone wrote:
> On Sun, 22 Jul 2007 21:13:02 +0200, Peter Kleiweg
<(E-Mail Removed)> wrote:
>> Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:
>>
>>> On Sun, 2007-07-22 at 17:44 +0200, Peter Kleiweg wrote:
>>>>> It's a feature. See help(str.split): "If sep is not specified or is
>>>>> None, any whitespace string is a separator."
>>>> Define "any whitespace".
>>> Any string for which isspace returns True.

>> Define white space to isspace()
>>
>>>> Why is it different in <type 'str'> and <type 'unicode'>?
>>>>>> '\xa0'.isspace()
>>> False
>>>>>> u'\xa0'.isspace()
>>> True

>> Here is another "space":
>>
>> >>> u'\uFEFF'.isspace()

>> False
>>
>> isspace() is inconsistent

>
> It's only inconsistent if you think it should behave based on the
> name of a unicode code point. It doesn't use the name, though. It
> uses the category. NO-BREAK SPACE is in the Zs category (Separator, Space).
> ZERO WIDTH NO-BREAK SPACE is in the Cf category (Other, Format).
>
> Maybe that makes unicode inconsistent (I won't try to argue either way),
> but it's pretty clear that isspace is being consistent based on the data
> it has to work with.
>

Well, if you're going to start answering questions with FACTS, how can
questioners reply on their prejudices to guide them any more?

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------

 
Reply With Quote
 
I V
Guest
Posts: n/a
 
      07-22-2007
On Sun, 22 Jul 2007 21:13:02 +0200, Peter Kleiweg wrote:
> Here is another "space":
>
> >>> u'\uFEFF'.isspace()

> False
>
> isspace() is inconsistent


Well, U+00A0 is in the category "Separator, Space" while U+FEFF is in the
category "Other, Format", so it doesn't seem unreasonable that one is
treated as a space and the other isn't.
 
Reply With Quote
 
Ben Finney
Guest
Posts: n/a
 
      07-23-2007
Steve Holden <(E-Mail Removed)> writes:

> Well, if you're going to start answering questions with FACTS, how
> can questioners reply on their prejudices to guide them any more?


You clearly underestimate the capacity for such people to choose only
the particular facts that support those prejudices.

--
\ "Are you pondering what I'm pondering?" "I think so, Brain, but |
`\ I don't think Kay Ballard's in the union." -- _Pinky and The |
_o__) Brain_ |
Ben Finney
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How can I split database results with ExecuteReader and Split? needin4mation@gmail.com ASP .Net 2 05-05-2006 10:36 PM
Why Python style guide (PEP-8) says 4 space indents instead of 8 space??? 8 space indents ever ok?? Christian Seberino Python 21 10-27-2003 04:20 PM
Re: Why Python style guide (PEP-8) says 4 space indents instead of8 space??? 8 space indents ever ok?? Ian Bicking Python 2 10-24-2003 11:15 AM
Re: Why Python style guide (PEP-8) says 4 space indents instead of8 space??? 8 space indents ever ok?? Ian Bicking Python 2 10-23-2003 07:07 AM
Stack space, global space, heap space Shuo Xiang C Programming 10 07-11-2003 07:30 PM



Advertisments