Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > UNICODE mode for regular expressions - time to change the default?

Reply
Thread Tools

UNICODE mode for regular expressions - time to change the default?

 
 
John Nagle
Guest
Posts: n/a
 
      04-05-2007
Regular expressions are compiled in ASCII mode unless
Unicode mode is specified to "rc.compile". The difference is that regular
expressions in ASCII mode don't recognize things like
Unicode whitespace, even when applied to Unicode strings.
For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is
a form of whitespace. It's the Unicode equivalent of HTML's " ".
This can create some strange bugs.

Is the current default good? Or is it time to compile all regular
expressions in Unicode mode by default? It shouldn't hurt processing of
ASCII strings to do that. The current setup is really a legacy of when
most things in Python didn't work in Unicode mode, and you didn't want to
introduce Unicode unnecessarily. It's another one of those obscure
Unicode "gotchas" that really should go away.

John Nagle
 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      04-05-2007
On Apr 6, 5:50 am, John Nagle <(E-Mail Removed)> wrote:
> Regular expressions are compiled in ASCII mode
> unless
> Unicode mode is specified to "rc.compile". The difference is that regular
> expressions in ASCII mode don't recognize things like
> Unicode whitespace, even when applied to Unicode strings.


AFAICT, the default is that \s, \d, etc are interpreted according to
the current locale's properties. Specifying re.U changes that to use
the unicodedata properties instead. There is no such thing as "ASCII
mode".

> For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is
> a form of whitespace. It's the Unicode equivalent of HTML's "&nbsp;".
> This can create some strange bugs.
>
> Is the current default good? Or is it time to compile all regular
> expressions in Unicode mode by default? It shouldn't hurt processing of
> ASCII strings to do that.


Believe it or not: there are folk out there who have data which is
encoded in 8-bit encodings which are not ASCII and for which
"\xA0".decode('whatever') does not produce u"\xA0" ... it could for
example be a box-drawing character or a letter:

>>> import unicodedata as ucd
>>> "\xA0".decode('koi8-r')

u'\u2550'
>>> ucd.name(_)

'BOX DRAWINGS DOUBLE HORIZONTAL'
>>> "\xA0".decode('cp850')

u'\xe1'
>>> ucd.name(_)

'LATIN SMALL LETTER A WITH ACUTE'
>>>


Problem number 2: It's probable that users in locale X wouldn't want a
match to succeed on a character that is regarded as a digit (say) in
distant locale Y, but if found in a data file in locale X is probably
more indicative of having read binary data instead of Unicode text:

>>> ucd.name(u"\u0f20")

'TIBETAN DIGIT ZERO'
>>> re.match(ur"\d", u"\u0f20")
>>> re.match(ur"\d", u"\u0f20", re.U)

<_sre.SRE_Match object at 0x00EFC9C0>


> The current setup is really a legacy of when
> most things in Python didn't work in Unicode mode, and you didn't want to
> introduce Unicode unnecessarily. It's another one of those obscure
> Unicode "gotchas" that really should go away.


It's the ASCII-centric mindset that creates gotchas and really should
go away

HTH,
John

 
Reply With Quote
 
 
 
 
Steve Holden
Guest
Posts: n/a
 
      04-05-2007
John Nagle wrote:
> Regular expressions are compiled in ASCII mode unless
> Unicode mode is specified to "rc.compile". The difference is that regular
> expressions in ASCII mode don't recognize things like
> Unicode whitespace, even when applied to Unicode strings.
> For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is
> a form of whitespace. It's the Unicode equivalent of HTML's "&nbsp;".
> This can create some strange bugs.
>
> Is the current default good? Or is it time to compile all regular
> expressions in Unicode mode by default? It shouldn't hurt processing of
> ASCII strings to do that. The current setup is really a legacy of when
> most things in Python didn't work in Unicode mode, and you didn't want to
> introduce Unicode unnecessarily. It's another one of those obscure
> Unicode "gotchas" that really should go away.
>
> John Nagle


Personally I'd leave it to go away with Python 3.0, when all strings
will be Unicode.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Unicode Regular Expressions bryan rasmussen Python 2 12-23-2007 11:31 PM
Unicode strings and ascii regular expressions Fuzzyman Python 2 01-31-2006 10:53 AM
regular expressions, unicode and XML ProvoWallis Python 3 01-27-2006 08:20 AM
Unicode regular expressions -- buggy? Christopher Subich Python 1 08-11-2005 08:08 AM
Add custom regular expressions to the validation list of available expressions Jay Douglas ASP .Net 0 08-15-2003 10:19 PM



Advertisments