Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Correct handling of case in unicode and regexps

Reply
Thread Tools

Correct handling of case in unicode and regexps

 
 
Devin Jeanpierre
Guest
Posts: n/a
 
      02-23-2013
Hi folks,

I'm pretty unsure of myself when it comes to unicode. As I understand
it, you're generally supposed to compare things in a case insensitive
manner by case folding, right? So instead of a.lower() == b.lower()
(the ASCII way), you do a.casefold() == b.casefold()

However, I'm struggling to figure out how regular expressions should
treat case. Python's re module doesn't "work properly" to my
understanding, because:

>>> a = 'ss'
>>> b = 'ß'
>>> a.casefold() == b.casefold()

True
>>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
>>> # oh dear!


In addition, it seems improbable that this ever _could_ work. Because
if it did work like that, then what would the value be of
re.match('s', 'ß', re.UNICODE | re.IGNORECASE).end() ? 0.5?

I'd really like to hear the thoughts of people more experienced with
unicode. What is the ideal correct behavior here? Or do I
misunderstand things?

-- Devin
 
Reply With Quote
 
 
 
 
jmfauth
Guest
Posts: n/a
 
      02-24-2013
On 23 fév, 15:26, Devin Jeanpierre <(E-Mail Removed)> wrote:
> Hi folks,
>
> I'm pretty unsure of myself when it comes to unicode. As I understand
> it, you're generally supposed to compare things in a case insensitive
> manner by case folding, right? So instead of a.lower() == b.lower()
> (the ASCII way), you do a.casefold() == b.casefold()
>
> However, I'm struggling to figure out how regular expressions should
> treat case. Python's re module doesn't "work properly" to my
> understanding, because:
>
> * * >>> a = 'ss'
> * * >>> b = 'ß'
> * * >>> a.casefold() == b.casefold()
> * * True
> * * >>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
> * * >>> # oh dear!
>
> In addition, it seems improbable that this ever _could_ work. Because
> if it did work like that, then what would the value be of
> re.match('s', 'ß', re.UNICODE | re.IGNORECASE).end() ? 0.5?
>
> I'd really like to hear the thoughts of people more experienced with
> unicode. What is the ideal correct behavior here? Or do I
> misunderstand things?


-----

I'm just wondering if there is a real issue here. After all,
this is only a question of conventions. Unicode has some
conventions, re modules may (has to) use some conventions too.

It seems to me, the safest way is to preprocess the text,
which has to be examinated.

Proposed case study:
How should be ss/ß/SS/ẞ interpreted?

'Richard-Strauss-Straße'
'Richard-Strauss-Strasse'
'RICHARD-STRAUSS-STRASSE'
'RICHARD-STRAUSS-STRAẞE'


There is more or less the same situation with sorting.
Unicode can not do all and it may be mandatory to
preprocess the "input".

Eg. This fct I wrote once for the fun. It sorts French
words (without unicodedata and locale).

>>> import libfrancais
>>> z = ['oeuf', 'œuf', 'od', 'of']
>>> zo = libfrancais.sortedfr(z)
>>> zo

['od', 'oeuf', 'œuf', 'of']

jmf
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Correct handling of case in unicode and regexps Devin Jeanpierre Python 0 02-23-2013 05:51 PM
Re: Correct handling of case in unicode and regexps MRAB Python 0 02-23-2013 05:41 PM
Re: Correct handling of case in unicode and regexps Devin Jeanpierre Python 0 02-23-2013 03:30 PM
Re: Correct handling of case in unicode and regexps Devin Jeanpierre Python 0 02-23-2013 03:26 PM
Re: Correct handling of case in unicode and regexps Vlastimil Brom Python 0 02-23-2013 03:11 PM



Advertisments