Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > unicode regex example: trouble

Reply
Thread Tools

unicode regex example: trouble

 
 
marek
Guest
Posts: n/a
 
      05-21-2004
trying this example to make print MatchObject reference. Fails (prints None).
Does anybody know where I am wrong?

# -*- coding: cp1251 -*-

import re

# pattern in Ukrainian ('привіт')
p = '\377\376?\004@\0048\0042\004V\004B\004'

# data (pattern is in the middle of the string)
d = '\377\376t\000e\000s\000t\000?\004@\0048\0042\004V \004B\004t\000t\000'

re_test = re.compile(p, re.UNICODE)

print re_test.search(d, re.UNICODE)
 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      05-21-2004
marek wrote:

> trying this example to make print MatchObject reference. Fails (prints
> None). Does anybody know where I am wrong?
>
> # -*- coding: cp1251 -*-
>
> import re
>
> # pattern in Ukrainian ('привіт')
> p = '\377\376?\004@\0048\0042\004V\004B\004'
>
> # data (pattern is in the middle of the string)
> d = '\377\376t\000e\000s\000t\000?\004@\0048\0042\004V \004B\004t\000t\000'
>
> re_test = re.compile(p, re.UNICODE)
>
> print re_test.search(d, re.UNICODE)


What you have here are funny 8 bit characters, not unicode:

>>>>>> print p, d

 ■?@82VB  ■test?@82VBtt

I guess the encoding is utf-16, therefore:

>>> du = d.decode("utf-16")
>>> pu = p.decode("utf-16")
>>> r = re.compile(pu)
>>> m = r.search(du)
>>> m

<_sre.SRE_Match object at 0x40392090>
>>> print m.group(0).encode("utf-16")

 ■?@82VB

Works as expected

Here's what the docs say about the unicode flag:

UNICODE
Make \w, \W, \b, and \B dependent on the Unicode character properties
database. New in version 2.0.

You may or may not need that when you refine your regexp.

Peter

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
os.lisdir, gets unicode, returns unicode... USUALLY?!?!? gabor Python 13 11-18-2006 09:23 AM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments