Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Unicode in regexp

Reply
Thread Tools

Unicode in regexp

 
 
patari
Guest
Posts: n/a
 
      05-21-2007
Hi,

I have some text which has unicode character \u+2013 for example:
PERFORMANCE - A COMPARATIVE STUDY

How can I find this character and change it to two - characters for
LaTeX?

Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:

$str =~ s/\x{2013}/--/g;

If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?

 
Reply With Quote
 
 
 
 
gypark2@gmail.com
Guest
Posts: n/a
 
      05-21-2007
On 5월21일, 오후8시09분, patari <(E-Mail Removed)> wrote:
> Hi,
>
> I have some text which has unicode character \u+2013 for example:
> PERFORMANCE - A COMPARATIVE STUDY
>
> How can I find this character and change it to two - characters for
> LaTeX?
>
> Somehow next code doesn't work, assuming that $str contains string
> mentioned earlier:
>
> $str =~ s/\x{2013}/--/g;
>
> If I save that text in a UTF-8 file and open that file like this
> open(FILE,"<:utf8","text.txt");
> then above regular expression works. How could I get regexp to work
> for text that is not read from a file which is specified to be in
> UTF-8 encoding?



Hello,

Save your script in UTF-8 encoding and just use the unicode
characters, rather than \x{****} form, in the regexp:

$str =~ s/-/--/g; # First "-" is \x{2013}, not dash.

Or,

decode it first, perform substitution, and encode it back:

use Encode;
$octets = decode("UTF-8", $str);
$octets =~ s/\x{2013}/--/g;
$str =~ encode("UTF-8", $octets);

 
Reply With Quote
 
 
 
 
Mumia W.
Guest
Posts: n/a
 
      05-21-2007
On 05/21/2007 06:09 AM, patari wrote:
> [...]
> Somehow next code doesn't work, assuming that $str contains string
> mentioned earlier:
>
> $str =~ s/\x{2013}/--/g;
>
> If I save that text in a UTF-8 file and open that file like this
> open(FILE,"<:utf8","text.txt");
> then above regular expression works. How could I get regexp to work
> for text that is not read from a file which is specified to be in
> UTF-8 encoding?
>


Where does the text come from?

How do you know that u+2013 is in that text?

 
Reply With Quote
 
patari
Guest
Posts: n/a
 
      05-22-2007
On 21 touko, 18:57, "Mumia W." <paduille.4061.mumia.w
(E-Mail Removed)> wrote:
> On 05/21/2007 06:09 AM, patari wrote:
>
> > [...]
> > Somehow next code doesn't work, assuming that $str contains string
> > mentioned earlier:

>
> > $str =~ s/\x{2013}/--/g;

>
> > If I save that text in a UTF-8 file and open that file like this
> > open(FILE,"<:utf8","text.txt");
> > then above regular expression works. How could I get regexp to work
> > for text that is not read from a file which is specified to be in
> > UTF-8 encoding?

>
> Where does the text come from?
>
> How do you know that u+2013 is in that text?



Text comes originally from user of cgi application, but in this case
the text is fetched from database. I know that character u+2013
because the text is viewed with browser where it shows, and I can copy
that for example to emacs which tells me the code of the character.

 
Reply With Quote
 
patari
Guest
Posts: n/a
 
      05-22-2007
Hi,

On 21 touko, 15:37, (E-Mail Removed) wrote:
> Hello,
>
> Save your script in UTF-8 encoding and just use the unicode
> characters, rather than \x{****} form, in the regexp:
>
> $str =~ s/-/--/g; # First "-" is \x{2013}, not dash.
>
> Or,
>
> decode it first, perform substitution, and encode it back:
>
> use Encode;
> $octets = decode("UTF-8", $str);
> $octets =~ s/\x{2013}/--/g;
> $str =~ encode("UTF-8", $octets);



Unfortunately that doesn't work either. It only changes that character
and characters like to some mess of characters. I think that decode
and encode should be changed.
Neither does $str =~ s/\x20\x13/--/g; work.

But thanks to you and Petr Vileta I finally got the solution by
combining your hints. I first encoded the string
my $octets = encode("UTF-8",$str);
and then printed it to Apaches log. The character seemed to be encoded
\xc2\x96. Using this I could match the regexp and change the
character.

Here is the solution if anyone else bumps into similar problems:
my $octets = encode("UTF-8",$str);
if ($octets =~ /\xc2\x96/) {
$octets =~ s/\xc2\x96/--/g;
}
$str = decode("UTF-8",$octets);

I'm still wondering why \x{2013} didn't match after encode. It seems
that encode also changes that character and in this case codes it as
\xc2\x96.

 
Reply With Quote
 
Brian McCauley
Guest
Posts: n/a
 
      05-22-2007
On May 21, 12:09 pm, patari <(E-Mail Removed)> wrote:
> I have some text which has unicode character \u+2013 for example:
> PERFORMANCE - A COMPARATIVE STUDY


Unicode text is a abstract series of code points.

When you pass Unicode character data from one place to another (e.g.
web form to web server, web server to web browser, application to
database, database to application, file to application, application to
file...) you need the two ends to agree what encoding is being used to
serialise the abstract series of code points into a series of bytes.

Perl has two types of string: Unicode strings and byte strings. Byte
strings contain bytes or, sometimes, ASCII text. There are various
rules about what happens if you treat a byte string containing bytes
in the range 0x80-0xFF a text but I'm not going to go into those here.
You should ideally explicitly say when you want to convert a byte
sequence to a Unicode character sequence and specify what encoding you
are using.

So, when you want to read your sample text (as a series of bytes from
an external source) into a Perl Unicode string you need to make sure
that you tell Perl (somehow) what encoding is being used.

> How can I find this character and change it to two - characters for
> LaTeX?
>
> Somehow next code doesn't work, assuming that $str contains string
> mentioned earlier:
>
> $str =~ s/\x{2013}/--/g;


The code is right the assumption is wrong. $str did not contain U
+2013.

>From evidence elsewhere in this thread I can determine that $str

either was not a Unicode string at all (in which case it contained
only bytes - one of which was 0x96) or it was a Unicode string and
contained U+96.

Now it just so happens that in Latin1 the byte 0x96 encodes the
Unicode code point U+96 and in Windows-1250 the byte 0x96 encodes the
Unicode code point U+2013.

So I conclude that at some point your Unicode text has been passed
from one place to another in such a way that the sender thinks it's
using Windows-1250 encoding and the receiver thinks it's Latin1
encoding. The effect of this is to transform the printable Unicode
characher 'EN DASH' into the non-printable Unicode control character
'START OF GUARDED AREA'.

There is not sufficient evidence presented in this thread to work out
where this corruption occurred.

> If I save that text in a UTF-8 file and open that file like this
> open(FILE,"<:utf8","text.txt");
> then above regular expression works. How could I get regexp to work
> for text that is not read from a file which is specified to be in
> UTF-8 encoding?


By making sure that you know what encoding is being used by the place
that you are reading it from and instructing Perl to decode it if from
that encoding into Unicode.

 
Reply With Quote
 
Brian McCauley
Guest
Posts: n/a
 
      05-22-2007
On May 22, 8:48 am, patari <(E-Mail Removed)> wrote:
> On 21 touko, 18:57, "Mumia W." <paduille.4061.mumia.w
>
>
>
> (E-Mail Removed)> wrote:
> > On 05/21/2007 06:09 AM, patari wrote:

>
> > > [...]
> > > Somehow next code doesn't work, assuming that $str contains string
> > > mentioned earlier:

>
> > > $str =~ s/\x{2013}/--/g;

>
> > > If I save that text in a UTF-8 file and open that file like this
> > > open(FILE,"<:utf8","text.txt");
> > > then above regular expression works. How could I get regexp to work
> > > for text that is not read from a file which is specified to be in
> > > UTF-8 encoding?

>
> > Where does the text come from?

>
> > How do you know that u+2013 is in that text?

>
> Text comes originally from user of cgi application, but in this case
> the text is fetched from database. I know that character u+2013
> because the text is viewed with browser where it shows, and I can copy
> that for example to emacs which tells me the code of the character.


That is a bad inference.

 
Reply With Quote
 
Brian McCauley
Guest
Posts: n/a
 
      05-22-2007
On May 22, 9:50 am, patari <(E-Mail Removed)> wrote:

> I first encoded the string
> my $octets = encode("UTF-8",$str);
> and then printed it to Apaches log. The character seemed to be encoded
> \xc2\x96.


Which tells us that the character is U+96.

> I'm still wondering why \x{2013} didn't match after encode.


encode() returns a byte string. It contains only bytes. \x{2013} is
not a byte so it can never exist in a byte string.

> It seems
> that encode also changes that character and in this case codes it as
> \xc2\x96.


No, there's no reason to believe that $str ever contained U+2013




 
Reply With Quote
 
Brian McCauley
Guest
Posts: n/a
 
      05-22-2007
On May 21, 1:37 pm, (E-Mail Removed) wrote:

> use Encode;
> $octets = decode("UTF-8", $str);


Your variable naming is confusing. decode() takes an byte (aka octet)
string as an argument and returns a string of Unicode characters (not
a string of bytes).

 
Reply With Quote
 
gypark2@gmail.com
Guest
Posts: n/a
 
      05-22-2007
On 5월23일, 오전2시24분, Brian McCauley <(E-Mail Removed)> wrote:
> On May 21, 1:37 pm, (E-Mail Removed) wrote:
>
> > use Encode;
> > $octets = decode("UTF-8", $str);

>
> Your variable naming is confusing. decode() takes an byte (aka octet)
> string as an argument and returns a string of Unicode characters (not
> a string of bytes).


Oops,

You are right. I copied that code from "perldoc Encode" but I made the
mistake and wrote the names the wrong way about. :'(

Thanks for pointing it.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
new RegExp().test() or just RegExp().test() Matj Cepl Javascript 3 11-24-2009 02:41 PM
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Ruby 1.9 - ArgumentError: incompatible encoding regexp match(US-ASCII regexp with ISO-2022-JP string) Mikel Lindsaar Ruby 0 03-31-2008 10:27 AM
Programmatically turning a Regexp into an anchored Regexp Greg Hurrell Ruby 4 02-14-2007 06:56 PM
RegExp.exec() returns null when there is a match - a JavaScript RegExp bug? Uldis Bojars Javascript 2 12-17-2006 09:59 PM



Advertisments