Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Replace Unicode character

Reply
Thread Tools

Replace Unicode character

 
 
Ryan Chan
Guest
Posts: n/a
 
      10-05-2009
Hello,

Below my code which want to replace unicode character "□" with empty
string, what wrong with the code?

###################

use strict;
use warnings;
use utf8;

my $s = "□"; # hex value = A1BC

$s =~ s/\xA1\xBC//gi;
print $s;

###################


Thanks.
 
Reply With Quote
 
 
 
 
Ryan Chan
Guest
Posts: n/a
 
      10-05-2009
Hello,

On Oct 5, 11:24*pm, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
> Your regexp replaces TWO characters, first one A1, second one BC.
>
> Since your target string does not contain either of these
> characters, nothing happens.
>
> * BugBear



even I use

$s =~ s/\xA1BC//gi;

the same...

Thanks anyway
 
Reply With Quote
 
 
 
 
Ben Bullock
Guest
Posts: n/a
 
      10-05-2009
On Oct 6, 12:28*am, Ryan Chan <(E-Mail Removed)> wrote:

> $s =~ s/\xA1BC//gi;


\x{A1BC} works though.

It's documented in "perldoc perlunicode".

According to Unicode::UCD this is the character "YI SYLLABLE LIEX".

 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      10-05-2009
On 2009-10-05 15:36, Ben Bullock <(E-Mail Removed)> wrote:
> On Oct 6, 12:28*am, Ryan Chan <(E-Mail Removed)> wrote:
>> $s =~ s/\xA1BC//gi;

>
> \x{A1BC} works though.
>
> It's documented in "perldoc perlunicode".
>
> According to Unicode::UCD this is the character "YI SYLLABLE LIEX".
>


Also note that UTF-8 "\xA1\xBC" is not equivalent to U+A1BC. In fact
"\xA1\xBC" is not a valid UTF-8 character at all, U+A1BC is
"\xEA\x86\xBC" in UTF-8, and the character in Ryan's posting was U+25A1
(WHITE SQUARE) or "\xE2\x96\xA1" in UTF-8.

hp

 
Reply With Quote
 
Jochen Lehmeier
Guest
Posts: n/a
 
      10-05-2009
On Mon, 05 Oct 2009 17:18:20 +0200, Ryan Chan <(E-Mail Removed)>
wrote:

> Below my code which want to replace unicode character "□" with empty
> string, what wrong with the code?


Since it has not been spelled out yet:

$s contains one character. The regex contains two characters. One
character never matches two characters.

Funnily, if you're working in an utf8 environment, even a simple \xA1 can
actually be stored as two *bytes*:

> perl -e '$s="\xa1"; print $s; binmode STDOUT,":encoding(utf"; print
> $s;' | hexdump -C

00000000 a1 c2 a1 |...|
00000003

 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      10-05-2009
On Mon, 05 Oct 2009 20:56:39 +0200, "Jochen Lehmeier" <(E-Mail Removed)> wrote:

>On Mon, 05 Oct 2009 17:18:20 +0200, Ryan Chan <(E-Mail Removed)>
>wrote:
>
>> Below my code which want to replace unicode character "?" with empty
>> string, what wrong with the code?

>
>Since it has not been spelled out yet:
>
>$s contains one character. The regex contains two characters. One
>character never matches two characters.
>
>Funnily, if you're working in an utf8 environment, even a simple \xA1 can
>actually be stored as two *bytes*:
>
>> perl -e '$s="\xa1"; print $s; binmode STDOUT,":encoding(utf"; print
>> $s;' | hexdump -C

>00000000 a1 c2 a1 |...|
>00000003


I guess scalar data can actually be stored as bytes (0..255) before say
decoding octets into Perl's internal form. Either the resultant string
is all ASCII or a mix with the utf8 flag turned on (character semantics).

I think this is the base storage strategy for Perl. It speeds things up.
Encoding just converts it back into octets, turning off the utf8 flag
(byte semantics). This process is not always symetrical and there is
sometimes more than one encoding representations of the same thing.

Sort of a bastardized system.

-sln
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Resolving unicode escapes to unicode character Tyler Ruby 1 07-29-2011 01:47 PM
Re: How do I replace an unwanted unicode character in my string? Alexey Smirnov ASP .Net 0 07-10-2008 09:05 PM
How do I replace an unwanted unicode character in my string? COHENMARVIN@lycos.com ASP .Net 1 07-10-2008 09:03 PM
How can I replace all occurrences of a character with another character in std string? herman C++ 5 08-30-2007 09:05 AM
Getting unicode escape sequence from unicode character? Kenneth McDonald Python 1 12-27-2006 10:27 PM



Advertisments