Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > replace unicode characters by &#number; representation

Reply
Thread Tools

replace unicode characters by &#number; representation

 
 
Alan J. Flavell
Guest
Posts: n/a
 
      02-21-2004

Suppose I've read-in a line of text which can contain a large
repertoire of characters. Assume that I've done it using Perl's
native unicode support, a la

binmode IN, ':encoding(whatever)';

specifying, of course, the correct external character encoding for
the input file that I'm reading, whatever it might be.

I've come up with this regex[1]

s/([^\0-\177])/'&#'.ord($1).';'/eg;

to replace all non-ASCII characters by their &#number; representation.

Is this indeed the simplest approach, or am I missing some simpler
code than writing ord($1) and using /e to evaluate it?

(Use \377 if the requirement is to retain iso-8859-1 characters and
only to convert the rest).

[1] Yes, I'm aware that - since this appears to be an HTML/XHTML
problem - then the proper place to do this would be in whatever
HTML-processing package/module one is using, but please humour me for
the low-level approach anyway, for the sake of this discussion.
 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      02-21-2004

"Alan J. Flavell" <(E-Mail Removed)> wrote:
>
> Suppose I've read-in a line of text which can contain a large
> repertoire of characters. Assume that I've done it using Perl's
> native unicode support, a la
>
> binmode IN, ':encoding(whatever)';
>
> specifying, of course, the correct external character encoding for
> the input file that I'm reading, whatever it might be.
>
> I've come up with this regex[1]
>
> s/([^\0-\177])/'&#'.ord($1).';'/eg;
>
> to replace all non-ASCII characters by their &#number; representation.


I would have said [^[:ascii:]] was clearer .

> Is this indeed the simplest approach, or am I missing some simpler
> code than writing ord($1) and using /e to evaluate it?
>
> (Use \377 if the requirement is to retain iso-8859-1 characters and
> only to convert the rest).


I usually use

use Encode qw/:fallbacks/;

$PerlIO::encoding::fallback = FB_HTMLCREF;
binmode STDOUT, ':encoding(ascii)'; # or iso8859-1, or whatever

which will leave the conversion until the data is output.

Ben

[1] NMF

--
Heracles: Vulture! Here's a titbit for you / A few dried molecules of the gall
From the liver of a friend of yours. / Excuse the arrow but I have no spoon.
(Ted Hughes, [ Heracles shoots Vulture with arrow. Vulture bursts into ]
/Alcestis/) [ flame, and falls out of sight. ] http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      02-21-2004
Alan J. Flavell <(E-Mail Removed)> wrote in comp.lang.perl.misc:
>
> Suppose I've read-in a line of text which can contain a large
> repertoire of characters. Assume that I've done it using Perl's
> native unicode support, a la
>
> binmode IN, ':encoding(whatever)';
>
> specifying, of course, the correct external character encoding for
> the input file that I'm reading, whatever it might be.
>
> I've come up with this regex[1]
>
> s/([^\0-\177])/'&#'.ord($1).';'/eg;
>
> to replace all non-ASCII characters by their &#number; representation.


Considerations aside whether this should be done on I/O level, if what
you want is the recoded string, there's nothing wrong with it. In
particular, /e has none of the bad smell of string eval (/ee does,
a bit). Ben's suggestion about :ascii: is a good one, and I'd space
out the perl code in s///e as usual, so

s/([^[:ascii:]])/'&#' . ord( $1) . ';'/eg;

but that's only stylistics.

> Is this indeed the simplest approach, or am I missing some simpler
> code than writing ord($1) and using /e to evaluate it?


I can't think of anything more elementary than ord().

Anno
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      02-22-2004
On Sat, 21 Feb 2004, Anno Siegel wrote:

> Alan J. Flavell <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> >
> > s/([^\0-\177])/'&#'.ord($1).';'/eg;

>

[...]
> Ben's suggestion about :ascii: is a good one,


You (both) have a point, though I was comfortable with having the
ability to switch the upper limit between \177 (ASCII) and \377 (for
iso-8859-1) in an obvious way.

Thanks for the other comments, too.

all the best
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      02-22-2004
On Sat, 21 Feb 2004, Ben Morrow wrote:

> I usually use
>
> use Encode qw/:fallbacks/;


Thanks for pointing that out. I wasn't properly aware of the feature.

> $PerlIO::encoding::fallback = FB_HTMLCREF;
> binmode STDOUT, ':encoding(ascii)'; # or iso8859-1, or whatever
>
> which will leave the conversion until the data is output.


OK, it looks as if the relevant documentation is in e.g
http://www.perldoc.com/perl5.8.0/lib/Encode.html

Thanks.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Python unicode utf-8 characters and MySQL unicode utf-8 characters Grzegorz Śliwiński Python 2 01-19-2011 07:31 AM
Re: convert unicode characters to visibly similar ascii characters Laszlo Nagy Python 6 07-02-2008 04:42 PM
Re: convert unicode characters to visibly similar ascii characters M.-A. Lemburg Python 0 07-02-2008 08:39 AM
Re: convert unicode characters to visibly similar ascii characters Terry Reedy Python 0 07-01-2008 07:46 PM
Trying to replace unicode characters James Perl Misc 0 09-08-2004 02:33 AM



Advertisments