Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > help needed making unicode entities

Reply
Thread Tools

help needed making unicode entities

 
 
Dan Jacobson
Guest
Posts: n/a
 
      08-07-2003
Why does
use HTML::Entities; use utf8; print HTML::Entities::encode_entities_numeric("\xE7\xA9\ x8D");
print
積
i.e. three entities, instead of one?

Must I use locale;? In any particular way?

Am I to blame?

Those three bytes represent a Chinese character.

Must I explore pack()?

Not only do I wish to convert one unicode character (three bytes), but
also a whole string of them.

$ perl -v
This is perl, v5.8.0

perldoc Encode's "The UTF-8 flag" holds the answer? And that is what?

perldoc perluniintro isn't helping.

All I want to do is
$ echo '[unicode string]'|perl -plwe 'something;'
and get
大原雄馬...
Is that to much to ask?
 
Reply With Quote
 
 
 
 
Alan J. Flavell
Guest
Posts: n/a
 
      08-07-2003
On Fri, Aug 8, Dan Jacobson inscribed on the eternal scroll:

> Why does
> use HTML::Entities; use utf8; print
> HTML::Entities::encode_entities_numeric("\xE7\xA9\ x8D"); print
> 積
> i.e. three entities, instead of one?


I think I'm going to have to leave the author to answer that; but my
question would be, did you have a reason for choosing that particular
solution? All you're trying to do is decode utf-8 and then represent
the answer in decimal.

> Those three bytes represent a Chinese character.


Yup, I could well believe that those three octets taken as utf-8
indeed represent a CJK unified character.

> Must I explore pack()?


Possibly. But why do you want to write out the nitty details of a
utf-8 coded octet stream? What's the _real_ starting point of this
exercise?

> Not only do I wish to convert one unicode character (three bytes), but
> also a whole string of them.


[ into HTML &bignumber; representations, apparently. ]

Starting from what? If you want to read them in, then read them in
(with :utf8 in effect, of course); and then use ord() to find out
what they are.

> Is that to much to ask?


Too much? I don't think so, but maybe the best way to reach a good
answer is to present the actual problem, rather than complaining about
an apparently non-working solution to an only incompletely stated
problem.

The easy way, btw, is to read your utf-8-encoded data into Mozilla,
edit it, and then save it as iso-8859-1-encoded. Mozilla will happily
then convert your CJK characters into &bignumber; representations.
But that's clearly off-topic for here.

Disclaimer: I don't read CJK, and at my time of life I'm probably
unlikely to start; but I'm still interested in the character coding
technology.
 
Reply With Quote
 
 
 
 
Alan J. Flavell
Guest
Posts: n/a
 
      08-08-2003

Let's try again:

On Fri, Aug 8, Dan Jacobson inscribed on the eternal scroll:

> Why does
> use HTML::Entities; use utf8;
> print HTML::Entities::encode_entities_numeric("\xE7\xA9\ x8D");
> print
> 積 i.e. three entities, instead of one?


I think the reason is that you've given it three characters, not one.

The effect of "use utf8;" is that when you write an 8-bit character
e.g \xE7 in your source code, Perl upgrades it to utf-8 instead of
maintaining it as an 8-bit character. So internally it becomes the
pair of octets which represent the Unicode character U+00E7, although
its ord() value is still, of course, hex E7. This is not what you
want.

What it appears you're trying to do is to construct the internal utf-8
representation yourself. I don't know why you'd want to do that, but
as far as I understand it, the following kind of code (I'm doing it
"per pedes" rather than trying any clever shortcuts) could do it.

Disclaimer: I'm still a bit of a beginner at this, but nobody else
seems particularly keen to offer answers in this area, it seems, so
I'm doing my best.

use Encode;

[...]

my $octets;
{
use bytes;
$octets = "\xE7\xA9\x8D";
}

my $string = decode_utf8($octets);

Note that not all octet sequences represent valid utf-8: this call
should throw a warning if an invalid sequence is presented.

If you want to be quick and dirty, I _think_ you can just set the
internal utf8 flag on your octet-string, taking responsibility
yourself for its validity. Further reading on this is at:

http://www.perldoc.com/perl5.8.0/lib/Encode.html


If you're just trying to compose Unicode characters into your source
code, I suppose you'd be better off using the "wide character"
notation, \x{uuuu} to represent the Unicode character U+uuuu (which
you can look up at the unicode web site, see the URLs I posted on
another recent thread re Japanese), rather than hand-coding utf-8
octets in hex. But then, you didn't explain why or how it arose that
you wanted to start from the latter notation - maybe you have your
own good reasons for wanting that...

cheers
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      08-08-2003
On Fri, Aug 8, Dan Jacobson inscribed on the eternal scroll:

> Works! That was pleasant.


nice to hear

> Never did figure out how to move the :utf8 inside the program whilst
> maintaining the -ple. perldoc -f open doesn't enlighten.


AIUI your standard input and output are already open; to apply :utf8
semantics to an already-open filehandle you use the extended form of
binmode(). I'm not sure if that's really the answer to your question,
though.

> as a batch job (no mozilla)?


My mention of Mozilla was very much an aside - but if you want to
convert an HTML document from any known coding, into one using a
specific coding - say utf-8 - or using &#number; notations, then it's
quite a handy tool, it seems to me, thanks to its syntax-awareness.

But of course something like HTMLtidy, or SP, can do that too. Or XML
tools if you're using XHTML.

> Certainly there is a ready made solution?


As I say, I'm also learning this stuff as I go along, so even if there
*is* one, there's no guarantee I have it at my fingertips. And you
can see for yourself how many other regular contributors here get
involved when the word Unicode is mentioned. Rather few,
unfortunately (which makes me worry a bit...).

cheers
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      08-10-2003
On Sun, Aug 10, Dan Jacobson inscribed on the eternal scroll:

> Alan> [In perl] to apply :utf8 semantics to an already-open filehandle
> Alan> you use the extended form of binmode().
>
> perldoc -f binmode has no eye grabbing example.


I'm looking at http://www.perldoc.com/perl5.8.0/pod/func/binmode.html

binmode FILEHANDLE, LAYER

[...]

If LAYER is present it is a single string, but may contain multiple
directives. The directives alter the behaviour of the file handle.
When LAYER is present using binmode on text file makes sense.

To mark FILEHANDLE as UTF-8, use :utf8.

Might not be an "eyegrabbing example", but it seems clear enough to
me, no?

Your "eyegrabbing example" seens to be here:
http://www.perldoc.com/perl5.8.0/pod...ml#Unicode-I-O

and on already open streams, use binmode():

binmode(STDOUT, ":utf8");

I would certainly recommend referring back to both perluniintro and
perlunicode while doing this sort of work - they've helped me, anyhow.

cheers
 
Reply With Quote
 
Dan Jacobson
Guest
Posts: n/a
 
      08-11-2003
Alan> binmode(STDOUT, ":utf8");

Bad news, only the first one works:
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
PERLIO=:utf8 perl -wple 's/./"&#".ord($&).";"/eg'
積丹尼
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
積丹尼
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDOUT,":utf8");s/./"&#".ord($&).";"/eg'
積丹尼
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDOUT,":utf8");binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
積丹尼
perl -v
This is perl, v5.8.0 built for i386-linux-thread-multi
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      08-11-2003
On Mon, Aug 11, Dan Jacobson inscribed on the eternal scroll:

> Alan> binmode(STDOUT, ":utf8");
>
> Bad news, only the first one works:
> echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
> PERLIO=:utf8 perl -wple 's/./"&#".ord($&).";"/eg'
> 積丹尼


Seems to be one of the possibilities documented in perlrun, so that's
good.

> echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
> perl -wple 'binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
> 積丹尼


I have to confess, I have no familiarity with the details of this part
of the -p option. I'm really not a great one-liner, I'm afraid.

> echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
> perl -wple 'binmode(STDOUT,":utf8");s/./"&#".ord($&).";"/eg'
> 積丹尼


Since you're not trying to send any utf-8-encoded characters (other
than those which are trivially us-ascii) to STDOUT, I'm not sure why
you're suggesting binmode(STDOUT, ...) as being possibly relevant.

Well, it looks as if you have one option which works.

I plead lack of knowledge on the other one, but it's at least
plausible that setting binmode on STDIN ought to work. Maybe someone
reading this who understands the -p processing better than I do would
care to comment - maybe even try reporting a bug - or at least getting
it documented in perlrun?

cheers
 
Reply With Quote
 
Dave Weaver
Guest
Posts: n/a
 
      08-12-2003
On Mon, 11 Aug 2003 09:47:43 +0800, Dan Jacobson <(E-Mail Removed)> wrote:
>
> Bad news, only the first one works:


> echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
> PERLIO=:utf8 perl -wple 's/./"&#".ord($&).";"/eg'
> 積丹尼


> echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
> perl -wple 'binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
> 積丹尼


Don't know much about utf8 etc, but try putting the binmode in a BEGIN{}
block, so that it is done immediately and only once (rather than once per
line) :

[davew]% echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'BEGIN{binmode(STDIN,":utf8")};s/./"&#".ord($&).";"/eg'
積丹尼
[davew]% perl -v
This is perl, v5.8.0 built for i386-linux-thread-multi


--
Cheers,
Dave
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Convert from unicode chars to HTML entities Steven D'Aprano Python 8 02-08-2007 05:58 PM
Build JDOM documents containing unicode entities Tom XML 0 09-10-2005 02:22 PM
Build JDOM documents containing unicode entities Tom Java 0 09-10-2005 02:22 PM
HTML::Entities::encode() returning wrong(?) entities Jim Higson Perl Misc 3 07-25-2004 09:13 PM
Making entities available to multiple XML documents Matthew Burgess XML 3 07-28-2003 11:27 AM



Advertisments