Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   HTML::Entities::encode() returning wrong(?) entities (http://www.velocityreviews.com/forums/t887388-html-entities-encode-returning-wrong-entities.html)

Jim Higson 07-23-2004 06:18 PM

HTML::Entities::encode() returning wrong(?) entities
 
I'm calling encode_entities on some text I have read from a file, to turn it
into a webpage. According to file:

$ file text/text.en
$ text/text.en: UTF-8 Unicode English text, with very long lines

(although this might not matter)
Anyway, the letter ä appears in the text, and should be changed to ä

However, instead it is changed to:
ä

I can't see anything unusual about my code. Any ideas why I'm having this
problem?




Jim Higson 07-23-2004 07:43 PM

Re: HTML::Entities::encode() returning wrong(?) entities
 
Jim Higson wrote:

> I'm calling encode_entities on some text I have read from a file, to turn
> it into a webpage. According to file:
>
> $ file text/text.en
> $ text/text.en: UTF-8 Unicode English text, with very long lines
>
> (although this might not matter)
> Anyway, the letter ä appears in the text, and should be changed to ä
>
> However, instead it is changed to:
> ä
>
> I can't see anything unusual about my code. Any ideas why I'm having this
> problem?



I just found the answer myself - as I suspected it was to do with reading
the unicode in perl. Adding use open ':utf8'; to the top of the source
fixed this (although I'm not quite certain exactly what this means)

Joe Smith 07-25-2004 10:13 AM

Re: HTML::Entities::encode() returning wrong(?) entities
 
Jim Higson wrote:

> $ text/text.en: UTF-8 Unicode English text, with very long lines
> Anyway, the letter ä appears in the text, and should be changed to ä


In UTF-8 encoding, the single character "ä" is stored as two bytes:
"\xC3" and "\xA9". If you allow perl to think that the file is ISO-8859-1,
it will interpret those two bytes as "Ã" and "©". You need to tell perl
that the file is :utf8 in order for it to recognize those two bytes as
being a single Unicode character.

-Joe

Eric Amick 07-25-2004 09:13 PM

Re: HTML::Entities::encode() returning wrong(?) entities
 
On Fri, 23 Jul 2004 20:43:44 +0100, Jim Higson <jh@333.org> wrote:

>Jim Higson wrote:
>
>> I'm calling encode_entities on some text I have read from a file, to turn
>> it into a webpage. According to file:
>>
>> $ file text/text.en
>> $ text/text.en: UTF-8 Unicode English text, with very long lines
>>
>> (although this might not matter)
>> Anyway, the letter appears in the text, and should be changed to &auml;
>>
>> However, instead it is changed to:
>> &Atilde;&curren;
>>
>> I can't see anything unusual about my code. Any ideas why I'm having this
>> problem?

>
>
>I just found the answer myself - as I suspected it was to do with reading
>the unicode in perl. Adding use open ':utf8'; to the top of the source
>fixed this (although I'm not quite certain exactly what this means)


It tells Perl to open all files with UTF-8 encoding set by default. Only
you can say whether that is the right thing. If it isn't, you can
specify it for specific files by using ':utf8' as the second argument of
a three-argument open or with a binmode call on the appropriate
filehandle.

--
Eric Amick
Columbia, MD


All times are GMT. The time now is 08:09 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.