Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > utf8 and HTML Entities

Reply
Thread Tools

utf8 and HTML Entities

 
 
Nick Gerber
Guest
Posts: n/a
 
      09-19-2007
Hi

I'm lost

I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.

How do I translate the HTML Entities into proper utf-8?

Thanks
 
Reply With Quote
 
 
 
 
Ben Bullock
Guest
Posts: n/a
 
      09-19-2007
On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber wrote:


> I have a string encodet in utf8 with part HTML Entities and part
> characters in utf-8.
>
> How do I translate the HTML Entities into proper utf-8?


Since this must be a commonly encountered problem, my first guess would be
to try cpan to save myself the bother of writing it myself. I rapidly found:

http://search.cpan.org/~gaas/HTML-Pa...ML/Entities.pm

Please note that I can't vouch for this software since I have not tried it.

As far as utf8 goes you need to use the "Encode" module.
 
Reply With Quote
 
 
 
 
Nick Gerber
Guest
Posts: n/a
 
      09-20-2007
I tried HTML/Entities.pm, but it didn't do the trick for me. But, it was
me that could not make it to do the conversion for me. I'll try again.

Thanks

Ben Bullock wrote:
> On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber wrote:
>
>
>> I have a string encodet in utf8 with part HTML Entities and part
>> characters in utf-8.
>>
>> How do I translate the HTML Entities into proper utf-8?

>
> Since this must be a commonly encountered problem, my first guess would be
> to try cpan to save myself the bother of writing it myself. I rapidly found:
>
> http://search.cpan.org/~gaas/HTML-Pa...ML/Entities.pm
>
> Please note that I can't vouch for this software since I have not tried it.
>
> As far as utf8 goes you need to use the "Encode" module.

 
Reply With Quote
 
Helmut Wollmersdorfer
Guest
Posts: n/a
 
      09-21-2007
Nick Gerber wrote:
> I tried HTML/Entities.pm, but it didn't do the trick for me. But, it was
> me that could not make it to do the conversion for me. I'll try again.


That's my way which works for millions of HTML (or XML) files:

use HTML::Entities;

my $ENCODING = 'utf8'; # or iso-8859-7, CP1250 etc.

open (HTML, "<:encoding($ENCODING)", "$DIR/$file")
or die "Can't open: $1!";

my $data = <HTML>;

my $content = decode_entities($data);

binmode(STDOUT, ":utf8");

print "$content\n";

It is also save (in most cases) to use

my $content = decode_entities(decode_entities($data));

which decodes something like

&amp;amp;



| $ perl -version
| This is perl, v5.8.8 built for i486-linux-gnu-thread-multi

Helmut Wollmersdorfer
 
Reply With Quote
 
Mumia W.
Guest
Posts: n/a
 
      09-21-2007
On 09/20/2007 08:31 PM, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber <(E-Mail Removed)> wrote:
>
>> Hi
>>
>> I'm lost
>>
>> I have a string encodet in utf8 with part HTML Entities and part
>> characters in utf-8.
>>
>> How do I translate the HTML Entities into proper utf-8?
>>
>> Thanks

>
> Should be enough here to get you going:
>
> [ long program snipped ]


No, that's too much.

Mr. Gerber didn't post any code or data, and so he didn't get many
responses because no one knew exactly what he was talking about.

As Mr. Bullock said, HTML::Entities should do it. Here is an example:

#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;

binmode(STDOUT, ':utf8');
local $/;
my $data = <DATA>;

$data = decode_entities($data);

print $data, "\n";

__DATA__
&#x8184; &#x8185; &#x8186;
&aacute; &eacute; &iacute; &oacute; &uacute;
&auml; &euml; &iuml; &ouml; &uuml;

 
Reply With Quote
 
Nick Gerber
Guest
Posts: n/a
 
      09-25-2007
Thanks all.

Nick

Mumia W. wrote:
> On 09/20/2007 08:31 PM, (E-Mail Removed) wrote:
>> On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber <(E-Mail Removed)> wrote:
>>
>>> Hi
>>>
>>> I'm lost
>>>
>>> I have a string encodet in utf8 with part HTML Entities and part
>>> characters in utf-8.
>>>
>>> How do I translate the HTML Entities into proper utf-8?
>>>
>>> Thanks

>>
>> Should be enough here to get you going:
>>
>> [ long program snipped ]

>
> No, that's too much.
>
> Mr. Gerber didn't post any code or data, and so he didn't get many
> responses because no one knew exactly what he was talking about.
>
> As Mr. Bullock said, HTML::Entities should do it. Here is an example:
>
> #!/usr/bin/perl
> use strict;
> use warnings;
> use HTML::Entities;
>
> binmode(STDOUT, ':utf8');
> local $/;
> my $data = <DATA>;
>
> $data = decode_entities($data);
>
> print $data, "\n";
>
> __DATA__
> &#x8184; &#x8185; &#x8186;
> &aacute; &eacute; &iacute; &oacute; &uacute;
> &auml; &euml; &iuml; &ouml; &uuml;
>

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
HTML::Entities & UTF8 howa Perl Misc 1 11-15-2008 07:21 PM
Easy way to remove HTML entities from an HTML document? Robert Oschler Python 8 07-31-2004 02:03 AM
HTML::Entities::encode() returning wrong(?) entities Jim Higson Perl Misc 3 07-25-2004 09:13 PM
RE: Easy way to remove HTML entities from an HTML document? Robert Brewer Python 0 07-25-2004 08:21 PM



Advertisments