Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > unicode: is decode-process-encode a "good" aproach?

Reply
Thread Tools

unicode: is decode-process-encode a "good" aproach?

 
 
peter pilsl
Guest
Posts: n/a
 
      09-28-2004

Thnx to Alan and Shawn for their reply to my last posting. I read a lot
of docs before, after and still do, but its all very confusing.

Finally I found an aproach that is actually working to me and I wanted
to ask you if this makes sense and *might* even work for longer or if it
just cries for troubles.

I read parameters delivered by the webbrowser (html-header is always
UTF-8 !!), and want to sort and lowercase them and print them out again.
I dont set STDIN and STDOUT to ":utf8", cause this does not work with
mod_perl.


.....
my $input=$cgi->param('myfield');
utf8::decode($input);
utf8::downgrade($input); # otherwise sort will not sort according to
# my LC_COLLATE-setting and I need

# localized sort (mainly german data)


my $value=do_a_lot($input); # do some dataprocessing including sorting

utf8::upgrade($value); # otherwise the lc() in the next line would
# not lower chars like german umlauts
$value=lc($value);
utf8::downgrade($value); # to make sort work again

$value=do_a_lot_more($value); # do some more dataprocessing and sorting

utf8::encode($value);
print $value;


So is it ok to get the data somehow "raw" from the webinterface, then
decode it, process it and encode it again to print it out or is this a
rather stupid approach?

Is it normal that I need to decode values delivered by an webpage that
has UTF-8 charset in its header?

Is it ok to clear the utf-8 flag to make sorting work in a locale-way
and set the flag again to make lc() work? Or does this just show that
there is something wrong in my script?
If I use Unicode::Collate I would not need this fiddling with utf-8, but
this is very slow (cause it loads the big allkeys.txt - file) and might
cause troubles in multithreaded applications (as I read somewhere)

I did not provide a full script, cause this posting is long enough that
way. Hope this is ok.


I also tried to replace the utf8::encode/decode with Encode::from_to but
failed so far, cause I actually dont know from what to what I like to
convert. One side is utf8 but what is the other side?


thnx a lot,
peter





--
http://www2.goldfisch.at/know_list
http://leblogsportif.sportnation.at
 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      09-30-2004

Quoth peter pilsl <(E-Mail Removed)>:
>
> I read parameters delivered by the webbrowser (html-header is always
> UTF-8 !!), and want to sort and lowercase them and print them out again.
> I dont set STDIN and STDOUT to ":utf8",


I will say, as I often have: I would recommend using :encoding(utf
rather that :utf8, as you can then handle malformed utf8 properly.

> cause this does not work with
> mod_perl.
>
>
> ....
> my $input=$cgi->param('myfield');
> utf8::decode($input);


I would use Encode::decode here, as you'll get better error handling.

<snip>
> So is it ok to get the data somehow "raw" from the webinterface, then
> decode it, process it and encode it again to print it out or is this a
> rather stupid approach?
>
> Is it normal that I need to decode values delivered by an webpage that
> has UTF-8 charset in its header?


If you haven't specified that the FH is utf8, then you'll have to decode it
by hand.

> Is it ok to clear the utf-8 flag to make sorting work in a locale-way
> and set the flag again to make lc() work? Or does this just show that
> there is something wrong in my script?


Hmmmmmmm..... I think this is a bad idea. What if you have chars outside
ISO8859-1? I would strongly recommend using Encode::encode to convert it
to ISO8859-1 explicitly, and be prepared to handle errors.

If you read perlunicode it tells you that Unicode and locales currently
don't play nicely together; I'd probably recommend doing something like
this:

my $iso = Encode::encode 'iso8859-1' => $utf8;
{
use locale;
do_stuff_with($iso);
}
$utf8 = Encode::decode 'iso8859-1' => $iso;

so that you don't try and use unicode data when locales are switched on.

Ben

--
We do not stop playing because we grow old;
we grow old because we stop playing.
http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
 
 
 
peter pilsl
Guest
Posts: n/a
 
      10-01-2004
>
>>Is it ok to clear the utf-8 flag to make sorting work in a locale-way
>>and set the flag again to make lc() work? Or does this just show that
>>there is something wrong in my script?

>
>
> Hmmmmmmm..... I think this is a bad idea. What if you have chars outside
> ISO8859-1? I would strongly recommend using Encode::encode to convert it
> to ISO8859-1 explicitly, and be prepared to handle errors.
>


thnx. I got around all these problems now by finding an appropriate
locale for my needs : "de_AT.UTF-8". I get the input from a
non-utf8-filehandle, decode and then everythings works smoothly
including sorting, lowercasing, patternmatching (see below). Then I
encode and print out to non-utf8-filehandle again.


> If you read perlunicode it tells you that Unicode and locales currently
> don't play nicely together; I'd probably recommend doing something like
> this:
>
> my $iso = Encode::encode 'iso8859-1' => $utf8;
> {
> use locale;
> do_stuff_with($iso);
> }
> $utf8 = Encode::decode 'iso8859-1' => $iso;
>
> so that you don't try and use unicode data when locales are switched on.
>


perlunicode states that is discouraged, but it also explains a bit what
can happen and and at the end I dont have much of a choice but using
Unicode and locales.
The Data I need to process can definitely include many different
languages and charsets. And the handling (especially collate) should
definitely follow german rules. (german text that can include words from
any other language, including chinese and hindi and other things I never
heard of). And it should be fast ....

Your idea above looks very smart and I'll definitely give it a very
close look. Currently all my locale-stuff work. (almost all - see my
other new posting where there is one construct that makes $s=~/$s/i fail !!)


thnx a lot,
peter



--
http://www2.goldfisch.at/know_list
http://leblogsportif.sportnation.at
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off




Advertisments