Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > character encodings

Reply
Thread Tools

character encodings

 
 
Jaap Karssenberg
Guest
Posts: n/a
 
      11-30-2003
I have a script that should read files utf8 compliant, so I used
binmode(FILE, ':utf8'). But now it appears some users have latin2
encoded files, causing some regexes to throw warnings about malformed
utf8 chars. Is there a way to detect the character encoding and DWIM ? I
would hate to have to tell my users they should convert everything to
utf8 first.

--
) ( Jaap Karssenberg || Pardus [Larus] | |0| |
: : http://pardus-larus.student.utwente.nl/~pardus | | |0|
) \ / ( |0|0|0|
",.*'*.," Proud owner of "Perl6 Essentials" 1st edition
 
Reply With Quote
 
 
 
 
Jürgen Exner
Guest
Posts: n/a
 
      11-30-2003
Jaap Karssenberg wrote:
> I have a script that should read files utf8 compliant, so I used
> binmode(FILE, ':utf8'). But now it appears some users have latin2
> encoded files, causing some regexes to throw warnings about malformed
> utf8 chars. Is there a way to detect the character encoding and DWIM


You don't tell us what kind fo files those are.
For e.g. HTML or XML the meta charset header resp. the encoding attribute
should tell you what encoding to expect.
Just scan for it and evaluate the rest of the file accordingly.

jue


 
Reply With Quote
 
 
 
 
Jaap Karssenberg
Guest
Posts: n/a
 
      11-30-2003
On Sun, 30 Nov 2003 16:52:46 GMT Jürgen Exner wrote:
: You don't tell us what kind fo files those are.
: For e.g. HTML or XML the meta charset header resp. the encoding
: attribute should tell you what encoding to expect.
: Just scan for it and evaluate the rest of the file accordingly.

Can be all kinds of files, the script is there to determine what they
are. In general the files have neither meta data attached to them or
headers with meta data.

--
) ( Jaap Karssenberg || Pardus [Larus] | |0| |
: : http://pardus-larus.student.utwente.nl/~pardus | | |0|
) \ / ( |0|0|0|
",.*'*.," Proud owner of "Perl6 Essentials" 1st edition
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      11-30-2003
On Sun, 30 Nov 2003, Jürgen Exner wrote:

> For e.g. HTML or XML the meta charset header resp. the encoding attribute
> should tell you what encoding to expect.


Maybe. Depends how the files got there. For HTTP transactions it's
legal (and often preferable) to supply the character coding on the
actual HTTP content-type header, and to make no mention of it inside
the actual body of the content. However, the O.P speaks of "files",
so presumably you're right, and the HTTP transaction issue is outside
this particular problem domain. But there's still the BOM option to
keep in mind!

> Just scan for it and evaluate the rest of the file accordingly.


But you can't scan it without reading it, and you can't read it
without opening it; so you'd have to open it provisionally with *some*
mode, scan for the stuff that you have described - and then maybe
re-open it with a different mode?

OTOH, if one opens it in raw mode, and scans it in a way which can
accommodate itself to different encodings, then, when the relevant
encoding information has been found, the data can be piped through the
appropriate encoding layers explicitly.

There's a lot of options, and I'm not sure of the practical
implications of choosing one or another. If the data is to be
processed by an appropriate HTML or XML module, maybe that module can
adapt to different data encodings as read in raw mode?

What I think it comes down to is that it would definitely be a mistake
to open the file with a utf8 IO layer without being sure that it's
utf-8-encoded, due to the errors that will inevitably result if it
isn't.

hope this helps a bit.
 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      11-30-2003

Jaap Karssenberg <> wrote:
> I have a script that should read files utf8 compliant, so I used
> binmode(FILE, ':utf8'). But now it appears some users have latin2
> encoded files, causing some regexes to throw warnings about malformed
> utf8 chars. Is there a way to detect the character encoding and DWIM ? I
> would hate to have to tell my users they should convert everything to
> utf8 first.


You can't, in general. One thing you could try is

1. open the file in :raw mode.
2. read a largeish chunk into a $scalar.
3. turn the utf8 flag on with Encode::_utf8_on($scalar);.
4. check if the data is valid with Encode::is_utf8($scalar, 1);.
5. If it is, reopen the file with :utf8. If it ain't, assume latin2
and reopen with :encoding(latin2).

It seems there is no way to check if a sequence of bytes forms valid
utf8 without first setting the utf8 flag on... but never mind
that. Note that it is perfectly possible for data that was in fact
saved in latin2 to pass this test, just rather unlikely; and that if
next week you find some users are using latin1 you're completely
screwed, as there's no way to tell latin1 from latin2.

Ben

--
Like all men in Babylon I have been a proconsul; like all, a slave ... During
one lunar year, I have been declared invisible; I shrieked and was not heard,
I stole my bread and was not decapitated.
~ ~ Jorge Luis Borges, 'The Babylon Lottery'
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      11-30-2003
On Sun, 30 Nov 2003, Jaap Karssenberg wrote:

> On Sun, 30 Nov 2003 16:52:46 GMT Jürgen Exner wrote:
> : You don't tell us what kind fo files those are.
> : For e.g. HTML or XML the meta charset header resp. the encoding
> : attribute should tell you what encoding to expect.
> : Just scan for it and evaluate the rest of the file accordingly.
>
> Can be all kinds of files, the script is there to determine what they
> are.


It can't be done, in general. There is no way to reliably distinguish
between the commonly-used 8-bit codes (iso-8859-whatever, etc.), for
example. It would be sheer guesswork, without some kind of additional
knowledge, language analysis or something.

As others have said, utf-8 can be verified for consistency, and the
hypothesis rejected if it proves to be false. But passing the
consistency test doesn't incontrovertibly prove that it's utf-8: it
might just be a co-incidence that a particular 8-bit-coded text passed
the utf-8 consistency check.

So we really _do_ need to know more about your situation if we are to
offer any kind of realistic help.

> In general the files have neither meta data attached to them or
> headers with meta data.


Then you're stuck with trying heuristic methods, IMHO.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA)Character mrdecav@gmail.com HTML 7 02-01-2009 11:05 PM
Help with character encodings A_H Python 3 05-20-2008 03:59 PM
Character Encodings and display of strings JKPeck Python 6 11-14-2006 09:59 PM
Questions about working with character encodings Kenneth McDonald Python 1 12-15-2005 08:03 AM
Character encodings and invalid characters Safalra Java 8 06-15-2004 10:43 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57