Jaap Karssenberg <> wrote:
> I have a script that should read files utf8 compliant, so I used
> binmode(FILE, ':utf8'). But now it appears some users have latin2
> encoded files, causing some regexes to throw warnings about malformed
> utf8 chars. Is there a way to detect the character encoding and DWIM ? I
> would hate to have to tell my users they should convert everything to
> utf8 first.
You can't, in general. One thing you could try is
1. open the file in :raw mode.
2. read a largeish chunk into a $scalar.
3. turn the utf8 flag on with Encode::_utf8_on($scalar);.
4. check if the data is valid with Encode::is_utf8($scalar, 1);.
5. If it is, reopen the file with :utf8. If it ain't, assume latin2
and reopen with :encoding(latin2).
It seems there is no way to check if a sequence of bytes forms valid
utf8 without first setting the utf8 flag on... but never mind
that. Note that it is perfectly possible for data that was in fact
saved in latin2 to pass this test, just rather unlikely; and that if
next week you find some users are using latin1 you're completely
screwed, as there's no way to tell latin1 from latin2.
Ben
--
Like all men in Babylon I have been a proconsul; like all, a slave ... During
one lunar year, I have been declared invisible; I shrieked and was not heard,
I stole my bread and was not decapitated.
~
~ Jorge Luis Borges, 'The Babylon Lottery'