Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Reading Text File Encoding and converting to Perls internal UTF-8 encoding

Reply
Thread Tools

Reading Text File Encoding and converting to Perls internal UTF-8 encoding

 
 
sln@netherlands.com
Guest
Posts: n/a
 
      04-17-2009
Need help from Unicode guru's or anybody with some knowledge on the subject.

I maybe have a text (character) file I just open. But I don't know the encoding and I
can't open it with any encoding attribute.

It would appear to me that at the start of the file, there is an encoding mark (or none),
assuming a text file, a sort of BOM sequence of octets that mark what its encoding is.

Given that I might be passed a file descriptor only, I am module, and I rewind the position
to the start of the file, is there any way I can tell the encoding. If I could, and
its not utf8, I could decode() the rest of the file as octets, ie: in-place memeory decode,
create a temp file decoded, or possibly re-open it with the proper encoding.

I think that encoding is the usual 8/16/32 bit utf but with many locales (chars).

I am still sketchy where to find a list of encoding markers to be able to find out
this information. And still sketchy on the methods available for analysis and transformation.

I know Perl has a massive 'use Encode' lib, nevertheless, this is what I need to do to finalize
a module I'm working on.

Thanks for the help.
-sln
 
Reply With Quote
 
 
 
 
Robert Billing
Guest
Posts: n/a
 
      04-17-2009
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> Given that I might be passed a file descriptor only, I am module, and I rewind the position
> to the start of the file, is there any way I can tell the encoding. If I could, and
> its not utf8, I could decode() the rest of the file as octets, ie: in-place memeory decode,
> create a temp file decoded, or possibly re-open it with the proper encoding.


As I understand it, and I have just written some Perl code that happily
mixes two dozen languages in one web page, there isn't a really good way
of doing what you want. Part of the reason for this is that given a big
block of text encoded as plain ASCII, the same text in UTF8 is exactly,
bit for bit, the same. It's only when you introduce "wide" characters in
other alphabets that UTF8 does anything.

In some cases it may be possible to make an intelligent guess at the
encoding, but no more.

Incidentally, and somewhat off-topic, is there anyone else for whom the
letters UTF automatically mean 'use the force'?

--
I am Robert Billing, Christian, author, inventor, traveller, cook and
animal lover. "It burned me from within. It quickened; I was with book
as a woman is with child."

Quality e-books for portable readers: http://www.alex-library.com
 
Reply With Quote
 
 
 
 
sln@netherlands.com
Guest
Posts: n/a
 
      04-17-2009
On Fri, 17 Apr 2009 23:48:10 +0100, Robert Billing <(E-Mail Removed)> wrote:

>(E-Mail Removed) wrote:
>
>> Given that I might be passed a file descriptor only, I am module, and I rewind the position
>> to the start of the file, is there any way I can tell the encoding. If I could, and
>> its not utf8, I could decode() the rest of the file as octets, ie: in-place memeory decode,
>> create a temp file decoded, or possibly re-open it with the proper encoding.

>
>As I understand it, and I have just written some Perl code that happily
>mixes two dozen languages in one web page, there isn't a really good way
>of doing what you want. Part of the reason for this is that given a big
>block of text encoded as plain ASCII, the same text in UTF8 is exactly,
>bit for bit, the same. It's only when you introduce "wide" characters in
>other alphabets that UTF8 does anything.
>
>In some cases it may be possible to make an intelligent guess at the
>encoding, but no more.
>
>Incidentally, and somewhat off-topic, is there anyone else for whom the
>letters UTF automatically mean 'use the force'?


I'm sorry, 'I exists and therefore I am' doesen't seem to work.

-sln
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
using target words from arrays in regex, pythons version of perls'map' Lance Hoffmeyer Python 6 05-17-2006 03:04 AM
HTML parsing as good as Perls. TLOlczyk Ruby 5 06-21-2005 09:18 PM
Perls system() call fails in a cgi-file running on win2k and apache Mr. Zeus Perl Misc 6 10-13-2004 07:52 PM
NEWBIE: Perls System -command and Cygwin bash-shell Pekka Niiranen Perl Misc 7 07-25-2004 09:47 AM
Emulate perls local Karsten Meier Ruby 2 09-09-2003 07:54 PM



Advertisments