Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > How to detect text file encoding in Perl

Reply
Thread Tools

How to detect text file encoding in Perl

 
 
chaojen.chen@gmail.com
Guest
Posts: n/a
 
      05-20-2006
Hello all,

If I have a bunch of text files in the same directory and their
encodings could UTF-16, UTF-8, or ASCII, is there any way of quickly
detecting the exact encoding of each of them?

Thanks,

Enoch Chen

 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      05-20-2006
cnhackTNT <> wrote in comp.lang.perl.misc:

[Please don't top-post, and leave some attribution. Text re-arranged]

> > Hello all,
> >
> > If I have a bunch of text files in the same directory and their
> > encodings could UTF-16, UTF-8, or ASCII, is there any way of quickly
> > detecting the exact encoding of each of them?


> Maybe Encode::GUESS could help


Without even looking at it, I'd say a module with its name in all-caps
is suspect. Supposing it is actually spelled that way.

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      05-20-2006
Anno Siegel wrote:
> cnhackTNT <> wrote in comp.lang.perl.misc:
>>
>>Maybe Encode::GUESS could help

>
> Without even looking at it, I'd say a module with its name in all-caps
> is suspect.


Yeah, it makes you think of creations like POSIX and CGI.

> Supposing it is actually spelled that way.


It's not.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      05-20-2006
Gunnar Hjalmarsson <> wrote in comp.lang.perl.misc:
> Anno Siegel wrote:
> > cnhackTNT <> wrote in comp.lang.perl.misc:
> >>
> >>Maybe Encode::GUESS could help

> >
> > Without even looking at it, I'd say a module with its name in all-caps
> > is suspect.

>
> Yeah, it makes you think of creations like POSIX and CGI.


Well, those are acronyms that weren't invented by the authors.

If GUESS were an acronym, the module would be more than suspect of
cutesiness.

> > Supposing it is actually spelled that way.

>
> It's not.


Good to know

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
 
Reply With Quote
 
Brian McCauley
Guest
Posts: n/a
 
      05-20-2006

wrote:
> If I have a bunch of text files in the same directory and their
> encodings could UTF-16, UTF-8, or ASCII, is there any way of quickly
> detecting the exact encoding of each of them?


Forget quickly, it is fundamentally impossible given an ASCII file to
tell that not utf8.

If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
examining the first two bytes.

That said, Encode::Guess is probably your friend.

 
Reply With Quote
 
chaojen.chen@gmail.com
Guest
Posts: n/a
 
      05-21-2006

Brian McCauley 寫道:

> wrote:
> > If I have a bunch of text files in the same directory and their
> > encodings could UTF-16, UTF-8, or ASCII, is there any way of quickly
> > detecting the exact encoding of each of them?

>
> Forget quickly, it is fundamentally impossible given an ASCII file to
> tell that not utf8.
>
> If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
> examining the first two bytes.
>
> That said, Encode::Guess is probably your friend.


Hello Brian,

Thanks for your suggestion. And what does BOM stand for?

Enoch

 
Reply With Quote
 
Guest
Posts: n/a
 
      05-21-2006
wrote:
: >
: > If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
: > examining the first two bytes.
: >

: Thanks for your suggestion. And what does BOM stand for?

Google is probably your friend. If not: <B>yte <O>rder <M>ark.

You frequently get a BOM at the beginning of your file if you store it
on Windows with Notepad or similar editor simulations. If you choose to
store your data as UTF-8, or your data _is_ UTF-8, you'll see that after
storing the bytecount is two bytes more because the byte 0xff 0xef get
prepended automatically, in order to tell the software which byte order
is to be expected. This makes sense with UCS-2 Unicode (the "original"
Unicode encoding) but not with UTF-8 (8-bit transformation format of
Unicode) because the characters encoded in UTF-8 are self-synchronizing
and no information about byte order is needed. In contrast, other programs
behaving correctly frequently complain if the BOM appears where it simply
doesn't belong.

Oliver.


--
Dr. Oliver Corff e-mail:
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      05-21-2006
On Sun, 21 May 2006, wrote:

> Google is probably your friend. If not: <B>yte <O>rder <M>ark.


http://www.unicode.org/faq/utf_bom.html#BOM

> store your data as UTF-8, or your data _is_ UTF-8, you'll see that after
> storing the bytecount is two bytes more because the byte 0xff 0xef get
> prepended automatically,


The BOM is the relevant encoding of the Unicode character U+FEFF. No
way is it 0xff 0xef. The various encoded byte patterns are shown in
that Unicode FAQ, and in utf-8 it's *three* bytes.

> in order to tell the software which byte order is to be expected.


"No, a BOM can be used as a signature no matter how the Unicode text
is transformed"

> This makes sense with UCS-2 Unicode (the "original" Unicode
> encoding)


Yes, but "UCS-2" is out of date:
http://www.unicode.org/faq/basic_q.html#23

The utf-16 encoding form is its present counterpart.

> but not with UTF-8 (8-bit transformation format of Unicode) because
> the characters encoded in UTF-8 are self-synchronizing and no
> information about byte order is needed.


Nevertheless, the Unicode FAQ points out that utf-8 can usefully
start with a BOM as an encoding signature.

> In contrast, other programs behaving correctly frequently complain
> if the BOM appears where it simply doesn't belong.


Except that it is not inherently incorrect for it to appear at the
beginning of a utf-8 stream - but see the cited FAQ for details.

Seems to me you would have done well to read that FAQ yourself, before
putting misleading opinions on the record.

regards

--

Beware of negative easements.
 
Reply With Quote
 
Guest
Posts: n/a
 
      05-21-2006
Alan J. Flavell <> wrote:
: > (Oliver's erroneous statement
: > storing the bytecount is two bytes more because the byte 0xff 0xef get
: > prepended automatically,

: The BOM is the relevant encoding of the Unicode character U+FEFF. No
: way is it 0xff 0xef.

Oops, I goofed up here, and the twisted order shows exactly what a byte
order mark is good for. Just imagine this would have been transmitted as
UCS-2, in Big Endian order.

: The various encoded byte patterns are shown in
: that Unicode FAQ, and in utf-8 it's *three* bytes.

Again, my fault. Shouldn't post when I'm too tired.

: > This makes sense with UCS-2 Unicode (the "original" Unicode
: > encoding)

: Yes, but "UCS-2" is out of date:
: http://www.unicode.org/faq/basic_q.html#23

But several (notably MS-based) applications still allow the user to choose
UCS-2, UTF-8 _and_ Unicode.

: > but not with UTF-8 (8-bit transformation format of Unicode) because
: > the characters encoded in UTF-8 are self-synchronizing and no
: > information about byte order is needed.

: Nevertheless, the Unicode FAQ points out that utf-8 can usefully
: start with a BOM as an encoding signature.

The FAQ says so, but...

: > In contrast, other programs behaving correctly frequently complain
: > if the BOM appears where it simply doesn't belong.

: Except that it is not inherently incorrect for it to appear at the
: beginning of a utf-8 stream - but see the cited FAQ for details.

But my experience (with shell scripts, interpretation of shebang lines
of perl scripts, etc.) runs to the contrary. A UTF-8-encoded file _with_
BOM causes unnecessary hiccups, even if this is against the formal spec.

: Seems to me you would have done well to read that FAQ yourself, before
: putting misleading opinions on the record.

Sorry, I should have consulted the FAQ, but I stand by my negative experiences
with superfluous BOMs.

Oliver.

--
Dr. Oliver Corff e-mail:
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      05-21-2006
On Sun, 21 May 2006, wrote:

> Alan J. Flavell <> wrote:


[re. my cite of http://www.unicode.org/faq/utf_bom.html#BOM ]

> : Except that it is not inherently incorrect for it to appear at the
> : beginning of a utf-8 stream - but see the cited FAQ for details.
>
> But my experience (with shell scripts, interpretation of shebang
> lines of perl scripts, etc.) runs to the contrary. A UTF-8-encoded
> file _with_ BOM causes unnecessary hiccups, even if this is against
> the formal spec.


Which is pretty much the point that the cited BOM FAQ makes, at
http://www.unicode.org/faq/utf_bom.html#29 , and that was my primary
reason for that suggestion to "see the cited FAQ for details".

regards
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
how to detect the encoding used for a specific text data ? iMath Python 8 12-21-2012 02:14 PM
[java programming] How to detect the file encoding? Simon Java 10 06-09-2009 02:12 PM
Reading Text File Encoding and converting to Perls internal UTF-8 encoding sln@netherlands.com Perl Misc 2 04-17-2009 11:22 PM
Detect file encoding utf-8 Rebhan, Gilbert Ruby 3 08-29-2007 06:44 PM



Advertisments