Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > how to detect the character encoding in a web page ?

Reply
Thread Tools

how to detect the character encoding in a web page ?

 
 
iMath
Guest
Posts: n/a
 
      12-24-2012
how to detect the character encoding in a web page ?
such as this page

http://python.org/
 
Reply With Quote
 
 
 
 
Chris Angelico
Guest
Posts: n/a
 
      12-24-2012
On Mon, Dec 24, 2012 at 11:34 AM, iMath <redstone-> wrote:
> how to detect the character encoding in a web page ?
> such as this page
>
> http://python.org/


You read part-way into the page, where you find this:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

That tells you that the character set is UTF-8.

ChrisA
 
Reply With Quote
 
 
 
 
Hans Mulder
Guest
Posts: n/a
 
      12-24-2012
On 24/12/12 01:34:47, iMath wrote:
> how to detect the character encoding in a web page ?


That depends on the site: different sites indicate
their encoding differently.

> such as this page: http://python.org/


If you download that page and look at the HTML code, you'll find a line:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

So it's encoded as utf-8.

Other sites declare their charset in the Content-Type HTTP header line.
And then there are sites relying on the default. And sites that get
it wrong, and send data in a different encoding from what they declare.


Welcome to the real world,

-- HansM
 
Reply With Quote
 
iMath
Guest
Posts: n/a
 
      12-24-2012
在 2012年12月24日星期一UTC+8上午8时34分47 ,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/


but how to let python do it for you ?

such as this page

http://python.org/

how to detect the character encoding in this web page by python ?
 
Reply With Quote
 
iMath
Guest
Posts: n/a
 
      12-24-2012
在 2012年12月24日星期一UTC+8上午8时34分47 ,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/


but how to let python do it for you ?

such as these 2 pages

http://python.org/
http://msdn.microsoft.com/en-us/libr...ffice.12).aspx

how to detect the character encoding in these 2 pages by python ?
 
Reply With Quote
 
iMath
Guest
Posts: n/a
 
      12-24-2012
在 2012年12月24日星期一UTC+8上午8时34分47 ,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/


but how to let python do it for you ?

such as these 2 pages

http://python.org/
http://msdn.microsoft.com/en-us/libr...ffice.12).aspx

how to detect the character encoding in these 2 pages by python ?
 
Reply With Quote
 
Kurt Mueller
Guest
Posts: n/a
 
      12-24-2012
Am 24.12.2012 um 04:03 schrieb iMath:
> but how to let python do it for you ?
> such as these 2 pages
> http://python.org/
> http://msdn.microsoft.com/en-us/libr...ffice.12).aspx
> how to detect the character encoding in these 2 pages by python ?



If you have the html code, let
chardetect.py
do an educated guess for you.

http://pypi.python.org/pypi/chardet

Example:
$ wget -q -O - http://python.org/ | chardetect.py
stdin: ISO-8859-2 with confidence 0.803579722043
$

$ wget -q -O - 'http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx' | chardetect.py
stdin: utf-8 with confidence 0.87625
$


Gressli
--


 
Reply With Quote
 
Kwpolska
Guest
Posts: n/a
 
      12-24-2012
On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
<> wrote:
> $ wget -q -O - http://python.org/ | chardetect.py
> stdin: ISO-8859-2 with confidence 0.803579722043
> $


And it sucks, because it uses magic, and not reading the HTML tags.
The RIGHT thing to do for websites is detect the meta charset
definition, which is

<meta http-equiv="content-type" content="text/html; charset=utf-8">

or

<meta charset="utf-8">

The second one for HTML5 websites, and both may require case
conversion and the useless ` /` at the end. But if somebody is using
HTML5, you are pretty much guaranteed to get UTF-8.

In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
Because nobody in the right mind would use something else today.

--
Kwpolska <http://kwpolska.tk>
stop html mail | always bottom-post
www.asciiribbon.org | www.netmeister.org/news/learn2quote.html
GPG KEY: 5EAAEA16
 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      12-24-2012
On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:

> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
> <> wrote:
>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>> with confidence 0.803579722043 $

>
> And it sucks, because it uses magic, and not reading the HTML tags. The
> RIGHT thing to do for websites is detect the meta charset definition,
> which is
>
> <meta http-equiv="content-type" content="text/html; charset=utf-8">
>
> or
>
> <meta charset="utf-8">
>
> The second one for HTML5 websites, and both may require case conversion
> and the useless ` /` at the end. But if somebody is using HTML5, you
> are pretty much guaranteed to get UTF-8.
>
> In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
> Because nobody in the right mind would use something else today.


Alas, there are many, many, many, MANY websites that are created by
people who are *not* in their right mind. To say nothing of 15 year old
websites that use a legacy encoding. And to support those, you may need
to guess the encoding, and for that, chardetect.py is the solution.


--
Steven
 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      12-24-2012
In article <rn%Bs.693798$4>,
Alister <> wrote:

> Indeed due to the poor quality of most websites it is not possible to be
> 100% accurate for all sites.
>
> personally I would start by checking the doc type & then the meta data as
> these should be quick & correct, I then use chardectect only if these
> fail to provide any result.


I agree that checking the metadata is the right thing to do. But, I
wouldn't go so far as to assume it will always be correct. There's a
lot of crap out there with perfectly formed metadata which just happens
to be wrong.

Although it pains me greatly to quote Ronald Reagan as a source of
wisdom, I have to admit he got it right with "Trust, but verify". It's
the only way to survive in the unicode world. Write defensive code.
Wrap try blocks around calls that might raise exceptions if the external
data is borked w/r/t what the metadata claims it should be.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Firefox "Auto Detect" for Character Encoding: how to turn it on? *alan* Computer Support 4 03-07-2007 12:17 AM
character encoding +missing character sequence raavi Java 2 03-02-2006 05:01 AM
Detect character encoding Michal Python 13 12-06-2005 04:59 AM
Web Service and Swedish Character Encoding Bo Wiklund ASP .Net Web Services 0 09-22-2003 07:02 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57