Go Back   Velocity Reviews > Newsgroups > HTML
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply

HTML - How should Chinese sites be encoded to be listed in search engine?

 
Thread Tools Search this Thread
Old 07-17-2006, 03:39 PM   #1
Default How should Chinese sites be encoded to be listed in search engine?


Would google and other search engines support the indexing of
non-English UTF-8 encoded websites?

Most chinese website indexed on google appears to be
- for Traditional Chinese, charset=big5" encoding=ANSI
- For Simplified Chinese, charset=gb2312 encoding=ANSI


Does it support for
charset=UTF-8" encoding=UTF-8



Pat
  Reply With Quote
Old 07-17-2006, 03:53 PM   #2
Nikita the Spider
 
Posts: n/a
Default Re: How should Chinese sites be encoded to be listed in search engine?

In article <. com>,
"Pat" <> wrote:

> Would google and other search engines support the indexing of
> non-English UTF-8 encoded websites?


Yes.


> Most chinese website indexed on google appears to be
> - for Traditional Chinese, charset=big5" encoding=ANSI
> - For Simplified Chinese, charset=gb2312 encoding=ANSI


I don't have any experience with Asian encodings but my guess is that
big5 is preferable to UTF8 because it is more efficient (i.e. takes up
less space) when most of the characters are Asian. If you don't mind
fatter pages, UTF8 should be fine.

HTH

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
  Reply With Quote
Old 07-17-2006, 05:58 PM   #3
Dylan Sung
 
Posts: n/a
Default Re: How should Chinese sites be encoded to be listed in search engine?


"Nikita the Spider" <> wrote in message
news:NikitaTheSpider-...
> In article <. com>,
> "Pat" <> wrote:
>
>> Would google and other search engines support the indexing of
>> non-English UTF-8 encoded websites?

>
> Yes.
>
>
>> Most chinese website indexed on google appears to be
>> - for Traditional Chinese, charset=big5" encoding=ANSI
>> - For Simplified Chinese, charset=gb2312 encoding=ANSI

>
> I don't have any experience with Asian encodings but my guess is that
> big5 is preferable to UTF8 because it is more efficient (i.e. takes up
> less space) when most of the characters are Asian. If you don't mind
> fatter pages, UTF8 should be fine.


Encodings like GB and Big5 are double byte encodings. However, unicode (utf8
at least) uses three or more bytes for far east asian characters (amongst
others). So yes, in terms of economy, GB and Big5 yield text files that have
fewer bytes.

You can view the repetoire of characters in unicode as having subsets of GB
and Big5 within them, and thus you can do direct converseions from GB to
unicode, and Big5 to unicode. However there are characters in GB which do
not occur in Big5 and vice versa, so conversion between the two is lossy. My
guess is that google employs searching algorithms which convert characters
to utf-8 and then searches for webpages which contain both simplified gb and
traditional characters in Big5 all at the same time, at least this is what I
get when I'm entering one or the other character set characters into their
search field.

Dyl.

  Reply With Quote
Old 07-17-2006, 06:00 PM   #4
Dylan Sung
 
Posts: n/a
Default Re: How should Chinese sites be encoded to be listed in search engine?


"Dylan Sung" <> wrote in message
news:e9gfj6$rli$...
>
> "Nikita the Spider" <> wrote in message
> news:NikitaTheSpider-...
>> In article <. com>,
>> "Pat" <> wrote:
>>
>>> Would google and other search engines support the indexing of
>>> non-English UTF-8 encoded websites?

>>
>> Yes.
>>
>>
>>> Most chinese website indexed on google appears to be
>>> - for Traditional Chinese, charset=big5" encoding=ANSI
>>> - For Simplified Chinese, charset=gb2312 encoding=ANSI

>>
>> I don't have any experience with Asian encodings but my guess is that
>> big5 is preferable to UTF8 because it is more efficient (i.e. takes up
>> less space) when most of the characters are Asian. If you don't mind
>> fatter pages, UTF8 should be fine.

>
> Encodings like GB and Big5 are double byte encodings. However, unicode
> (utf8 at least) uses three or more bytes for far east asian characters
> (amongst others). So yes, in terms of economy, GB and Big5 yield text
> files that have fewer bytes.
>
> You can view the repetoire of characters in unicode as having subsets of
> GB and Big5 within them, and thus you can do direct converseions from GB
> to unicode, and Big5 to unicode. However there are characters in GB which
> do not occur in Big5 and vice versa, so conversion between the two is
> lossy. My guess is that google employs searching algorithms which convert
> characters to utf-8 and then searches for webpages which contain both
> simplified gb and traditional characters in Big5 all at the same time, at
> least this is what I get when I'm entering one or the other character set
> characters into their search field.



Sorry, didn't answer the original question. I think that web pages should
list their encodings as appropriate. That is gb, when gb is used and so
forth. Search engines can do the rest.

Dyl.

  Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump