![]() |
|
|
|||||||
![]() |
HTML - How should Chinese sites be encoded to be listed in search engine? |
|
|
Thread Tools | Search this Thread |
|
|
#1 |
|
Would google and other search engines support the indexing of
non-English UTF-8 encoded websites? Most chinese website indexed on google appears to be - for Traditional Chinese, charset=big5" encoding=ANSI - For Simplified Chinese, charset=gb2312 encoding=ANSI Does it support for charset=UTF-8" encoding=UTF-8 Pat |
|
|
|
|
#2 |
|
Posts: n/a
|
In article <. com>,
"Pat" <> wrote: > Would google and other search engines support the indexing of > non-English UTF-8 encoded websites? Yes. > Most chinese website indexed on google appears to be > - for Traditional Chinese, charset=big5" encoding=ANSI > - For Simplified Chinese, charset=gb2312 encoding=ANSI I don't have any experience with Asian encodings but my guess is that big5 is preferable to UTF8 because it is more efficient (i.e. takes up less space) when most of the characters are Asian. If you don't mind fatter pages, UTF8 should be fine. HTH -- Philip http://NikitaTheSpider.com/ Whole-site HTML validation, link checking and more |
|
|
|
#3 |
|
Posts: n/a
|
"Nikita the Spider" <> wrote in message news:NikitaTheSpider-... > In article <. com>, > "Pat" <> wrote: > >> Would google and other search engines support the indexing of >> non-English UTF-8 encoded websites? > > Yes. > > >> Most chinese website indexed on google appears to be >> - for Traditional Chinese, charset=big5" encoding=ANSI >> - For Simplified Chinese, charset=gb2312 encoding=ANSI > > I don't have any experience with Asian encodings but my guess is that > big5 is preferable to UTF8 because it is more efficient (i.e. takes up > less space) when most of the characters are Asian. If you don't mind > fatter pages, UTF8 should be fine. Encodings like GB and Big5 are double byte encodings. However, unicode (utf8 at least) uses three or more bytes for far east asian characters (amongst others). So yes, in terms of economy, GB and Big5 yield text files that have fewer bytes. You can view the repetoire of characters in unicode as having subsets of GB and Big5 within them, and thus you can do direct converseions from GB to unicode, and Big5 to unicode. However there are characters in GB which do not occur in Big5 and vice versa, so conversion between the two is lossy. My guess is that google employs searching algorithms which convert characters to utf-8 and then searches for webpages which contain both simplified gb and traditional characters in Big5 all at the same time, at least this is what I get when I'm entering one or the other character set characters into their search field. Dyl. |
|
|
|
#4 |
|
Posts: n/a
|
"Dylan Sung" <> wrote in message news:e9gfj6$rli$... > > "Nikita the Spider" <> wrote in message > news:NikitaTheSpider-... >> In article <. com>, >> "Pat" <> wrote: >> >>> Would google and other search engines support the indexing of >>> non-English UTF-8 encoded websites? >> >> Yes. >> >> >>> Most chinese website indexed on google appears to be >>> - for Traditional Chinese, charset=big5" encoding=ANSI >>> - For Simplified Chinese, charset=gb2312 encoding=ANSI >> >> I don't have any experience with Asian encodings but my guess is that >> big5 is preferable to UTF8 because it is more efficient (i.e. takes up >> less space) when most of the characters are Asian. If you don't mind >> fatter pages, UTF8 should be fine. > > Encodings like GB and Big5 are double byte encodings. However, unicode > (utf8 at least) uses three or more bytes for far east asian characters > (amongst others). So yes, in terms of economy, GB and Big5 yield text > files that have fewer bytes. > > You can view the repetoire of characters in unicode as having subsets of > GB and Big5 within them, and thus you can do direct converseions from GB > to unicode, and Big5 to unicode. However there are characters in GB which > do not occur in Big5 and vice versa, so conversion between the two is > lossy. My guess is that google employs searching algorithms which convert > characters to utf-8 and then searches for webpages which contain both > simplified gb and traditional characters in Big5 all at the same time, at > least this is what I get when I'm entering one or the other character set > characters into their search field. Sorry, didn't answer the original question. I think that web pages should list their encodings as appropriate. That is gb, when gb is used and so forth. Search engines can do the rest. Dyl. |
|