Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   XML (http://www.velocityreviews.com/forums/f32-xml.html)
-   -   identify the language of a web page (http://www.velocityreviews.com/forums/t604963-identify-the-language-of-a-web-page.html)

usgog@yahoo.com 04-11-2008 02:08 AM

identify the language of a web page
 
Suppose I need to classify 10000 web pages based on their languages.
What should I look for to determine the language of each web page? Any
advice is welcome.

Andreas Prilop 04-11-2008 01:26 PM

Re: identify the language of a web page
 
On Thu, 10 Apr 2008, usgog@yahoo.com wrote:

> Suppose I need to classify 10000 web pages based on their languages.
> What should I look for to determine the language of each web page?


The "lang" attribute in HTML; the "xml:lang" attribute in XHTML.

Richard Tobin 04-11-2008 01:34 PM

Re: identify the language of a web page
 
In article <26414fbe-a0ef-48c9-af07-0575e05b1da1@p39g2000prm.googlegroups.com>,
usgog@yahoo.com <usgog@yahoo.com> wrote:

>Suppose I need to classify 10000 web pages based on their languages.
>What should I look for to determine the language of each web page? Any
>advice is welcome.


Assuming you want to do this by inspection of the text (rather than
looking for xml:lang and the like), Google for language
identification. The first page lists several tools and a research
bibliography on the subject.

-- Richard
--
:wq


All times are GMT. The time now is 04:24 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.