Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Language detection module..

Reply
Thread Tools

Language detection module..

 
 
AR
Guest
Posts: n/a
 
      01-21-2004
Does exist any module/script that can 100% detect text language..
for example English, German, French, ... (European languages, at least
English...)
 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      01-21-2004

AR <> wrote:
> Does exist any module/script that can 100% detect text language..
> for example English, German, French, ... (European languages, at least
> English...)


100%? No. What language is this string: "hotel"?

Ben

--
Joy and Woe are woven fine,
A Clothing for the Soul divine William Blake
Under every grief and pine 'Auguries of Innocence'
Runs a joy with silken twine.
 
Reply With Quote
 
 
 
 
J.B. Moreno
Guest
Posts: n/a
 
      01-21-2004
Ben Morrow <> wrote:

> AR <> wrote:
> > Does exist any module/script that can 100% detect text language..
> > for example English, German, French, ... (European languages, at least
> > English...)

>
> 100%? No. What language is this string: "hotel"?


Swahili?

--
JBM
"Everything is futile." -- Marvin of Borg
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      01-21-2004
On Wed, 21 Jan 2004, Ben Morrow wrote:

> 100%? No. What language is this string: "hotel"?


Yeah, ask a German speaker what language this is: "Gift".
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      01-21-2004
Ben Morrow <> wrote:
>
> AR <> wrote:
>> Does exist any module/script that can 100% detect text language..
>> for example English, German, French, ... (European languages, at least
>> English...)

>
> 100%? No. What language is this string: "hotel"?



Military? (the letter "H") ?


--
Tad McClellan SGML consulting
Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Martin Quensel
Guest
Posts: n/a
 
      01-21-2004
J.B. Moreno wrote:
> Ben Morrow <> wrote:
>
>
>>AR <> wrote:
>>
>>>Does exist any module/script that can 100% detect text language..
>>>for example English, German, French, ... (European languages, at least
>>>English...)

>>
>>100%? No. What language is this string: "hotel"?

>
>
> Swahili?

Start by adding all words from all the dictionaries in the world in a file.
Then using statistics you get the most likely one.

or why not just?

#!/usr/bin/perl -w

print "String is in any known language or some constructed language such
as Esperanto, Volapuk, Glosa, Loglan, or even klingon.\n";


Now that would almost certainly cover 95% of all the languages (missed
adding the tolkien languages, but i leave that as a programmin excercise
). But im not sure if its 100% future proof. The "any known language"
could be interpreted as "known" to the person running the program.

Best Regards
Martin Quensel

 
Reply With Quote
 
J.B. Moreno
Guest
Posts: n/a
 
      01-22-2004
Martin Quensel <> wrote:

> J.B. Moreno wrote:
> > Ben Morrow <> wrote:
> >
> >>AR <> wrote:
> >>
> >>>Does exist any module/script that can 100% detect text language..
> >>>for example English, German, French, ... (European languages, at least
> >>>English...)
> >>
> >>100%? No. What language is this string: "hotel"?

> >
> > Swahili?

>
> Start by adding all words from all the dictionaries in the world in a
> file. Then using statistics you get the most likely one.


The phrases "100%" and "most likely one" aren't equivalent.

And look up the James Nicoll quote on the purity of the english
language.

--
JBM
"Everything is futile." -- Marvin of Borg
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      01-22-2004
Ben Morrow <> wrote in comp.lang.perl.misc:
>
> AR <> wrote:
> > Does exist any module/script that can 100% detect text language..
> > for example English, German, French, ... (European languages, at least
> > English...)

>
> 100%? No. What language is this string: "hotel"?


Well, one-word-samples are hard, and 100% is unattainable.

Entirely off topic, I have recently heard of an approach to text
classification (with an eye to language recognition) that I found
interesting.

Use a Ziv-Lempel-like method to compress your sample. Then concatenate
it with texts of similar lengths taken from known languages and compress
again. If the compression rate is similar or better than that of the
original text, the appended text is similar to the original one. If
the compression deteriorates, the texts are dissimilar.

The source (some idle chat on IRC, sorry) said that this works for
rather small samples of fewer than a hundred words. I have always been
meaning to play with it, but haven't got around.

Anno
 
Reply With Quote
 
Eric Wilhelm
Guest
Posts: n/a
 
      01-22-2004
On Thu, 22 Jan 2004 00:35:12 -0600, J.B. Moreno wrote:

>> Start by adding all words from all the dictionaries in the world in a
>> file. Then using statistics you get the most likely one.

>
> The phrases "100%" and "most likely one" aren't equivalent


This is true, but in the real world, something which gives a 99.9%
probability is about as good as we are going to get. No sense in
refusing to use a circle simply because it is impossible to make a
perfect one.

IMO, 99.9% might be a low estimate even if the program takes a naive
approach. If the dictionaries include "adopted" phrases (e.g. Latin
expressions which are often cited in English, etc.) and some kind of
best-fit spell check is used, you might push the probabilities into
99.99%. Now feed some works of literature from each language into a
phrase-counter and use phrases as well, and you might find that a text of
100 words or more can be predicted correctly 99.9999% of the time.

If that isn't good enough (missing 1 of 10^6), you're going to be working
on the thing for so long that half of the languages in use at its
conception are out of use before you reach the prototype.

--Eric
 
Reply With Quote
 
Malcolm Dew-Jones
Guest
Posts: n/a
 
      01-22-2004
Ben Morrow () wrote:

: AR <> wrote:
: > Does exist any module/script that can 100% detect text language..
: > for example English, German, French, ... (European languages, at least
: > English...)

: 100%? No. What language is this string: "hotel"?

I can say with 100% certainty that that is an english word.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Natural language detection library Thomas Nitsche Ruby 4 05-09-2007 09:32 AM
A language-agnostic language Ed Java 24 03-27-2006 08:19 PM
c is a low-level language or neither low level nor high level language pabbu C Programming 8 11-07-2005 03:05 PM
Using a Scripting Language as Your Scripting Language DaveInSidney Python 0 05-09-2005 03:13 AM
Python is the best and most popular general purpose scripting language; the universal scripting language Ron Stephens Python 23 04-12-2004 05:32 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57