Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Text parser (text into sentences) that works with UTF-8 and multiple languages?

Reply
Thread Tools

Text parser (text into sentences) that works with UTF-8 and multiple languages?

 
 
mike b.
Guest
Posts: n/a
 
      07-30-2007
Hi all,

I have to parse about 2000 files that are written in multiple
languages (some English, some Korean, some Arabic and some Japanese).
I have to split these UTF-8 encoded into individual sentences. Has
anyone written a good parser that can parse all these non-Latin
character languages or can someone give me some advice on how to go
about writing a parser that can handle all these fairly different
languages?

Thank you,

Mike

 
Reply With Quote
 
 
 
 
Robert Klemme
Guest
Posts: n/a
 
      07-30-2007
2007/7/30, mike b. <>:
> I have to parse about 2000 files that are written in multiple
> languages (some English, some Korean, some Arabic and some Japanese).
> I have to split these UTF-8 encoded into individual sentences. Has
> anyone written a good parser that can parse all these non-Latin
> character languages or can someone give me some advice on how to go
> about writing a parser that can handle all these fairly different
> languages?


I would consider doing this in Java, as Java's regular expressions
support Unicode. That might make the job much easier. OTOH, if all
files use only dot, question mark etc. (i.e. ASCII chars) as sentence
delimiters then Ruby's regular expressions might as well do the job.

Kind regards

robert

 
Reply With Quote
 
 
 
 
Oblomov
Guest
Posts: n/a
 
      07-30-2007
On Jul 30, 11:26 am, "Robert Klemme" <shortcut...@googlemail.com>
wrote:
> 2007/7/30, mike b. <michael.w.b...@gmail.com>:
>
> > I have to parse about 2000 files that are written in multiple
> > languages (some English, some Korean, some Arabic and some Japanese).
> > I have to split these UTF-8 encoded into individual sentences. Has
> > anyone written a good parser that can parse all these non-Latin
> > character languages or can someone give me some advice on how to go
> > about writing a parser that can handle all these fairly different
> > languages?

>
> I would consider doing this in Java, as Java's regular expressions
> support Unicode. That might make the job much easier. OTOH, if all
> files use only dot, question mark etc. (i.e. ASCII chars) as sentence
> delimiters then Ruby's regular expressions might as well do the job.


Ruby supports UTF-8 regular expressions: for example, /\w+|\W/u can be
used
to scan a string splitting it into words and non-words. There were
some bugs
with Unicode character classifications in older versions of Ruby, but
I'm not
aware of any in 1.8.6; OTOH I've never tried it with non-latin text so
I don't
know if it works correctly in those cases too.


 
Reply With Quote
 
James Edward Gray II
Guest
Posts: n/a
 
      07-30-2007
On Jul 30, 2007, at 3:50 AM, mike b. wrote:

> I have to parse about 2000 files that are written in multiple
> languages (some English, some Korean, some Arabic and some Japanese).
> I have to split these UTF-8 encoded into individual sentences.


As has been stated, Ruby's regular expression engine has a Unicode
mode and that may be all you need here, depending on how you
recognize sentence boundaries.

> Has anyone written a good parser that can parse all these non-Latin
> character languages or can someone give me some advice on how to go
> about writing a parser that can handle all these fairly different
> languages?


I've released an initial version of my Ghost Wheel parser generator
library. It doesn't have documentation yet, but it was built using
TDD and you should be able to look over the tests to see how it
works. I'm also happy to answer questions.

My hope is that it works fine for non-Latin languages, but I'll
confess that I haven't tested it that way yet. I would try to fix
any issues you uncovered though.

James Edward Gray II

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
problem in running a basic code in python 3.3.0 that includes HTML file Satabdi Mukherjee Python 1 04-04-2013 07:48 PM
Load HTML in text strings into HTML parser in Javascript David Virgil Hobbs HTML 2 04-09-2006 01:21 PM
When I turn on my PC, it works, works, works. Problem! Fogar Computer Information 1 01-17-2006 12:57 AM
Text files read multiple files into single file, and then recreate the multiple files googlinggoogler@hotmail.com Python 4 02-13-2005 05:44 PM
After rebooting my PC works, works, works! Antivirus problem? Adriano Computer Information 1 12-15-2003 05:30 AM



Advertisments