Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Text search with accented characters

Reply
Thread Tools

Text search with accented characters

 
 
Mickey Segal
Guest
Posts: n/a
 
      12-15-2005
Does Java have a method to take a string with accented characters and
convert it to unaccented characters? I want to search a big string for a
test string, ignoring accents on characters.

Doing the equivalent ignoring of case is simple:

String actualTestString = testString.toLowerCase();
String actualBigString = bigString.toLowerCase();
if (actualBigString.lastIndexOf(actualTestString) >= 0)
{
// do stuff
}

In the Collator class I see a way of checking if two strings are equivalent,
disregarding both case and accents:

Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY); // ignore both case and accents
if (c.compare(oneString, otherString) == 0)
{
//do stuff
}

However, I don't see a way of reducing the accented string to a simpler
string so I could search in a bigger string using a "toUnaccentedForm"
method instead of the toLowerCase method in the code above.

Is there a built-in method like "toUnaccentedForm" or some other approach
simpler than writing one's own version of lastIndexOf to ignore accents?


 
Reply With Quote
 
 
 
 
Oliver Wong
Guest
Posts: n/a
 
      12-15-2005

"Mickey Segal" <> wrote in message
news:r_CdnX08_qdcVzzeRVn-...
> Does Java have a method to take a string with accented characters and
> convert it to unaccented characters? I want to search a big string for a
> test string, ignoring accents on characters.
>
> Doing the equivalent ignoring of case is simple:
>
> String actualTestString = testString.toLowerCase();
> String actualBigString = bigString.toLowerCase();
> if (actualBigString.lastIndexOf(actualTestString) >= 0)
> {
> // do stuff
> }
>
> In the Collator class I see a way of checking if two strings are
> equivalent, disregarding both case and accents:
>
> Collator c = Collator.getInstance();
> c.setStrength(Collator.PRIMARY); // ignore both case and accents
> if (c.compare(oneString, otherString) == 0)
> {
> //do stuff
> }
>
> However, I don't see a way of reducing the accented string to a simpler
> string so I could search in a bigger string using a "toUnaccentedForm"
> method instead of the toLowerCase method in the code above.
>
> Is there a built-in method like "toUnaccentedForm" or some other approach
> simpler than writing one's own version of lastIndexOf to ignore accents?


AFAIK, there is no built in "toUnaccentedForm()". What you can do that
might be less painful than implementing your own lastIndexOf() is to built a
Map of characters that goes from the accented version to the unaccented
version, and then transforms your string using that map, and THEN do the
comparison.

- Oliver


 
Reply With Quote
 
 
 
 
Mickey Segal
Guest
Posts: n/a
 
      12-16-2005
"Oliver Wong" <> wrote in message
newsymof.2297$lv3.1552@clgrps12...
> AFAIK, there is no built in "toUnaccentedForm()". What you can do that
> might be less painful than implementing your own lastIndexOf() is to built
> a Map of characters that goes from the accented version to the unaccented
> version, and then transforms your string using that map, and THEN do the
> comparison.


I came to the same conclusion, mapping the 10 non-standard lower-case
characters likely to come up in our database. Since I was also using
toLowerCase this also covered the upper-case forms.

I also fiddled around with writing my own equivalent of lastIndexOf() using
CollationElementIterator after finding an example at
http://icu.sourceforge.net/docs/pape...g_in_java.html.
However in the real world that approached turned out to be painfully slow
when searching 1000 strings. In contrast, the approach of mapping 10
characters was very fast because the characters are very rare in our
database so the handling of accented characters did not slow down the
program much.





 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      12-16-2005
On Thu, 15 Dec 2005 14:57:53 -0500, "Mickey Segal"
<> wrote, quoted or indirectly quoted someone
who said :

>Does Java have a method to take a string with accented characters and
>convert it to unaccented characters? I want to search a big string for a
>test string, ignoring accents on characters.


There is one in Abundance, but I don't think I have seen one in Java.
The way you implement it is with a translate table. You index by
accented char to get unaccented. You might just implement it for low
numbered chars.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Using MS Index Server to search french Accented characters Rob ASP General 3 08-27-2007 01:59 PM
Dealing with accented characters Mark Drummond Perl 0 05-31-2006 01:38 PM
accented characters Davide Benini XML 4 06-01-2005 03:06 PM
Help with windows clipboard and accented characters Stephen Boulet Python 3 07-16-2004 03:45 AM
Problems With Accented Characters Fuzzyman Python 1 02-23-2004 08:45 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57