Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Unescaping Unicode code points in a Java string

Reply
Thread Tools

Unescaping Unicode code points in a Java string

 
 
Greg
Guest
Posts: n/a
 
      08-31-2006
My Java program reads in (from an external source) text that contains
the same sort of unicode character escape sequences as java source
code. For example, one such string might be:

"En Espa\u00f1ol"

Naturally, I would like to convert the five characters subsequence,
"\u00f1", into the single character codepoint (hex 00F1) that those
characters actually represent:

"En Espaņol"

I've been browsing the J2SE 1.5 docs hoping to find a convenient method
to perform this kind of conversion, but so far have not found one. Does
anyone have any suggestions?

Thanks,

Greg

 
Reply With Quote
 
 
 
 
Thomas Fritsch
Guest
Posts: n/a
 
      08-31-2006
Greg wrote:
> My Java program reads in (from an external source) text that contains
> the same sort of unicode character escape sequences as java source
> code. For example, one such string might be:
>
> "En Espa\u00f1ol"
>
> Naturally, I would like to convert the five characters subsequence,
> "\u00f1", into the single character codepoint (hex 00F1) that those
> characters actually represent:
>
> "En Espaņol"
>
> I've been browsing the J2SE 1.5 docs hoping to find a convenient method
> to perform this kind of conversion, but so far have not found one. Does
> anyone have any suggestions?


Long time ago I searched the Java API and sources for a method doing
that kind of String decoding, but to no avail. The only thing I found
was method
private String loadConvert(String)
in class java.util.Properties. But because it is private, it is not
reusable outside Properties.

(You find the source in src.zip of JDK installation directory)

--
Thomas
 
Reply With Quote
 
 
 
 
Oliver Wong
Guest
Posts: n/a
 
      08-31-2006

"Greg" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
> My Java program reads in (from an external source) text that contains
> the same sort of unicode character escape sequences as java source
> code. For example, one such string might be:
>
> "En Espa\u00f1ol"
>
> Naturally, I would like to convert the five characters subsequence,
> "\u00f1", into the single character codepoint (hex 00F1) that those
> characters actually represent:
>
> "En Espaņol"
>
> I've been browsing the J2SE 1.5 docs hoping to find a convenient method
> to perform this kind of conversion, but so far have not found one. Does
> anyone have any suggestions?


Iterate through each character of the String, looking for the sequence
"\u". If you find it, delete those two chars, and read in the next 4 chars.
Parse that sequence of 4 characters into a integer assuming hexadecimal
notation. Take that integer and cast it to a char, and insert the resulting
char back into the String.

- Oliver

 
Reply With Quote
 
=?ISO-8859-1?Q?Arne_Vajh=F8j?=
Guest
Posts: n/a
 
      09-01-2006
Greg wrote:
> My Java program reads in (from an external source) text that contains
> the same sort of unicode character escape sequences as java source
> code. For example, one such string might be:
>
> "En Espa\u00f1ol"
>
> Naturally, I would like to convert the five characters subsequence,
> "\u00f1", into the single character codepoint (hex 00F1) that those
> characters actually represent:
>
> "En Espaņol"
>
> I've been browsing the J2SE 1.5 docs hoping to find a convenient method
> to perform this kind of conversion, but so far have not found one. Does
> anyone have any suggestions?


One of many possible solutions:

private static final Pattern p = Pattern.compile("\\\\u([0-9A-F]{4})");
public static String U2U(String s) {
String res = s;
Matcher m = p.matcher(res);
while(m.find()) {
res = res.replaceAll("\\" + m.group(0),
Character.toString((char)Integer.parseInt(m.group( 1), 16)));
}
return res;
}

Arne
 
Reply With Quote
 
Dale King
Guest
Posts: n/a
 
      09-01-2006
Oliver Wong wrote:
>
> "Greg" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed) oups.com...
>> My Java program reads in (from an external source) text that contains
>> the same sort of unicode character escape sequences as java source
>> code. For example, one such string might be:
>>
>> "En Espa\u00f1ol"
>>
>> Naturally, I would like to convert the five characters subsequence,
>> "\u00f1", into the single character codepoint (hex 00F1) that those
>> characters actually represent:
>>
>> "En Espaņol"
>>
>> I've been browsing the J2SE 1.5 docs hoping to find a convenient method
>> to perform this kind of conversion, but so far have not found one. Does
>> anyone have any suggestions?

>
> Iterate through each character of the String, looking for the
> sequence "\u". If you find it, delete those two chars, and read in the
> next 4 chars. Parse that sequence of 4 characters into a integer
> assuming hexadecimal notation. Take that integer and cast it to a char,
> and insert the resulting char back into the String.


It's a bit more complicated than that because you will also need to
support things like \\ to actually insert a backslash and perhaps
support things like \n.

--
Dale King
 
Reply With Quote
 
David Lee Lambert
Guest
Posts: n/a
 
      09-01-2006
On Fri, 01 Sep 2006 01:09:40 -0400, Dale King wrote:

>> "Greg" <(E-Mail Removed)> wrote in message
>> news:(E-Mail Removed) oups.com...
>>> My Java program reads in (from an external source) text that contains
>>> the same sort of unicode character escape sequences as java source
>>> code. For example, one such string might be:
>>>
>>> "En Espa\u00f1ol"
>>>
>>> Naturally, I would like to convert the five characters subsequence,
>>> "\u00f1", into the single character codepoint (hex 00F1) that those
>>> characters actually represent:
>>>
>>> "En Espaņol"

>
> It's a bit more complicated than that because you will also need to
> support things like \\ to actually insert a backslash and perhaps
> support things like \n.


If he is defining a new specification for escaped input, this would be
nice but not necessary. "\" can be escaped as "\u005C", and a newline
as "\u000A". In Java source code, "\u005C" results in a malformed string
literal (which means one needs to use "\n" instead), but that escape
sequence is permitted in properties files. On the other hand, the Java
compiler and Properties.load() do not recognize the C escape-sequences
"\v" and "\a" for VT and BEL.

I think Arne's response (that used a regular expression) was too
complicated, and the response to which you are responding was
poorly-thought-out (because strings are immutable in Java). Here's a
possible solution:

String unescape(String s) {
int i=0,len=s.length(); char c; StringBuffer sb = new StringBuffer(len);
while (i<len) {
c = s.charAt(i++);
if (c=='\\') {
if (i<len) {
c = s.charAt(i++);
if (c=='u') {
c = (char) Integer.parseInt(s.substring(i,i+4),16);
i += 4;
} // add other cases here as desired...
}} // fall through: \ escapes itself, quotes any character but u
sb.append(c);
}
return sb.toString();
}

Unlike Arne's solution, it examines each character in the string only
once, and it doesn't require the java.util.regex package (which was not
introduced until Java 1.4). I also think it's more readable, to one who
is trying to verify that it does exactly what's expected and no more.

(What would Arne's solution do to "\u005Cu0020\u0020"? Is that the
correct result?)

--
PGP key posted on website ... http://www.lmert.com/people/davidl/

 
Reply With Quote
 
Dale King
Guest
Posts: n/a
 
      09-01-2006
David Lee Lambert wrote:
> On Fri, 01 Sep 2006 01:09:40 -0400, Dale King wrote:
>
>>> "Greg" <(E-Mail Removed)> wrote in message
>>> news:(E-Mail Removed) oups.com...
>>>> My Java program reads in (from an external source) text that contains
>>>> the same sort of unicode character escape sequences as java source
>>>> code. For example, one such string might be:
>>>>
>>>> "En Espa\u00f1ol"
>>>>
>>>> Naturally, I would like to convert the five characters subsequence,
>>>> "\u00f1", into the single character codepoint (hex 00F1) that those
>>>> characters actually represent:
>>>>
>>>> "En Espaņol"

>> It's a bit more complicated than that because you will also need to
>> support things like \\ to actually insert a backslash and perhaps
>> support things like \n.

>
> If he is defining a new specification for escaped input, this would be
> nice but not necessary. "\" can be escaped as "\u005C", and a newline
> as "\u000A". In Java source code, "\u005C" results in a malformed string
> literal (which means one needs to use "\n" instead), but that escape
> sequence is permitted in properties files.


It's up to him what he wants to specify, but personally I would prefer
the \\ and \n.

> On the other hand, the Java
> compiler and Properties.load() do not recognize the C escape-sequences
> "\v" and "\a" for VT and BEL.


Which is understandable. BEL is specific to consoles and Java has no
real support for consoles because they are too platform specific and VT
is rarely used.

> I think Arne's response (that used a regular expression) was too
> complicated, and the response to which you are responding was
> poorly-thought-out (because strings are immutable in Java). Here's a
> possible solution:
>
> String unescape(String s) {


The proper time to do the conversion is when the text is being read from
the "external source" using some form of FilterReader subclass. I
remember now that I wrote one of those once, but after a long search I
have figured out that I left that code at my previous employer and did
not keep a copy of it (which is a shame because that was part of
something that was some really good work).

--
Dale King
 
Reply With Quote
 
vektor vektor is offline
Junior Member
Join Date: May 2011
Posts: 1
 
      05-17-2011
I use the apache converter:
Code:
org.apache.commons.lang.StringEscapeUtils
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Unescaping hex encoded characters in string? sprite Ruby 2 09-02-2010 08:52 PM
Unescaping URLs in Python John Nagle Python 3 12-25-2006 09:46 PM
Dequote a string, unescaping escaped quotes Jonny C Programming 7 12-20-2005 03:36 AM
Re: Unescaping ASP vbscript escaped string Vance Kessler ASP .Net 0 03-01-2004 03:11 PM
unescaping xml escape codes Daniel Python 2 08-11-2003 12:22 AM



Advertisments