Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Java (http://www.velocityreviews.com/forums/f30-java.html)
-   -   how do I expand a unicode string to its visual UTF8 representation? (http://www.velocityreviews.com/forums/t693886-how-do-i-expand-a-unicode-string-to-its-visual-utf8-representation.html)

Andrew 08-06-2009 03:03 PM

how do I expand a unicode string to its visual UTF8 representation?
 
Hello,

I have an example program below that contains weird Icelandic
characters, and a copyright symbol, just for good measure. The code
expresses these as UTF8. They print exactly as you would want/expect
them to. So far so good. But what I want is to be able to go the other
way. I want to take a unicode string and recreate the escape sequences
for the funny international characters.For example, the single
character E-acute should be expanded to \u00C9 (6 characters). Any
ideas on how to do this please?

public class UTF8Test {
public UTF8Test() {
}

public String getString() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
return builder.toString();
}

public static void main(String[] args) {
UTF8Test test = new UTF8Test();
System.out.println(test.getString());
}
}

FWIW, the reason I want to do this is I need to write strings like
this to a sybase table where the column is of type varchar. We cannot
make it univarchar (don't ask). So I need to be able to write unicode
characters without using unicode chars! I thought by having them in
this expanded form java can convert them just like the program above
does.

Regards,

Andrew Marlow

Knute Johnson 08-06-2009 04:02 PM

Re: how do I expand a unicode string to its visual UTF8 representation?
 
Andrew wrote:
> Hello,
>
> I have an example program below that contains weird Icelandic
> characters, and a copyright symbol, just for good measure. The code
> expresses these as UTF8. They print exactly as you would want/expect
> them to. So far so good. But what I want is to be able to go the other
> way. I want to take a unicode string and recreate the escape sequences
> for the funny international characters.For example, the single
> character E-acute should be expanded to \u00C9 (6 characters). Any
> ideas on how to do this please?
>
> public class UTF8Test {
> public UTF8Test() {
> }
>
> public String getString() {
> StringBuilder builder = new StringBuilder();
> builder.append("Copyright \u00A9 2009\n");
> builder.append("Here is the phrase (in Icelandic): I can eat glass
> and it doesn't hurt me\n");
> builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
> mei\u00F0a mig");
> return builder.toString();
> }
>
> public static void main(String[] args) {
> UTF8Test test = new UTF8Test();
> System.out.println(test.getString());
> }
> }
>
> FWIW, the reason I want to do this is I need to write strings like
> this to a sybase table where the column is of type varchar. We cannot
> make it univarchar (don't ask). So I need to be able to write unicode
> characters without using unicode chars! I thought by having them in
> this expanded form java can convert them just like the program above
> does.
>
> Regards,
>
> Andrew Marlow


public class UTF8Test {
public UTF8Test() {
}

public void doit() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass and
it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
String str = builder.toString();

System.out.println(str);

byte[] buf = str.getBytes();
for (byte b : buf)
System.out.printf("\\u%04x",b);
}

public static void main(String[] args) {
UTF8Test test = new UTF8Test();
test.doit();
}
}

C:\Documents and Settings\Knute Johnson>java UTF8Test
Copyright ⌐ 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
╔g get eti≡ gler ßn *ess a≡ mei≡a mig
\u0043\u006f\u0070\u0079\u0072\u0069\u0067\u0068\u 0074\u0020\u00a9\u0020\u0032\u
0030\u0030\u0039\u000a\u0048\u0065\u0072\u0065\u00 20\u0069\u0073\u0020\u0074\u00
68\u0065\u0020\u0070\u0068\u0072\u0061\u0073\u0065 \u0020\u0028\u0069\u006e\u0020
\u0049\u0063\u0065\u006c\u0061\u006e\u0064\u0069\u 0063\u0029\u003a\u0020\u0049\u
0020\u0063\u0061\u006e\u0020\u0065\u0061\u0074\u00 20\u0067\u006c\u0061\u0073\u00
73\u0020\u0061\u006e\u0064\u0020\u0069\u0074\u0020 \u0064\u006f\u0065\u0073\u006e
\u0027\u0074\u0020\u0068\u0075\u0072\u0074\u0020\u 006d\u0065\u000a\u00c9\u0067\u
0020\u0067\u0065\u0074\u0020\u0065\u0074\u0069\u00 f0\u0020\u0067\u006c\u0065\u00
72\u0020\u00e1\u006e\u0020\u00fe\u0065\u0073\u0073 \u0020\u0061\u00f0\u0020\u006d
\u0065\u0069\u00f0\u0061\u0020\u006d\u0069\u0067

--

Knute Johnson
email s/nospam/knute2009/

--
Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
------->>>>>>http://www.NewsDemon.com<<<<<<------
Unlimited Access, Anonymous Accounts, Uncensored Broadband Access

Andrew 08-06-2009 04:12 PM

Re: how do I expand a unicode string to its visual UTF8representation?
 
On 6 Aug, 17:02, Knute Johnson <nos...@rabbitbrush.frazmtn.com> wrote:
> Andrew wrote:
> > Hello,

>
> > I have an example program below that contains weird Icelandic
> > characters, and a copyright symbol, just for good measure. The code
> > expresses these as UTF8. They print exactly as you would want/expect
> > them to. So far so good. But what I want is to be able to go the other
> > way. I want to take a unicode string and recreate the escape sequences
> > for the funny international characters.For example, the single
> > character E-acute should be expanded to \u00C9 (6 characters). Any
> > ideas on how to do this please?


> C:\Documents and Settings\Knute Johnson>java UTF8Test
> Copyright ⌐ 2009
> Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
> ╔g get eti≡ gler ßn *ess a≡ mei≡a mig
> \u0043\u006f\u0070\u0079\u0072\u0069\u0067\u0068\u 0074\u0020\u00a9\u0020\u0032\u
> 0030\u0030\u0039\u000a\u0048\u0065\u0072\u0065\u00 20\u0069\u0073\u0020\u0074\u00
> 68\u0065\u0020\u0070\u0068\u0072\u0061\u0073\u0065 \u0020\u0028\u0069\u006e\u0020
> \u0049\u0063\u0065\u006c\u0061\u006e\u0064\u0069\u 0063\u0029\u003a\u0020\u0049\u
> 0020\u0063\u0061\u006e\u0020\u0065\u0061\u0074\u00 20\u0067\u006c\u0061\u0073\u00
> 73\u0020\u0061\u006e\u0064\u0020\u0069\u0074\u0020 \u0064\u006f\u0065\u0073\u006e
> \u0027\u0074\u0020\u0068\u0075\u0072\u0074\u0020\u 006d\u0065\u000a\u00c9\u0067\u
> 0020\u0067\u0065\u0074\u0020\u0065\u0074\u0069\u00 f0\u0020\u0067\u006c\u0065\u00
> 72\u0020\u00e1\u006e\u0020\u00fe\u0065\u0073\u0073 \u0020\u0061\u00f0\u0020\u006d
> \u0065\u0069\u00f0\u0061\u0020\u006d\u0069\u0067


Well, thanks for the quick reply, but that hasn't quite worked has it?
All the chars have come out as \uxxxx. I want the ones that are 7 bit
ASCII to come out as the normal printable char, i.e I want the output
of doit to be:

Copyright \u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
me
\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0 mei\u00F0a mig


Arne Vajhj 08-06-2009 04:15 PM

Re: how do I expand a unicode string to its visual UTF8 representation?
 
Andrew wrote:
> I have an example program below that contains weird Icelandic
> characters, and a copyright symbol, just for good measure. The code
> expresses these as UTF8. They print exactly as you would want/expect
> them to. So far so good. But what I want is to be able to go the other
> way. I want to take a unicode string and recreate the escape sequences
> for the funny international characters.For example, the single
> character E-acute should be expanded to \u00C9 (6 characters). Any
> ideas on how to do this please?
>
> public class UTF8Test {
> public UTF8Test() {
> }
>
> public String getString() {
> StringBuilder builder = new StringBuilder();
> builder.append("Copyright \u00A9 2009\n");
> builder.append("Here is the phrase (in Icelandic): I can eat glass
> and it doesn't hurt me\n");
> builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
> mei\u00F0a mig");
> return builder.toString();
> }
>
> public static void main(String[] args) {
> UTF8Test test = new UTF8Test();
> System.out.println(test.getString());
> }
> }
>
> FWIW, the reason I want to do this is I need to write strings like
> this to a sybase table where the column is of type varchar. We cannot
> make it univarchar (don't ask). So I need to be able to write unicode
> characters without using unicode chars! I thought by having them in
> this expanded form java can convert them just like the program above
> does.


The specific question asked can be solved with something like:

public static String encode(String s) {
StringBuffer sb = new StringBuffer("");
for(int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if((c >= 0) && (c <=127)) {
sb.append(c);
} else {
String hex = Integer.toHexString(c);
sb.append("\\u" + "0000".substring(hex.length(), 4) + hex);
}
}
return sb.toString();
}

But it will actually also require some work to decode it. Because the
unescape done in your code is done at compile time not runtime.

And 1 code point -> 6 bytes is not a very efficient encoding.

Assuming your VARCHAR supports 0-255 then you should be able
to store you UTF-8 bytes as ISO-8859-1.

A bit messy but more efficient space wise and less code.

Alternatively you could look at Quoted Printable but that
will also have overhead.

Arne

Mayeul 08-06-2009 04:24 PM

Re: how do I expand a unicode string to its visual UTF8 representation?
 
Andrew wrote:
> Hello,
>
> I have an example program below that contains weird Icelandic
> characters, and a copyright symbol, just for good measure. The code
> expresses these as UTF8. They print exactly as you would want/expect
> them to. So far so good. But what I want is to be able to go the other
> way. I want to take a unicode string and recreate the escape sequences
> for the funny international characters.For example, the single
> character E-acute should be expanded to \u00C9 (6 characters). Any
> ideas on how to do this please?


> public class UTF8Test {
> public UTF8Test() {
> }
>
> public String getString() {
> StringBuilder builder = new StringBuilder();
> builder.append("Copyright \u00A9 2009\n");
> builder.append("Here is the phrase (in Icelandic): I can eat glass
> and it doesn't hurt me\n");
> builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
> mei\u00F0a mig");
> return builder.toString();
> }
>
> public static void main(String[] args) {
> UTF8Test test = new UTF8Test();
> System.out.println(test.getString());
> }
> }


You might want to read on UTF-8, as something like \u00C9 has absolutely
nothing to do with UTF-8. It is the Java escape notation which enables
to represent a character with its Unicode code point as hexadecimal.
Nothing to do with UTF-8. A lot to do with UTF-16, though.

As a side note, please be aware that Java Strings are sequences of Java
char values. Char values are unsigned and 16-bit, which is not enough to
hold characters with a Unicode code point above U+FFFF. Such characters
are therefore encoded as a combination of two Java chars, in the same
way UTF-16 works.
This won't impact what you're trying to do though, since UTF-16 use
surrogate characters that are still non-ASCII for characters above
U+FFFF. Their correct escape sequence is the horrible \uAAAA\uBBBB, the
escape sequences of the surrogates. Not addressing the issue at all will
automagically produce the desired results.


As for how to do encode to or decode from such a format, I don't know of
any direct way, but Knute and Arne showed it should be rather
straightforward.

> FWIW, the reason I want to do this is I need to write strings like
> this to a sybase table where the column is of type varchar. We cannot
> make it univarchar (don't ask). So I need to be able to write unicode
> characters without using unicode chars!


I recommand you store them encoded in UTF-7 or quoted-printable, then.
This will be more efficient and more standard than what you're trying to
do, and libraries will do it for you.

> I thought by having them in
> this expanded form java can convert them just like the program above
> does.


As far as I know, you were wrong when thinking that.

--
Mayeul

Knute Johnson 08-06-2009 04:46 PM

Re: how do I expand a unicode string to its visual UTF8 representation?
 
Andrew wrote:
> Well, thanks for the quick reply, but that hasn't quite worked has it?
> All the chars have come out as \uxxxx. I want the ones that are 7 bit
> ASCII to come out as the normal printable char, i.e I want the
> output of doit to be:
>
> Copyright \u00A9 2009 Here is the phrase (in Icelandic): I can eat
> glass and it doesn't hurt me \u00C9g get eti\u00F0 gler \u00E1n
> \u00FEess a\u00F0 mei\u00F0a mig


Well I figured since you had a fairly sophisticated question and
appeared to have some knowledge of Java that you could figure out how to
use the 'if' statement yourself. Oh and just so you don't complain that
I used lower case hex, I fixed that too.

C:\Documents and Settings\Knute Johnson>java UTF8Test
Copyright ⌐ 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
╔g get eti≡ gler ßn *ess a≡ mei≡a mig
Copyright \u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0 mei\u00F0a mig

public class UTF8Test {
public UTF8Test() {
}

public void doit() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
String str = builder.toString();

System.out.println(str);

byte[] buf = str.getBytes();
for (byte b : buf) {
if ((b & 0x80) == 0)
System.out.print(new String(new byte[] { b }));
else
System.out.printf("\\u%04X",b);
}
}

public static void main(String[] args) {
UTF8Test test = new UTF8Test();
test.doit();
}
}

--

Knute Johnson
email s/nospam/knute2009/

--
Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
------->>>>>>http://www.NewsDemon.com<<<<<<------
Unlimited Access, Anonymous Accounts, Uncensored Broadband Access

Roedy Green 08-06-2009 05:11 PM

Re: how do I expand a unicode string to its visual UTF8 representation?
 
On Thu, 6 Aug 2009 08:03:59 -0700 (PDT), Andrew
<marlow.andrew@googlemail.com> wrote, quoted or indirectly quoted
someone who said :

> I want to take a unicode string and recreate the escape sequences
>for the funny international characters.For example, the single
>character E-acute should be expanded to \u00C9 (6 characters). Any
>ideas on how to do this please?


Another way of formulating your question is how to I take some
Unicode-16 data in RAM and write it out in 8-bit Icelandic encoding or
possibly UTF-8 encoding.

See http://mindprod.com/applet/file.html

See http://mindprod.com/jgloss/encoding.html
to find the name of the possible Icelandic encodings.

See http://mindprod.com/applet/encodingrecogniser.html
To help you figure out which Icelandic encoding you sample is using.

P.S. none of these codes is "visual". Turning these codes to glyphs is
the job of the font. See
http://mindprod.com/jgloss/font.html
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Let us pray it is not so, or if it is, that it will not become widely known."
~ Wife of the Bishop of Exeter on hearing of Darwin's theory of the common descent of humans and apes.

Andrew 08-06-2009 05:16 PM

Re: how do I expand a unicode string to its visual UTF8representation?
 
On 6 Aug, 17:24, Mayeul <mayeul.marg...@free.fr> wrote:
> Andrew wrote:
> > Hello,

>
> > I have an example program below that contains weird Icelandic
> > characters, and a copyright symbol, just for good measure. The code
> > expresses these as UTF8. They print exactly as you would want/expect
> > them to. So far so good. But what I want is to be able to go the other
> > way. I want to take a unicode string and recreate the escape sequences
> > for the funny international characters.


>
> You might want to read on UTF-8, as something like \u00C9 has absolutely
> nothing to do with UTF-8. It is the Java escape notation which enables
> to represent a character with its Unicode code point as hexadecimal.
> Nothing to do with UTF-8. A lot to do with UTF-16, though.


Yes, ahem, you're right.

> As for how to do encode to or decode from such a format, I don't know of
> any direct way, but Knute and Arne showed it should be rather
> straightforward.


I am not sure about those solutions. Don't I need to convert the
internal representation to something specific first, like UTF8? Or is
there a formal definition of the internal representation whee no
explicit encoding is given?

>
> > FWIW, the reason I want to do this is I need to write strings like
> > this to a sybase table where the column is of type varchar. We cannot
> > make it univarchar (don't ask). So I need to be able to write unicode
> > characters without using unicode chars!

>
> I recommand you store them encoded in UTF-7 or quoted-printable, then.
> This will be more efficient and more standard than what you're trying to
> do, and libraries will do it for you.


If I store the data in a varchar as this:

Copyright \u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
me
\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0 mei\u00F0a mig

then java will do the working of conversion for me automatically.
That's why I need to move in the other direction first.

> > I thought by having them in
> > this expanded form java can convert them just like the program above
> > does.

>
> As far as I know, you were wrong when thinking that.


I think I am right. When the \uxxxx strings are in a file and I read
them in, printing gives the correct result. Therefore reading from a
varchar should also give the correct result.

Andrew 08-06-2009 05:25 PM

Re: how do I expand a unicode string to its visual UTF8representation?
 
On 6 Aug, 18:11, Roedy Green <see_webs...@mindprod.com.invalid> wrote:
> On Thu, 6 Aug 2009 08:03:59 -0700 (PDT), Andrew
> <marlow.and...@googlemail.com> wrote, quoted or indirectly quoted
> someone who said :
>
> > I want to take a unicode string and recreate the escape sequences
> >for the funny international characters.For example, the single
> >character E-acute should be expanded to \u00C9 (6 characters). Any
> >ideas on how to do this please?

>
> Another way of formulating your question is how to I take some
> Unicode-16 data in RAM and write it out in 8-bit Icelandic encoding or
> possibly UTF-8 encoding.


No, that is not my question. Icelandic was just an example. The point
is the data contains international characters. I don't know what
language the text will be in and I don't care. I just need to be able
to write it to the database without losing information but I cannot
make the column univarchar (for reasons I won't go into here).

> Seehttp://mindprod.com/applet/encodingrecogniser.html
> To help you figure out which Icelandic encoding you sample is using.


This is not the problem (but I appreciate the thought though....).

>
> P.S. none of these codes is "visual". Turning these codes to glyphs is
> the job of the font. *Seehttp://mindprod.com/jgloss/font.html


By visual I meant NOT binary. I.e. I do not want to get to the raw bit
pattern that represents E-acute, I want the single char that is E-
acute to be mapped to 6 bytes of the form \uxxxx that is the
equivalent.

> --
> Roedy Green Canadian Mind Productshttp://mindprod.com


-Andrew M.

Andrew 08-06-2009 05:32 PM

Re: how do I expand a unicode string to its visual UTF8representation?
 
On 6 Aug, 17:46, Knute Johnson <nos...@rabbitbrush.frazmtn.com> wrote:
> Andrew wrote:
>
> * > Well, thanks for the quick reply, but that hasn't quite worked has it?
>
> > All the chars have come out as \uxxxx. I want the ones that are 7 bit
> > *ASCII to come out as the normal printable char, i.e I want the
> > output of doit to be:

>
> > Copyright \u00A9 2009 Here is the phrase (in Icelandic): I can eat
> > glass and it doesn't hurt me \u00C9g get eti\u00F0 gler \u00E1n
> > \u00FEess a\u00F0 mei\u00F0a mig

>
> Well I figured since you had a fairly sophisticated question and
> appeared to have some knowledge of Java that you could figure out how to
> use the 'if' statement yourself. *Oh and just so you don't complain that
> I used lower case hex, I fixed that too.


> * * *public void doit() {
> * * *StringBuilder builder = new StringBuilder();
> * * *builder.append("Copyright \u00A9 2009\n");
> * * *builder.append("Here is the phrase (in Icelandic): I can eat glass
> and it doesn't hurt me\n");
> * * *builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
> mei\u00F0a mig");
> * * *String str = builder.toString();
>
> * * *System.out.println(str);
>
> * * *byte[] buf = str.getBytes();
> * * *for (byte b : buf) {
> * * * * *if ((b & 0x80) == 0)
> * * * * * * *System.out.print(new String(new byte[] { b }));
> * * * * *else
> * * * * * * *System.out.printf("\\u%04X",b);
> * * *}
>
> }


I do appreciate you trying to help but I'm afraid that code does not
do the job. When I run it, this is what I get:

Copyright \u00C2\u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
me
\u00C3\u0089g get eti\u00C3\u00B0 gler \u00C3\u00A1n \u00C3\u00BEess a
\u00C3\u00

For example, the copyright symbol comes out as 00C2 when I expect
00A9. The E-acute comes out as 00C3 where I expect 00C9.

-Andrew Marlow



All times are GMT. The time now is 08:33 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.