Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > strings - reading utf8 characters such as japanese. how?

Reply
Thread Tools

strings - reading utf8 characters such as japanese. how?

 
 
stefoid
Guest
Posts: n/a
 
      07-03-2006
Hi. Ive got a problem. I have some code that takes a text file and
breaks it into an array of substrings for displaying the text truncated
to fit the screen width on word boundaries.

It just looks for the spaces.

trouble is, it crashes out with japenese text. There is a part of the
code that looks at the next character to see if it is a space:

ch = str.substring(offset, offset + 1);
isSpace = false;

// return when a new line is reached
if (ch.equals("\n"))
return offset+1;

currentWidth += font.stringWidth(ch);

if (ch.equals(" "))
isSpace = true;

and if it isnt a space, it adds the width of the character (in pixels)
, and keeps going until it does find a space.

the problem with this is it assumes that each byte is a characater. In
utf8, up to 3 bytes could be one character, so this code is trying to
find the widths of characters representing each byte in a utf8
sequence, rather than the width of the utf8 character as a whole.

my additional problem is this is iAppli code, so I am limited to a 30K
codebase, and I have hit the limit, so I cant write any more lines of
code - I just have to change the existing code such that it doesnt
generate any more bytecode.

what can I do to the above code so that I can count widths of utf8
characters instead of asc characters, without writing too much extra
code - I need existing java library functions to do it for me, but I
dont know what that fucntionality is.

 
Reply With Quote
 
 
 
 
Damian Driscoll
Guest
Posts: n/a
 
      07-03-2006
stefoid wrote:

> Hi. Ive got a problem. I have some code that takes a text file and
> breaks it into an array of substrings for displaying the text truncated
> to fit the screen width on word boundaries.
>
> It just looks for the spaces.
>
> trouble is, it crashes out with japenese text. There is a part of the
> code that looks at the next character to see if it is a space:
>
> ch = str.substring(offset, offset + 1);
> isSpace = false;
>
> // return when a new line is reached
> if (ch.equals("\n"))
> return offset+1;
>
> currentWidth += font.stringWidth(ch);
>
> if (ch.equals(" "))
> isSpace = true;
>
> and if it isnt a space, it adds the width of the character (in pixels)
> , and keeps going until it does find a space.
>
> the problem with this is it assumes that each byte is a characater. In
> utf8, up to 3 bytes could be one character, so this code is trying to
> find the widths of characters representing each byte in a utf8
> sequence, rather than the width of the utf8 character as a whole.
>
> my additional problem is this is iAppli code, so I am limited to a 30K
> codebase, and I have hit the limit, so I cant write any more lines of
> code - I just have to change the existing code such that it doesnt
> generate any more bytecode.
>
> what can I do to the above code so that I can count widths of utf8
> characters instead of asc characters, without writing too much extra
> code - I need existing java library functions to do it for me, but I
> dont know what that fucntionality is.


have a look at:
http://javaalmanac.com/egs/java.nio....nvertChar.html

 
Reply With Quote
 
 
 
 
Chris Uppal
Guest
Posts: n/a
 
      07-03-2006
stefoid wrote:

> what can I do to the above code so that I can count widths of utf8
> characters instead of asc characters, without writing too much extra
> code - I need existing java library functions to do it for me, but I
> dont know what that fucntionality is.


Why are you working in UTF8 using Java Strings ? Indeed /how/ are you doing
it -- I would put it somewhere between impossible and dangerously difficult and
confusing.

If you want to load your information into /text/, let Java decode the external
UTF-8 into Strings (of characters, already decoded as they are read in). If,
possibly for space reasons, you have to work in UTF-8 internally, then you'd be
far better off keeping the data in byte[] arrays.

-- chris



 
Reply With Quote
 
Oliver Wong
Guest
Posts: n/a
 
      07-03-2006

"stefoid" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
> Hi. Ive got a problem. I have some code that takes a text file and
> breaks it into an array of substrings for displaying the text truncated
> to fit the screen width on word boundaries.
>
> It just looks for the spaces.
>
> trouble is, it crashes out with japenese text. There is a part of the
> code that looks at the next character to see if it is a space:
>
> ch = str.substring(offset, offset + 1);
> isSpace = false;
>
> // return when a new line is reached
> if (ch.equals("\n"))
> return offset+1;
>
> currentWidth += font.stringWidth(ch);
>
> if (ch.equals(" "))
> isSpace = true;
>
> and if it isnt a space, it adds the width of the character (in pixels)
> , and keeps going until it does find a space.


How about something like:

<pseudoCode>
StringTokenizer st = new StringTokenizer(str, " \n", true);
int offset = 0;
while (st.hasMoreTokens()) {
String token = st.nextToken();
if (token.equals(" ")) {
/*do whatever you gotta do with spaces here.*/
offset++;
} else if (token.equals("\n")) {
return offset;
} else {
currentWidth = font.stringWidth(token);
offset += token.length();
}
}
</pseudoCode>

You'll avoid breaking up the string into its individual codepoints,
potentially splitting a character in two.

>
> the problem with this is it assumes that each byte is a characater. In
> utf8, up to 3 bytes could be one character, so this code is trying to
> find the widths of characters representing each byte in a utf8
> sequence, rather than the width of the utf8 character as a whole.


Actually, it assumes each (Java) char is a (semantic) character. A Java char
is 16 bits long, and Java Strings are internally stored in UTF-16, so a
semantic character might be spread over 2 java char (32 bits).

>
> my additional problem is this is iAppli code, so I am limited to a 30K
> codebase, and I have hit the limit, so I cant write any more lines of
> code - I just have to change the existing code such that it doesnt
> generate any more bytecode.


Sounds rough. Can't really help you with this.

>
> what can I do to the above code so that I can count widths of utf8
> characters instead of asc characters, without writing too much extra
> code - I need existing java library functions to do it for me, but I
> dont know what that fucntionality is.


See above. Since you're working with Unicode, you might want to use the
Character.isWhiteSpace() method, isntead of the String.equals(" ") method. I
believe the Japanese whitespace has a different unicode value than the ASCII
whitespace.

- Oliver

 
Reply With Quote
 
stefoid
Guest
Posts: n/a
 
      07-04-2006
good question. An iAppli is something like an applet, designed to go
into a cutdown java virtual machine to fit inside mobile devices. The
available java libraires are greatly restricted - I have lang.string
and lang.character to choose from (that relate to this problem). In
addition to the 30K codebase limit, which i have reached - seriously, I
am like 2 bytes off the maximum.

this is the only part of the code where I have to recognize individual
characters. everything else is just read a string and output it to the
screen, which works fine for utf8, cos its null terminated.




Chris Uppal wrote:
> stefoid wrote:
>
> > what can I do to the above code so that I can count widths of utf8
> > characters instead of asc characters, without writing too much extra
> > code - I need existing java library functions to do it for me, but I
> > dont know what that fucntionality is.

>
> Why are you working in UTF8 using Java Strings ? Indeed /how/ are you doing
> it -- I would put it somewhere between impossible and dangerously difficult and
> confusing.
>
> If you want to load your information into /text/, let Java decode the external
> UTF-8 into Strings (of characters, already decoded as they are read in). If,
> possibly for space reasons, you have to work in UTF-8 internally, then you'd be
> far better off keeping the data in byte[] arrays.
>
> -- chris


 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      07-04-2006
[reorderd to remove top-posting]

stefoid wrote:

[me:]
> > Why are you working in UTF8 using Java Strings ? Indeed /how/ are you
> > doing it -- I would put it somewhere between impossible and dangerously
> > difficult and confusing.
> >

> good question. An iAppli is something like an applet, designed to go
> into a cutdown java virtual machine to fit inside mobile devices. The
> available java libraires are greatly restricted - I have lang.string
> and lang.character to choose from (that relate to this problem). In
> addition to the 30K codebase limit, which i have reached - seriously, I
> am like 2 bytes off the maximum.
>
> this is the only part of the code where I have to recognize individual
> characters. everything else is just read a string and output it to the
> screen, which works fine for utf8, cos its null terminated.


But you haven't really answered my question. I'll try again:

Are you saying that your iAppli doesn't support byte[] arrays ? I find that
impossible to believe.

Are you handling your UTF-8 data as binary (in byte[] arrays) or are you
somehow stuffing UTF-8 encoded data into Java Strings ? If the latter then
(a) why ? and (b) how ?

When you read your data in, why don't you use the Java-provided stuff to decode
the UTF-8 into native (decoded) Java Strings ? I could understand that you
might want to stick with UTF-8 encoded data for space reasons, but then it
doesn't make sense that you'd put that data into Strings (16 bits per
character), which would double the space requirement over byte[] arrays for the
same data. (Unless you stuffed two bytes into each Java char -- which would be
downright perverse

Maybe this implementation lacks the character encoding stuff found everwhere in
real Java ? If not then why are you not using it ? If it does, then I suspect
you are hosed.

-- chris


 
Reply With Quote
 
Oliver Wong
Guest
Posts: n/a
 
      07-04-2006

"stefoid" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ps.com...
> good question. An iAppli is something like an applet, designed to go
> into a cutdown java virtual machine to fit inside mobile devices. The
> available java libraires are greatly restricted - I have lang.string
> and lang.character to choose from (that relate to this problem).


Maybe you should have mentioned this when you wrote

<quote>
I need existing java library functions to do it for me, but I
dont know what that fucntionality is.
</quote>

else you're wasting people's times coming up with solutions that won't
solve your problem.

> In
> addition to the 30K codebase limit, which i have reached - seriously, I
> am like 2 bytes off the maximum.
>
> this is the only part of the code where I have to recognize individual
> characters. everything else is just read a string and output it to the
> screen, which works fine for utf8, cos its null terminated.


My concern right now is that you might not know what you're talking
about. Where are you getting the string data from? What is the type of the
parameter of that string data? Is it String? Byte[]? byte[]? Something else?

What makes you believe it is UTF-8 encoded? What makes you think it's
null terminated?

I don't want to start explaining how to convert UTF-8 binary data
stuffed into Java Strings into "normal" Java Strings, unless I'm sure that's
what is nescessary to solve your problem.

- Oliver

 
Reply With Quote
 
stefoid
Guest
Posts: n/a
 
      07-05-2006
yeah, youre right, sorry I didnt mention that. I think youre also
right in that I dont have a firm grasp of java strings, internal
coding, etc...

This is the code that is used to read the utf8 text resources into
strings:

" dis = Connector.openDataInputStream(resourcePath);
text = new byte[bytes];
dis.readFully(text, 0, bytes);
dis.close();
return new String(text); "

I didnt write it, but I wrote the code that uses the strings, and since
the strings passed to my stuff seemed to print OK, I was happy to
ignore where they came from. Now that guy has gone, and the strings
are in japanese and problems begin.

Actually I have re-written the code that truncates the strings and
solved my original problem. Its very inefficient, but it uses less
lines of code than the original and still works, so I save bytes of
code which is a godsend.

However, I have noticed another problem - the start of every utf8
encoded string resource starts with an unwanted 'dot' character which
does not appear in the original text files. (whether it has passed
through my truncating code or not - it still happens) I have tracked
this down to (I think) the fact that java uses a modified utf8 encoding
scheme, and the text files I am inputting are generated with Word which
will be writing them in normal utf8. I assume thats the problem,
anyway. I have yet to work out how to fix it. I am looking for a
convert program that will convert the the utf8 text files to modified
utf8 format .. seems easiest and preserves precious bytes of code.

any help appreciated.

Oliver Wong wrote:
> "stefoid" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed) ps.com...
> > good question. An iAppli is something like an applet, designed to go
> > into a cutdown java virtual machine to fit inside mobile devices. The
> > available java libraires are greatly restricted - I have lang.string
> > and lang.character to choose from (that relate to this problem).

>
> Maybe you should have mentioned this when you wrote
>
> <quote>
> I need existing java library functions to do it for me, but I
> dont know what that fucntionality is.
> </quote>
>
> else you're wasting people's times coming up with solutions that won't
> solve your problem.
>
> > In
> > addition to the 30K codebase limit, which i have reached - seriously, I
> > am like 2 bytes off the maximum.
> >
> > this is the only part of the code where I have to recognize individual
> > characters. everything else is just read a string and output it to the
> > screen, which works fine for utf8, cos its null terminated.

>
> My concern right now is that you might not know what you're talking
> about. Where are you getting the string data from? What is the type of the
> parameter of that string data? Is it String? Byte[]? byte[]? Something else?
>
> What makes you believe it is UTF-8 encoded? What makes you think it's
> null terminated?
>
> I don't want to start explaining how to convert UTF-8 binary data
> stuffed into Java Strings into "normal" Java Strings, unless I'm sure that's
> what is nescessary to solve your problem.
>
> - Oliver


 
Reply With Quote
 
stefoid
Guest
Posts: n/a
 
      07-05-2006
I should add, here is what the cldc has available (cut down java for
wireless devices and pdas)

java.io:
Interfaces
--------
DataInput
DataOutput

Classes
-------
ByteArrayInputStream
ByteArrayOutputStream
DataInputStream
DataOutputStream
InputStream
InputStreamReader
OutputStream
OutputStreamWriter
PrintStream
Reader
Writer


java.lang:
Classes
---------
Boolean
Byte
Character
Class
Double
Float
Integer
Long
Math
Object
Runtime
Short
String
StringBuffer
System
Thread
Throwable

and something called microedition connectors API:

Interfaces
---------
Connection
ContentConnection
Datagram
DatagramConnection
InputConnection
OutputConnection
StreamConnection
StreamConnectionNotifier
Classes
----------
Connector

 
Reply With Quote
 
Oliver Wong
Guest
Posts: n/a
 
      07-05-2006
"stefoid" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
> This is the code that is used to read the utf8 text resources into
> strings:
>
> " dis = Connector.openDataInputStream(resourcePath);
> text = new byte[bytes];
> dis.readFully(text, 0, bytes);
> dis.close();
> return new String(text); "
>
> I didnt write it, but I wrote the code that uses the strings, and since
> the strings passed to my stuff seemed to print OK, I was happy to
> ignore where they came from. Now that guy has gone, and the strings
> are in japanese and problems begin.


The problem is that you're using the default encoding instead of
specifying the encoding to be UTF-8.

>
> Actually I have re-written the code that truncates the strings and
> solved my original problem. Its very inefficient, but it uses less
> lines of code than the original and still works, so I save bytes of
> code which is a godsend.


I don't know if it's relevant, but I haven't seen "the code that
truncates the string".

>
> However, I have noticed another problem - the start of every utf8
> encoded string resource starts with an unwanted 'dot' character which
> does not appear in the original text files. (whether it has passed
> through my truncating code or not - it still happens) I have tracked
> this down to (I think) the fact that java uses a modified utf8 encoding
> scheme, and the text files I am inputting are generated with Word which
> will be writing them in normal utf8. I assume thats the problem,
> anyway. I have yet to work out how to fix it. I am looking for a
> convert program that will convert the the utf8 text files to modified
> utf8 format .. seems easiest and preserves precious bytes of code.


UTF-8 encoded files sometimes have byte-ordering mark (BOM) at the
beginning. Incidentally, Java doesn't use UTF-8 internally; it uses (a
modified) UTF-16. The two formats are significantly different. I think if
you use a reader, and specify the encoding as UTF-8, it'll take care of
handling the BOM for you.

>
> any help appreciated.
>
> Oliver Wong wrote:
>> "stefoid" <(E-Mail Removed)> wrote in message
>> news:(E-Mail Removed) ps.com...
>> > good question. An iAppli is something like an applet, designed to go
>> > into a cutdown java virtual machine to fit inside mobile devices. The
>> > available java libraires are greatly restricted - I have lang.string
>> > and lang.character to choose from (that relate to this problem).

>>
>> Maybe you should have mentioned this when you wrote
>>
>> <quote>
>> I need existing java library functions to do it for me, but I
>> dont know what that fucntionality is.
>> </quote>
>>
>> else you're wasting people's times coming up with solutions that
>> won't
>> solve your problem.
>>
>> > In
>> > addition to the 30K codebase limit, which i have reached - seriously, I
>> > am like 2 bytes off the maximum.
>> >
>> > this is the only part of the code where I have to recognize individual
>> > characters. everything else is just read a string and output it to the
>> > screen, which works fine for utf8, cos its null terminated.

>>
>> My concern right now is that you might not know what you're talking
>> about. Where are you getting the string data from? What is the type of
>> the
>> parameter of that string data? Is it String? Byte[]? byte[]? Something
>> else?
>>
>> What makes you believe it is UTF-8 encoded? What makes you think it's
>> null terminated?
>>
>> I don't want to start explaining how to convert UTF-8 binary data
>> stuffed into Java Strings into "normal" Java Strings, unless I'm sure
>> that's
>> what is nescessary to solve your problem.
>>
>> - Oliver

>


"stefoid" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
>I should add, here is what the cldc has available (cut down java for
> wireless devices and pdas)

[most of it snipped]
>
> InputStreamReader


Right, so after you get your DataInputStream, you should wrap it around an
InputStreamReader. I don't know if the constructors on CLDC are the same as
JavaSE, but in JavaSE, it'd look like this:

<code>
InputStream is = /*get your input stream somehow. In your case, it looks
like Connector.openDataInputStream(resourcePath)*/
InputStreamReader isr = new InputStreamReader(is, "UTF-8");
</code>

From there, you use the isr.read() method to read 1 character at a time
(note that a character is a 16 bit value, and not an 8 bit value). If
..read() returns -1, that means it reached the end of the stream.

Normally, in JavaSE, you'd also wrap your InputStreamReader into a
BufferedReader. In addition to improving performance via buffering,
BufferedReader also provides a convenience method readLine() which will
return a whole line of text to you, instead of only 1 character at a time.
Unfortunately, BufferedReader wasn't in the list of classes you provided, so
you might have to construct the string manually from the individual
characters.

- Oliver

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
UTF8 characters not appearing correctly in email subject line Andee Weir Java 10 08-13-2007 09:02 PM
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM
how to convert characters to upper case in utf8 env. csanjith@gmail.com C Programming 3 03-17-2006 02:23 AM



Advertisments