Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > length of char in bits differs on Win/Linux and Mac

Reply
Thread Tools

length of char in bits differs on Win/Linux and Mac

 
 
Bart Rider
Guest
Posts: n/a
 
      05-29-2006

Hi all,

last week i had to write a little homework program like open
a file and count all characters present in this file. I did it
using a counting array of size 256 and increasing the specific
chars position by one, if i've read that character from the
file.
The file itself was opened via a FileReader/BufferedReader and
the lines were read by readLine()

Now i observed the following. The character '' stored in the
char variable c and used to access the counting array:
countingArray[c]++
caused no problems on windows/linux computers, but on macs,
where the value 8240 (0x2030) was assigned with this char.

It seems to me, that char on mac computers is 16bit wide.
Is this true?

Using a mac even a double cast like
countingArray[(char)(int)c]++
did not work. And (c & 0xFF) was no option either, because now
i would match the '' to '0' (0x30).

I solved the problem by using a try-catch-block and counting
'other' characters through it.

Best regards,
Bart
 
Reply With Quote
 
 
 
 
Thomas Hawtin
Guest
Posts: n/a
 
      05-29-2006
Bart Rider wrote:
>
> Now i observed the following. The character '' stored in the
> char variable c and used to access the counting array:
> countingArray[c]++
> caused no problems on windows/linux computers, but on macs,
> where the value 8240 (0x2030) was assigned with this char.
>
> It seems to me, that char on mac computers is 16bit wide.
> Is this true?


Windows is probably using a single byte character encoding (probably
Cp1252 or similar), whereas Linux and Macs are probably using UTF-8,
which encodes ASCII characters as ASCII, but characters with codes of
128 or higher as seguences of two or more bytes.

http://en.wikipedia.org/wiki/UTF-8

On Linux I believe by default uses the LANG environment variable. If you
type echo $LANG you should see something like en_US.UTF-8 printed. You
can get back to old fashioned character sets with export LANG=C (as it's
an environment variable, it wont apply to Java processes run from other
shell processes).

Tom Hawtin
--
Unemployed English Java programmer
http://jroller.com/page/tackline/
 
Reply With Quote
 
 
 
 
Thomas Weidenfeller
Guest
Posts: n/a
 
      05-29-2006
Bart Rider wrote:
> Now i observed the following. The character '' stored in the
> char variable c and used to access the counting array:
> countingArray[c]++
> caused no problems on windows/linux computers, but on macs,
> where the value 8240 (0x2030) was assigned with this char.
>
> It seems to me, that char on mac computers is 16bit wide.
> Is this true?


You were just lucky on Windows with your algorithm, and you used the
wrong encoding for reading on the Mac.

You were lucky on Windows, because Java uses Unicode for all characters.
Current Unicode standards support characters with code points beyond
2^16 (Unicode is not a 16 character standard) - although you have
trouble with Unicode beyond 2^16 in Java. But whatever Java version you
use, your 256 wide array could have fallen any time. You were lucky,
because your input didn't contain any character beyond the Latin-1
range. If it would, your code would have blown up on Windows already.

Regarding the Mac result: You used the wrong encoding. When you read
text data into Java, Java needs to know in what encoding that data
comes, so it can be translated to Java's internal Unicode. You did use
an encoding (implicitly or explicitly) which triggered the translation
of some input data to the Unicode code point 0x2030. Since 0x2030 is the
Unicode code point for the permille sign, and not for a-umlaut, the
conversion was wrong.

You need to fix the encoding which you use for reading the data. All
your casting and and bit-masking is nonsense, it will not fix the
encoding problem.

In general, even if you had fixed the encoding problem, your original
algorithm was faulty. It failed for everything beyond code point 255,
which are roughly 96000 possible characters your algorithm doesn't
cover. Your original algorithm just handled about 1/377th of all valid
input values.

You only partly fixed that with the counting of 'other' characters,
partly only because ...

> I solved the problem by using a try-catch-block and counting
> 'other' characters through it.


.... using exceptions to handle valid input data is bad. A simple
comparison if a code point is greater 255 would be the right thing to do
here.

/Thomas
--
The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/...g/java/gui/faq
http://www.uni-giessen.de/faq/archiv....java.gui.faq/
 
Reply With Quote
 
alexandre_paterson@yahoo.fr
Guest
Posts: n/a
 
      05-29-2006
Bart Rider wrote:
> Hi all,
>
> last week i had to write a little homework program like open
> a file and count all characters present in this file.


Apparently that's not what your program is trying to do: your
program seems to be trying to count how many occurence of each
character appears in the file.

The billion-dollar question: what is the encoding of the file
containing the characters you want to count?


> I did it
> using a counting array of size 256 and increasing the specific
> chars position by one, if i've read that character from the
> file.


It could work the way you programmed it if you knew for sure
that your source file contains characters that could be mapped
to ISO-Latin-1 chars when "decoded"/recoded to Unicode.

If a Java char is between 0 and 127 you know that it is an
ASCII character (and hence also an ISO-Latin-1 character).

If a Java char is between 160 and 255 you know that you
have an ISO-Latin-1 character (128 through 159 being
control codes).

If you read a file by specifying a wrong encoding (or by using
a default encoding that doesn't match your file's encoding),
you'll read meaningless char values...

If you read a file specifying a correct encoding, while having
your file containing characters not belonging in the ISO-Latin-1
range (which is completely legal), some of your char *will*
be greater than 255 and hence your broken code *will*
throw ArrayIndexOfOutBoundsExceptions.


> It seems to me, that char on mac computers is 16bit wide.
> Is this true?


"char" in Java is always 16 bit wide (which is unfortunate btw
since since Unicode 3.1 this is not wide enough to represent
every Unicode code points, but this another topic).

Your question shows one thing: you need to read on Java's
primitive char type and on the various character encodings.


> Using a mac even a double cast like
> countingArray[(char)(int)c]++


nonsense...


> did not work. And (c & 0xFF) was no option either, because now
> i would match the '' to '0' (0x30).


0x2030 & 0xff gives indeed 0x30...

'' can be represented in ISO-Latin-1 and in Unicode by the value
0x00e4 (it cannot be represented in ASCII).

The problem is that you're using FileReader, which is using the
default platform's encoding (in this case "MACROMAN"), on a
file that is encoded using ISO-8859-1 encoding, hence the
conversion of 0x00e4 to 0x2030.

You should use an InputStreamReader and specify the correct
encoding:

InputStream is = new FileInputStream("/home/public/dl/tmp.txt");
InputStreamReader isr = new InputStreamReader(is, "ISO-8859-1");


> I solved the problem by using a try-catch-block and counting
> 'other' characters through it.


Using exceptions for flow control is a seriously broken way of
programming in Java...

You want to read on "encoding", you want to know what is
the encoding of the file you're trying to read, you want to
know what your platform's default encoding is, you want to
understand what the char primitive in Java is, you want
to know that ISO-Latin-1 (aka ISO-8859-1) is a superset
of ASCII (using the same code for the same characters) and
you want to know that Unicode is a superset of the
ISO-Latin-1 characters (using the same "codepoint" [though
this is Unicode-specific terminology] for same characters).

As a last note, ASCII (aka US-ASCII) defines the position of
128 characters, not 256 as many people believe.

Hope it helps,

Alex

 
Reply With Quote
 
alexandre_paterson@yahoo.fr
Guest
Posts: n/a
 
      05-29-2006
Hi Thomas,

two really minor nitpicks...

(I thought the same "nonsense" about the OP's double cast


Thomas Weidenfeller wrote:
....
> Regarding the Mac result: You used the wrong encoding. When you read
> text data into Java, Java needs to know in what encoding that data
> comes, so it can be translated to Java's internal Unicode. You did use
> an encoding (implicitly or explicitly) which triggered the translation
> of some input data to the Unicode code point 0x2030. Since 0x2030 is the
> Unicode code point for the permille sign, and not for a-umlaut, the
> conversion was wrong.


yup, wrong conversion because FileReader use the platform's default
encoding, "MACROMAN" in his case, to read a file that is not encoded
in MACROMAN.


> > I solved the problem by using a try-catch-block and counting
> > 'other' characters through it.

>
> ... using exceptions to handle valid input data is bad. A simple
> comparison if a code point is greater 255 would be the right thing to do
> here.


The right thing to do here would be to use an InputStreamReader and
specify the correct file encoding (ie ISO-8859-1).

 
Reply With Quote
 
Thomas Weidenfeller
Guest
Posts: n/a
 
      05-29-2006
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> The right thing to do here would be to use an InputStreamReader and
> specify the correct file encoding (ie ISO-8859-1).


Only if one knows that the input is indeed ISO-8859-1 - which the OP
didn't tell us. If the input data contains data which, if correctly
decoded, maps to Unicode code point greater 255 you are back to the same
problem. Th usage of an 'other' counter is IHMO a good idea.

/Thomas
--
The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/...g/java/gui/faq
http://www.uni-giessen.de/faq/archiv....java.gui.faq/
 
Reply With Quote
 
Oliver Wong
Guest
Posts: n/a
 
      05-29-2006

<(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
> Bart Rider wrote:
> > Hi all,
> >
> > last week i had to write a little homework program like open
> > a file and count all characters present in this file.

>
> Apparently that's not what your program is trying to do: your
> program seems to be trying to count how many occurence of each
> character appears in the file.


This threw me off too. To the OP: Please be very precise about what your
program is supposed to do, or else I'll be very confused and my advice will
probably be less effective.

> > I did it
> > using a counting array of size 256 and increasing the specific
> > chars position by one, if i've read that character from the
> > file.


Perhaps the OP isn't trying to read characters at all, but instead is
reading in bytes. That is, the reader could stick with an array of size 256,
and read in one byte at a time, counting how often each byte appears in a
file. That would remove the need for an encoding all together, as well as
that "others" variable mentioned upthread.

- Oliver

 
Reply With Quote
 
Rogan Dawes
Guest
Posts: n/a
 
      05-30-2006
Oliver Wong wrote:
>
> <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed) ups.com...
>> Bart Rider wrote:
>> > Hi all,
>> >
>> > last week i had to write a little homework program like open
>> > a file and count all characters present in this file.

>>
>> Apparently that's not what your program is trying to do: your
>> program seems to be trying to count how many occurence of each
>> character appears in the file.

>
> This threw me off too. To the OP: Please be very precise about what
> your program is supposed to do, or else I'll be very confused and my
> advice will probably be less effective.
>
>> > I did it
>> > using a counting array of size 256 and increasing the specific
>> > chars position by one, if i've read that character from the
>> > file.

>
> Perhaps the OP isn't trying to read characters at all, but instead is
> reading in bytes. That is, the reader could stick with an array of size
> 256, and read in one byte at a time, counting how often each byte
> appears in a file. That would remove the need for an encoding all
> together, as well as that "others" variable mentioned upthread.
>
> - Oliver


As an additional aside, given that the OP will be potentially dealing
with far more characters than just 256, but possibly quite sparsely
distributed, the better data structure would probably be a
Map<Character, Integer>

Assuming he really IS interested in chars, not bytes, that is.

FWIW.

Rogan
 
Reply With Quote
 
Bart Rider
Guest
Posts: n/a
 
      05-30-2006
Rogan Dawes wrote:
> Oliver Wong wrote:
>
>>
>> <(E-Mail Removed)> wrote in message
>> news:(E-Mail Removed) ups.com...
>>
>>> Bart Rider wrote:
>>> > Hi all,
>>> >
>>> > last week i had to write a little homework program like open
>>> > a file and count all characters present in this file.
>>>
>>> Apparently that's not what your program is trying to do: your
>>> program seems to be trying to count how many occurence of each
>>> character appears in the file.

>>
>>
>> This threw me off too. To the OP: Please be very precise about what
>> your program is supposed to do, or else I'll be very confused and my
>> advice will probably be less effective.
>>
>>> > I did it
>>> > using a counting array of size 256 and increasing the specific
>>> > chars position by one, if i've read that character from the
>>> > file.

>>
>>
>> Perhaps the OP isn't trying to read characters at all, but instead
>> is reading in bytes. That is, the reader could stick with an array of
>> size 256, and read in one byte at a time, counting how often each byte
>> appears in a file. That would remove the need for an encoding all
>> together, as well as that "others" variable mentioned upthread.
>>
>> - Oliver

>
>
> As an additional aside, given that the OP will be potentially dealing
> with far more characters than just 256, but possibly quite sparsely
> distributed, the better data structure would probably be a
> Map<Character, Integer>
>
> Assuming he really IS interested in chars, not bytes, that is.
>
> FWIW.
>
> Rogan


Thanks a lot for all your replies. They helped me a lot to
understand what are the flaws in my little programm.

Actually I really thought char is only 8 bit wide (I come from
c programming, where char is a replacement for byte ...)
But now, with your hints on Unicode and character mapping I
have to look closer to every file I read and what I intend to
do with it.

Thanks again,
Bart
 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      05-30-2006
Rogan Dawes wrote:

> As an additional aside, given that the OP will be potentially dealing
> with far more characters than just 256, but possibly quite sparsely
> distributed, the better data structure would probably be a
> Map<Character, Integer>


Or maybe even an int[] array for the first 127 code points and a Map<Character,
Integer> to handle the overflow.

-- chris


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Padding bits and char, unsigned char, signed char Ioannis Vranos C Programming 6 03-29-2008 10:55 AM
Padding bits and char, unsigned char, signed char Ioannis Vranos C++ 11 03-28-2008 10:47 PM
length of 2D Array >> char **myString= (char **) malloc (sizeof (char *)); davidb C++ 0 09-01-2006 03:22 PM
(const char *cp) and (char *p) are consistent type, (const char **cpp) and (char **pp) are not consistent lovecreatesbeauty C Programming 1 05-09-2006 08:01 AM
8-Bits vs 12 or 16 bits/pixel; When does more than 8 bits count ? Al Dykes Digital Photography 3 12-29-2003 07:08 PM



Advertisments