Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Re: number of bytes for each (uni)code point while using utf-8 asencoding ...

Reply
Thread Tools

Re: number of bytes for each (uni)code point while using utf-8 asencoding ...

 
 
Joshua Cranmer
Guest
Posts: n/a
 
      07-12-2012
On 7/10/2012 3:45 PM, lbrt chx _ gemale wrote:
>> On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:

>
>>> How can you get the number of bytes you "get()"?

>
>> Well, UTF-8 always encodes the same char to the same (number of) bytes,
>> doesn't it?

> ~
> What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages
> ~
> Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems
> ~


I don't see how knowing the char -> length mapping is going to help you
in this case. If your input is a blob of bytes which someone claims is
UTF-8 but isn't, you can set up decoders to throw an error or at least
instead of the replacement char (U+FFFD) which makes it detectable that
someone screwed up.

The problem also is, if it's not UTF-8, what is it then? The heuristics
for this kind of stuff is incredibly squirrely and it more or less turns
out that the most reliable way to fix it is to know the default charset
of the computer spitting data out at you. Even then, there's still a
possibility that its input was screwed up in a similar fashion: I've
seen one message undergo the standard I-thought-your-UTF8-was-ISO-8859-1
twice, so that every standard character ended up with 4 gibberish
characters.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: number of bytes for each (uni)code point while using utf-8 asencoding ... Lew Java 0 07-11-2012 09:05 PM
Re: number of bytes for each (uni)code point while using utf-8 asencoding ... Robert Klemme Java 0 07-11-2012 08:03 PM
Re: number of bytes for each (uni)code point while using utf-8 asencoding ... Daniele Futtorovic Java 1 07-10-2012 09:17 PM
Re: number of bytes for each (uni)code point while using utf-8 asencoding ... Lew Java 0 07-10-2012 07:57 PM
Re: number of bytes for each (uni)code point while using utf-8 asencoding ... Daniele Futtorovic Java 0 07-10-2012 06:13 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57