Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Re: Getting started with AVR and C

Reply
Thread Tools

Re: Getting started with AVR and C

 
 
Richard Damon
Guest
Posts: n/a
 
      12-15-2012
On 12/11/12 9:53 AM, John Devereux wrote:
>
> UTF-8 is the way forward isn't it?
>


As with most compression systems it depends on what the usage pattern of
characters is. If the text base is mostly the 7 bit ASCII character set,
with some of the other lower valued characters and only a few bigger
valued characters, UTF-8 makes sense. If most of the characters are in
the larger values (like using a non-Latin based character set) then
UTF-16 may make much more sense.
 
Reply With Quote
 
 
 
 
Keith Thompson
Guest
Posts: n/a
 
      12-15-2012
Richard Damon <(E-Mail Removed)> writes:
> On 12/11/12 9:53 AM, John Devereux wrote:
>> UTF-8 is the way forward isn't it?

>
> As with most compression systems it depends on what the usage pattern of
> characters is. If the text base is mostly the 7 bit ASCII character set,
> with some of the other lower valued characters and only a few bigger
> valued characters, UTF-8 makes sense. If most of the characters are in
> the larger values (like using a non-Latin based character set) then
> UTF-16 may make much more sense.


UTF-8 has a couple of other advantages. It's equivalent to ASCII
as long as all the characters are <= 127, which means you can
(mostly) deal with UTF-8 using old tools that aren't Unicode-aware.
And it has no byte ordering issues, so it doesn't need a BOM (Byte
Order Mark).

As for compression, you can always use another compression tool
if necessary; gzipped UTF-8 should be about as compact as gzipped
UTF-16.

--
Keith Thompson (The_Other_Keith) http://www.velocityreviews.com/forums/(E-Mail Removed) <http://www.ghoti.net/~kst>
Will write code for food.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
 
 
 
Richard Damon
Guest
Posts: n/a
 
      12-15-2012
On 12/15/12 3:24 PM, Keith Thompson wrote:
> Richard Damon <(E-Mail Removed)> writes:
>> On 12/11/12 9:53 AM, John Devereux wrote:
>>> UTF-8 is the way forward isn't it?

>>
>> As with most compression systems it depends on what the usage pattern of
>> characters is. If the text base is mostly the 7 bit ASCII character set,
>> with some of the other lower valued characters and only a few bigger
>> valued characters, UTF-8 makes sense. If most of the characters are in
>> the larger values (like using a non-Latin based character set) then
>> UTF-16 may make much more sense.

>
> UTF-8 has a couple of other advantages. It's equivalent to ASCII
> as long as all the characters are <= 127, which means you can
> (mostly) deal with UTF-8 using old tools that aren't Unicode-aware.
> And it has no byte ordering issues, so it doesn't need a BOM (Byte
> Order Mark).
>
> As for compression, you can always use another compression tool
> if necessary; gzipped UTF-8 should be about as compact as gzipped
> UTF-16.
>


UTF-8 and UTF-16 *ARE* compression methods. Uncompressed Unicode would
be UTF-32 or UCS-4, using 32 bits per character. For most use, if you
don't need code points above U+0FFFF, then you might consider UCS-2
uncompressed format. Then UTF-16 isn't really compression, but a method
to mark the very rare character above U+0FFFF. UTF-8 is really just a
compression format to try and remove some of the extra space, and will
do so to the extent that characters 0-7F are more common than U+0800 and
higher, the former saving you a byte, and the latter costing you one.

UTF-8 does have the other advantage that you mention, looking like ASCII
for those characters allowing many Unicode unaware programs to mostly
function with UTF-8 data.
 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      12-15-2012
Richard Damon <(E-Mail Removed)> writes:
> On 12/15/12 3:24 PM, Keith Thompson wrote:

[...]
>> As for compression, you can always use another compression tool
>> if necessary; gzipped UTF-8 should be about as compact as gzipped
>> UTF-16.
>>

>
> UTF-8 and UTF-16 *ARE* compression methods.

[...]

I don't recall saying they aren't.

But they're (relatively) simplistic compression methods that don't
adapt to the content being compressed, which is why applying another
compression tool (I *did* say "another") can be useful.

--
Keith Thompson (The_Other_Keith) (E-Mail Removed) <http://www.ghoti.net/~kst>
Will write code for food.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      12-16-2012
On Sat, 15 Dec 2012 14:30:44 -0500, Richard Damon wrote:

>> UTF-8 is the way forward isn't it?
>>

>
> As with most compression systems it depends on what the usage pattern of
> characters is. If the text base is mostly the 7 bit ASCII character set,
> with some of the other lower valued characters and only a few bigger
> valued characters, UTF-8 makes sense. If most of the characters are in
> the larger values (like using a non-Latin based character set) then
> UTF-16 may make much more sense.


Size isn't the only issue; the fact that UTF-16 may (and usually does)
contain null bytes ('\0') rules it out for many applications.

Similarly, anything which expects specific bytes (e.g. '\x0a', '\x0d',
etc) to have their "usual" meanings regardless of context will work fine
with UTF-8 but not with UTF-16 or UTF-32.

 
Reply With Quote
 
upsidedown@downunder.com
Guest
Posts: n/a
 
      12-16-2012
On Sat, 15 Dec 2012 14:30:44 -0500, Richard Damon
<(E-Mail Removed)> wrote:

>On 12/11/12 9:53 AM, John Devereux wrote:
>>
>> UTF-8 is the way forward isn't it?
>>

>
>As with most compression systems it depends on what the usage pattern of
>characters is. If the text base is mostly the 7 bit ASCII character set,
>with some of the other lower valued characters and only a few bigger
>valued characters, UTF-8 makes sense. If most of the characters are in
>the larger values (like using a non-Latin based character set) then
>UTF-16 may make much more sense.


For any given non-Latin based language, there are only a few possible
bit combinations in the first byte(s) of the UTF-8 sequence, thus it
should compress quite well.

For use inside a program, UTF-32 would be the natural choice with 1
array element/character.

Compressing a UTF-32 file using some form of Huffman coding, should
not take more space than compressed UTF-8/UTF-16 files, since the
actually used (and stored) symbol table would reflect the actual usage
of sequences in the whole file. Doing the compression on the fly in a
communication link would be less effective, since only a part of the
data would be available at a time, in order to keep the latencies
acceptable.

 
Reply With Quote
 
Richard Damon
Guest
Posts: n/a
 
      12-17-2012
On 12/15/12 6:45 PM, Keith Thompson wrote:
> Richard Damon <(E-Mail Removed)> writes:
>> On 12/15/12 3:24 PM, Keith Thompson wrote:

> [...]
>>> As for compression, you can always use another compression tool
>>> if necessary; gzipped UTF-8 should be about as compact as gzipped
>>> UTF-16.
>>>

>>
>> UTF-8 and UTF-16 *ARE* compression methods.

> [...]
>
> I don't recall saying they aren't.
>
> But they're (relatively) simplistic compression methods that don't
> adapt to the content being compressed, which is why applying another
> compression tool (I *did* say "another") can be useful.
>


But they are fundamentally different than other compressions.
Multi-byte/symbol encodings are generally designed so that it is
possible to process the data in that encoding. It isn't that much harder
to process the data then if it was kept fully expanded. Some operations,
like computing the length of a string, require doing a pass over the
data instead of just taking the difference in the addresses, but nothing
becomes particularly hard.

On the other hand, it is very unusual for any program to actually
process "zipped" data as such, it is almost always uncompressed to be
worked on and then re-compressed, and any changes tend to require
reprocessing the entire rest of the file (or at least the current
compression block).
 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      12-17-2012
Richard Damon <(E-Mail Removed)> writes:
> On 12/15/12 6:45 PM, Keith Thompson wrote:
>> Richard Damon <(E-Mail Removed)> writes:
>>> On 12/15/12 3:24 PM, Keith Thompson wrote:

>> [...]
>>>> As for compression, you can always use another compression tool
>>>> if necessary; gzipped UTF-8 should be about as compact as gzipped
>>>> UTF-16.
>>>>
>>>
>>> UTF-8 and UTF-16 *ARE* compression methods.

>> [...]
>>
>> I don't recall saying they aren't.
>>
>> But they're (relatively) simplistic compression methods that don't
>> adapt to the content being compressed, which is why applying another
>> compression tool (I *did* say "another") can be useful.

>
> But they are fundamentally different than other compressions.
> Multi-byte/symbol encodings are generally designed so that it is
> possible to process the data in that encoding. It isn't that much harder
> to process the data then if it was kept fully expanded. Some operations,
> like computing the length of a string, require doing a pass over the
> data instead of just taking the difference in the addresses, but nothing
> becomes particularly hard.
>
> On the other hand, it is very unusual for any program to actually
> process "zipped" data as such, it is almost always uncompressed to be
> worked on and then re-compressed, and any changes tend to require
> reprocessing the entire rest of the file (or at least the current
> compression block).


I'd say that's a difference of degree, not anything fundamental.

Computing the length of a string requires doing a pass over it, whether
it's UTF-8 encoded or gzipped. And it's certainly possible to process
UTF-8 data by internally converting it to UTF-32.

And copying a file doesn't require uncompressing it, regardless of the
format.

--
Keith Thompson (The_Other_Keith) (E-Mail Removed) <http://www.ghoti.net/~kst>
Will write code for food.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
Phil Carmody
Guest
Posts: n/a
 
      12-17-2012
Richard Damon <(E-Mail Removed)> writes:
> On 12/15/12 3:24 PM, Keith Thompson wrote:
> > Richard Damon <(E-Mail Removed)> writes:
> >> On 12/11/12 9:53 AM, John Devereux wrote:
> >>> UTF-8 is the way forward isn't it?
> >>
> >> As with most compression systems it depends on what the usage pattern of
> >> characters is. If the text base is mostly the 7 bit ASCII character set,
> >> with some of the other lower valued characters and only a few bigger
> >> valued characters, UTF-8 makes sense. If most of the characters are in
> >> the larger values (like using a non-Latin based character set) then
> >> UTF-16 may make much more sense.

> >
> > UTF-8 has a couple of other advantages. It's equivalent to ASCII
> > as long as all the characters are <= 127, which means you can
> > (mostly) deal with UTF-8 using old tools that aren't Unicode-aware.
> > And it has no byte ordering issues, so it doesn't need a BOM (Byte
> > Order Mark).
> >
> > As for compression, you can always use another compression tool
> > if necessary; gzipped UTF-8 should be about as compact as gzipped
> > UTF-16.
> >

>
> UTF-8 and UTF-16 *ARE* compression methods.


Hmmm, those who work in compression tend to prefer the term
"encodings", for such fixed 1-1 mappings of input to output
tokens. UTF-8, and the others you consider to be "compressed",
simply have output tokens of different lengths.

Phil
--
I'm not saying that google groups censors my posts, but there's a strong link
between me saying "google groups sucks" in articles, and them disappearing.

Oh - I guess I might be saying that google groups censors my posts.
 
Reply With Quote
 
John Devereux
Guest
Posts: n/a
 
      12-17-2012
Keith Thompson <(E-Mail Removed)> writes:

> Richard Damon <(E-Mail Removed)> writes:
>> On 12/11/12 9:53 AM, John Devereux wrote:
>>> UTF-8 is the way forward isn't it?

>>
>> As with most compression systems it depends on what the usage pattern of
>> characters is. If the text base is mostly the 7 bit ASCII character set,
>> with some of the other lower valued characters and only a few bigger
>> valued characters, UTF-8 makes sense. If most of the characters are in
>> the larger values (like using a non-Latin based character set) then
>> UTF-16 may make much more sense.

>
> UTF-8 has a couple of other advantages. It's equivalent to ASCII
> as long as all the characters are <= 127, which means you can
> (mostly) deal with UTF-8 using old tools that aren't Unicode-aware.
> And it has no byte ordering issues, so it doesn't need a BOM (Byte
> Order Mark).


Yes precisely. I had to update an embedded system with a simple
home-made gui, so that it could do Chinese. I was pleasantly suprised
how painless it was using UTF8. Strings are still null terminated char
arrays, most everything just worked as before. You can't predict the
number of characters just from the string size, but I was already using
proportional fonts so this was not an issue. I could even abuse the C
standard - sorry c.l.c - and embed utf8 in the C source code and that
worked too. (I moved these out into resource files in the end though).

> As for compression, you can always use another compression tool
> if necessary; gzipped UTF-8 should be about as compact as gzipped
> UTF-16.


--

John Devereux
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Denon avr-3300 problem gbozovic11@yahoo.com DVD Video 1 10-20-2005 09:56 PM
AVR core and patents avishay VHDL 0 06-09-2005 01:10 PM
AVR's 2005 Audio-Video Predictions Diane Sherwin DVD Video 0 12-31-2004 01:28 AM
ANN: ESF support for the Atmel AVR 8-Bit RISC microcontroller j.laws@ieee.org C++ 0 07-26-2004 09:46 PM
Denon AVR 5803 Setup Question Joseph Atsmon DVD Video 2 11-03-2003 07:38 AM



Advertisments