Velocity Reviews > Questions on conversions between char* to unsigned char* and vice versa

Questions on conversions between char* to unsigned char* and vice versa

Navaneeth
Guest
Posts: n/a

 12-31-2010
I have few questions on conversions between "char*" to "unsigned char*" and vice versa. I am assuming casting "unsigned char*" to "char*" is safe because "char" can hold all the values that an "unsigned char" can hold.

But conversion of "char*" to "unsigned char*" won't be safe as "char" can hold more values. Is this understanding correct? On what cases "char*" will have negative values?

I have never seen negative values on a "char*" string. So is that safe to do conversion from "char*" to "unsigned char*"?

By conversion, I mean using casting - char* c = (char*) string; where string is a "unsigned char*".

Why I am using unsigned char
------

If any one wondering, why I use unsigned char - I use it for doing some UTF8 processing on the string. I need to use that to skip the multi-byte sequences correctly.

Any help would be great!

Angus
Guest
Posts: n/a

 12-31-2010
On Dec 31, 11:19*am, Navaneeth <(E-Mail Removed)> wrote:
> I have few questions on conversions between "char*" to "unsigned char*" and vice versa. I am assuming casting "unsigned char*" to "char*" is safe because "char" can hold all the values that an "unsigned char" can hold.
>
> But conversion of "char*" to "unsigned char*" won't be safe as "char" can hold more values. Is this understanding correct? On what cases "char*" will have negative values?
>
> I have never seen negative values on a "char*" string. So is that safe to do conversion from "char*" to "unsigned char*"?
>
> By conversion, I mean using casting - char* c = (char*) string; where string is a "unsigned char*".
>
> Why I am using unsigned char
> ------
>
> If any one wondering, why I use unsigned char - I use it for doing some UTF8 processing on the string. I need to use that to skip the multi-byte sequences correctly.
>
> Any help would be great!

In ASCII (and maybe also EBCIDIC, not sure) all the printing
characters are are represented as positive numbers - ie only lower 7
bits are used so converting printable characters either way should
make no difference.

That also assumes your target machine is using twos compliment system.

If you are using extended characters then possibly you may have
problems.

Ben Bacarisse
Guest
Posts: n/a

 12-31-2010
Navaneeth <(E-Mail Removed)> writes:

> I have few questions on conversions between "char*" to "unsigned
> char*" and vice versa. I am assuming casting "unsigned char*" to
> "char*" is safe because "char" can hold all the values that an
> "unsigned char" can hold.
>
> But conversion of "char*" to "unsigned char*" won't be safe as "char"
> can hold more values. Is this understanding correct? On what cases
> "char*" will have negative values?

There's been some confusion in the answers you've had. For one thing,
they reinforce your idea that the conversion of a char * to an unsigned
char * might be related to the range of values the char and unsigned
char can represent. This is not the case.

You can convert from a char * to an unsigned char * because the language
standard permits this.

Once you have done so, the characters pointed to are not converted when
you access them. Conversion has a special meaning in C, and it does not
apply here. Having done:

unsigned char *up = (unsigned char *)cp;

*up (or up[0]) does not convert anything. It simple reinterprets the
first byte of whatever cp pointed to as an unsigned char -- i.e. as a
number from 0 to UCHAR_MAX (almost always 255).

> I have never seen negative values on a "char*" string. So is that safe
> to do conversion from "char*" to "unsigned char*"?

Yes, and it is safe regardless of whether there are negative char values.

You may view *any* object at all (and a string of chars is no different in
the respect) by converting a pointer to it to an unsigned char and
examining the bytes of the object by using that converted pointer.

> By conversion, I mean using casting - char* c = (char*) string; where
> string is a "unsigned char*".

This is also safe, but much less useful. char is an odd type -- it may
be signed or it may be unsigned so it is less useful that unsigned char
for examining objects. However, it safe to do this pointer conversion
and you'll do it often if you are working with unsigned char * and you
have to call library functions that expect a char * parameter.

> Why I am using unsigned char
> ------
>
> If any one wondering, why I use unsigned char - I use it for doing
> some UTF8 processing on the string. I need to use that to skip the
> multi-byte sequences correctly.

That's a perfectly valid reason to use unsigned char. You can do all
this using char * rather than unsigned char *, but I think the code is
clearer if you use unsigned char.

--
Ben.

Keith Thompson
Guest
Posts: n/a

 12-31-2010
Navaneeth <(E-Mail Removed)> writes:
> I have few questions on conversions between "char*" to "unsigned
> char*" and vice versa. I am assuming casting "unsigned char*" to
> "char*" is safe because "char" can hold all the values that an
> "unsigned char" can hold.

char cannot necessarily hold all the values that an unsigned char can hold.

(Plain) char may be either signed or unsigned, depending on the
implementation. If it's unsigned, it has exactly the same range
as unsigned char. But if it's signed, it can hold negative values.
Very commonly, the range of char is -128 .. +127, and the range of
unsigned char is 0 .. 255.

ASCII only specifies character values from 0 to 127, but there are
a number of extended-ASCII character sets (Latin-1, for example)
that specify character values from 0 to 255. This makes dealing
with Latin-1 characters as (signed) char slightly awkward.

(EBCDIC is an 8-bit encoding; systems that use EBCDIC (almost?) always
make plain char unsigned.)

--
Keith Thompson (The_Other_Keith) http://www.velocityreviews.com/forums/(E-Mail Removed) <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Seebs
Guest
Posts: n/a

 01-01-2011
On 2010-12-31, Navaneeth <(E-Mail Removed)> wrote:
> I have few questions on conversions between "char*" to "unsigned
>char*" and vice versa. I am assuming casting "unsigned char*" to
>"char*" is safe because "char" can hold all the values that an
>"unsigned char" can hold.

This is true if, and only if, you are on a system where "char" and "unsigned
char" have the exact same range of values. Otherwise, there will be values
that you can store in "unsigned char" that can't be stored in "char".

> But conversion of "char*" to "unsigned char*" won't be safe as
>"char" can hold more values.

No, it can't. At least, so far as I recall, it's absolutely necessary
that "unsigned char" have at least as many possible values as "char".

>Is this understanding correct? On what
>cases "char*" will have negative values?

Negative values are not coherent for pointers. You probably meant "char".
The answer is, if you're on an implementation where "char" is a signed type,
then sometimes it could have negative values.

> I have never seen negative values on a "char*" string. So is that
> safe to do conversion from "char*" to "unsigned char*"?

Maybe.

> By conversion, I mean using casting - char* c = (char*) string;
> where string is a "unsigned char*".

Maybe.

You haven't explained what you mean by "safe", though. If you convert any
numeric value whatsoever to "unsigned char", it is guaranteed "safe" in that
it cannot cause a processor trap, or result in a value that is not valid
for "unsigned char". It may, however, not be the value you expected to get.
For instance, on most modern CPUs, if you convert any of 256, 512, or 1024 to
unsigned char, you will quite safely and reliably get the value 0. But it
won't crash.

> If any one wondering, why I use unsigned char - I use it for doing
> some UTF8 processing on the string. I need to use that to skip the
> multi-byte sequences correctly.

So you probably do. But before you go reinventing the wheel, why not check
to see what your implementation has for existing UTF-8 support.

If you're at a level of experience where you're not quite sure about how
char and unsigned char interact, I would suggest that you are probably not
ready to reliably and consistently implement UTF-8. If you're doing it just
to learn, hey, sounds like a fun project, good luck with that. If you're
doing it because you want to get something done, though, consider using the
existing code that already does it correctly.

-s
--
Copyright 2010, all wrongs reversed. Peter Seebach / (E-Mail Removed)
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
I am not speaking for my employer, although they do rent some of my opinions.

Barry Schwarz
Guest
Posts: n/a

 01-01-2011
On Fri, 31 Dec 2010 04:49:03 -0800 (PST), Angus
<(E-Mail Removed)> wrote:

>In ASCII (and maybe also EBCIDIC, not sure) all the printing
>characters are are represented as positive numbers - ie only lower 7
>bits are used so converting printable characters either way should
>make no difference.

In EBCDIC, upper case letters range between 0xC1 and 0xE9 (and they
are not contiguous). Digits range from 0xF1 to 0xF9. Definitely not
the lower 7 bits. On EBCDIC systems, char defaults to unsigned char
to avoid negative values for normal characters.

--
Remove del for email

Ben Bacarisse
Guest
Posts: n/a

 01-01-2011
Seebs <(E-Mail Removed)> writes:

> On 2010-12-31, Navaneeth <(E-Mail Removed)> wrote:

<snip>
>> I have never seen negative values on a "char*" string. So is that
>> safe to do conversion from "char*" to "unsigned char*"?

>
> Maybe.
>
>> By conversion, I mean using casting - char* c = (char*) string;
>> where string is a "unsigned char*".

>
> Maybe.
>
> You haven't explained what you mean by "safe", though. If you convert any
> numeric value whatsoever to "unsigned char", it is guaranteed "safe" in that
> it cannot cause a processor trap, or result in a value that is not valid
> for "unsigned char". It may, however, not be the value you expected to get.
> For instance, on most modern CPUs, if you convert any of 256, 512, or 1024 to
> unsigned char, you will quite safely and reliably get the value 0. But it
> won't crash.

Did you miss the * in the question? I am not sure why you are talking
about converting numbers to unsigned char. That is not what is being

<snip>
--
Ben.

Seebs
Guest
Posts: n/a

 01-01-2011
On 2011-01-01, Ben Bacarisse <(E-Mail Removed)> wrote:
> Did you miss the * in the question?

Yes.

> I am not sure why you are talking
> about converting numbers to unsigned char. That is not what is being

Probably because elsewhere there was a * that looked spurious, so I started
translating everything to questions about conversions between values -- in
particular, because of the assertion that char could hold more values than
unsigned char. At least, I think that was how it happened; my brain is a
mysterious place.

-s
--
Copyright 2010, all wrongs reversed. Peter Seebach / (E-Mail Removed)
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
I am not speaking for my employer, although they do rent some of my opinions.

Keith Thompson
Guest
Posts: n/a

 01-01-2011
Ben Pfaff <(E-Mail Removed)> writes:
> Barry Schwarz <(E-Mail Removed)> writes:
>> In EBCDIC, upper case letters range between 0xC1 and 0xE9 (and they
>> are not contiguous). Digits range from 0xF1 to 0xF9. Definitely not
>> the lower 7 bits. On EBCDIC systems, char defaults to unsigned char
>> to avoid negative values for normal characters.

>
> It's not just a default. Having plain char be signed would be
> nonconforming in an EBCDIC environment.

Unless CHAR_BIT > 8, but I presume that all existing EBCDIC-based
systems have CHAR_BIT==8. (If EBCDIC had caught on more widely
than it did, there could easily have been, for example, EBCDIC-based
DSPs with CHAR_BIT==32.)

--
Keith Thompson (The_Other_Keith) (E-Mail Removed) <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is OffTrackbacks are On Pingbacks are On Refbacks are Off Forum Rules

 Similar Threads Thread Thread Starter Forum Replies Last Post Navaneeth C Programming 3 01-05-2011 01:08 AM Navaneeth C Programming 1 01-04-2011 03:09 AM Navaneeth C Programming 3 01-01-2011 01:15 AM carmen Java 4 01-12-2010 05:00 PM timor.super@gmail.com C++ 3 04-02-2007 02:12 PM