Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > I want unsigned char * string literals

Reply
Thread Tools

I want unsigned char * string literals

 
 
Michael B Allen
Guest
Posts: n/a
 
      07-22-2007
Hello,

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *. The reasoning is that the elements
of these arrays are decidedly not signed. In fact, they may not even
represent complete characters. At this point I think of text as simple
binary blobs. What charset, character encoding and termination they use
should not be exposed in the interface used to operate on them.

But now I have a dilemma. C string literals are signed char *. With GCC
4 warning about every sign mismatch, my code is spewing warnings all
over the place and I'm trying to figure out what to do about it.

My current thought is to define a Windows style _T macro:

#define _T(s) ((unsigned char *)s)

Use "text" functions like:

int
text_copy(const unsigned char *src, unsigned char *dst, int n)
{
while (n-- && *src) {
*dst++ = *src++;
...

And abolish the use of traditional string functions (at least for "text").

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

If I do the above I have a lot of work to do so if someone has a better
idea I'd really like to hear about it.

Mike

PS: If you have an opinion that is unfavorable (but professional) let's
hear it.

[1] I use the term "text" to mean stuff that may actually be displayed
to a user (possibly in a foreign country). I use the term "string"
to represent traditional 8 bit zero terminated char * arrays.
 
Reply With Quote
 
 
 
 
Eric Sosman
Guest
Posts: n/a
 
      07-22-2007
Michael B Allen wrote:
> Hello,
>
> Early on I decided that all text (what most people call "strings" [1])
> in my code would be unsigned char *. The reasoning is that the elements
> of these arrays are decidedly not signed. In fact, they may not even
> represent complete characters. At this point I think of text as simple
> binary blobs. What charset, character encoding and termination they use
> should not be exposed in the interface used to operate on them.
>
> But now I have a dilemma. C string literals are signed char *.


Well, no. String literals (in typical contexts) generate
anonymous arrays of char -- just plain char, not signed char
or unsigned char. Plain char is signed on some systems and
signed on others, but it is a type of its own nevertheless.

(People seem to have a hard time with the notion that char
behaves like one of signed char or unsigned char, but is a
type distinct from both. The same people seem to have no
trouble with the fact that int is a type distinct from both
short and long, even though on most systems it behaves exactly
like one or the other. Go figure.)

> With GCC
> 4 warning about every sign mismatch, my code is spewing warnings all
> over the place and I'm trying to figure out what to do about it.


"Don't Do That." The compiler is telling you that the
square peg is a poor fit for the round hole, no matter how
hard you push on it.

> My current thought is to define a Windows style _T macro:
>
> #define _T(s) ((unsigned char *)s)


... invading the namespace reserved to the implementation,
thus making the code non-portable to any implementation that
decides to use _T as one of its own identifiers. If you really
want to pursue this folly, change the macro name. And put
parens around the use of the argument, too.

> Use "text" functions like:
>
> int
> text_copy(const unsigned char *src, unsigned char *dst, int n)
> {
> while (n-- && *src) {
> *dst++ = *src++;
> ...
>
> And abolish the use of traditional string functions (at least for "text").


You'll also need to find substitutes for the *printf family,
for getenv, for the strto* family, for asctime and ctime, for
most of the locale mechanism, for ...

> The code might then look like the following:
>
> unsigned char buf[255];
> text_copy(_T("hello, world"), buf, sizeof(buf));
>
> What do you think?


I think you want some other programming language, possibly
Java. If you try to do this in C, you will waste an inordinate
amount of time and effort struggling against the language and
(especially) against the library.

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)lid
 
Reply With Quote
 
 
 
 
Malcolm McLean
Guest
Posts: n/a
 
      07-22-2007

"Michael B Allen" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Hello,
>
> Early on I decided that all text (what most people call "strings" [1])
> in my code would be unsigned char *. The reasoning is that the
> elements of these arrays are decidedly not signed. In fact, they may not
> even represent complete characters. At this point I think of text as
> simple binary blobs. What charset, character encoding and termination
> they use should not be exposed in the interface used to operate on
> them.
>

char * for a list of human readable characters.
unsigned char *for a list of arbitrary bytes - almost always octets.
signed char * - very rare. Sometimes you might need a tiny integer. I will
resist mentioning my campaign for 64 bit ints.

unsigned char really ought to be "byte". Unfortunately a bad decison was
taken to treat characters and bytes the same way, and now we are stuck with
sizeof(char) == 1 byte.

If you start using unsigned char* for strings then, as you have found, you
will merrily break all the calls to string library functions. This can be
patched up by a cast, but the real answer is not to do that in the first
place.
Very rarely are you interested in the actual encoding of a character. A few
exceptions arise when you want to code lookup tables for speed, or write
low-level routines to convert from decimal to machine letter, or put text
into binary files in an agreed coding, but they are very few.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm



 
Reply With Quote
 
Michael B Allen
Guest
Posts: n/a
 
      07-22-2007
On Sun, 22 Jul 2007 15:02:31 -0400
Eric Sosman <(E-Mail Removed)> wrote:

> > With GCC
> > 4 warning about every sign mismatch, my code is spewing warnings all
> > over the place and I'm trying to figure out what to do about it.

>
> "Don't Do That." The compiler is telling you that the
> square peg is a poor fit for the round hole, no matter how
> hard you push on it.


Hi Eric,

Trying to put a square peg in a round hole does not fairly characterize
casting char * to unsigned char *.

> > My current thought is to define a Windows style _T macro:
> >
> > #define _T(s) ((unsigned char *)s)

>
> ... invading the namespace reserved to the implementation,
> thus making the code non-portable to any implementation that
> decides to use _T as one of its own identifiers. If you really
> want to pursue this folly, change the macro name. And put
> parens around the use of the argument, too.


I didn't invade the namespace, MS did. Which is to say that symbol is
unlikely to be use for anything other than what MS (and I) are using it
for.

But I don't see why I can't use a different symbol and retain
compatibility with the Windows platform. I will do that.

> > Use "text" functions like:
> >
> > int
> > text_copy(const unsigned char *src, unsigned char *dst, int n)
> > {
> > while (n-- && *src) {
> > *dst++ = *src++;
> > ...
> >
> > And abolish the use of traditional string functions (at least for "text").


> You'll also need to find substitutes for the *printf family,
> for getenv, for the strto* family, for asctime and ctime, for
> most of the locale mechanism, for ...


That's not a big deal. I suspect that in the end I would only end up
wrapping very few functions. I don't really use any of the above directly
as it is.

Note that if you need a truly internationalized solution (everyone should)
you can't use a lot of the traditional C string functions anyway. Strncpy
and ctype stuff is useless. Consider that web servers almost invariably
run in the C locale so anything that depends on the locale mechanism is
of limited use.

> > The code might then look like the following:
> >
> > unsigned char buf[255];
> > text_copy(_T("hello, world"), buf, sizeof(buf));
> >
> > What do you think?

>
> I think you want some other programming language, possibly
> Java. If you try to do this in C, you will waste an inordinate
> amount of time and effort struggling against the language and
> (especially) against the library.


I would love to use Java the language. Unfortunately it's libraries,
host OS integration, multi-threading and networking capabilities and
just about everything else is not suitable for my purposes. C++ seems
like an over design to me but I've never really tried to use it. The
C language itself is ideal for me. I don't think deficiencies in text
processing should deter me from using it.

So I take it you just use char * for text?

It doesn't bother you that char * isn't the appropriate type for what
is effectively a binary blob especially when most of the str* functions
don't handle internationalized text anyway?

Mike
 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      07-22-2007
Michael B Allen <(E-Mail Removed)> writes:
> Early on I decided that all text (what most people call "strings" [1])
> in my code would be unsigned char *. The reasoning is that the elements
> of these arrays are decidedly not signed. In fact, they may not even
> represent complete characters. At this point I think of text as simple
> binary blobs. What charset, character encoding and termination they use
> should not be exposed in the interface used to operate on them.
>
> But now I have a dilemma. C string literals are signed char *. With GCC
> 4 warning about every sign mismatch, my code is spewing warnings all
> over the place and I'm trying to figure out what to do about it.

[...]

No, C string literals have type 'array[N] of char'; in most, but not
all, contexts, this is implicity converted to 'char*. (Consider
'sizeof "hello, world"'.)

My main point isn't that they're arrays rather than pointers, but that
they're arrays of (plain) char, not of signed char. Plain char is
equivalent to *either* signed char or unsigned char, but is still a
distinct type from either of them. It appears that plain char is
signed in your implementation.

I know this doesn't answer your actual question; hopefully someone
else can help with that.

--
Keith Thompson (The_Other_Keith) (E-Mail Removed) <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
pete
Guest
Posts: n/a
 
      07-22-2007
Michael B Allen wrote:
>
> Hello,
>
> Early on I decided that all text (what most people call "strings" [1])
> in my code would be unsigned char *.
> The reasoning is that the elements
> of these arrays are decidedly not signed. In fact, they may not even
> represent complete characters. At this point I think of text as simple
> binary blobs. What charset,
> character encoding and termination they use
> should not be exposed in the interface used to operate on them.
>
> But now I have a dilemma. C string literals are signed char *.


They are arrays of plain char,
which may be either a signed or unsigned type.

> With GCC
> 4 warning about every sign mismatch, my code is spewing warnings all
> over the place and I'm trying to figure out what to do about it.
>
> My current thought is to define a Windows style _T macro:
>
> #define _T(s) ((unsigned char *)s)
>
> Use "text" functions like:
>
> int
> text_copy(const unsigned char *src, unsigned char *dst, int n)
> {
> while (n-- && *src) {
> *dst++ = *src++;
> ...
>
> And abolish the use of traditional string functions
> (at least for "text").
>
> The code might then look like the following:
>
> unsigned char buf[255];
> text_copy(_T("hello, world"), buf, sizeof(buf));
>
> What do you think?
>
> If I do the above I have a lot of work to do
> so if someone has a better idea
> I'd really like to hear about it.
>
> Mike
>
> PS: If you have an opinion that is unfavorable
> (but professional) let's hear it.


The solution is obvious: use arrays of char to contain strings.

Using arrays of unsigned char to hold strings
creates a problem for you, but solves nothing.

If I have a problem
that is caused by using arrays of char to hold strings,
I'm unaware of what the problem is.

--
pete
 
Reply With Quote
 
Michael B Allen
Guest
Posts: n/a
 
      07-22-2007
On Sun, 22 Jul 2007 22:02:42 GMT
pete <(E-Mail Removed)> wrote:

> > The code might then look like the following:
> >
> > unsigned char buf[255];
> > text_copy(_T("hello, world"), buf, sizeof(buf));
> >
> > What do you think?
> >
> > If I do the above I have a lot of work to do
> > so if someone has a better idea
> > I'd really like to hear about it.
> >
> > Mike
> >
> > PS: If you have an opinion that is unfavorable
> > (but professional) let's hear it.

>
> The solution is obvious: use arrays of char to contain strings.
>
> Using arrays of unsigned char to hold strings
> creates a problem for you, but solves nothing.
>
> If I have a problem
> that is caused by using arrays of char to hold strings,
> I'm unaware of what the problem is.


Hi pete,

I accept that there's no technical problem with using char. But I just
can't get over the fact that char isn't the right type for text.

If you read data from binary file would you read it into a char buffer
or unsigned char buffer?

Type char is not the correct type for text. It is mearly adequate for
a traditional C 7 bit encoded "string". But char is not the right type
for binary blobs of "text" used in internationalized programs.

The only problem with using unsigned char is string literals and that
seems like a weak reason to make all downstream functions use char.

Also, technically speaking, if I used char all internationalized string
functions eventually have to cast char to unsigned char so that it could
decode and encode and interpret whole characters.

If compilers allowed the user to specify what the type for string literals
was, that would basically solve this "problem".

Mike
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      07-23-2007
Michael B Allen wrote:
> On Sun, 22 Jul 2007 15:02:31 -0400
> Eric Sosman <(E-Mail Removed)> wrote:
>
>>> With GCC
>>> 4 warning about every sign mismatch, my code is spewing warnings all
>>> over the place and I'm trying to figure out what to do about it.

>> "Don't Do That." The compiler is telling you that the
>> square peg is a poor fit for the round hole, no matter how
>> hard you push on it.

>
> Hi Eric,
>
> Trying to put a square peg in a round hole does not fairly characterize
> casting char * to unsigned char *.


Sorry: my mistake. I ought to have said round peg and
square hole. My apologies.

>>> My current thought is to define a Windows style _T macro:
>>>
>>> #define _T(s) ((unsigned char *)s)

>> ... invading the namespace reserved to the implementation,
>> thus making the code non-portable to any implementation that
>> decides to use _T as one of its own identifiers. If you really
>> want to pursue this folly, change the macro name. And put
>> parens around the use of the argument, too.

>
> I didn't invade the namespace, MS did. Which is to say that symbol is
> unlikely to be use for anything other than what MS (and I) are using it
> for.
>
> But I don't see why I can't use a different symbol and retain
> compatibility with the Windows platform. I will do that.


Sorry again; I have no idea what you're talking about.
Whatever it is doesn't seem to be C, in which identifiers
beginning with _ and a capital letter belong to the implementation
and not to the programmer.

>>> Use "text" functions like:
>>>
>>> int
>>> text_copy(const unsigned char *src, unsigned char *dst, int n)
>>> {
>>> while (n-- && *src) {
>>> *dst++ = *src++;
>>> ...
>>>
>>> And abolish the use of traditional string functions (at least for "text").

>
>> You'll also need to find substitutes for the *printf family,
>> for getenv, for the strto* family, for asctime and ctime, for
>> most of the locale mechanism, for ...

>
> That's not a big deal. I suspect that in the end I would only end up
> wrapping very few functions. I don't really use any of the above directly
> as it is.


Not even printf? Are you writing for freestanding environments
where most of the Standard library is absent?

> Note that if you need a truly internationalized solution (everyone should)
> you can't use a lot of the traditional C string functions anyway. Strncpy
> and ctype stuff is useless.


I'll agree with you about strncpy, but not about <ctype.h>.

> Consider that web servers almost invariably
> run in the C locale so anything that depends on the locale mechanism is
> of limited use.


Well, that's really not a C problem, or at least not a "C-
only" problem. C's locale support is, admittedly, an afterthought
if not actually a wart, and doesn't generalize to multi-threaded
environments. But then, C itself has no notion of multiple threads,
so what can you expect?

>>> The code might then look like the following:
>>>
>>> unsigned char buf[255];
>>> text_copy(_T("hello, world"), buf, sizeof(buf));
>>>
>>> What do you think?

>> I think you want some other programming language, possibly
>> Java. If you try to do this in C, you will waste an inordinate
>> amount of time and effort struggling against the language and
>> (especially) against the library.

>
> I would love to use Java the language. Unfortunately it's libraries,
> host OS integration, multi-threading and networking capabilities and
> just about everything else is not suitable for my purposes. C++ seems
> like an over design to me but I've never really tried to use it. The
> C language itself is ideal for me. I don't think deficiencies in text
> processing should deter me from using it.


Then go ahead; nobody's stopping you. But if you've made up
your mind to use C, then use C and not some Frankenstein's monster
made of parts from one language and parts from the other. If text
processing is important to you and C's text processing isn't rich
enough for your needs, then either seek another language or add
your own text-processing libraries to C. But don't try to retrofit
C's admittedly primitive text-processing to suit your more advanced
goals; all you're doing is putting lipstick on a pig.

> So I take it you just use char * for text?


That I do.

> It doesn't bother you that char * isn't the appropriate type for what
> is effectively a binary blob especially when most of the str* functions
> don't handle internationalized text anyway?


You haven't explained just why you find char* inadequate,
and the only virtue of unsigned char* you've mentioned is that it's
unsigned. I don't see how that helps with internationalization.

Are you looking for wchar_t, by any chance?

--
Eric Sosman
(E-Mail Removed)lid
 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      07-23-2007
Michael B Allen <(E-Mail Removed)> writes:
[...]
> I accept that there's no technical problem with using char. But I just
> can't get over the fact that char isn't the right type for text.


But that's exactly what it's *supposed* to be. If you're saying it
doesn't meet that requirement, I don't disagree. Personally, I think
it would make more sense i most environments for plain char to be
unsigned.

> If you read data from binary file would you read it into a char buffer
> or unsigned char buffer?


Probably an unsigned char buffer, but a binary file could be anything.
It if contained 8-bit signed data, I'd use signed char.

> Type char is not the correct type for text. It is mearly adequate for
> a traditional C 7 bit encoded "string". But char is not the right type
> for binary blobs of "text" used in internationalized programs.
>
> The only problem with using unsigned char is string literals and that
> seems like a weak reason to make all downstream functions use char.
>
> Also, technically speaking, if I used char all internationalized string
> functions eventually have to cast char to unsigned char so that it could
> decode and encode and interpret whole characters.
>
> If compilers allowed the user to specify what the type for string literals
> was, that would basically solve this "problem".


Not really; the standard functions that take strings would still
require pointers to plain char.

As I said, IMHO making plain char unsigned is the best solution in
most environments. I don't know why that hasn't caught on. Perhaps
there's to much badly writen code that assumes plain char is signed.

--
Keith Thompson (The_Other_Keith) (E-Mail Removed) <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
pete
Guest
Posts: n/a
 
      07-23-2007
Michael B Allen wrote:
>
> Hello,
>
> Early on I decided that all text (what most people call "strings" [1])
> in my code would be unsigned char *.
> The reasoning is that the elements
> of these arrays are decidedly not signed. In fact, they may not even
> represent complete characters. At this point I think of text as simple
> binary blobs. What charset,
> character encoding and termination they use
> should not be exposed in the interface used to operate on them.
>
> But now I have a dilemma. C string literals are signed char *.
> With GCC
> 4 warning about every sign mismatch, my code is spewing warnings all
> over the place and I'm trying to figure out what to do about it.
>
> My current thought is to define a Windows style _T macro:
>
> #define _T(s) ((unsigned char *)s)
>
> Use "text" functions like:
>
> int
> text_copy(const unsigned char *src, unsigned char *dst, int n)
> {
> while (n-- && *src) {
> *dst++ = *src++;
> ...
>
> And abolish the use of traditional string functions
> (at least for "text").
>
> The code might then look like the following:
>
> unsigned char buf[255];
> text_copy(_T("hello, world"), buf, sizeof(buf));
>
> What do you think?
>
> If I do the above I have a lot of work to do
> so if someone has a better
> idea I'd really like to hear about it.
>
> Mike
>
> PS: If you have an opinion that is unfavorable
> (but professional) let's hear it.
>
> [1] I use the term "text" to mean stuff that may actually be displayed
> to a user (possibly in a foreign country). I use the term "string"
> to represent traditional 8 bit zero terminated char * arrays.


I think it might be simpler to retain the char interface,
and then cast inside your functions:

int
text_copy(const char *src, char *dst, int n)
{
unsigned char *s1 = ( unsigned char *)dst;
const unsigned char *s2 = (const unsigned char *)src;

while (n != 0 && *s2 != '\0') {
*s1++ = *s2++;
--n;
}
while (n-- != 0) {
*s1++ = '\0';
}

--
pete
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
(int) -> (unsigned) -> (int) or (unsigned) -> (int) -> (unsigned):I'll loose something? pozz C Programming 12 03-20-2011 11:32 PM
Casting from const pair<const unsigned char*, size_t>* to constpair<unsigned char*, size_t>* Alex Vinokur C++ 9 10-13-2008 05:05 PM
Printing the range s of unsigned char and unsigned int. Junmin H. C Programming 20 09-20-2007 06:03 AM
Java: byte literals and short literals John Goche Java 8 01-17-2006 11:12 PM
unsigned long to unsigned char ashtonn@gmail.com Python 1 06-01-2005 07:00 PM



Advertisments