Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > I want unsigned char * string literals

Reply
Thread Tools

I want unsigned char * string literals

 
 
Eric Sosman
Guest
Posts: n/a
 
      07-23-2007
Keith Thompson wrote:
> Michael B Allen <(E-Mail Removed)> writes:
> [...]
>> I accept that there's no technical problem with using char. But I just
>> can't get over the fact that char isn't the right type for text.

>
> But that's exactly what it's *supposed* to be. If you're saying it
> doesn't meet that requirement, I don't disagree. Personally, I think
> it would make more sense i most environments for plain char to be
> unsigned.
>
>> If you read data from binary file would you read it into a char buffer
>> or unsigned char buffer?

>
> Probably an unsigned char buffer, but a binary file could be anything.
> It if contained 8-bit signed data, I'd use signed char.
>
>> Type char is not the correct type for text. It is mearly adequate for
>> a traditional C 7 bit encoded "string". But char is not the right type
>> for binary blobs of "text" used in internationalized programs.
>>
>> The only problem with using unsigned char is string literals and that
>> seems like a weak reason to make all downstream functions use char.
>>
>> Also, technically speaking, if I used char all internationalized string
>> functions eventually have to cast char to unsigned char so that it could
>> decode and encode and interpret whole characters.
>>
>> If compilers allowed the user to specify what the type for string literals
>> was, that would basically solve this "problem".

>
> Not really; the standard functions that take strings would still
> require pointers to plain char.
>
> As I said, IMHO making plain char unsigned is the best solution in
> most environments. I don't know why that hasn't caught on. Perhaps
> there's to much badly writen code that assumes plain char is signed.


The historical background for C's ambiguity is fairly
clear: The "load byte" instruction sign-extended on some
machines and zero-extended on others (and on some, simply
left the high-order bits of the destination register alone).
Had C mandated either sign-extension or zero-extension, it
would have added extra instructions to every single character
fetch on the un-favored architectures.

Nowadays it is a good trade to hide such minor matters
behind a veneer of "programmer friendliness," but the economics
(i.e., the relative cost of computer time and programmer time)
were different when C was devised. It would, I think, be an act
of supreme arrogance and stupidity to maintain that today's
economic balance is the end state, subject to no further change.

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)lid
 
Reply With Quote
 
 
 
 
Keith Thompson
Guest
Posts: n/a
 
      07-23-2007
Eric Sosman <(E-Mail Removed)> writes:
> Keith Thompson wrote:

[...]
>> As I said, IMHO making plain char unsigned is the best solution in
>> most environments. I don't know why that hasn't caught on. Perhaps
>> there's to much badly writen code that assumes plain char is signed.

>
> The historical background for C's ambiguity is fairly
> clear: The "load byte" instruction sign-extended on some
> machines and zero-extended on others (and on some, simply
> left the high-order bits of the destination register alone).
> Had C mandated either sign-extension or zero-extension, it
> would have added extra instructions to every single character
> fetch on the un-favored architectures.
>
> Nowadays it is a good trade to hide such minor matters
> behind a veneer of "programmer friendliness," but the economics
> (i.e., the relative cost of computer time and programmer time)
> were different when C was devised. It would, I think, be an act
> of supreme arrogance and stupidity to maintain that today's
> economic balance is the end state, subject to no further change.


I'm not (necessarily) suggesting that the standard should require
plain char to be unsigned. What I'm suggesting is that most current
implementations should probably choose to make plain char unsigned.
Many of them make it signed, perhaps for backward compatibility, but
IMHO it's a poor tradeoff.

--
Keith Thompson (The_Other_Keith) (E-Mail Removed) <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
 
 
 
Michael B Allen
Guest
Posts: n/a
 
      07-23-2007
On Mon, 23 Jul 2007 01:31:22 GMT
pete <(E-Mail Removed)> wrote:

> Michael B Allen wrote:
> >
> > Hello,
> >
> > Early on I decided that all text (what most people call "strings" [1])
> > in my code would be unsigned char *.
> > The reasoning is that the elements
> > of these arrays are decidedly not signed. In fact, they may not even
> > represent complete characters. At this point I think of text as simple
> > binary blobs. What charset,
> > character encoding and termination they use
> > should not be exposed in the interface used to operate on them.
> >
> > But now I have a dilemma. C string literals are signed char *.
> > With GCC
> > 4 warning about every sign mismatch, my code is spewing warnings all
> > over the place and I'm trying to figure out what to do about it.
> >
> > My current thought is to define a Windows style _T macro:
> >
> > #define _T(s) ((unsigned char *)s)
> >
> > Use "text" functions like:
> >
> > int
> > text_copy(const unsigned char *src, unsigned char *dst, int n)
> > {
> > while (n-- && *src) {
> > *dst++ = *src++;
> > ...
> >
> > And abolish the use of traditional string functions
> > (at least for "text").
> >
> > The code might then look like the following:
> >
> > unsigned char buf[255];
> > text_copy(_T("hello, world"), buf, sizeof(buf));
> >
> > What do you think?
> >
> > If I do the above I have a lot of work to do
> > so if someone has a better
> > idea I'd really like to hear about it.
> >
> > Mike
> >
> > PS: If you have an opinion that is unfavorable
> > (but professional) let's hear it.
> >
> > [1] I use the term "text" to mean stuff that may actually be displayed
> > to a user (possibly in a foreign country). I use the term "string"
> > to represent traditional 8 bit zero terminated char * arrays.

>
> I think it might be simpler to retain the char interface,
> and then cast inside your functions:
>
> int
> text_copy(const char *src, char *dst, int n)
> {
> unsigned char *s1 = ( unsigned char *)dst;
> const unsigned char *s2 = (const unsigned char *)src;


Hi pete,

Ok, I'm giving in. I asked, I got an answer and you guys are right.

Even though char is wrong, it's just another little legacy wart with
no serious technical impact other than the fact that to inspect bytes
within the text one should cast to unsigned char first. So if casting
has to occur, doing it in the base functions is a lot more elegant than
casting every string literal throughout the entire codebase.

But in hope that someday compilers will provide an option for char to
be unsigned, I have started to replaced all instances of the char type
with my own typedef so that when that day comes I can tweak one line of
code and have what I want.

Actually I see GCC has a -funsigned-char option that seems to be what
I want but it didn't seem to have any effect on the warnings.

Mike
 
Reply With Quote
 
Ian Collins
Guest
Posts: n/a
 
      07-23-2007
Michael B Allen wrote:
>
> Actually I see GCC has a -funsigned-char option that seems to be what
> I want but it didn't seem to have any effect on the warnings.
>

Could it be that it simply makes char unsigned?

--
Ian Collins.
 
Reply With Quote
 
Alan Curry
Guest
Posts: n/a
 
      07-23-2007
In article <(E-Mail Removed)>,
Michael B Allen <(E-Mail Removed)> wrote:
>
>Actually I see GCC has a -funsigned-char option that seems to be what
>I want but it didn't seem to have any effect on the warnings.


-funsigned-char affects the compiler's behavior, possibly causing your
program to behave differently, but it doesn't make your code correct. Correct
code works when compiled with either -fsigned-char or -funsigned-char.
The warning is designed to help you make your code correct, by alerting you
when you've done something which might not work the same if you changed from
-funsigned-char to -fsigned-char (or from gcc to some other compiler that
doesn't let you choose)

If you got different warnings depending on your -f[un]signed-char option,
you'd have to compile your code twice to see all the possible warnings. That
wouldn't be friendly.

--
Alan Curry
(E-Mail Removed)
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      07-23-2007
Michael B Allen wrote:
> [...]
>
> Even though char is wrong, it's just another little legacy wart with
> no serious technical impact other than the fact that to inspect bytes
> within the text one should cast to unsigned char first. [...]


It is unnecessary to cast anything in order to "inspect"
a character in a string. *cptr == 'A' and *cptr == '' work
just fine (on systems that have a character), and there's
no need to cast either *cptr or the constant.

Perhaps you're unhappy about the casting that *is* needed
for the <ctype.h> functions, and I share your unhappiness.
But that's not really a consequence of the sign ambiguity of
char; rather, it follows from the functions' having a domain
consisting of all char values *plus* EOF. Were it not for the
need to handle EOF -- a largely useless addition, IMHO -- there
would be no need to cast when using <ctype.h>.

However, that's far from the worst infelicity in the C
library. The original Standard tried (mostly) to codify
C-as-it-was, not to replace it with C-remade-in-trendy-mode.
The <ctype.h> functions -- and their treatment of EOF -- were
already well-established before the first Standard was written,
and the writers had little choice but to accept them.

--
Eric Sosman
(E-Mail Removed)lid
 
Reply With Quote
 
Michael B Allen
Guest
Posts: n/a
 
      07-23-2007
On Mon, 23 Jul 2007 09:02:04 -0400
Eric Sosman <(E-Mail Removed)> wrote:

> Michael B Allen wrote:
> > [...]
> >
> > Even though char is wrong, it's just another little legacy wart with
> > no serious technical impact other than the fact that to inspect bytes
> > within the text one should cast to unsigned char first. [...]

>
> It is unnecessary to cast anything in order to "inspect"
> a character in a string. *cptr == 'A' and *cptr == '' work
> just fine (on systems that have a character), and there's
> no need to cast either *cptr or the constant.


Hi Eric,

The above code will not work with non-latin1 character encodings (most
importantly UTF-. That will severely limit it's portability from an i18n
perspective (e.g. no CJK). And even domestically you're going to run into
trouble soon. Standards related to Kebreros, LDAP, GSSAPI and many more
are basically saying they don't care about codepages anymore. Everything
is going to be UTF-8 (except on Windows which will of course continue
to use wchar_t).

> Perhaps you're unhappy about the casting that *is* needed
> for the <ctype.h> functions, and I share your unhappiness.
> But that's not really a consequence of the sign ambiguity of
> char; rather, it follows from the functions' having a domain
> consisting of all char values *plus* EOF. Were it not for the
> need to handle EOF -- a largely useless addition, IMHO -- there
> would be no need to cast when using <ctype.h>.


Forget casting, the ctype functions don't even work at all if the high
bit is on. Ctype only works with ASCII.

> However, that's far from the worst infelicity in the C
> library. The original Standard tried (mostly) to codify
> C-as-it-was, not to replace it with C-remade-in-trendy-mode.
> The <ctype.h> functions -- and their treatment of EOF -- were
> already well-established before the first Standard was written,
> and the writers had little choice but to accept them.


Ok. A little history is nice. But I really think these discussions
should be punctuated with saying that the C standard library is basically
useless at this point.

ctype - useless for i18n
errno - a classic non-standard standard
locale - no context object so it can't be safely used in libraries
setjmp - not portable
signal - no comment necessary
stdio - no context object to keep state separate (e.g. can't mix wide
and non-wide I/O)
stdlib - malloc has no context object
string - useless for i18n

If we're ever going to create a new "standard" library for C the first
step is to admit that the one we have now is useless for anything but
hello world programs.

Mike
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      07-23-2007
Michael B Allen wrote On 07/23/07 12:53,:
> On Mon, 23 Jul 2007 09:02:04 -0400
> Eric Sosman <(E-Mail Removed)> wrote:
>> [...]
>> Perhaps you're unhappy about the casting that *is* needed
>>for the <ctype.h> functions, and I share your unhappiness.
>>But that's not really a consequence of the sign ambiguity of
>>char; rather, it follows from the functions' having a domain
>>consisting of all char values *plus* EOF. Were it not for the
>>need to handle EOF -- a largely useless addition, IMHO -- there
>>would be no need to cast when using <ctype.h>.

>
>
> Forget casting, the ctype functions don't even work at all if the high
> bit is on. Ctype only works with ASCII.


First, C does not assume ASCII character encodings,
and runs happily on systems that do not use ASCII. The
only constraints on the encoding are (1) that the available
characters include a specified set of "basic" characters,
(2) that the codes for the basic characters be non-negative,
and (3) that the codes for the characters '0' through '9'
be consecutive and ascending. Any encoding that meets
these requirements -- ASCII or not -- is acceptable for C.

Second, the <ctype.h> functions are required to accept
arguments whose values cover the entire range of unsigned
char (plus EOF). Half those values have the high bit set,
and the <ctype.h> functions cannot ignore that half.

> Ok. A little history is nice. But I really think these discussions
> should be punctuated with saying that the C standard library is basically
> useless at this point.


If you think so, then why use C? You're planning on
throwing away the entire library and changing the handling
of text in fundamental ways (ways that go far beyond your
initial "I want unsigned text" plea). The result would be
a programming language in which existing C programs would
not run and perhaps would not compile; why are so you set
on calling this new and different language "C?" Call it
"D" or "Sanskrit" or "Baloney" if you like, but it ain't C.

--
(E-Mail Removed)
 
Reply With Quote
 
Richard Heathfield
Guest
Posts: n/a
 
      07-23-2007
Michael B Allen said:

<snip>

> Forget casting, the ctype functions don't even work at all if the high
> bit is on. Ctype only works with ASCII.


Strange, that - I've used it with EBCDIC, with the high bit set, and it
worked just fine. I wonder what I'm doing wrong.

> If we're ever going to create a new "standard" library for C the first
> step is to admit that the one we have now is useless for anything but
> hello world programs.


The standard C library could be a lot, lot better, it's true, but it's
surprising just how much can be done with it if you try.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999
 
Reply With Quote
 
Michael B Allen
Guest
Posts: n/a
 
      07-23-2007
On Mon, 23 Jul 2007 13:31:24 -0400
Eric Sosman <(E-Mail Removed)> wrote:

> Michael B Allen wrote On 07/23/07 12:53,:
> > On Mon, 23 Jul 2007 09:02:04 -0400
> > Eric Sosman <(E-Mail Removed)> wrote:
> >> [...]
> >> Perhaps you're unhappy about the casting that *is* needed
> >>for the <ctype.h> functions, and I share your unhappiness.
> >>But that's not really a consequence of the sign ambiguity of
> >>char; rather, it follows from the functions' having a domain
> >>consisting of all char values *plus* EOF. Were it not for the
> >>need to handle EOF -- a largely useless addition, IMHO -- there
> >>would be no need to cast when using <ctype.h>.

> >
> >
> > Forget casting, the ctype functions don't even work at all if the high
> > bit is on. Ctype only works with ASCII.

>
> First, C does not assume ASCII character encodings,
> and runs happily on systems that do not use ASCII. The
> only constraints on the encoding are (1) that the available
> characters include a specified set of "basic" characters,
> (2) that the codes for the basic characters be non-negative,
> and (3) that the codes for the characters '0' through '9'
> be consecutive and ascending. Any encoding that meets
> these requirements -- ASCII or not -- is acceptable for C.


True. I forgot about EBCDIC and such (thanks Richard).

But that is just a pedantic distraction from the real point which is that
your code will not work with non-latin1 encodings and that is going to
seriously impact it's portablity.

> Second, the <ctype.h> functions are required to accept
> arguments whose values cover the entire range of unsigned
> char (plus EOF). Half those values have the high bit set,
> and the <ctype.h> functions cannot ignore that half.


#include <stdio.h>
#include <ctype.h>

#define CH 0xdf

int
main()
{
printf("%c %d %x\n", CH, CH, CH);

printf("isalnum=%d\n", isalnum(CH));
printf("isalpha=%d\n", isalpha(CH));
printf("iscntrl=%d\n", iscntrl(CH));
printf("isdigit=%d\n", isdigit(CH));
printf("isgraph=%d\n", isgraph(CH));
printf("islower=%d\n", islower(CH));
printf("isupper=%d\n", isupper(CH));
printf("isprint=%d\n", isprint(CH));
printf("ispunct=%d\n", ispunct(CH));
printf("isspace=%d\n", isspace(CH));

return 0;
}

$ LANG=en_US.ISO-8859-1 ./t
223 df
isalnum=0
isalpha=0
iscntrl=0
isdigit=0
isgraph=0
islower=0
isupper=0
isprint=0
ispunct=0
isspace=0

Again, even if these functions did work they *still* wouldn't handle
non-latin1 encodings (e.g. UTF-.

> > Ok. A little history is nice. But I really think these discussions
> > should be punctuated with saying that the C standard library is basically
> > useless at this point.

>
> If you think so, then why use C? You're planning on
> throwing away the entire library and changing the handling
> of text in fundamental ways (ways that go far beyond your
> initial "I want unsigned text" plea). The result would be
> a programming language in which existing C programs would
> not run and perhaps would not compile; why are so you set
> on calling this new and different language "C?" Call it
> "D" or "Sanskrit" or "Baloney" if you like, but it ain't C.


I think that you should consider the possability that programming
requirements are changing and that discussing the history of C will have
no impact on that. Anyone who could move to Java or .NET already has. The
rest of us are doing systems programming that needs to be C (like me).

If standards mandate UTF-8 your techniques will have to change or you're
going to be doing a lot of painful character encoding conversions at
interface boundries.

Mike
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
(int) -> (unsigned) -> (int) or (unsigned) -> (int) -> (unsigned):I'll loose something? pozz C Programming 12 03-20-2011 11:32 PM
Casting from const pair<const unsigned char*, size_t>* to constpair<unsigned char*, size_t>* Alex Vinokur C++ 9 10-13-2008 05:05 PM
Printing the range s of unsigned char and unsigned int. Junmin H. C Programming 20 09-20-2007 06:03 AM
Java: byte literals and short literals John Goche Java 8 01-17-2006 11:12 PM
unsigned long to unsigned char ashtonn@gmail.com Python 1 06-01-2005 07:00 PM



Advertisments