Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Manipulation of strings: upper/lower case

Reply
Thread Tools

Manipulation of strings: upper/lower case

 
 
Peter Nilsson
Guest
Posts: n/a
 
      01-17-2005
infobahn wrote:
> Lew Pitcher wrote:
> >
> > #include <ctype.h>
> >
> > void UppercaseString(char *string)
> > {
> > for(;*string;++string)
> > if (islower(*string)) *string = toupper(*string);
> > }

>
> Caution is necessary here. The behaviours of islower and toupper
> are undefined if they are passed a value that is neither EOF nor
> representable as an unsigned char. It is good practice, therefore,
> to cast *string to unsigned char.


I believe the cast (conversion) of individual characters is
incorrect. Instead, the byte characters should be interpreted as
unsigned char...

.. char *make_upper(char *s)
.. {
.. unsigned char *us = (unsigned char *) s;
.. for (; *us; us++) *us = toupper(*us);
.. return s;
.. }

The reason being that reinterpretation is more likely to be
correct.

I did once post a query about this...
http://groups.google.com/groups?thre...ivernet.com.au
--
Peter

 
Reply With Quote
 
 
 
 
Old Wolf
Guest
Posts: n/a
 
      01-17-2005
Peter Nilsson wrote:
> infobahn wrote:
> > Lew Pitcher wrote:
> > > if (islower(*string)) *string = toupper(*string);

> >
> > Caution is necessary here. The behaviours of islower and toupper
> > are undefined if they are passed a value that is neither EOF nor
> > representable as an unsigned char. It is good practice, therefore,
> > to cast *string to unsigned char.

>
> I believe the cast (conversion) of individual characters is
> incorrect. Instead, the byte characters should be interpreted as
> unsigned char...
>
> unsigned char *us = (unsigned char *) s;
>
> The reason being that reinterpretation is more likely to be
> correct.


Casting a signed char to unsigned is always correct.
So everything else is equally or less likely to be
correct
AFAIK the standard does not explicitly say that you
can cast a (char *) to an (unsigned char *) , for example
many compilers warn about parameter type mismatches if you
pass one to a function expecting the other.
However it does say that they must have the same size,
alignment etc. etc. etc. so I don't see how an implementation
could conform but not allow the cast. (Unless it was the DS9k).

 
Reply With Quote
 
 
 
 
Peter Nilsson
Guest
Posts: n/a
 
      01-17-2005
Old Wolf wrote:
> Peter Nilsson wrote:
> > infobahn wrote:
> > > Lew Pitcher wrote:
> > > > if (islower(*string)) *string = toupper(*string);
> > >
> > > Caution is necessary here. The behaviours of islower and toupper
> > > are undefined if they are passed a value that is neither EOF nor
> > > representable as an unsigned char. It is good practice,

therefore,
> > > to cast *string to unsigned char.

> >
> > I believe the cast (conversion) of individual characters is
> > incorrect. Instead, the byte characters should be interpreted as
> > unsigned char...
> >
> > unsigned char *us = (unsigned char *) s;
> >
> > The reason being that reinterpretation is more likely to be
> > correct.

>
> Casting a signed char to unsigned is always correct.
> So everything else is equally or less likely to be
> correct


Chapter and verse, please.

Consider that I/O functions write to buffers (and strings)
using unsigned char, not char. The string and mem functions
use unsigned char, not char.

My main point is that a cast from char to unsigned char may
NOT yield the original value that was written to the char.

> AFAIK the standard does not explicitly say that you
> can cast a (char *) to an (unsigned char *) ,


6.3.2.3p7 "... When a pointer to an object is converted to a
pointer to a character type, the result points to the lowest
addressed byte of the object. ..."

> for example
> many compilers warn about parameter type mismatches if you
> pass one to a function expecting the other.


Because many implicit conversions _require_ a diagnostic.
> <snip>


--
Peter

 
Reply With Quote
 
infobahn
Guest
Posts: n/a
 
      01-18-2005
Old Wolf wrote:
> Peter Nilsson wrote:
> > infobahn wrote:
> > > It is good practice, therefore,
> > > to cast *string to unsigned char.

> >
> > I believe the cast (conversion) of individual characters is
> > incorrect. Instead, the byte characters should be interpreted as
> > unsigned char...
> >
> > unsigned char *us = (unsigned char *) s;
> >
> > The reason being that reinterpretation is more likely to be
> > correct.

>
> Casting a signed char to unsigned is always correct.


Yes. His complaint is most strange, since there's nothing at all
wrong with the cast I suggested.

> So everything else is equally or less likely to be
> correct
> AFAIK the standard does not explicitly say that you
> can cast a (char *) to an (unsigned char *) ,


You can point an unsigned char * anywhere you can point (within
reason - for example, you wouldn't want to point it at a function).

The closest the Standard comes to formalising this, as far as I can
tell, is:

"Values stored in non-bit-field objects of any other object type
consist of n x CHAR_BIT bits, where n is the size of an object of
that type, in bytes. The value may be copied into an object of type
unsigned char [n] (e.g., by memcpy); the resulting set of bytes is
called the object representation of the value."

This doesn't actually say anything about casting, but it does say
we can represent any object using an array of unsigned char.

> for example
> many compilers warn about parameter type mismatches if you
> pass one to a function expecting the other.


And rightly so, but not because objects can't be pointed to by
unsigned char *.

> However it does say that they must have the same size,
> alignment etc. etc. etc. so I don't see how an implementation
> could conform but not allow the cast. (Unless it was the DS9k).


I do not believe the DS9K could refuse the cast either.
 
Reply With Quote
 
Peter Nilsson
Guest
Posts: n/a
 
      01-18-2005
infobahn wrote:
> Old Wolf wrote:
> > Peter Nilsson wrote:
> > > infobahn wrote:
> > > > It is good practice, therefore,
> > > > to cast *string to unsigned char.
> > >
> > > I believe the cast (conversion) of individual characters is
> > > incorrect. Instead, the byte characters should be interpreted as
> > > unsigned char...
> > >
> > > unsigned char *us = (unsigned char *) s;
> > >
> > > The reason being that reinterpretation is more likely to be
> > > correct.

> >
> > Casting a signed char to unsigned is always correct.

>
> Yes. His complaint is most strange, since there's nothing at all
> wrong with the cast I suggested.


6.2.5p3 says:

" An object declared as type char is large enough to store any
" member of the basic execution character set. If a member of the
" basic execution character set is stored in a char object, its
" value is guaranteed to be positive. If any other character is
" stored in a char object, the resulting value is implementation-
" defined but shall be within the range of values that can be
" represented in that type.

This makes it quite clear that plain char may not be sufficient
to represent the values of all (extended) characters in the
execution character set. This is the first clue that a conversion
of a plain char value might not be appropriate.

But let's look at an example...

Suppose we have an implementation with an extended character set
that includes an accented e. For the sake of argument, let's
suppose the coding for that character is 233 (0xE9). This is
representable within a byte on any system, and is therefore a
valid single-byte character.

Let's go on to suppose we read input into a character array, and
that input includes one accented e. Note that ordinary input is
made through "byte input/output functions", so the value stored
in the corresponding byte is 233. Assuming an 8-bit byte, this
has the representation...

11101001

Consider the possible signed plain char value of this
representation on various allowed 8-bit implementations...

2c: -23
1c: -22
sm: -105

Using your cast to convert char to unsigned char, we get...

2c: 233
1c: 234
sm: 151

....only _one_ of which is correct.

If instead we interpret the byte through an unsigned char
pointer, then we get 233, irrespective of the signed plain
char value. Had I considered the character coding of 128,
then the last sentance of 6.2.5p3 says you have _NO_ guarantee
that your cast to unsigned char will produce 128.

That is why the 'interpreted' way is better than 'conversion'.
Note that the string/memory functions interpret, rather than
cast, for similar reasons.

--
Peter

 
Reply With Quote
 
Lawrence Kirby
Guest
Posts: n/a
 
      01-18-2005
On Tue, 18 Jan 2005 00:49:54 -0800, Peter Nilsson wrote:

....

> If instead we interpret the byte through an unsigned char
> pointer, then we get 233, irrespective of the signed plain
> char value. Had I considered the character coding of 128,
> then the last sentance of 6.2.5p3 says you have _NO_ guarantee
> that your cast to unsigned char will produce 128.
>
> That is why the 'interpreted' way is better than 'conversion'.
> Note that the string/memory functions interpret, rather than
> cast, for similar reasons.


The real issue is that neither approach is correct until we know how the
value in the char has been derived in the first place. Maybe the character
value was obtained by converting the return value of getc() to char,
maybe it was written directly by fgets() or fread().

In practice implementations that create inconsistent results for the
various appraches discussed are going to cause problems. In such
environments it would probably be wise for the implementation to define
char as an unsigned type. It is one of those things where the best thing
to do is ignore it until you come across it. You would have to be
AMAZINGLY unlucky for that to happen. IMO you are more likely to encounter
problems due to compiler bugs than this, and you might as well treat this
as such.

Lawrence
 
Reply With Quote
 
Peter Nilsson
Guest
Posts: n/a
 
      01-18-2005
Lawrence Kirby wrote:
> On Tue, 18 Jan 2005 00:49:54 -0800, Peter Nilsson wrote:
>
> ...
>
> > If instead we interpret the byte through an unsigned char
> > pointer, then we get 233, irrespective of the signed plain
> > char value. Had I considered the character coding of 128,
> > then the last sentance of 6.2.5p3 says you have _NO_ guarantee
> > that your cast to unsigned char will produce 128.
> >
> > That is why the 'interpreted' way is better than 'conversion'.
> > Note that the string/memory functions interpret, rather than
> > cast, for similar reasons.

>
> The real issue is that neither approach is correct until we know
> how the value in the char has been derived in the first place.
> Maybe the character value was obtained by converting the return
> value of getc() to char, maybe it was written directly by fgets()
> or fread().


This is generally within the control of the programmer. Reading
input into char arrays by assigning values returned by fgetc is
wrong... in the theoretical sense. That a lot of programs do it
(K&R2 does it) doesn't make it any less 'wrong'.

> In practice implementations that create inconsistent results for
> the various appraches discussed are going to cause problems. In
> such environments it would probably be wise for the implementation
> to define char as an unsigned type.


It would be even better if the standard actually _required_ this
for qualified implementations.

Personally, I think the standard is defective, not merely because
of the above issues, but also in the way it treats character
constants.

Consider an 8-bit implementation where plain char is signed, uses
non two's complement, but supports a subset of iso646. C99, by
my reading, _requires_ that such implementations generate a value
_other than_ 233 for the character constants '\xe9' and '\u00e9'!

That said, I don't honestly claim to be able to rectify the standard
in a way that a significant majority of C diehards would approve of.

> It is one of those things where the best thing to do is ignore it
> until you come across it.
>
> You would have to be AMAZINGLY unlucky for that to happen. IMO you
> are more likely to encounter problems due to compiler bugs than
> this, and you might as well treat this as such.


I agree, but I note that a modern C programmer would have to be
'amazingly unlucky' to ever program a hosted implementation that
didn't use two's complement, or had 9-bit chars, or uses different
sized pointers for different (object or incomplete) pointer types,
has integer padding bits, ... and all the other things which are
regularly cited in clc as being supposedly relevant considerations.

Such things are so esoteric as to be worth ignoring. Nontheless, I
still believe clc would be doing a disservice to its readers if it
did not mention them.

--
Peter

 
Reply With Quote
 
Old Wolf
Guest
Posts: n/a
 
      01-18-2005
infobahn wrote:
> Old Wolf wrote:
>
> > AFAIK the standard does not explicitly say that you
> > can cast a (char *) to an (unsigned char *) ,

>
> You can point an unsigned char * anywhere you can point (within
> reason - for example, you wouldn't want to point it at a function).


Right. I meant to also say "...and get the expected result".

After reading Peter Nilsson's last post, I think his point
was that if you want to access the representation of a byte,
then you must point to it with (unsigned char *) and then read
it. This is of course different to reading the C value of a
signed char, and then converting to unsigned (because of
non-2s-magnitude systems).

 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      01-18-2005
"Peter Nilsson" <(E-Mail Removed)> writes:
[...]
> Personally, I think the standard is defective, not merely because
> of the above issues, but also in the way it treats character
> constants.
>
> Consider an 8-bit implementation where plain char is signed, uses
> non two's complement, but supports a subset of iso646. C99, by
> my reading, _requires_ that such implementations generate a value
> _other than_ 233 for the character constants '\xe9' and '\u00e9'!
>
> That said, I don't honestly claim to be able to rectify the standard
> in a way that a significant majority of C diehards would approve of.


Is there any real advantage (other than not breaking existing
implementations) in allowing plain char to be signed? I know there
are historical reasons, but what would break if the standard required
char to have the same characteristics as unsigned char?

--
Keith Thompson (The_Other_Keith) http://www.velocityreviews.com/forums/(E-Mail Removed) <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      01-19-2005
Richard Bos wrote:

> "Peter Nilsson" <(E-Mail Removed)> wrote:
>
>>My main point is that a cast from char to unsigned char may
>>NOT yield the original value that was written to the char.

>
> I'm afraid they must.


A counterexample comes to mind. Consider a signed `char'
on a system that uses either ones' complement or signed
magnitude to represent negative integers. On such a system
there are two distinct `char' representations that have the
value zero (unless "minus zero" is a trap value), and both
of them produce the same value (zero) upon conversion to
`unsigned char'. Conversion obliterates the distinction.

Whether all this makes much difference is open to question,
though. A conforming C implementation can use signed magnitude,
can choose signed `char', can even choose CHAR_MAX==ULLONG_MAX,
but if it is a hosted implementation it must still make the I/O
functions work "properly." A successful getc() delivers an `int'
in the range 0..UCHAR_MAX, and if CHAR_MAX<UCHAR_MAX we might
think it unsafe to assign such a value to a plain `char' -- the
attempted conversion, according to the Standard, produces an
implementation-defined result or raises an implementation-defined
signal, and thus cannot be performed in a strictly-conforming
program. However, an implementation capable of reading a valid
character from an input stream but incapable of storing it into
a `char' would be laughed out of the marketplace. It might be
too ambitious to claim that such an implementation violated the
Standard, but "quality of implementation" concerns would, I think,
rule it out. As a practical matter, any system with signed `char'
must do "something reasonable" when it converts an out-of-range
`unsigned char' to plain (signed) `char'; the implementation-
defined aspect will turn out to be "what you wanted."

--
Eric Sosman
(E-Mail Removed)lid
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
case insensitive find on case sensitive stl map benhoefer@gmail.com C++ 1 04-06-2007 08:42 PM
lower case to upper case Janice C Programming 17 12-14-2004 02:35 PM
how to case select with case-insensitive string ? Tee ASP .Net 3 06-23-2004 07:40 PM
Possible to turn on/off cookieless sessions dynamically on a case by case basis at run-time? Steve Franks ASP .Net 2 06-10-2004 02:04 PM
Scorsese Collection: Keep case vs Snap case Ray DVD Video 0 05-30-2004 04:04 AM



Advertisments