Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > clarification on character handling

Reply
Thread Tools

clarification on character handling

 
 
aegis
Guest
Posts: n/a
 
      08-08-2005
7.4#1 states
The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

--
aegis

 
Reply With Quote
 
 
 
 
Peter Nilsson
Guest
Posts: n/a
 
      08-08-2005
aegis wrote:
> 7.4#1 states
> The header <ctype.h> declares several functions useful for classifying
> and mapping characters.166) In all cases the argument is an int, the
> value of which shall be representable as an unsigned char or shall
> equal the value of the macro EOF. If the
> argument has any other value, the behavior is undefined.
>
> Why should something such as:
> tolower(-10); invoke undefined behavior?


More to the point, what should it be if _not_ UB?

> It obviously has something with how tolower can be implemented,
> but I can't think of anything concrete.


Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array. It's no different to tolower(32767) on an 8-bit
char system. Why would you _expect_ some defined behaviour?

--
Peter

 
Reply With Quote
 
 
 
 
RAJU
Guest
Posts: n/a
 
      08-08-2005
Hi aegis,

The expected argument to tolower(c) is mentioned in the specification.
It's not specified if an unexpected arguments is passed. It's left to
the Compiler writers to have their own implementation, so it's
compiler/system dependent.

It's progrmmer's responsibility to avoid these kind of scenarios. There
is no error code retruned for these C functions. This is very common
for C standard.

Regards,
Raju




aegis wrote:
> 7.4#1 states
> The header <ctype.h> declares several functions useful for classifying
> and mapping characters.166) In all cases the argument is an int, the
> value of which shall be representable as an unsigned char or shall
> equal the value of the macro EOF. If the
> argument has any other value, the behavior is undefined.
>
> Why should something such as:
> tolower(-10); invoke undefined behavior?
>
> It obviously has something with how tolower can be implemented,
> but I can't think of anything concrete.
>
> --
> aegis


 
Reply With Quote
 
CBFalconer
Guest
Posts: n/a
 
      08-08-2005
aegis wrote:
>
> 7.4#1 states
> The header <ctype.h> declares several functions useful for
> classifying and mapping characters.166) In all cases the argument
> is an int, the value of which shall be representable as an
> unsigned char or shall equal the value of the macro EOF. If the
> argument has any other value, the behavior is undefined.
>
> Why should something such as:
> tolower(-10); invoke undefined behavior?
>
> It obviously has something with how tolower can be implemented,
> but I can't think of anything concrete.


Many systems have an array of bits with masks, such that the array
can be indexed by the value of the character + 1. If the value of
EOF is -1 this maps into a normal 0 based array, if EOF is
something else appropriate code can correct. The bits have
significance as to whether the character is upper case, lower case,
printable, numeric, etc. A single index and mask can return the
appropriate characteristic.

Negative (-ve) input values other than EOF foul this up, and result
in illegal memory accesses.

--
Chuck F ((E-Mail Removed)) ((E-Mail Removed))
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

 
Reply With Quote
 
Richard Kettlewell
Guest
Posts: n/a
 
      08-08-2005
"aegis" <(E-Mail Removed)> writes:
> 7.4#1 states
> The header <ctype.h> declares several functions useful for
> classifying and mapping characters.166) In all cases the argument is
> an int, the value of which shall be representable as an unsigned
> char or shall equal the value of the macro EOF. If the argument has
> any other value, the behavior is undefined.
>
> Why should something such as:
> tolower(-10); invoke undefined behavior?
>
> It obviously has something with how tolower can be implemented,
> but I can't think of anything concrete.


I would say you have it backwards: the ways in which tolower can be
implemented are defined by the specification, and the specification
allows implementations to break on negative non-EOF input if that's
the most convenient thing for them.

--
http://www.greenend.org.uk/rjk/
 
Reply With Quote
 
Antoine Leca
Guest
Posts: n/a
 
      08-08-2005
En <news:(E-Mail Removed) roups.com>,
aegis va escriure:
> Why should something such as:
> tolower(-10); invoke undefined behavior?


Because historically it does (out of bounds access), and it was not deemed
worthwhile to put it a reasonable behaviour (which one, by the way?)


Antoine

 
Reply With Quote
 
Antoine Leca
Guest
Posts: n/a
 
      08-08-2005
Sorry if I am too picky, I do not know what was the point of the original
poster, but since it posted to both comp.lang.c and comp.std.c, he perhaps
wants to make a point about toxxx() vs. isxxx().

En <news:(E-Mail Removed) roups.com>,
Peter Nilsson va escriure:
> The toxxxx() macros and functions are often implemented in this way...
>
> unsigned char _flags[257] = { 0, .... };
>
> #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)


This is unlikely to work correctly on a large scale (and *_flags can't be
0);
furthermore your _flags[] array cannot be shared with toupper(), which makes
its name pretty misleading.

Also, implementations of tolower() and toupper() as macros using the
classification array lookup, like
#define tolower(x) ((x) ^ _flags[(x) + 1] & _upper_case_flag)
(with an adequately choosen _upper_case_flag, i.e. 0x20 for ASCII and 0x40
for EBCDIC) do not comply with the C standard, because the x argument is
evaluated twice.

The other obvious "solution",
#define tolower(x) (_locale_dependent_array_for_tolower[(x) + 1])
is difficult to have it working correctly according to the specifications,
because you should return an int, including for EOF (which is negative) and
UCHAR_MAX (which is positive), so the type of the element of the array
cannot in general be a character type; and the resulting increase in width
wastes memory. As a result, many implementations do not provide tolower()
and toupper() as macros, only as functions.


Antoine

 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      08-08-2005
"Peter Nilsson" <(E-Mail Removed)> writes:
> aegis wrote:
>> 7.4#1 states
>> The header <ctype.h> declares several functions useful for classifying
>> and mapping characters.166) In all cases the argument is an int, the
>> value of which shall be representable as an unsigned char or shall
>> equal the value of the macro EOF. If the
>> argument has any other value, the behavior is undefined.
>>
>> Why should something such as:
>> tolower(-10); invoke undefined behavior?

>
> More to the point, what should it be if _not_ UB?


If plain char is signed, it would be sensible to define the various
functions to work properly with signed values, including negative
values. All the characters of the basic character set are required to
be positive, but it would be nice to be able to do something like:

char c = some_arbitrary_value;
if (isupper(c)) {
do_something();
}
else {
do_something_else();
}

The need to cast the argument to unsigned char is well documented, but
IMHO counterintuitive.

The restriction to non-negative values and EOF makes things slightly
easier for the implementation, and slightly more difficult for the
programmer. This may have been a good tradeoff when the functions
were first defined; I don't think it is now.

I've seen implementations of <ctype.h> that work properly for values
from -128 to +255, covering both signed and unsigned characters.
There is an overlap between EOF (typically -1) and whatever character
is encoded as -1 (lowercase-y-with-diaresis in Latin-1, I think), but
that's not a problem in the default locale, since all the functions
happen to return the same value for EOF and that character.

>> It obviously has something with how tolower can be implemented,
>> but I can't think of anything concrete.

>
> Consider a simple look up table (and the fact that EOF is quite
> often and deliberately set at -1). The toxxxx() macros and functions
> are often implemented in this way...
>
> unsigned char _flags[257] = { 0, .... };
>
> #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)
>
> If you try tolower(-10), then the element referenced is not within
> the specified array. It's no different to tolower(32767) on an 8-bit
> char system. Why would you _expect_ some defined behaviour?


This approach can handle negative values sensibly by changing the
offset value and making the array bigger.

Of course, since the standard doesn't require implementations to do
this, portable code still needs to make sure the argument is either
EOF or a non-negative value.

--
Keith Thompson (The_Other_Keith) http://www.velocityreviews.com/forums/(E-Mail Removed) <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
 
Reply With Quote
 
Johan Borkhuis
Guest
Posts: n/a
 
      08-09-2005
Peter Nilsson wrote:
> Consider a simple look up table (and the fact that EOF is quite
> often and deliberately set at -1). The toxxxx() macros and functions
> are often implemented in this way...
>
> unsigned char _flags[257] = { 0, .... };
>
> #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)
>
> If you try tolower(-10), then the element referenced is not within
> the specified array.


Then why not change it to:
#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
This will make sure that you cannot get outside the boundaries of the
lookup table.

Kind regards,
Johan

--
o o o o o o o . . . _____J_o_h_a_n___B_o_r_k_h_u_i_s___
o _____ || http://www.borkhuis.com |
.][__n_n_|DD[ ====_____ | (E-Mail Removed) |
>(________|__|_[_________]_|________________________________|

_/oo OOOOO oo` ooo ooo 'o!o!o o!o!o`
== VxWorks FAQ: http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html ==
 
Reply With Quote
 
Krishanu Debnath
Guest
Posts: n/a
 
      08-09-2005

Johan Borkhuis wrote:
> Peter Nilsson wrote:
> > Consider a simple look up table (and the fact that EOF is quite
> > often and deliberately set at -1). The toxxxx() macros and functions
> > are often implemented in this way...
> >
> > unsigned char _flags[257] = { 0, .... };
> >
> > #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)
> >
> > If you try tolower(-10), then the element referenced is not within
> > the specified array.

>
> Then why not change it to:
> #define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
> This will make sure that you cannot get outside the boundaries of the
> lookup table.
>


It'll make the tolower implementation buggy. Because in this case
tolower will successfully return if called with a negative integer
which maps to a valid uppercase letter after unsigned wrap.

Krishanu

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
character encoding +missing character sequence raavi Java 2 03-02-2006 05:01 AM
getting the character code of a character in a string Velvet ASP .Net 9 01-19-2006 09:27 PM
warning: multi-character character constant...help me! mimmo C Programming 4 04-10-2004 08:15 PM
Character class [\W_] clarification Fiaz Idris Perl Misc 7 12-11-2003 06:10 AM
Character reference "&#c" is an invalid XML character cgbusch XML 6 09-02-2003 07:04 PM



Advertisments