Velocity Reviews > VLA question

VLA question

Stephen Sprunk
Guest
Posts: n/a

 07-02-2013
On 01-Jul-13 13:07, James Kuyper wrote:
> On 07/01/2013 12:36 PM, Stephen Sprunk wrote:
>> On 01-Jul-13 07:56, James Kuyper wrote:
>>> On 06/27/2013 12:07 PM, Stephen Sprunk wrote:
>>>> an int would require a possibly-problematic conversion.
>>>>
>>>> Is that the concern?
>>>
>>> Almost. The conversion to 'int' would be guaranteed to produce
>>> exactly the same value that the character literal would have had
>>> under the current rules.

>>
>> Why? I thought that, while converting a negative value to unsigned
>> was well-defined, converting an out-of-range unsigned value to
>> signed was not.

>
> I mentioned my argument for that conclusion earlier in this thread -
> both you and Keith seem to have skipped over it without either
> accepting it or explaining why you had rejected it. Here it is
> again.

I'll admit that I didn't quite understand the relevance the first time;
you added some clarification this time (plus some of the other points
discussed have started to sink in), so now I think I get it.

> ... While, in general, conversion to signed type of a value that is
> too big to be represented by that type produces an implementation-
> defined result or raises an implementation-defined signal, for this
> particular conversion, I think that 7.21.2p3 implicitly prohibits the
> signal, and requires that if 'c' is an unsigned char, then
>
> (unsigned char)(int)c == c
>
> If CHAR_MAX > INT_MAX, then 'char' must behave the same as 'unsigned
> char'. Also, on such an implementation, there cannot be more valid
> 'int' values than there are 'char' values, and the inversion
> requirement implies that there cannot be more char values than there
> are valid 'int' values. This means that we must also have, if 'i' is
> an int object containing a valid representation, that
>
> (int)(char)i == i

This is indeed an interesting property of such systems, and one with
unexpectedly far-reaching implications.

> In particular, this applies when i==EOF, which is why comparing
> fgetc() values with EOF is not sufficient to determine whether or not
> the call was successful.

I'd wondered about that, since the usual excuse for fgetc() returning an
int is to allow for EOF, which is presented by most introductory texts
as being impossible to mistake for a valid character.

> Negative zero and positive zero have to
> convert to the same unsigned char, which would make it impossible to
> meet both inversion requirements, so it also follows that 'int' must
> have a 2's complement representation on such a platform.

That only holds if plain char is unsigned, right?

It seems these seemingly-unrelated restrictions would not apply if plain
char were signed, which would be the (IMHO only) logical choice if
character literals were signed.

>> I consider it insane to have an unsigned plain char when character
>> literals can be negative.

>
> You've already said that. What you haven't done so far is explained
> why. I agree that there's a bit of conflict there, but 'insane' seems
> extreme.

Perhaps "insane" was a bit strong, but I see no rational excuse for the
signedness of plain chars and character literals to differ; the two are
logically linked, and only C's definition of the latter as "int" even
allows such a bizarre case to exist in theory.

IMHO, that C++ implicitly requires the signedness of the two to match,
apparently without problems, is an argument in favor of adopting the
same rule in C. As long as the signedness matches, none of the problems
mentioned in this thread would come up--and potentially break code that
was not written to account for this unlikely corner case.

>> In C++, character literals have type char, so if char is unsigned,
>> then by definition no character literal can be negative.

>
> I'd forgotten that C++ had a different rule for the value of a
> character literal than C does. The C rule is defined in terms of
> conversion of a char object's value to type 'int', which obviously
> would be inappropriate given that C++ gives character literals a type
> of 'char'. Somehow I managed to miss that "obvious" conclusion, and I
> didn't bother to check. Sorry.

I'm in no position to complain about that.

>>> 3. Obscure, and possibly mythical, implementations where CHAR_MAX
>>> > INT_MAX.
>>>
>>> I consider the third item to be overwhelmingly the most
>>> significant of the three issues, even though the unlikelihood of
>>> such implementations makes it an insignificant issue in absolute
>>> terms.

>>
>> We know there are systems where sizeof(int)==1; can we really
>> assume that plain char is signed on all such implementations, which
>> is the only way for them to _avoid_ CHAR_MAX > INT_MAX?

>
> Every time I've brought up the odd behavior of implementations which
> have UCHAR_MAX > INT_MAX, it's been argued that they either don't
> exist or are so rare that we don't need to bother worrying about
> them. Implementations where CHAR_MAX>INT_MAX must be even rarer
> (since they are a subset of implementations where UCHAR_MAX >
> INT_MAX), so I'm surprised (and a bit relieved) to see someone
> actually arguing for the probable existence of such implementations.
> I'd feel happier about it if someone could actually cite one, but I
> don't remember anyone ever doing so.

I'm not arguing for the _probable_ existence of such systems as much as
admitting that I don't have enough experience with atypical systems to
have much idea what's really out there on the fringes, other than
pretty much standardized on twos-complement systems with flat, 32-bit
address spaces by the time I started using C; 64-bit systems were my
first real-world experience with having to think about variations in the
sizes of base types--and even then usually only pointers.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

James Kuyper
Guest
Posts: n/a

 07-02-2013
On 07/02/2013 01:10 AM, Stephen Sprunk wrote:
> On 01-Jul-13 13:07, James Kuyper wrote:

....
>> If CHAR_MAX > INT_MAX, then 'char' must behave the same as 'unsigned
>> char'. Also, on such an implementation, there cannot be more valid
>> 'int' values than there are 'char' values, and the inversion
>> requirement implies that there cannot be more char values than there
>> are valid 'int' values. This means that we must also have, if 'i' is
>> an int object containing a valid representation, that
>>
>> (int)(char)i == i

>
> This is indeed an interesting property of such systems, and one with
> unexpectedly far-reaching implications.
>
>> In particular, this applies when i==EOF, which is why comparing
>> fgetc() values with EOF is not sufficient to determine whether or not
>> the call was successful.

>
> I'd wondered about that, since the usual excuse for fgetc() returning an
> int is to allow for EOF, which is presented by most introductory texts
> as being impossible to mistake for a valid character.

On most systems, including the ones where C was first developed, that's
perfectly true. But the C standard allows an implementation where that's
not true to still be fully conforming. This does not "break" fgetc(), as
some have claimed, since you can still use feof() and ferror() to
determine whether an EOF value indicates success, failure, or
end-of-file; but in principle it does make use of fgetc() less convenient.

>> Negative zero and positive zero have to
>> convert to the same unsigned char, which would make it impossible to
>> meet both inversion requirements, so it also follows that 'int' must
>> have a 2's complement representation on such a platform.

>
> That only holds if plain char is unsigned, right?
>
> It seems these seemingly-unrelated restrictions would not apply if plain
> char were signed, which would be the (IMHO only) logical choice if
> character literals were signed.

Correct - most of what I've been saying has been explicitly about
platforms where CHAR_MAX > INT_MAX, which would not be permitted if char
were signed. "For any two integer types with the same signedness and
different integer conversion rank (see 6.3.1.1), the range of values of
the type with smaller integer conversion rank is a subrange of the
values of the other type." (6.2.5p

....
> Perhaps "insane" was a bit strong, but I see no rational excuse for the
> signedness of plain chars and character literals to differ; the two are
> logically linked, and only C's definition of the latter as "int" even
> allows such a bizarre case to exist in theory.
>
> IMHO, that C++ implicitly requires the signedness of the two to match,
> apparently without problems, is an argument in favor of adopting the
> same rule in C. As long as the signedness matches, none of the problems
> mentioned in this thread would come up--and potentially break code that
> was not written to account for this unlikely corner case.

I agree that the C++ approach makes more sense - I'm taking issue only
with your characterization of C code which relies upon the C approach as
"broken". I also think it's unlikely that the C committee would decide
to change this, even though I've argued that the breakage that could
occur would be fairly minor.

You've seen how many complicated ideas and words I've had to put
together to construct my arguments for the breakage being minor. The
committee would have to be even more rigorous in considering the same
issues. The fact that there could be any breakage at all (and there can
be) means that there would have to be some pretty significant
compensating advantages for the committee to decide to make such a
change. Despite agreeing with the C++ approach, I don't think the
advantages are large enough to justify such a change.
--
James Kuyper

James Kuyper
Guest
Posts: n/a

 07-02-2013
On 06/27/2013 06:17 PM, James Kuyper wrote:
> On 06/27/2013 06:06 PM, Keith Thompson wrote:
>> James Kuyper <(E-Mail Removed)> writes:
>>> On 06/27/2013 02:41 PM, Seebs wrote:
>>>> On 2013-06-27, James Kuyper <(E-Mail Removed)> wrote:
>>>>> As I indicated above, the problem I described arises only on
>>>>> implementations where CHAR_MAX > INT_MAX. If CHAR_BIT==8, then you can't
>>>>> have been testing on such a system.
>>>>
>>>> I was about to say "hang on, how can that happen, int must be at least
>>>> as wide as char", but of course, it can happen if CHAR_MAX == UCHAR_MAX.
>>>
>>> Right - as I mentioned earlier, CHAR_MAX > INT_MAX implies that CHAR_MIN
>>> == 0.

>>
>> Suppose CHAR_BIT==32, CHAR_MIN==-2**31, CHAR_MAX==2**31-1,
>> sizeof(int)==1, and int has one padding bit, so INT_MAX==2**30-1.

>
> You're right - I reached that conclusion so many years ago that I forgot
> the assumptions I relied upon to reach it. I was thinking of the minimal
> case where CHAR_MAX is as small as possible while still being greater
> than INT_MAX, in which case there's no room for padding bits. If you
> move away from the minimal case, there is room for padding bits, and
> then the argument breaks down. Of course, such implementations are even
> less commonplace than the minimal case.
>
> I'll have to review my earlier comments more carefully with that
> correction in mind.

I was tired and in a hurry to go home, and didn't put enough thought
into my response. Such an implementation would violate 6.2.5p8:

"For any two integer types with the same signedness and different
integer conversion rank (see 6.3.1.1), the range of values of the type
with smaller integer conversion rank is a subrange of the values of the
other type."
--
James Kuyper

James Kuyper
Guest
Posts: n/a

 07-02-2013
On 06/29/2013 02:05 PM, Keith Thompson wrote:
....
> Right -- but that's only an issue when CHAR_BIT >= 16, which is the
> context I missed in my previous response. As I also noted elsethread,
> the conversion from char to int, where char is an unsigned type and the
> value doesn't fit, is implementation-defined; the result is *probably*
> negative, but it's not guaranteed.

I've just posted an arguemnt on a different branch of this thread that
7.21.2p3 indirectly implies that on systems where UCHAR_MAX > INT_MAX,
given an unsigned character c and a valid int i, we must have

(unsigned char)(int)c == c

and

(int)(unsigned char)i == i

Comment?
--
James Kuyper

glen herrmannsfeldt
Guest
Posts: n/a

 07-02-2013
James Kuyper <(E-Mail Removed)> wrote:
> On 07/02/2013 01:10 AM, Stephen Sprunk wrote:

(snip)
>> I'd wondered about that, since the usual excuse for fgetc() returning an
>> int is to allow for EOF, which is presented by most introductory texts
>> as being impossible to mistake for a valid character.

> On most systems, including the ones where C was first developed,
> that's perfectly true. But the C standard allows an
> implementation where that's not true to still be fully conforming.
> This does not "break" fgetc(), as some have claimed, since you can
> still use feof() and ferror() to determine whether an EOF value
> indicates success, failure, or end-of-file; but in principle it
> does make use of fgetc() less convenient.

Depending on your definition of valid character. My undestanding
is that ASCII-7 system can use a signed 8 bit char, but EBCDIC
eight bit systems should use unsigned char. (No systems ever
used the ASCII-8 code that IBM designed into S/360.)

A unicode based system could use a 16 bit unsigned char, like
Jave does.

Valid character doesn't mean anything that you can put the
but pattern out for, but for an actual character in the input
character set.

-- glen

James Kuyper
Guest
Posts: n/a

 07-02-2013
On 07/02/2013 11:14 AM, glen herrmannsfeldt wrote:
> James Kuyper <(E-Mail Removed)> wrote:
>> On 07/02/2013 01:10 AM, Stephen Sprunk wrote:

>
> (snip)
>>> I'd wondered about that, since the usual excuse for fgetc() returning an
>>> int is to allow for EOF, which is presented by most introductory texts
>>> as being impossible to mistake for a valid character.

>
>> On most systems, including the ones where C was first developed,
>> that's perfectly true. But the C standard allows an
>> implementation where that's not true to still be fully conforming.
>> This does not "break" fgetc(), as some have claimed, since you can
>> still use feof() and ferror() to determine whether an EOF value
>> indicates success, failure, or end-of-file; but in principle it
>> does make use of fgetc() less convenient.

>
> Depending on your definition of valid character.

For this purpose, a valid character is anything that can be returned by
a successful call to fgetc(). Since I can fill a buffer with unsigned
char values from 0 to UCHAR_MAX, and write that buffer to a binary
stream, with a guarantee of being able read the same values back, I must
respectfully disagree with the following assertion:

....
> Valid character doesn't mean anything that you can put the
> but pattern out for, but for an actual character in the input
> character set.

Do you think that the only purpose for fgetc() is to read text files?
All C input, whether from text streams or binary, has behavior defined
by the standard in terms of calls to fgetc(), whether or not actual
calls to that function occur.

Keith Thompson
Guest
Posts: n/a

 07-02-2013
Stephen Sprunk <(E-Mail Removed)> writes:
[...]
> Granted, one can create arbitrary character literals, but doing so
> ventures into "contrived" territory. I only mean to include real
> characters, which I think means ones in the source or execution
> character sets.

[...]

I wouldn't call '\xff' (or '\xffff' for CHAR_BIT==16) contrived.

--
Keith Thompson (The_Other_Keith) http://www.velocityreviews.com/forums/(E-Mail Removed) <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Keith Thompson
Guest
Posts: n/a

 07-02-2013
James Kuyper <(E-Mail Removed)> writes:
> On 06/29/2013 02:05 PM, Keith Thompson wrote:
> ...
>> Right -- but that's only an issue when CHAR_BIT >= 16, which is the
>> context I missed in my previous response. As I also noted elsethread,
>> the conversion from char to int, where char is an unsigned type and the
>> value doesn't fit, is implementation-defined; the result is *probably*
>> negative, but it's not guaranteed.

>
> I've just posted an arguemnt on a different branch of this thread that
> 7.21.2p3 indirectly implies that on systems where UCHAR_MAX > INT_MAX,
> given an unsigned character c and a valid int i, we must have
>
> (unsigned char)(int)c == c
>
> and
>
> (int)(unsigned char)i == i
>
> Comment?

I agree.

I sometimes wonder how much thought the committee put into making
everything consistent for "exotic" systems, particularly those with
char and int having the same size (which implies CHAR_BIT >= 16).
I'm fairly sure that most C programmers don't put much thought
into it.

For most systems, having fgetc() return EOF reliably indicates that
there were no more characters to read, and that exactly one of feof()
or ferror() will then return true, and I think most C programmers
rely on that assumption. That assumption can be violated only if
CHAR_BIT >= 16.

Even with CHAR_BIT == 8, storing the (non-EOF) result of fgetc() into a
char object depends on the conversion to char (which is
implementation-defined if plain char is signed) being particularly well
behaved.

Are there *any* systems with sizeof (int) == 1 (implying CHAR_BIT >= 16)
that support stdio? I know that some implementations for DSPs have
CHAR_BIT > 8, but are they all freestanding?

I wonder if we (well, the committee) should consider adding some
restrictions for hosted implementations, such as requiring INT_MAX >
CHAR_MAX or specifying the results of out-of-range conversions to plain
or signed char.

--
Keith Thompson (The_Other_Keith) (E-Mail Removed) <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

James Kuyper
Guest
Posts: n/a

 07-02-2013
On 07/02/2013 03:24 PM, Keith Thompson wrote:
....
> I wonder if we (well, the committee) should consider adding some
> restrictions for hosted implementations, such as requiring INT_MAX >
> CHAR_MAX or specifying the results of out-of-range conversions to plain
> or signed char.

That sounds like a good idea to me. However, if there's any existing
implementations that would become non-conforming as a result of such a
change, it could be difficult (and properly so) to get it approved.

Stephen Sprunk
Guest
Posts: n/a

 07-02-2013
On 02-Jul-13 14:12, Keith Thompson wrote:
> Stephen Sprunk <(E-Mail Removed)> writes:
>> Granted, one can create arbitrary character literals, but doing so
>> ventures into "contrived" territory. I only mean to include real
>> characters, which I think means ones in the source or execution
>> character sets.

>
> I wouldn't call '\xff' (or '\xffff' for CHAR_BIT==16) contrived.

Why would anyone use that syntax for a character literal, rather than
the shorter 0xff (or 0xffff)? That strikes me as contrived.

There are certain cases where using the escape syntax is reasonable,
such as '\n', but even '\0' is more simply written as just 0. String
literals are another matter entirely, but those already have type
(pointer to) char--another argument in favor of character literals
having type char.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking