Velocity Reviews > Advancing past the last element of an array

# Advancing past the last element of an array

Johannes Schaub (litb)
Guest
Posts: n/a

 12-27-2009

int a[3][1];
int *ap = &a[0][0];

I know that in C++ the following is perfectly fine:

int ap1 = *ap;
int ap2 = *(ap + 1);
int ap3 = *(ap + 1 + 1);

That is because the past-the-end pointer "ap + 1" happens to point to an
unrelated integer that just happens to be stored there, and dereferencing it
dereferences that pointer. Adding +1 again adds +1 to *that* pointer which
is a pointer into the second subarray of "a". Which will point to the last
integer that just happens to be at the past-the-end position of the second
subarray.

I know that the following two lines are undefined behavior in C++:

int ap3 = *(ap + 2);
int ap3_secondtry = *(ap + (1 + 1));

That is because it adds 2 to the pointer into an array that only has one
element.

Now my question is - how is the matter in C? Is there some paragraph in the
Standard that allows it? To me, it looks like all but "int ap1 = *ap;" is
undefined behavior in C, because it seems to disallow dereferencing the
past-the-end pointer.

Any help is welcome!

Seebs
Guest
Posts: n/a

 12-27-2009
On 2009-12-27, Johannes Schaub (litb) <(E-Mail Removed)> wrote:
>
> int a[3][1];
> int *ap = &a[0][0];
>
> I know that in C++ the following is perfectly fine:
>
> int ap1 = *ap;
> int ap2 = *(ap + 1);
> int ap3 = *(ap + 1 + 1);

Are you sure? I'd agree that it's almost certainly going to work. However,
it's not as clear to me that it's "perfectly fine".

> That is because the past-the-end pointer "ap + 1" happens to point to an
> unrelated integer that just happens to be stored there, and dereferencing it
> dereferences that pointer.

In C, that's a bounds violation, because you're going past the bounds of
the object to which you have a pointer. You have a pointer to the first of
the three subarrays. While it happens that this is part of a larger object,
it's still going past the bounds of the specific object from which you
derived the pointer.

You're certainly allowed to generate a pointer one past the end of an
array, but you're not allowed to dereference it.

> I know that the following two lines are undefined behavior in C++:

> int ap3 = *(ap + 2);

You are clearly crazy.

There is no difference between "ap + 1 + 1" and "ap + 2".

If you think there is, either the C++ standards committee is deeply
insane, or you're very confused. I'm guessing both.

> int ap3_secondtry = *(ap + (1 + 1));

> That is because it adds 2 to the pointer into an array that only has one
> element.

Again, this is not different. There's no difference to be had. You're still
going past the end of an array by the same amount.

> Now my question is - how is the matter in C? Is there some paragraph in the
> Standard that allows it? To me, it looks like all but "int ap1 = *ap;" is
> undefined behavior in C, because it seems to disallow dereferencing the
> past-the-end pointer.

It does.

If C++ allows dereferencing the one-past-the-end pointer, that gets you
ap+1, but it doesn't make ap+1+1 different from ap+2. The mere fact that
ap+1 happens to be the same address as &ap[1][0] doesn't mean that you can
then expect ap+1+1 to be dereferenceable; it's still going two past the
end of an array.

-s
--
Copyright 2009, all wrongs reversed. Peter Seebach / http://www.velocityreviews.com/forums/(E-Mail Removed)
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!

Johannes Schaub (litb)
Guest
Posts: n/a

 12-28-2009
Seebs wrote:

> On 2009-12-27, Johannes Schaub (litb) <(E-Mail Removed)> wrote:
>>
>> int a[3][1];
>> int *ap = &a[0][0];
>>
>> I know that in C++ the following is perfectly fine:
>>
>> int ap1 = *ap;
>> int ap2 = *(ap + 1);
>> int ap3 = *(ap + 1 + 1);

>
> Are you sure? I'd agree that it's almost certainly going to work.
> However, it's not as clear to me that it's "perfectly fine".
>
>> That is because the past-the-end pointer "ap + 1" happens to point to an
>> unrelated integer that just happens to be stored there, and dereferencing
>> it dereferences that pointer.

>
> In C, that's a bounds violation, because you're going past the bounds of
> the object to which you have a pointer. You have a pointer to the first
> of
> the three subarrays. While it happens that this is part of a larger
> object, it's still going past the bounds of the specific object from which
> you derived the pointer.
>
> You're certainly allowed to generate a pointer one past the end of an
> array, but you're not allowed to dereference it.
>
>> I know that the following two lines are undefined behavior in C++:

>
>> int ap3 = *(ap + 2);

>
> You are clearly crazy.
>
> There is no difference between "ap + 1 + 1" and "ap + 2".
>
> If you think there is, either the C++ standards committee is deeply
> insane, or you're very confused. I'm guessing both.
>

I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+ 2"
above (notice that the binding is ((a + b) + c) and not (a + (b + c)). It
says:

"If an object of type T is located at an address A, a pointer of type cv T*
whose value is the address A is said to point to that object, regardless of
how the value was obtained. [Note: for instance, the address one past the
end of an array (5.7) would be considered to point to an unrelated object of
the array’s element type that might be located at that address. ]"

The note sufficiently clarifies that "ap + 1" above is the same as "&a[1]
[0]" in my opinion - and because C++ does not forbid dereferencing past-the-
end unconditionally, i was in the opinion it is valid.

The reason that i say "+ 2" or "+ (1+1)" is undefined behavior is because it
says that one can not add an integer so it goes further than past-the-end.
But once we have hit past-the-end, and point to an element of another array,
we could increment again.

My question was whether such things exist in C too. It has indeed practical
relevance, since if C guarantees it too, then we could go from &a[0][0] to
&a[2][0] without undefined behavior in C too, like we can in C++ using
"ap++" until we hit end.

If C does not provide this - is there a reason for that? Thanks for any
pointers!

Seebs
Guest
Posts: n/a

 12-28-2009
On 2009-12-28, Johannes Schaub (litb) <(E-Mail Removed)> wrote:
> I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+ 2"
> above (notice that the binding is ((a + b) + c) and not (a + (b + c)).

Hmm.

> "If an object of type T is located at an address A, a pointer of type cv T*
> whose value is the address A is said to point to that object, regardless of
> how the value was obtained. [Note: for instance, the address one past the
> end of an array (5.7) would be considered to point to an unrelated object of
> the array?s element type that might be located at that address. ]"

If you did:
int *x = *a[0][0];
int *y = x + 1;
int *z = y + 1;
you might be able to argue that the +1s are each being resolved separately, and
you're not just running two past the end of the array. But otherwise, I don't
think I buy it. If this really is what the spec says, I'd guess it's an
unintentional bug.

However, if you want to discuss C++ rules, you'll get more informed
opinions in a C++ newsgroup.

> The reason that i say "+ 2" or "+ (1+1)" is undefined behavior is because it
> says that one can not add an integer so it goes further than past-the-end.
> But once we have hit past-the-end, and point to an element of another array,
> we could increment again.

Interesting.

> My question was whether such things exist in C too. It has indeed practical
> relevance, since if C guarantees it too, then we could go from &a[0][0] to
> &a[2][0] without undefined behavior in C too, like we can in C++ using
> "ap++" until we hit end.

> If C does not provide this - is there a reason for that? Thanks for any
> pointers!

There is absolutely no such guarantee, and there is a very good reason:
Because all such code is fundamentally, deeply, broken.

C does not allow dereferencing outside the bounds of an object. The one
thing you can do is calculate the address one past the end -- but you
can't dereference it. C does not have the rule that, no matter how you
get a pointer, if you have a pointer that compares equal to another
pointer, they're the same -- because this would break bounds checking.

My guess is that C++ has that for some stupid reason pertaining to

But even if it works, you should never, ever, not in a million years, not
under any circumstances, write code depending on this kind of idiocy.

(There is in fact a very good practical reason for this -- one of the
most common changes to see in 2D array code is a shift from a 2D array
to a 1D array of pointers, in which case, iterating off one does NOT
lead you to the next one...)

-s
--
Copyright 2009, all wrongs reversed. Peter Seebach / (E-Mail Removed)
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!

Kaz Kylheku
Guest
Posts: n/a

 12-28-2009
On 2009-12-28, Johannes Schaub (litb) <(E-Mail Removed)> wrote:
>
> I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+ 2"

Not for the built-in + operator over arithmetic types.

> above (notice that the binding is ((a + b) + c) and not (a + (b + c)). It
> says:

> The reason that i say "+ 2" or "+ (1+1)" is undefined behavior is because it
> says that one can not add an integer so it goes further than past-the-end.
> But once we have hit past-the-end, and point to an element of another array,
> we could increment again.

That isn't true; you are past the end of the object from which you
derived the pointer.
This is undefined behavior.

When you're dealing with multidimensional arrays, the compiler can
genearate code which assumes that no bounds are violated.

So for instance if you have some machine instruction with, say, a 12 bit
displacement field, and in the given situation, that bit field is wide
enough to address a dimension of the array, then the compiler can just
blindly generate that instrution, even if your program overflows the 12
bit width.

Johannes Schaub (litb)
Guest
Posts: n/a

 12-28-2009
Kaz Kylheku wrote:

> On 2009-12-28, Johannes Schaub (litb) <(E-Mail Removed)> wrote:
>>
>> I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+
>> 2"

>
> Not for the built-in + operator over arithmetic types.
>
>> above (notice that the binding is ((a + b) + c) and not (a + (b + c)). It
>> says:

>
>> The reason that i say "+ 2" or "+ (1+1)" is undefined behavior is because
>> it says that one can not add an integer so it goes further than
>> past-the-end. But once we have hit past-the-end, and point to an element
>> of another array, we could increment again.

>
> That isn't true; you are past the end of the object from which you
> derived the pointer.
> This is undefined behavior.
>

One-past-the-end is fine. Going past *that* is undefined if you do that
addition in one operation (e.g +2 instead of +1 + 1). There is no saying, to
what i know, in the Standard that it's undefined behavior to do this in two
steps.

I'm just unsure about C. But it seems it's actually not allowed by C.

> When you're dealing with multidimensional arrays, the compiler can
> genearate code which assumes that no bounds are violated.
>
> So for instance if you have some machine instruction with, say, a 12 bit
> displacement field, and in the given situation, that bit field is wide
> enough to address a dimension of the array, then the compiler can just
> blindly generate that instrution, even if your program overflows the 12
> bit width.

There are exactly two valid values of object pointers: Either a byte in
memory, or a null pointer value - there is no dedicated past-the-end value.
If you are one past the end of one array that is just prior to another
array, it follows you are also at the first element of the next array.

The compiler could switch the segments if it hits past-the-end and it sees
the addition would overflow the segment. This sounds like a practical reason
for why "+1 + 1" is not UB: It just inserts checks after each addition
whether it hit the end of a segment, and switches, if needed. But if you do
"+2", you "jump over it", and do an overflow right there, with no chance for
the compiler to switch segments.

Johannes Schaub (litb)
Guest
Posts: n/a

 12-28-2009
Seebs wrote:

> On 2009-12-28, Johannes Schaub (litb) <(E-Mail Removed)> wrote:
>> I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+
>> 2" above (notice that the binding is ((a + b) + c) and not (a + (b + c)).

>
> Hmm.
>
>> "If an object of type T is located at an address A, a pointer of type cv
>> T* whose value is the address A is said to point to that object,
>> regardless of how the value was obtained. [Note: for instance, the
>> address one past the end of an array (5.7) would be considered to point
>> to an unrelated object of the array?s element type that might be located

>
> If you did:
> int *x = *a[0][0];
> int *y = x + 1;
> int *z = y + 1;
> you might be able to argue that the +1s are each being resolved
> separately, and
> you're not just running two past the end of the array. But otherwise, I
> don't
> think I buy it. If this really is what the spec says, I'd guess it's an
> unintentional bug.
>

Hmm, maybe i should ask this question in a C++ group. I thought it's exactly
intended behavior. But reading how surprised you guys are, i've now doubts

Kaz Kylheku
Guest
Posts: n/a

 12-28-2009
On 2009-12-28, Johannes Schaub (litb) <(E-Mail Removed)> wrote:
>> When you're dealing with multidimensional arrays, the compiler can
>> genearate code which assumes that no bounds are violated.
>>
>> So for instance if you have some machine instruction with, say, a 12 bit
>> displacement field, and in the given situation, that bit field is wide
>> enough to address a dimension of the array, then the compiler can just
>> blindly generate that instrution, even if your program overflows the 12
>> bit width.

>
> There are exactly two valid values of object pointers: Either a byte in
> memory, or a null pointer value - there is no dedicated past-the-end value.

In C, there is a concept of pointer validity which takes into account
/how/ that pointer was obtained. That information is not necessarily
encoded in the pointer's run time value. (Remember, in C, type
information is not also encoded in run-time values; that doesn't mean
you can violate the type system and still have a well-defined program).

Since the validity of a pointer includes how it was obtained,
merely knowing where that pointer points is not enough of an assurance
of correctness.

Validity is important when it comes to code generation, and
optimization. Code can be generated and optimized based on validity
assumptions (that the program hasn't invoked any undefined behavior).

> If you are one past the end of one array that is just prior to another
> array, it follows you are also at the first element of the next array.

It doesn't follow that you are legally at the first element of the
array.

If a prisoner climbs the fence, it follows that he's physically not in
prison any more, not that he's legally a free man.

C is not assembly language. What is well-defined or not at the language
level is not governed by the object code generated by some compilers.

There isn't just once C language so you have to be careful about what
you mean; when you say that something is well-defined, do you mean
ISO C, or do you mean some dialect accepted by some compilers?

Both concepts of definedness are valid and useful, as is not
confusing one for the other.

Johannes Schaub (litb)
Guest
Posts: n/a

 12-28-2009
Kaz Kylheku wrote:

> On 2009-12-28, Johannes Schaub (litb) <(E-Mail Removed)> wrote:
>>> When you're dealing with multidimensional arrays, the compiler can
>>> genearate code which assumes that no bounds are violated.
>>>
>>> So for instance if you have some machine instruction with, say, a 12 bit
>>> displacement field, and in the given situation, that bit field is wide
>>> enough to address a dimension of the array, then the compiler can just
>>> blindly generate that instrution, even if your program overflows the 12
>>> bit width.

>>
>> There are exactly two valid values of object pointers: Either a byte in
>> memory, or a null pointer value - there is no dedicated past-the-end
>> value.

>
> In C, there is a concept of pointer validity which takes into account
> /how/ that pointer was obtained. That information is not necessarily
> encoded in the pointer's run time value. (Remember, in C, type
> information is not also encoded in run-time values; that doesn't mean
> you can violate the type system and still have a well-defined program).
>
> Since the validity of a pointer includes how it was obtained,
> merely knowing where that pointer points is not enough of an assurance
> of correctness.
>
> Validity is important when it comes to code generation, and
> optimization. Code can be generated and optimized based on validity
> assumptions (that the program hasn't invoked any undefined behavior).
>

I see now. In C, pointers seem to have these relatioships to where they are
generated from.

>> If you are one past the end of one array that is just prior to another
>> array, it follows you are also at the first element of the next array.

>
> It doesn't follow that you are legally at the first element of the
> array.
>
> If a prisoner climbs the fence, it follows that he's physically not in
> prison any more, not that he's legally a free man.
>
> C is not assembly language. What is well-defined or not at the language
> level is not governed by the object code generated by some compilers.
>
> There isn't just once C language so you have to be careful about what
> you mean; when you say that something is well-defined, do you mean
> ISO C, or do you mean some dialect accepted by some compilers?
>
> Both concepts of definedness are valid and useful, as is not
> confusing one for the other.

I think this makes sense. I'm talking about ISO C99. We cannot do this in C
then. Thanks for showing me the matters, i like the prisoner analogy.

Seebs
Guest
Posts: n/a

 12-28-2009
On 2009-12-28, Johannes Schaub (litb) <(E-Mail Removed)> wrote:
> Hmm, maybe i should ask this question in a C++ group. I thought it's exactly
> intended behavior. But reading how surprised you guys are, i've now doubts

It makes no sense for it to be intentionally specified to work.

The rule about pointers outside the bounds of an object is that you're
allowed to generate a pointer one past the end of an array, for purposes
of comparing it to pointers into the array, or subtracting other addresses
in the array from it to count offsets.

In C, where you got the pointer matters.

int ary[3][1] = { 0 };
int *ap_1 = (int *) ary;
int *ap_2 = (int *) ary[0];
int x;
x = ap_1[0]; // clearly well-defined
x = ap_1[1]; // well-defined, reads from ary[1][0]
x = ap_2[0]; // clearly well-defined
x = ap_2[1]; // undefined, tries to read from ary[0][1]

In short, even though ap_1 and ap_2 are the same address in memory, the
compiler is allowed to note that one of them is the address of a block of
three arrays of single integers, thus, an object of size 3*sizeof(int),
and the other is the address of an array of one integer, thus, an object
of size 1*sizeof(int).

If you want a pointer to the whole object, don't derive it from one of
the members.

-s
--
Copyright 2009, all wrongs reversed. Peter Seebach / (E-Mail Removed)
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!