Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   C++ (http://www.velocityreviews.com/forums/f39-c.html)
-   -   C++0x: release sequence (http://www.velocityreviews.com/forums/t620349-c-0x-release-sequence.html)

Dmitriy V'jukov 06-15-2008 08:17 PM

C++0x: release sequence
 
Current C++0x draft (N2606):
1.10/6
A release sequence on an atomic object M is a maximal contiguous sub-
sequence of side effects in the modification
order of M, where the first operation is a release, and every
subsequent operation
— is performed by the same thread that performed the release, or
— is a non-relaxed atomic read-modify-write operation.

Why in second clause there is *non-relaxed* atomic read-modify-write
operation? Why non-relaxed?
On what hardware architecture relaxed atomic read-modify-write
operations in release sequence will interfere with efficient
implementation?

Dmitriy V'jukov

Anthony Williams 06-16-2008 10:13 AM

Re: C++0x: release sequence
 
"Dmitriy V'jukov" <dvyukov@gmail.com> writes:

> Current C++0x draft (N2606):
> 1.10/6
> A release sequence on an atomic object M is a maximal contiguous sub-
> sequence of side effects in the modification
> order of M, where the first operation is a release, and every
> subsequent operation
> — is performed by the same thread that performed the release, or
> — is a non-relaxed atomic read-modify-write operation.
>
> Why in second clause there is *non-relaxed* atomic read-modify-write
> operation? Why non-relaxed?
> On what hardware architecture relaxed atomic read-modify-write
> operations in release sequence will interfere with efficient
> implementation?


Relaxed operations can read values from other threads
out-of-order. Consider the following:

atomic_int x=0;
atomic_int y=0;

Processor 1 does store-release:

A: x.store(1,memory_order_relaxed)
B: y.store(1,memory_order_release)

Processor 2 does relaxed RMW op:
int expected=1;
C: while(!y.compare_swap(expected,2,memory_order_rela xed));

Processor 3 does load-acquire:
D: a=y.load(memory_order_acquire);
E: b=x.load(memory_order_relaxed);

If a is 2, what is b?

On most common systems (e.g. x86, PowerPC, Sparc), b will be 1. This
is not guaranteed by the standard though, since this may not be
guaranteed by NUMA systems.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Dmitriy V'jukov 06-16-2008 10:40 AM

Re: C++0x: release sequence
 
On Jun 16, 2:13 pm, Anthony Williams <anthony....@gmail.com> wrote:

> Relaxed operations can read values from other threads
> out-of-order. Consider the following:
>
> atomic_int x=0;
> atomic_int y=0;
>
> Processor 1 does store-release:
>
> A: x.store(1,memory_order_relaxed)
> B: y.store(1,memory_order_release)
>
> Processor 2 does relaxed RMW op:
> int expected=1;
> C: while(!y.compare_swap(expected,2,memory_order_rela xed));
>
> Processor 3 does load-acquire:
> D: a=y.load(memory_order_acquire);
> E: b=x.load(memory_order_relaxed);
>
> If a is 2, what is b?
>
> On most common systems (e.g. x86, PowerPC, Sparc), b will be 1. This
> is not guaranteed by the standard though, since this may not be
> guaranteed by NUMA systems.



The problem is that it prohibits usage of relaxed fetch_add in acquire
operation in reference counting with basic thread-safety:

struct rc_t
{
std::atomic<int> rc;
};

void acquire(rc_t* obj)
{
obj->rc.fetch_add(1, std::memory_order_relaxed);
}

This implementation can lead to data races in some usage patterns
according to C++0x. Is it intended?

Dmitriy V'jukov

Anthony Williams 06-16-2008 11:09 AM

Re: C++0x: release sequence
 
"Dmitriy V'jukov" <dvyukov@gmail.com> writes:

> On Jun 16, 2:13 pm, Anthony Williams <anthony....@gmail.com> wrote:
>
>> Relaxed operations can read values from other threads
>> out-of-order. Consider the following:
>>
>> atomic_int x=0;
>> atomic_int y=0;
>>
>> Processor 1 does store-release:
>>
>> A: x.store(1,memory_order_relaxed)
>> B: y.store(1,memory_order_release)
>>
>> Processor 2 does relaxed RMW op:
>> int expected=1;
>> C: while(!y.compare_swap(expected,2,memory_order_rela xed));
>>
>> Processor 3 does load-acquire:
>> D: a=y.load(memory_order_acquire);
>> E: b=x.load(memory_order_relaxed);
>>
>> If a is 2, what is b?
>>
>> On most common systems (e.g. x86, PowerPC, Sparc), b will be 1. This
>> is not guaranteed by the standard though, since this may not be
>> guaranteed by NUMA systems.

>
>
> The problem is that it prohibits usage of relaxed fetch_add in acquire
> operation in reference counting with basic thread-safety:
>
> struct rc_t
> {
> std::atomic<int> rc;
> };
>
> void acquire(rc_t* obj)
> {
> obj->rc.fetch_add(1, std::memory_order_relaxed);
> }
>
> This implementation can lead to data races in some usage patterns
> according to C++0x. Is it intended?


I'm fairly sure it was intentional. If you don't want data races,
specify a non-relaxed ordering: I'd guess that
std::memory_order_acquire would be good for your example. If you don't
want the sync on the fetch_add, you could use a fence. Note that the
use of fences in the C++0x WP has changed this week from
object-specific fences to global fences. See Peter Dimov's paper
N2633:
http://www.open-std.org/JTC1/SC22/WG...008/n2633.html

Relaxed ordering is intended to be minimal overhead on all systems, so
it provides no ordering guarantees. On systems that always provide the
ordering guarantees, putting memory_order_acquire on the fetch_add is
probably minimal overhead. On systems that truly exhibit relaxed
ordering, requiring that the relaxed fetch_add participate in the
release sequence could add considerable overhead.

Consider my example above on a distributed system where the processors
are conceptually "a long way" apart, and data synchronization is
explicit.

With the current WP, processor 2 only needs to synchronize access to
y. If the relaxed op featured in the release sequence, it would need
to also handle the synchronization data for x, so that processor 3 got
the "right" values for x and y.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Dmitriy V'jukov 06-16-2008 04:25 PM

Re: C++0x: release sequence
 
On Jun 16, 3:09 pm, Anthony Williams <anthony....@gmail.com> wrote:

> Relaxed ordering is intended to be minimal overhead on all systems, so
> it provides no ordering guarantees. On systems that always provide the
> ordering guarantees, putting memory_order_acquire on the fetch_add is
> probably minimal overhead. On systems that truly exhibit relaxed
> ordering, requiring that the relaxed fetch_add participate in the
> release sequence could add considerable overhead.
>
> Consider my example above on a distributed system where the processors
> are conceptually "a long way" apart, and data synchronization is
> explicit.
>
> With the current WP, processor 2 only needs to synchronize access to
> y. If the relaxed op featured in the release sequence, it would need
> to also handle the synchronization data for x, so that processor 3 got
> the "right" values for x and y.



In your example, yes, one have to use non-relaxed rmw. But consider
following example:

struct object
{
std::atomic<int> rc;
int data;

void acquire()
{
rc.fetch_add(1, std::memory_order_relaxed);
}

void release()
{
if (1 == rc.fetch_sub(1, std::memory_order_release)
{
std::atomic_fence(std::memory_order_acquire);
data = 0;
delete this;
}
}
};

object* g_obj;

void thread1();
void thread2();
void thread3();

int main()
{
g_obj = new object;
g_obj->data = 1;
g_obj->rc = 3;

thread th1 = start_thread(&thread1);
thread th2 = start_thread(&thread2);
thread th3 = start_thread(&thread3);

join_thread(th1);
join_thread(th2);
join_thread(th3);
}

void thread1()
{
volatile int data = g_obj->data;
g_obj->release(); // T1-1
}

void thread2()
{
g_obj->acquire(); // T2-1
g_obj->release(); // T2-2
g_obj->release(); // T2-3
}

void thread3()
{
g_obj->release(); // T3-1
}

From point of view of current C++0x draft this code contains race on
g_obj->data. But I think this code is perfectly legal from hardware
point of view.

Consider following order of execution:
T1-1
T2-1 - here release sequence is broken, because of relaxed rmw
T2-2 - but here release sequence is effectively "resurrected from
dead", because thread, which executed relaxed rmw, now execute non-
relaxed rmw
T2-3
T3-1

So I think that T1-1 must 'synchronize-with' T3-1.

Formal definition is something like this:

A release sequence on an atomic object M is a maximal contiguous sub-
sequence of side effects in the modification order of M, where the
first operation is a release, and every subsequent operation
— is performed by the same thread that performed the release, or
— is a non-relaxed atomic read-modify-write operation.
— is a *relaxed* atomic read-modify-write operation.

Loaded release sequence on an atomic object M wrt evaluation A is part
of release sequence starting from the beginning and up to (inclusive)
value loaded by evaluation A.

An evaluation A that performs a release operation on an object M
synchronizes with an evaluation B that performs an acquire operation
on M and reads a value written by any side effect in the release
sequence headed by A, *if* for every relaxed rmw operation in loaded
release sequence there is subsequent non-relaxed rmw operation in
loaded release sequence executed by the same thread.

More precisely: *if* for every relaxed rmw operation (executed not by
thread which execute release)...

I'm trying to make definitions more "permissive", thus making more
correct (from hardware point of view) usage patterns legal (from C++0x
point of view).

What do you think?

Dmitriy V'jukov

Dmitriy V'jukov 06-16-2008 04:48 PM

Re: C++0x: release sequence
 
On Jun 16, 3:09 pm, Anthony Williams <anthony....@gmail.com> wrote:

> Note that the
> use of fences in the C++0x WP has changed this week from
> object-specific fences to global fences. See Peter Dimov's paper
> N2633:http://www.open-std.org/JTC1/SC22/WG...008/n2633.html



Yes, I've already read this. It's just GREAT! It's far more useful and
intuitive.
And it contains clear and simple binding to memory model, i.e.
relations between acquire/release fences; and between acquire/release
fences and acquire/release operations.

Is it already generally approved by memory model working group?
For atomic_fence() I'm not worry :) But what about complier_fence()?

Btw, I see some problems in Peter Dimov's proposal.
First, it's possible to write:

x.store(1, std::memory_order_relaxed);
std::atomic_compiler_fence(std::memory_order_relea se);
y.store(1, std::memory_order_relaxed);

But it's not possible to write:

x.store(1, std::memory_order_relaxed);
y.store(1, std::memory_order_relaxed_but_compiler_order_relea se);
// or just y.store(1, std::compiler_order_release);

I.e. it's not possible to use complier ordering, when using acquire/
release operations. It's a bit inconsistent, especially taking into
account that acquire/release operations are primary and standalone
bidirectional fences are supplementary.

Second, more important moment. It's possible to write:

//thread 1:
data = 1;
std::atomic_memory_fence(std::memory_order_release );
x.store(1, std::memory_order_relaxed);

//thread 2:
if (x.load(std::memory_order_acquire))
assert(1 == data);

But it's not possible to write:

//thread 1:
data = 1;
z.store(1, std::memory_order_release);
x.store(1, std::memory_order_relaxed);

//thread 2:
if (x.load(std::memory_order_acquire))
assert(1 == data);

From point of view of Peter Dimov's proposal, this core contains race
on 'data'.

I think there must be following statements:

- release operation *is a* release fence
- acquire operation *is a* acquire fence

So this:
z.store(1, std::memory_order_release);
basically transforms to:
std::atomic_memory_fence(std::memory_order_release );
z.store(1, std::memory_order_release);

Then second example will be legal. What do you think?

Dmitriy V'jukov

Anthony Williams 06-16-2008 07:36 PM

Re: C++0x: release sequence
 
"Dmitriy V'jukov" <dvyukov@gmail.com> writes:

> On Jun 16, 3:09 pm, Anthony Williams <anthony....@gmail.com> wrote:
>
>> Relaxed ordering is intended to be minimal overhead on all systems, so
>> it provides no ordering guarantees. On systems that always provide the
>> ordering guarantees, putting memory_order_acquire on the fetch_add is
>> probably minimal overhead. On systems that truly exhibit relaxed
>> ordering, requiring that the relaxed fetch_add participate in the
>> release sequence could add considerable overhead.
>>
>> Consider my example above on a distributed system where the processors
>> are conceptually "a long way" apart, and data synchronization is
>> explicit.
>>
>> With the current WP, processor 2 only needs to synchronize access to
>> y. If the relaxed op featured in the release sequence, it would need
>> to also handle the synchronization data for x, so that processor 3 got
>> the "right" values for x and y.

>
>
> In your example, yes, one have to use non-relaxed rmw. But consider
> following example:
>
> struct object
> {
> std::atomic<int> rc;
> int data;
>
> void acquire()
> {
> rc.fetch_add(1, std::memory_order_relaxed);
> }
>
> void release()
> {
> if (1 == rc.fetch_sub(1, std::memory_order_release)
> {
> std::atomic_fence(std::memory_order_acquire);
> data = 0;
> delete this;
> }
> }
> };
>
> object* g_obj;
>
> void thread1();
> void thread2();
> void thread3();
>
> int main()
> {
> g_obj = new object;
> g_obj->data = 1;
> g_obj->rc = 3;
>
> thread th1 = start_thread(&thread1);
> thread th2 = start_thread(&thread2);
> thread th3 = start_thread(&thread3);
>
> join_thread(th1);
> join_thread(th2);
> join_thread(th3);
> }
>
> void thread1()
> {
> volatile int data = g_obj->data;
> g_obj->release(); // T1-1
> }
>
> void thread2()
> {
> g_obj->acquire(); // T2-1
> g_obj->release(); // T2-2
> g_obj->release(); // T2-3
> }
>
> void thread3()
> {
> g_obj->release(); // T3-1
> }
>
> From point of view of current C++0x draft this code contains race on
> g_obj->data. But I think this code is perfectly legal from hardware
> point of view.


I guess it depends on your hardware. The relaxed fetch_add says "I
don't care about ordering", yet your code blatantly does care about
the ordering. I can't help thinking it should be
fetch_add(1,memory_order_acquire).

> Consider following order of execution:
> T1-1
> T2-1 - here release sequence is broken, because of relaxed rmw
> T2-2 - but here release sequence is effectively "resurrected from
> dead", because thread, which executed relaxed rmw, now execute non-
> relaxed rmw
> T2-3
> T3-1


And there's the rub: I don't think this is sensible. You explicitly
broke the release sequence with the relaxed fetch_add, so you can't
resurrect it.

T2-1 is not ordered wrt the read of g_obj->data in thread1. If it
needs to be ordered, it should say so.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Anthony Williams 06-16-2008 07:47 PM

Re: C++0x: release sequence
 
"Dmitriy V'jukov" <dvyukov@gmail.com> writes:

> On Jun 16, 3:09 pm, Anthony Williams <anthony....@gmail.com> wrote:
>
>> Note that the
>> use of fences in the C++0x WP has changed this week from
>> object-specific fences to global fences. See Peter Dimov's paper
>> N2633:http://www.open-std.org/JTC1/SC22/WG...008/n2633.html

>
>
> Yes, I've already read this. It's just GREAT! It's far more useful and
> intuitive.
> And it contains clear and simple binding to memory model, i.e.
> relations between acquire/release fences; and between acquire/release
> fences and acquire/release operations.
>
> Is it already generally approved by memory model working group?
> For atomic_fence() I'm not worry :) But what about complier_fence()?


Yes. It's been approved to be applied to the WP with minor renamings
(atomic_memory_fence -> atomic_thread_fence, atomic_compiler_fence ->
atomic_signal_fence)

> Btw, I see some problems in Peter Dimov's proposal.
> First, it's possible to write:
>
> x.store(1, std::memory_order_relaxed);
> std::atomic_compiler_fence(std::memory_order_relea se);
> y.store(1, std::memory_order_relaxed);
>
> But it's not possible to write:
>
> x.store(1, std::memory_order_relaxed);
> y.store(1, std::memory_order_relaxed_but_compiler_order_relea se);
> // or just y.store(1, std::compiler_order_release);
>
> I.e. it's not possible to use complier ordering, when using acquire/
> release operations. It's a bit inconsistent, especially taking into
> account that acquire/release operations are primary and standalone
> bidirectional fences are supplementary.


You're right that you can't do this. I don't think it's a problem as
compiler orderings are not really the same as the inter-thread
orderings.

> Second, more important moment. It's possible to write:
>
> //thread 1:
> data = 1;
> std::atomic_memory_fence(std::memory_order_release );
> x.store(1, std::memory_order_relaxed);
>
> //thread 2:
> if (x.load(std::memory_order_acquire))
> assert(1 == data);
>
> But it's not possible to write:
>
> //thread 1:
> data = 1;
> z.store(1, std::memory_order_release);
> x.store(1, std::memory_order_relaxed);
>
> //thread 2:
> if (x.load(std::memory_order_acquire))
> assert(1 == data);
>
> From point of view of Peter Dimov's proposal, this core contains race
> on 'data'.


Yes. Fences are global, whereas ordering on individual objects is
specific. The fence version is equivalent to:

// thread 1
data=1
x.store(1,std::memory_order_release);

> I think there must be following statements:
>
> - release operation *is a* release fence
> - acquire operation *is a* acquire fence
>
> So this:
> z.store(1, std::memory_order_release);
> basically transforms to:
> std::atomic_memory_fence(std::memory_order_release );
> z.store(1, std::memory_order_release);
>
> Then second example will be legal. What do you think?


I think that compromises the model, because it makes release
operations contagious. The fence transformation is precisely the
reverse of this, which I think is correct.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Dmitriy V'jukov 06-16-2008 07:52 PM

Re: C++0x: release sequence
 
On Jun 16, 11:36 pm, Anthony Williams <anthony....@gmail.com> wrote:

> > From point of view of current C++0x draft this code contains race on
> > g_obj->data. But I think this code is perfectly legal from hardware
> > point of view.

>
> I guess it depends on your hardware. The relaxed fetch_add says "I
> don't care about ordering", yet your code blatantly does care about
> the ordering. I can't help thinking it should be
> fetch_add(1,memory_order_acquire).


Ok. Let's put it this way. I change your initial example a bit:

atomic_int x=0;
atomic_int y=0;

Processor 1 does store-release:

A: x.store(1,memory_order_relaxed)
B: y.store(1,memory_order_release)

Processor 2 does relaxed RMW op and then non-relaxed RMW op:
int expected=1;
C: while(!y.compare_swap(expected,2,memory_order_rela xed));
D: y.fetch_add(1,memory_order_acq_rel));

Processor 3 does load-acquire:
E: a=y.load(memory_order_acquire);
F: b=x.load(memory_order_relaxed);
if (3 == a) assert(1 == b);

If a is 3, what is b?

I believe that b==1, and here is no race on x.

And in this particular example following lines:
C: while(!y.compare_swap(expected,2,memory_order_rela xed));
D: y.fetch_add(1,memory_order_acq_rel));
synchronize memory exactly like this single line:
C: while(!y.compare_swap(expected,3,memory_order_acq_ rel));

Or I am wrong even here?

Dmitriy V'jukov

Dmitriy V'jukov 06-16-2008 08:15 PM

Re: C++0x: release sequence
 
On Jun 16, 11:47 pm, Anthony Williams <anthony....@gmail.com> wrote:

> > Yes, I've already read this. It's just GREAT! It's far more useful and
> > intuitive.
> > And it contains clear and simple binding to memory model, i.e.
> > relations between acquire/release fences; and between acquire/release
> > fences and acquire/release operations.

>
> > Is it already generally approved by memory model working group?
> > For atomic_fence() I'm not worry :) But what about complier_fence()?

>
> Yes. It's been approved to be applied to the WP with minor renamings
> (atomic_memory_fence -> atomic_thread_fence, atomic_compiler_fence ->
> atomic_signal_fence)


COOL!

Looking forward to next draft. Btw, what about dependent memory
ordering (memory_order_consume)? Is it going to be accepted?




> > Btw, I see some problems in Peter Dimov's proposal.
> > First, it's possible to write:

>
> > x.store(1, std::memory_order_relaxed);
> > std::atomic_compiler_fence(std::memory_order_relea se);
> > y.store(1, std::memory_order_relaxed);

>
> > But it's not possible to write:

>
> > x.store(1, std::memory_order_relaxed);
> > y.store(1, std::memory_order_relaxed_but_compiler_order_relea se);
> > // or just y.store(1, std::compiler_order_release);

>
> > I.e. it's not possible to use complier ordering, when using acquire/
> > release operations. It's a bit inconsistent, especially taking into
> > account that acquire/release operations are primary and standalone
> > bidirectional fences are supplementary.

>
> You're right that you can't do this. I don't think it's a problem as
> compiler orderings are not really the same as the inter-thread
> orderings.


Yes, but why I can do and inter-thread orderings and compiler
orderings with stand-alone fences, and can do only inter-thread
orderings with operations? Why stand-alone fences are more 'powerful'?


> > Second, more important moment. It's possible to write:

>
> > //thread 1:
> > data = 1;
> > std::atomic_memory_fence(std::memory_order_release );
> > x.store(1, std::memory_order_relaxed);

>
> > //thread 2:
> > if (x.load(std::memory_order_acquire))
> > assert(1 == data);

>
> > But it's not possible to write:

>
> > //thread 1:
> > data = 1;
> > z.store(1, std::memory_order_release);
> > x.store(1, std::memory_order_relaxed);

>
> > //thread 2:
> > if (x.load(std::memory_order_acquire))
> > assert(1 == data);

>
> > From point of view of Peter Dimov's proposal, this core contains race
> > on 'data'.

>
> Yes. Fences are global, whereas ordering on individual objects is
> specific.


Hmmm... need to think some more on this...


> The fence version is equivalent to:
>
> // thread 1
> data=1
> x.store(1,std::memory_order_release);
>
> > I think there must be following statements:

>
> > - release operation *is a* release fence
> > - acquire operation *is a* acquire fence

>
> > So this:
> > z.store(1, std::memory_order_release);
> > basically transforms to:
> > std::atomic_memory_fence(std::memory_order_release );
> > z.store(1, std::memory_order_release);

>
> > Then second example will be legal. What do you think?

>
> I think that compromises the model, because it makes release
> operations contagious...


.... and this will interfere with efficient implementation on some
hardware. Right? Or there are some 'logical' reasons for this (why you
don't want to make release operations contagious)?

Dmitriy V'jukov


All times are GMT. The time now is 05:06 AM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57