Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Synchronization Algorithm Verificator for C++0x

Reply
Thread Tools

Synchronization Algorithm Verificator for C++0x

 
 
Peter Dimov
Guest
Posts: n/a
 
      08-06-2008
On Aug 6, 5:31*am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> Also. Following naive approach doesn't work (at least in C++0x):
> * * * * void signal()
> * * * * {
> * * * * * * * * std::atomic_thread_fence(std::memory_order_seq_cst );
> * * * * * * * * signal_relaxed();
> * * * * }


I'm not sure that I understand the algorithm here, but the reason this
doesn't introduce a synchronize-with relationship is that the fence
needs to follow the load, not precede it.

load.relaxed count
fence.seq_cst

does synchronize with

store.seq_cst count, v

when the load sees the value v written by the store.

If you do need a leading seq_cst fence, why not just use a seq_cst
load?
 
Reply With Quote
 
 
 
 
Dmitriy V'jukov
Guest
Posts: n/a
 
      08-06-2008
On Aug 6, 5:08 pm, Peter Dimov <pdi...@gmail.com> wrote:
> On Aug 6, 5:31 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > Also. Following naive approach doesn't work (at least in C++0x):
> > void signal()
> > {
> > std::atomic_thread_fence(std::memory_order_seq_cst );
> > signal_relaxed();
> > }

>
> I'm not sure that I understand the algorithm here, but the reason this
> doesn't introduce a synchronize-with relationship is that the fence
> needs to follow the load, not precede it.


I will try to describe the issue in more detail in separate post in
near future.

> load.relaxed count
> fence.seq_cst
>
> does synchronize with
>
> store.seq_cst count, v
>
> when the load sees the value v written by the store.
>
> If you do need a leading seq_cst fence, why not just use a seq_cst
> load?


Because this seq_cst fence must introduce synchronizes-with not only
from second thread to this thread, but also possibly it must introduce
synchronizes-with from this thread to second thread. I.e. OR first
thread must synchronize-with second thread OR second thread must
synchronize-with first thread (depending on which thread's seq_cst
operations is first in global order).
If I use seq_cst load, then it only possible that second thread
synchronizes-with this thread, but not vice versa, because seq_cst
load is only acquire, it's NOT release.

Dmitriy V'jukov
 
Reply With Quote
 
 
 
 
Dmitriy V'jukov
Guest
Posts: n/a
 
      08-06-2008
On Aug 6, 5:42 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> On Aug 6, 5:08 pm, Peter Dimov <pdi...@gmail.com> wrote:
>
> > On Aug 6, 5:31 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

>
> > > Also. Following naive approach doesn't work (at least in C++0x):
> > > void signal()
> > > {
> > > std::atomic_thread_fence(std::memory_order_seq_cst );
> > > signal_relaxed();
> > > }

>
> > I'm not sure that I understand the algorithm here, but the reason this
> > doesn't introduce a synchronize-with relationship is that the fence
> > needs to follow the load, not precede it.

>
> I will try to describe the issue in more detail in separate post in
> near future.


Here:
http://groups.google.com/group/comp....513cf616dc168c

Dmitriy V'jukov
 
Reply With Quote
 
Chris M. Thomasson
Guest
Posts: n/a
 
      08-07-2008
"Dmitriy V'jukov" <> wrote in message
news:29a4e368-c068-460a-86a6-...
> On Aug 6, 12:23 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>
>> Have you testing the following algorithm yet?
>>
>> http://research.sun.com/scalable/pub...rkstealing.pdf

>
> Yes and no. No, I didn't test this particular algorithm. But I test my
> analogous algorithm for work-stealing deque.
> It's extremely hardcore lock-free algorithm with 500+ LOC. And it was
> hangup sometimes, and sometimes cause SIGSEGV. I was trying to debug
> it manually, I was trying to insert some asserts, I was trying to
> insert some "yields", I was trying to review it several times, I was
> trying to analyze dumps. Uselessly.
> When I run it under Relacy Race Detector I localize and fix 2 subtle
> algo errors and clarify some moments with memory fences it about 3
> hours. And it was almost mechanical work, not some black magic.


Did you develop Relacy in part so that you can totally "deconstruct" your
work-stealing algorithm into a form in which its "random" run-time behavior
could be studied at another level and therefore be _reliably_ debugged?

;^D

 
Reply With Quote
 
Chris M. Thomasson
Guest
Posts: n/a
 
      08-07-2008
"Peter Dimov" <> wrote in message
news:8cf29aaf-916b-40dc-89bf-...
On Aug 6, 5:31 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> > Also. Following naive approach doesn't work (at least in C++0x):
> > void signal()
> > {
> > std::atomic_thread_fence(std::memory_order_seq_cst );
> > signal_relaxed();
> > }


> I'm not sure that I understand the algorithm here, but the reason this
> doesn't introduce a synchronize-with relationship is that the fence
> needs to follow the load, not precede it.


> load.relaxed count
> fence.seq_cst


> does synchronize with


> store.seq_cst count, v


> when the load sees the value v written by the store.


> If you do need a leading seq_cst fence, why not just use a seq_cst
> load?


That can be done as long as the full fence is executed before the load of
the eventcount variable. BTW Peter, here is a very simple pseudo-code impl
of my algorithm:

http://groups.google.com/group/comp....8c62ad06dbb380

Here is where Joe Seigh points out the membars:

http://groups.google.com/group/comp....7a0379b55557c0


AFAICT, the release barrier in a user algorithm is in the wrong place. The
load cannot be allowed to be hoisted up above the production of any item in
a user container class. Basically, the production of an item into a user
container usually only requires a release barrier. Its in the wrong place
wrt coupling the algorithm with my eventcount. A StoreLoad needs to be there
in order to allow the production to be completely visible BEFORE the load
from the eventcount is performed... Where am I going wrong?

Read here for more info:

http://groups.google.com/group/comp....cf4a952ff5af7d

Humm...

 
Reply With Quote
 
Dmitriy V'jukov
Guest
Posts: n/a
 
      08-07-2008
On Aug 7, 3:09 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> That can be done as long as the full fence is executed before the load of
> the eventcount variable. BTW Peter, here is a very simple pseudo-code impl
> of my algorithm:
>
> http://groups.google.com/group/comp..../browse_frm/th...
>
> Here is where Joe Seigh points out the membars:
>
> http://groups.google.com/group/comp..../msg/8e7a0379b...
>
> AFAICT, the release barrier in a user algorithm is in the wrong place. The
> load cannot be allowed to be hoisted up above the production of any item in
> a user container class. Basically, the production of an item into a user
> container usually only requires a release barrier. Its in the wrong place
> wrt coupling the algorithm with my eventcount. A StoreLoad needs to be there
> in order to allow the production to be completely visible BEFORE the load
> from the eventcount is performed... Where am I going wrong?



The problem with C++0x in this context is the following.
Assume container is lock-free stack. And 'production' is implemented
as seq_cst atomic RMW on 'head' variable. In this situation you can
reasonably presume that you can use 'relaxed signaling', i.e. relaxed
load of eventcount w/o any additional fences, because seq_cst atomic
RMW on 'head' variable already contains strong enough fence in right
place. WRONG! Why? See here:
http://groups.google.com/group/comp....513cf616dc168c


Dmitriy V'jukov
 
Reply With Quote
 
Peter Dimov
Guest
Posts: n/a
 
      08-07-2008
On Aug 7, 2:09*pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> AFAICT, the release barrier in a user algorithm is in the wrong place. The
> load cannot be allowed to be hoisted up above the production of any item in
> a user container class. Basically, the production of an item into a user
> container usually only requires a release barrier. Its in the wrong place
> wrt coupling the algorithm with my eventcount. A StoreLoad needs to be there
> in order to allow the production to be completely visible BEFORE the load
> from the eventcount is performed... Where am I going wrong?
>
> Read here for more info:
>
> http://groups.google.com/group/comp..../browse_frm/th...
>
> Humm...


Hmmm.

In terms of the C++MM, the options I see are:

1. store.seq_cst in push, load.seq_cst in signal, cas.seq_cst in get
2. fence.seq_cst+load.relaxed in signal, cas.relaxed+fence.seq_cst in
get
3. RMW.seq_cst in signal, cas.seq_cst in get

and maybe

4. RMW.acq_rel in signal, cas.acq_rel in get
 
Reply With Quote
 
Chris M. Thomasson
Guest
Posts: n/a
 
      08-07-2008
"Peter Dimov" <> wrote in message
news:791c816b-dce9-4325-a231-...
On Aug 7, 2:09 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > AFAICT, the release barrier in a user algorithm is in the wrong place.
> > The
> > load cannot be allowed to be hoisted up above the production of any item
> > in
> > a user container class. Basically, the production of an item into a user
> > container usually only requires a release barrier. Its in the wrong
> > place
> > wrt coupling the algorithm with my eventcount. A StoreLoad needs to be
> > there
> > in order to allow the production to be completely visible BEFORE the
> > load
> > from the eventcount is performed... Where am I going wrong?
> >
> > Read here for more info:
> >
> > http://groups.google.com/group/comp..../browse_frm/th...
> >
> > Humm...


> Hmmm.


> In terms of the C++MM, the options I see are:


> 1. store.seq_cst in push, load.seq_cst in signal, cas.seq_cst in get
> 2. fence.seq_cst+load.relaxed in signal, cas.relaxed+fence.seq_cst in
> get
> 3. RMW.seq_cst in signal, cas.seq_cst in get


> and maybe


> 4. RMW.acq_rel in signal, cas.acq_rel in get


For now, I choose number 2 because in the current eventcount algorithm
as-is, the `signal()' procedure needs a preceding StoreLoad, and the `get()'
function needs a trailing StoreLoad. AFAICT, this is fine... As for the
other choices, well, I don't like number 1 because it moves memory
visibility rules directly into an end-user algorithm. I dislike 3-4 because
of the dummy RMW; for instance, I could use a `seq_cst' fetch-and-add on the
eventcount state with an addend of zero. However, IMVHO, that looks like
fairly dirty hack indeed... I mean, IMVHO, a RMW which does not perform a
mutation is basically pointless...

 
Reply With Quote
 
Dmitriy V'jukov
Guest
Posts: n/a
 
      08-07-2008
On Aug 7, 7:13 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > In terms of the C++MM, the options I see are:
> > 1. store.seq_cst in push, load.seq_cst in signal, cas.seq_cst in get
> > 2. fence.seq_cst+load.relaxed in signal, cas.relaxed+fence.seq_cst in
> > get
> > 3. RMW.seq_cst in signal, cas.seq_cst in get
> > and maybe
> > 4. RMW.acq_rel in signal, cas.acq_rel in get

>
> For now, I choose number 2 because in the current eventcount algorithm
> as-is, the `signal()' procedure needs a preceding StoreLoad, and the `get()'
> function needs a trailing StoreLoad. AFAICT, this is fine... As for the
> other choices, well, I don't like number 1 because it moves memory
> visibility rules directly into an end-user algorithm. I dislike 3-4 because
> of the dummy RMW; for instance, I could use a `seq_cst' fetch-and-add on the
> eventcount state with an addend of zero. However, IMVHO, that looks like
> fairly dirty hack indeed... I mean, IMVHO, a RMW which does not perform a
> mutation is basically pointless...



Consider implementation of number 2 on x86.
Producer:
1. Interlocked RMW to enqueue element
2. mfence in signal (even more expensive than Interlocked RMW, and
absolutely useless)

Consumer:
3. Interlocked RMW in get
4. mfence in get (even more expensive than Interlocked RMW, and
absolutely useless)

Maybe compiler can optimize 4 away, because 3 and 4 will be in one
function. But I am not sure that first versions of C++0x compilers
will be so smart. And it's questionable that compiler will be able to
optimize 2 away, because 1 and 2 will be in different functions and
probably in different source files.

Do you like this? Don't hasten forgetting your asm skills

Dmitriy V'jukov
 
Reply With Quote
 
Peter Dimov
Guest
Posts: n/a
 
      08-07-2008
On Aug 7, 6:31*pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:

> Consider implementation of number 2 on x86.
> Producer:
> 1. Interlocked RMW to enqueue element
> 2. mfence in signal (even more expensive than Interlocked RMW, and
> absolutely useless)


It might be interesting to time the actual mfence overhead when the
preceding locked RMW instruction has already drained the store queue.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Synchronization john VHDL 7 04-13-2009 05:18 PM
Filtered Back Projection Algorithm (FBP Algorithm) Bapaiah Katepalli VHDL 1 06-23-2006 04:50 PM
synchronization Firefox 0 04-12-2005 03:44 PM
synchronization Thomas Jollans Java 14 12-22-2003 03:18 PM
Key generation algorithm and Cipher algorithm Ahmed Moustafa Java 0 11-15-2003 05:35 AM



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57