Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > single producer, single consumer

Reply
Thread Tools

single producer, single consumer

 
 
goodfella
Guest
Posts: n/a
 
      10-06-2009
Here is some code I came up with for a lock free single producer
single consumer algorithm. I have done some tests using GCC with -O3
optimizations and all seems well. I would like some constructive
criticism of the code as I am not yet an expert on concurrency.

I'm assuming that reading and writing the value of a pointer is
atomic. Although, I have read on some architectures that is a bad
assumption, so to insure correctness across all platforms, I may have
to wait for C++ atomic types. Any suggestions and criticisms are
welcomed thanks. The code is below.


bool push_element(T* el);
T* pop_element();

const size_t Size = 1024;
T* volatile data[Size];

bool push_element(T* el)
{
static int write_idx = 0;

if( !data[write_idx] )
{
data[write_idx] = el;
write_idx = (write_idx + 1) % Size;
return true;
}
else
return false;
}

T* pop_element()
{
static int read_idx = 0;

T* element;

if( element = data[read_idx] )
{
data[read_idx] = 0;
read_idx = (read_idx + 1) % Size;
return element;
}
else
return 0;
}
 
Reply With Quote
 
 
 
 
REH
Guest
Posts: n/a
 
      10-06-2009
On Oct 6, 10:00*am, goodfella <goodfella...@gmail.com> wrote:
> Here is some code I came up with for a lock free single producer
> single consumer algorithm. *I have done some tests using GCC with -O3
> optimizations and all seems well. *I would like some constructive
> criticism of the code as I am not yet an expert on concurrency.
>
> I'm assuming that reading and writing the value of a pointer is
> atomic. *Although, I have read on some architectures that is a bad
> assumption, so to insure correctness across all platforms, I may have
> to wait for C++ atomic types. *Any suggestions and criticisms are
> welcomed thanks. *The code is below.
>
> bool push_element(T* el);
> T* pop_element();
>
> const size_t Size = 1024;
> T* volatile data[Size];
>
> bool push_element(T* el)
> {
> * * static int write_idx = 0;
>
> * * if( !data[write_idx] )
> * * {
> * * * * data[write_idx] = el;
> * * * * write_idx = (write_idx + 1) % Size;
> * * * * return true;
> * * }
> * * else
> * * * * return false;
>
> }
>
> T* pop_element()
> {
> * * static int read_idx = 0;
>
> * * T* element;
>
> * * if( element = data[read_idx] )
> * * {
> * * * * data[read_idx] = 0;
> * * * * read_idx = (read_idx + 1) % Size;
> * * * * return element;
> * * }
> * * else
> * * * * return 0;
>
> }


You have a few potential issues:
1) Even if pointer accesses are atomic on your machine, your array
element accesses most certainly aren't.
2) If you run this on a multi-processor system, you will have cache-
coherency issues.

You really need to use proper atomic operations. Most systems provide
these. GCC 4.1 or higher has atomic built-ins. Windows has atomic
APIs. Most processors provide them (e.g., Intel has lock+cmpxchg).

REH
 
Reply With Quote
 
 
 
 
REH
Guest
Posts: n/a
 
      10-06-2009
On Oct 6, 10:00*am, goodfella <goodfella...@gmail.com> wrote:
> Here is some code I came up with for a lock free single producer
> single consumer algorithm. *I have done some tests using GCC with -O3
> optimizations and all seems well. *I would like some constructive
> criticism of the code as I am not yet an expert on concurrency.
>
> I'm assuming that reading and writing the value of a pointer is
> atomic. *Although, I have read on some architectures that is a bad
> assumption, so to insure correctness across all platforms, I may have
> to wait for C++ atomic types. *Any suggestions and criticisms are
> welcomed thanks. *The code is below.
>


Incidentally, I've written a lock-free multi-read/multi-write queue
for multi-core machines that runs on both Windows and Linux, if you
need an example. It probably does more than you want/need, but you
could strip it down.

REH
 
Reply With Quote
 
goodfella
Guest
Posts: n/a
 
      10-06-2009
On Oct 6, 7:29*am, REH <spamj...@stny.rr.com> wrote:
> On Oct 6, 10:00*am, goodfella <goodfella...@gmail.com> wrote:
>
> > Here is some code I came up with for a lock free single producer
> > single consumer algorithm. *I have done some tests using GCC with -O3
> > optimizations and all seems well. *I would like some constructive
> > criticism of the code as I am not yet an expert on concurrency.

>
> > I'm assuming that reading and writing the value of a pointer is
> > atomic. *Although, I have read on some architectures that is a bad
> > assumption, so to insure correctness across all platforms, I may have
> > to wait for C++ atomic types. *Any suggestions and criticisms are
> > welcomed thanks. *The code is below.

>
> Incidentally, I've written a lock-free multi-read/multi-write queue
> for multi-core machines that runs on both Windows and Linux, if you
> need an example. It probably does more than you want/need, but you
> could strip it down.
>
> REH


REH, thanks for your replay, and yes I would be interested in seeing
your lock-free multi-read/multi-write queue. However, I do not see
how the array element access would not be atomic. From my
understanding, the following line of code:

element = data[read_idx]

Is converted to something like:

T* temp = data + read_idx
element = temp

The first line might not be atomic, but if reading the value from a
pointer is atomic, my code is still valid because its the second line
that counts. Neither data nor read_idx are going to be updated by
another thread. data is never pointed to another array of T* and
read_idx is only updated by the consumer thread.

Also, could you elaborate more on your second point, thanks.
 
Reply With Quote
 
REH
Guest
Posts: n/a
 
      10-06-2009
On Oct 6, 1:22*pm, goodfella <goodfella...@gmail.com> wrote:
> REH, thanks for your replay, and yes I would be interested in seeing
> your lock-free multi-read/multi-write queue. *However, I do not see
> how the array element access would not be atomic. *From my
> understanding, the following line of code:
>
> element = data[read_idx]
>
> Is converted to something like:
>
> T* temp = data + read_idx
> element = temp
>
> The first line might not be atomic, but if reading the value from a
> pointer is atomic, my code is still valid because its the second line
> that counts. *Neither data nor read_idx are going to be updated by
> another thread. *data is never pointed to another array of T* and
> read_idx is only updated by the consumer thread.


I glossed over the part where you said you had single readers and
writers. In that limited case, and if you can guarantee that your
pointer accesses are atomic, I think you are OK.


>
> Also, could you elaborate more on your second point, thanks.


If your code is running on a system with more than one processor
(e.g., a multi-core processor), each processor will have its own
cache. If one thread updates a pointer, the other processor(s) may not
see it, because they are accessing a value stored in their cache. One
of the things an atomic operation needs to do is ensure the value it
is working with is the most recent one, regardless of any caches that
may lay between the processor and memory. You normally don't give this
a second thought in multi-threaded code because your synchronization
objects (i.e., mutexes, semaphores, etc.) do this for you. They also
ensure that optimizations such as code reordering do not change the
logic. Code that must occur before or after the atomic operation must
not be move across that operation. You cannot provide these guarantees
in normal C++ (yet).

REH
 
Reply With Quote
 
REH
Guest
Posts: n/a
 
      10-06-2009
On Oct 6, 1:22*pm, goodfella <goodfella...@gmail.com> wrote:
> REH, thanks for your replay, and yes I would be interested in seeing
> your lock-free multi-read/multi-write queue.


It extracted the relevant code, and wrote a README file to explain all
the pieces. I can email it to you tonight when I get home. Is the
email you have above the correct one to use?

REH
 
Reply With Quote
 
Joshua Maurice
Guest
Posts: n/a
 
      10-06-2009
On Oct 6, 7:00*am, goodfella <goodfella...@gmail.com> wrote:
> Here is some code I came up with for a lock free single producer
> single consumer algorithm. *I have done some tests using GCC with -O3
> optimizations and all seems well. *I would like some constructive
> criticism of the code as I am not yet an expert on concurrency.
>
> I'm assuming that reading and writing the value of a pointer is
> atomic. *Although, I have read on some architectures that is a bad
> assumption, so to insure correctness across all platforms, I may have
> to wait for C++ atomic types. *Any suggestions and criticisms are
> welcomed thanks. *The code is below.
>

[Snip code]

As already mentioned, multi-processor systems will probably not run
your code as you intended because of cache coherency. I'd strongly
suggest looking it up and getting some good references on threading.
The short version is that BOTH the reader and the writer threads must
execute some kind of synchronization primitive to have any guarantee
WHATSOEVER on the order that writes become visible to another thread
(where a synchronization primitive might be a full blown lock, acquire
and release memory barriers, read and write memory barriers, etc.).
Ex:

#include <string>
#include <fstream>
using namespace std;

//insert threading library
template <typename callable_t>
void startThread(callable_t callable);

int x = 0;
int y = 0;

struct foo
{ foo(string const& f) : filename(f) {}
string filename;
void operator() ()
{ ofstream fout(filename.c_str());
fout << x << " " << y << endl;
}
};

int main()
{ startThread(foo("a"));
startThread(foo("b"));
startThread(foo("c"));
startThread(foo("d"));
x = 1;
y = 2;
}

This program, on a \single\ execution, can produce 4 files with the
contents (though highly unlikely for this contrived example):
0 0
0 2
1 0
1 2

This is the cache coherency problem. Let me emphasize again: two
different threads may see the same writes occur in different orders.
On modern machines, there is no single view of main memory. Everyone
has their own local copy because of their own independent caches. One
thread may execute the instructions "load register to x" then "load
register to y", but another thread may see it happen in the opposite
order.

Synchronization instructions make the view consistent when used
properly. Atomic is insufficient. The pointer might have been updated
but what it points to has not been updated. See C++ And The Perils Of
Double Checked Locking
http://www.aristeia.com/Papers/DDJ_J...04_revised.pdf
for an excellent description of modern threading, and common
pitfalls.

This is of course ignoring the other two big "culprits" of making non-
conformant code do what the user does not expect: compiler
optimizations and single core pipeline reorderings / optimizations.

Also, even on a single core single processor machine, you might still
hit related issues due to compiler optimizations and hardware pipeline
optimizations, especially with signal handlers.
 
Reply With Quote
 
goodfella
Guest
Posts: n/a
 
      10-06-2009
On Oct 6, 1:52*pm, Joshua Maurice <joshuamaur...@gmail.com> wrote:
> On Oct 6, 7:00*am, goodfella <goodfella...@gmail.com> wrote:> Here is some code I came up with for a lock free single producer
> > single consumer algorithm. *I have done some tests using GCC with -O3
> > optimizations and all seems well. *I would like some constructive
> > criticism of the code as I am not yet an expert on concurrency.

>
> > I'm assuming that reading and writing the value of a pointer is
> > atomic. *Although, I have read on some architectures that is a bad
> > assumption, so to insure correctness across all platforms, I may have
> > to wait for C++ atomic types. *Any suggestions and criticisms are
> > welcomed thanks. *The code is below.

>
> [Snip code]
>
> As already mentioned, multi-processor systems will probably not run
> your code as you intended because of cache coherency. I'd strongly
> suggest looking it up and getting some good references on threading.
> The short version is that BOTH the reader and the writer threads must
> execute some kind of synchronization primitive to have any guarantee
> WHATSOEVER on the order that writes become visible to another thread
> (where a synchronization primitive might be a full blown lock, acquire
> and release memory barriers, read and write memory barriers, etc.).
> Ex:
>
> #include <string>
> #include <fstream>
> using namespace std;
>
> //insert threading library
> template <typename callable_t>
> void startThread(callable_t callable);
>
> int x = 0;
> int y = 0;
>
> struct foo
> { * foo(string const& f) : filename(f) {}
> * * string filename;
> * * void operator() ()
> * * { * ofstream fout(filename.c_str());
> * * * * fout << x << " " << y << endl;
> * * }
>
> };
>
> int main()
> { * startThread(foo("a"));
> * * startThread(foo("b"));
> * * startThread(foo("c"));
> * * startThread(foo("d"));
> * * x = 1;
> * * y = 2;
>
> }
>
> This program, on a \single\ execution, can produce 4 files with the
> contents (though highly unlikely for this contrived example):
> * * 0 0
> * * 0 2
> * * 1 0
> * * 1 2
>
> This is the cache coherency problem. Let me emphasize again: two
> different threads may see the same writes occur in different orders.
> On modern machines, there is no single view of main memory. Everyone
> has their own local copy because of their own independent caches. One
> thread may execute the instructions "load register to x" then "load
> register to y", but another thread may see it happen in the opposite
> order.
>
> Synchronization instructions make the view consistent when used
> properly. Atomic is insufficient. The pointer might have been updated
> but what it points to has not been updated. See C++ And The Perils Of
> Double Checked Lockinghttp://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
> for an excellent description of modern threading, and common
> pitfalls.
>
> This is of course ignoring the other two big "culprits" of making non-
> conformant code do what the user does not expect: compiler
> optimizations and single core pipeline reorderings / optimizations.
>
> Also, even on a single core single processor machine, you might still
> hit related issues due to compiler optimizations and hardware pipeline
> optimizations, especially with signal handlers.


The machine I'm testing this on has a dual core processor, and I have
yet to hit a problem like that. I would expect the process to hang
because the producer would constantly be getting a false value
returned from the push function and the consumer would constantly be
getting a null pointer from the pop function due to the fact that the
pointer value would be stored in their respective cache's and a write
to the pointer in one cache would not propagate to the pointer value
in the other cache. Are there any guarantees in regard to cache
coherency on certain platforms?
 
Reply With Quote
 
Joshua Maurice
Guest
Posts: n/a
 
      10-06-2009
On Oct 6, 3:10*pm, goodfella <goodfella...@gmail.com> wrote:
> On Oct 6, 1:52*pm, Joshua Maurice <joshuamaur...@gmail.com> wrote:
>
>
>
> > On Oct 6, 7:00*am, goodfella <goodfella...@gmail.com> wrote:> Here is some code I came up with for a lock free single producer
> > > single consumer algorithm. *I have done some tests using GCC with -O3
> > > optimizations and all seems well. *I would like some constructive
> > > criticism of the code as I am not yet an expert on concurrency.

>
> > > I'm assuming that reading and writing the value of a pointer is
> > > atomic. *Although, I have read on some architectures that is a bad
> > > assumption, so to insure correctness across all platforms, I may have
> > > to wait for C++ atomic types. *Any suggestions and criticisms are
> > > welcomed thanks. *The code is below.

>
> > [Snip code]

>
> > As already mentioned, multi-processor systems will probably not run
> > your code as you intended because of cache coherency. I'd strongly
> > suggest looking it up and getting some good references on threading.
> > The short version is that BOTH the reader and the writer threads must
> > execute some kind of synchronization primitive to have any guarantee
> > WHATSOEVER on the order that writes become visible to another thread
> > (where a synchronization primitive might be a full blown lock, acquire
> > and release memory barriers, read and write memory barriers, etc.).
> > Ex:

>
> > #include <string>
> > #include <fstream>
> > using namespace std;

>
> > //insert threading library
> > template <typename callable_t>
> > void startThread(callable_t callable);

>
> > int x = 0;
> > int y = 0;

>
> > struct foo
> > { * foo(string const& f) : filename(f) {}
> > * * string filename;
> > * * void operator() ()
> > * * { * ofstream fout(filename.c_str());
> > * * * * fout << x << " " << y << endl;
> > * * }

>
> > };

>
> > int main()
> > { * startThread(foo("a"));
> > * * startThread(foo("b"));
> > * * startThread(foo("c"));
> > * * startThread(foo("d"));
> > * * x = 1;
> > * * y = 2;

>
> > }

>
> > This program, on a \single\ execution, can produce 4 files with the
> > contents (though highly unlikely for this contrived example):
> > * * 0 0
> > * * 0 2
> > * * 1 0
> > * * 1 2

>
> > This is the cache coherency problem. Let me emphasize again: two
> > different threads may see the same writes occur in different orders.
> > On modern machines, there is no single view of main memory. Everyone
> > has their own local copy because of their own independent caches. One
> > thread may execute the instructions "load register to x" then "load
> > register to y", but another thread may see it happen in the opposite
> > order.

>
> > Synchronization instructions make the view consistent when used
> > properly. Atomic is insufficient. The pointer might have been updated
> > but what it points to has not been updated. See C++ And The Perils Of
> > Double Checked Lockinghttp://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
> > for an excellent description of modern threading, and common
> > pitfalls.

>
> > This is of course ignoring the other two big "culprits" of making non-
> > conformant code do what the user does not expect: compiler
> > optimizations and single core pipeline reorderings / optimizations.

>
> > Also, even on a single core single processor machine, you might still
> > hit related issues due to compiler optimizations and hardware pipeline
> > optimizations, especially with signal handlers.

>
> The machine I'm testing this on has a dual core processor, and I have
> yet to hit a problem like that.


Probably just luck. No one side it's likely to fail. Depending on the
application it could be quite rare. You don't want such a Heisen-bug
to show up only once a month on a customer's machine.

> I would expect the process to hang
> because the producer would constantly be getting a false value
> returned from the push function and the consumer would constantly be
> getting a null pointer from the pop function due to the fact that the
> pointer value would be stored in their respective cache's and a write
> to the pointer in one cache would not propagate to the pointer value
> in the other cache.


Hardware, from the perspective of a software programmer, is a black
art. You seem to have a very poor understanding of how modern
operating systems work, and how basic processors and caches work. Just
follow the rules and you should be OK. Here are some examples from
real machines.

Caches are not infinite in size. Eventually they get full and things
must be evicted, aka spilled, to main memory. Generally they use a
heuristic of least recently referenced, which means first in is not
guaranteed to be first out. (To answer your question, surely you have
other processes running on your machine, and swapping between these
processes many times a second means that the cache will be cleared
many times a second from this context switching.)

The reader's cache may contain a stale entry for one but not the
other, which means it'll fetch the new version from main memory for
one but not the other.

The infamous DEC Alpha had a split cache, meaning that (something
like ??) all even addresses went to one cache, and all odd addresses
went to the other cache. This is much more likely to produce effects
like above, as something much more recently referenced will be flushed
to main memory first because that subportion of the cache is full.
(The DEC Alpha also has crazy look ahead, where **x may be the newest
value in main memory, but *x may be a stale value. At least, that's
how it's been explained to me. I have no clue how or why you would
implement that.)

I'd assume there exist some caches which use other heuristics, like
locality heuristics. Maybe some caches only contain ~20 byte units of
main memory, in effect like a look-ahead optimization, pulling data
from main memory before it's required, resulting in possibly stale
data for some data if it's near other used data, but if it's far away
in main memory, then it will be much more likely to be the newest
version of main memory.

I'm sure there are even "weirder" optimizations which will mess with
you if you don't follow the rules. The programming language gives you
a contract, and that contract is both threads must execute the proper
synchronization primitives to give any guarantee of the order that
writes become visible from one thread to the other. The programming
language, hardware, and the operating system fulfill this contract,
and they will do their best to optimize within the confines of this
contract. If you do not follow the contract, then you violated this
contract, and violated assumptions made by the implementation, and in
which case you're SOL.

> Are there any guarantees in regard to cache
> coherency on certain platforms?


Probably. I don't know offhand.
 
Reply With Quote
 
Chris M. Thomasson
Guest
Posts: n/a
 
      10-07-2009
"goodfella" <> wrote in message
news:d75f8e29-b59e-4c40-9660-...
> Here is some code I came up with for a lock free single producer
> single consumer algorithm. I have done some tests using GCC with -O3
> optimizations and all seems well. I would like some constructive
> criticism of the code as I am not yet an expert on concurrency.
>
> I'm assuming that reading and writing the value of a pointer is
> atomic. Although, I have read on some architectures that is a bad
> assumption, so to insure correctness across all platforms, I may have
> to wait for C++ atomic types. Any suggestions and criticisms are
> welcomed thanks. The code is below.


[...]

Check this example code out:

(IA-32/MSVC/PThread)

http://pastebin.com/f693b64b3


It's a wait-free fast-pathed single-consumer/producer bounded ring buffer
that allows for efficient in place writes and reads. A producer can directly
reserve and utilize buffer space, then atomically commit it such that a
consumer can read and process the buffer in place. Nothing gets overwritten,
everything works, and it has conditional blocking ability.

The code has the ability to block because it makes efficient use of my
eventcount algorithm. The eventcount is as important to non-blocking code as
condition variables are to PThread's. This uses my original algorithm which
I disclosed to the public here:


http://groups.google.com/group/comp....8c62ad06dbb380

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
CISCO Vs Consumer Grade mhaase-at-springmind.com Cisco 3 11-06-2005 04:11 PM
Parser Error Message: Could not load type 'Consumer.Global' =?Utf-8?B?cG11ZA==?= ASP .Net 1 12-20-2004 11:38 PM
VOIP - Consumer Phones and Quality information Subba Rao Cisco 4 09-10-2004 12:03 AM
Hosting remoting obj in consumer web app Gatwick ASP .Net 0 08-31-2004 10:03 PM
Remoting obj and consumer web app question senglory ASP .Net 0 08-25-2004 10:41 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57