Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Why is java consumer/producer so much faster than C++

Reply
Thread Tools

Why is java consumer/producer so much faster than C++

 
 
Bo Persson
Guest
Posts: n/a
 
      07-24-2012
Luca Risolia skrev 2012-07-24 00:33:
> On 23/07/2012 12:17, Melzzzzz wrote:
>>> (On a side note, the C++ version you have given does not have the
>>> best possible interface for a thread-safe queue)

>>
>> I have just copied Java BlockingQueue interface, which is of course
>> suited for Java. Don't know actually how ideal interface would look
>> like.

>
> In C++11 the ideal interface for a generic thread-safe and
> exception-safe queue should at least provide std::shared_ptr<T> take()
> and/or void take(T&). T take() is limited to those types having no-throw
> copy constructors or move-constructors. This also means that you should
> check the type at compile-time by using type traits. However, in your
> test case you used move-semantic everywhere, so that is not problem.
> Also note that the term "pop" instead of "take" is more in the spirit of
> C++.
>


Using move semantics in a return statement might interfere with the NRVO
that would otherwise likely kick in. It is not an optimization.


Bo Persson

 
Reply With Quote
 
 
 
 
Melzzzzz
Guest
Posts: n/a
 
      07-24-2012
On Mon, 23 Jul 2012 18:11:58 -0700 (PDT)
Howard Hinnant <> wrote:

> On Monday, July 23, 2012 6:32:10 PM UTC-4, Melzzzzz wrote:
> > On Mon, 23 Jul 2012 13:23:30 -0700 (PDT)
> > Howard Hinnant wrote:
> >
> > &gt; I made a few changes to your code:
> > &gt;
> > &gt;
> > &gt; On my system this sped the C++ solution up considerably (to
> > about &gt; 13.7 seconds). I didn't time the Java solution.
> >
> > Yes, that's it, on my system:
> > bmaxa@maxa:~/examples$ time ./consprod2
> >
> > real 0m9.478s
> > user 0m10.797s
> > sys 0m7.196s
> >
> > which is more than twice speed of original program!
> > (and faster than java)

>
> At second glance I wasn't that fond of my solution and tweaked it as
> below:
>
>
> void put(T t)
> {
> std::unique_lock<std::mutex> lock(m_, std::try_to_lock);
> while (!lock.owns_lock())
> {
> if (queue_.size() > capacity_/4)
> {


There is problem, as queue_ is not protected under lock.

> for (int i = 0; i < 3250; ++i)
> std::this_thread::yield();
> lock.try_lock();
> }
> else
> {
> lock.lock();
> break;
> }
> }
> while(queue_.size() >= capacity_)c_full_.wait(lock);
> queue_.push_back(std::move(t));
> if (queue_.size() == 1)
> c_empty_.notify_one();
> }
> T take()
> {
> std::unique_lock<std::mutex> lock(m_, std::try_to_lock);
> int retry = 0;
> while (!lock.owns_lock())
> {
> if (queue_.size() < 3*capacity_/4)
> {
> for (int i = 0; i < 3250; ++i)
> std::this_thread::yield();
> lock.try_lock();
> }
> else
> {
> lock.lock();
> break;
> }
> }
> while(queue_.empty())c_empty_.wait(lock);
> T tmp = std::move(queue_.front());
> queue_.pop_front();
> if (queue_.size() == capacity_-1)
> c_full_.notify_one();
> return tmp;
> }
>
> I am admittedly coding to the benchmark (which is not a really great
> idea). But I got the time on my system down from 13.7 seconds to
> about 9.7 seconds (another 40%).



This is version with array (identical implementation
as ArrayBlockingQueue in Java), instead of deque (I
used atomic type for count_). This executes about same speed as Java
as atomic type slows it down a bit.
Sadly, no noticing difference between deque and array.

#include <condition_variable>
#include <mutex>
#include <thread>
#include <cstdlib>
#include <atomic>

template <class T>
class BlockingQueue{
public:
BlockingQueue(unsigned cap)
:
queue_((T*)new char[cap*sizeof(T)]),
insPos_(0),
extPos_(0),
count_(0),
capacity_(cap)
{
}
BlockingQueue(const BlockingQueue&)=delete;
BlockingQueue& operator=(const BlockingQueue&)=delete;
~BlockingQueue()
{
delete[] (char*)queue_;
}
void put(T t)
{
std::unique_lock<std::mutex> lock(m_, std::try_to_lock);
int retry = 0;
while (!lock.owns_lock())
{
if (count_ > capacity_/4 && ++retry < 1000)
{
std::this_thread::yield();
}
else
{
lock.lock();
}
}
c_full_.wait(lock, [this]{return count_ < capacity_;});
new(queue_+insPos_)T(std::move(t));
inc(insPos_);
++count_;
if (count_ == 1)
c_empty_.notify_one();
}
T take()
{
std::unique_lock<std::mutex> lock(m_, std::try_to_lock);
int retry = 0;
while (!lock.owns_lock())
{
if (count_ < 3*capacity_/4 && ++retry < 1000)
{
std::this_thread::yield();
}
else
{
lock.lock();
}
}
c_empty_.wait(lock, [this]{return count_ > 0 ;});
T tmp = std::move(queue_[extPos_]);
queue_[extPos_].~T();
inc(extPos_);
--count_;
if (count_ == capacity_-1)
c_full_.notify_one();
return tmp;
}
bool empty()
{
std::lock_guard<std::mutex> lock(m_);
return count_ == 0;
}
private:
void inc(unsigned& i)
{
++i;
if(i == capacity_)i = 0;
}
std::mutex m_;
std::condition_variable c_empty_,c_full_;
T* queue_;
unsigned insPos_,extPos_;
std::atomic<unsigned> count_;
const unsigned capacity_;
};

int main()
{
BlockingQueue<int> produced(1000);
const int nitems = 100000000;
std::srand(12345);


std::function<void()> f_prod = [&]() {
int i = nitems;
while(i-- > 0){
produced.put(i);
}
};

std::thread producer1(f_prod);

std::function<void()> f_cons = [&]() {
const int size = 10000;
int arr[size];
int i = nitems;
while(i-- > 0)
{
arr[std::rand()%size] =
produced.take();
}
};

std::thread consumer1(f_cons);

producer1.join();
consumer1.join();
}

 
Reply With Quote
 
 
 
 
Luca Risolia
Guest
Posts: n/a
 
      07-24-2012
On 24/07/2012 21:45, Melzzzzz wrote:
> This is version with array (identical implementation
> as ArrayBlockingQueue in Java), instead of deque (I
> used atomic type for count_).


Why do you need count_ to be an atomic type, it seems that all the
updates to that variable are already protected by a mutex..

 
Reply With Quote
 
Melzzzzz
Guest
Posts: n/a
 
      07-24-2012
On Tue, 24 Jul 2012 23:24:07 +0200
Luca Risolia <> wrote:

> On 24/07/2012 21:45, Melzzzzz wrote:
> > This is version with array (identical implementation
> > as ArrayBlockingQueue in Java), instead of deque (I
> > used atomic type for count_).

>
> Why do you need count_ to be an atomic type, it seems that all the
> updates to that variable are already protected by a mutex..
>


count_ is not protected when testing if lock is held in while loop.
Anyway this has not be precise and if count_ is not atomic program
is much faster.
I have tried Howard's algorithm with Java and got same
time as C++.
Perhaps Array/LinkedBlockingQueue implementation in Java has
to be updated

 
Reply With Quote
 
Luca Risolia
Guest
Posts: n/a
 
      07-24-2012
On 25/07/2012 00:48, Melzzzzz wrote:
> On Tue, 24 Jul 2012 23:24:07 +0200
> Luca Risolia <> wrote:
>
>> On 24/07/2012 21:45, Melzzzzz wrote:
>>> This is version with array (identical implementation
>>> as ArrayBlockingQueue in Java), instead of deque (I
>>> used atomic type for count_).

>>
>> Why do you need count_ to be an atomic type, it seems that all the
>> updates to that variable are already protected by a mutex..
>>

>
> count_ is not protected when testing if lock is held in while loop.


Correct, I did not read the code carefully, since I supposed the only
difference from your original version was the use of an array, which
made the program about 20-30% faster in my tests.

 
Reply With Quote
 
Howard Hinnant
Guest
Posts: n/a
 
      07-25-2012
On Tuesday, July 24, 2012 3:45:22 PM UTC-4, Melzzzzz wrote:
> On Mon, 23 Jul 2012 18:11:58 -0700 (PDT)
> Howard Hinnant &lt;&gt; wrote:
> &gt; void put(T t)
> &gt; {
> &gt; std::unique_lock&lt;std::mutex&gt; lock(m_, std::try_to_lock);
> &gt; while (!lock.owns_lock())
> &gt; {
> &gt; if (queue_.size() &gt; capacity_/4)
> &gt; {
>
> There is problem, as queue_ is not protected under lock.


Correct. But the code gets away with it anyway because:

1. It does not modify the queue.
2. It does not need an accurate value for queue_.size().

The *very* worst that could happen is that the value of queue_.size() read by one thread is never updated by the other as the other thread fills/empties the queue. Eventually the other thread will drop the mutex ownership, either because it runs out of things to put into the queue, or because it fills the queue and has to wait, or because it empties the queue and has to wait. In any event, the other thread drops the lock and this thread acquires it via the try_lock. It has not mattered, except perhaps for efficiency purposes, that this thread has been reading the wrong value of queue_.size().

And as you pointed out, efficiency hasn't suffered.
 
Reply With Quote
 
none
Guest
Posts: n/a
 
      07-25-2012
In article <jumg3b$nvn$>, Melzzzzz <> wrote:
>On Tue, 24 Jul 2012 12:51:20 +0000 (UTC)
>yatremblay@bel1lin202.(none) (Yannick Tremblay) wrote:
>
>> In article <86c79114-84e6-4065-a9a4->,
>> Howard Hinnant <> wrote:
>> >I made a few changes to your code:
>> >
>> >1. I only signal if the capacity becomes non-zero or non-full.

>>
>> >2. If the queue is the queue has got some items in it and the mutex
>> >is locked, the put thread yields to the take thread. If the queue
>> >has some free space in it and the mutex is locked, the take thread
>> >yields to the put thread.

>>
>> Are you sure about the usefulness of #2?

>
>Problem is that we need to reduce waiting on condition to minimum
>as more condition waits program is slower.
>Ideally there would be no condition waits (maximum speed).


OK, for this test program, I guess it might make sense (albeit in my
case, this worsens the results)

However, you should be careful to generalizing this to a "real"
program. This test program passes integers back and forth via the
queue and do no "work" at all. A "real" program will most likely do
work.

Typically for a real program, fcons/fprod will look more like:

void fprod()
{
for(;
{
Job job = CreateJob();
queue.put(job);
}
}

void fcons()
{
for(;
{
Job job = queue.take();
ProcessJob(Job);
}
}

And the cost of ProcessJob()/CreateJob() will dwarf the cost of
put()/take() by several order of magnitude. (in your test program, the
cost of storing the integer is dwarfed by the cost of accessing the
queue)

For such a system, you actually want it to be a common occurence that
the consumer blocks waiting on a condition.

>> Doing a few tests myself, adding this in m queue seems to reduce
>> performance by some 20%: while (!lock.owns_lock())
>> {
>> if (queue_.size() > capacity_/4)
>> {
>> for (int i = 0; i < 3250; ++i)
>> std::this_thread::yield();
>> lock.try_lock();
>> }
>> else
>> {
>> lock.lock();
>> break;
>> }
>> }
>>

>
>This is second version, on my system first version seems better
>for smaller capacities...
>
>



 
Reply With Quote
 
Howard Hinnant
Guest
Posts: n/a
 
      07-25-2012
On Wednesday, July 25, 2012 4:09:43 AM UTC-4, none wrote:

> However, you should be careful to generalizing this to a &quot;real&quot;
> program.


100% agreed.

I wrote earlier:

> I am admittedly coding to the benchmark (which is not a really great idea).


It was fun. I had a good time.

Howard
 
Reply With Quote
 
Luca Risolia
Guest
Posts: n/a
 
      07-26-2012
On 24/07/2012 21:45, Melzzzzz wrote:
> This is version with array (identical implementation
> as ArrayBlockingQueue in Java), instead of deque (I
> used atomic type for count_). This executes about same speed as Java
> as atomic type slows it down a bit.


> Sadly, no noticing difference between deque and array.


Right, it may well depend on how smart the std::deque::allocator_type is.

Below is a version of the BlockingQueue using spin locks. Compared to
the original version in Java it is about 2.5 times faster on my
platform, provided that you remove std::rand() from the test, as it
seems to be a real bottleneck and does not give any precious
informations about the efficiency of the Queue (which is the thing to
measure after all).

Let me know your results.

---

#include <thread>
#include <cstdlib>
#include <functional>
#include <atomic>
#include <memory>
#include <type_traits>

class spinlock_mutex {
std::atomic_flag flag = ATOMIC_FLAG_INIT;
public:

void lock() {
while (flag.test_and_set(std::memory_order_acquire)) {
std::this_thread::yield();
};
}

void unlock() {
flag.clear(std::memory_order_release);
}
};

template <class T, class A = std::allocator<T >>
class BlockingQueue {
const std::size_t cap_;
A alloc_;
T* queue_;
std::size_t n = 0, b = 0, f = 0;
spinlock_mutex m_;
public:

BlockingQueue(const std::size_t cap, const A& alloc = A())
: cap_(cap), alloc_(alloc), queue_(alloc_.allocate(cap)) { }

~BlockingQueue() {
alloc_.deallocate(queue_, cap_);
}

static_assert(std::is_nothrow_move_constructible<T >::value,
"BlockingQueue cannot be exception-safe");

void put(const T t) {
for (; {
m_.lock();
if (n < cap_)
break;
m_.unlock();
}
alloc_.construct(&queue_[f], std::move(t));
f = ++f % cap_;
++n;
m_.unlock();
}

T take() {
for (; {
m_.lock();
if (n > 0)
break;
m_.unlock();
}
const T t = std::move(queue_[b]);
alloc_.destroy(&queue_[b]);
b = ++b % cap_;
--n;
m_.unlock();
return t;
}
};

int main() {
BlockingQueue<int> produced(100000);
const int nitems = 100000000;
std::srand(12345);


std::function<void() > f_prod = [&](){
int i = nitems;
while (i-- > 0) {
produced.put(i);
}
};

std::thread producer1(f_prod);

std::function<void() > f_cons = [&](){
const int size = 10000;
int arr[size];
int i = nitems;
while (i-- > 0) {
arr[std::rand() % size] = produced.take();
}
};

std::thread consumer1(f_cons);

producer1.join();
consumer1.join();
}

 
Reply With Quote
 
Melzzzzz
Guest
Posts: n/a
 
      07-26-2012
On Thu, 26 Jul 2012 21:19:21 +0200
Luca Risolia <> wrote:

> On 24/07/2012 21:45, Melzzzzz wrote:
> > This is version with array (identical implementation
> > as ArrayBlockingQueue in Java), instead of deque (I
> > used atomic type for count_). This executes about same speed as Java
> > as atomic type slows it down a bit.

>
> > Sadly, no noticing difference between deque and array.

>
> Right, it may well depend on how smart the std::deque::allocator_type
> is.
>
> Below is a version of the BlockingQueue using spin locks. Compared to
> the original version in Java it is about 2.5 times faster on my
> platform, provided that you remove std::rand() from the test, as it
> seems to be a real bottleneck and does not give any precious
> informations about the efficiency of the Queue (which is the thing to
> measure after all).
>
> Let me know your results.

(your version without rand and array assignment)
bmaxa@maxa:~/examples$ time ./consprod5

real 0m6.812s
user 0m7.532s
sys 0m5.136s
(Howards version without rand and array assignment)
bmaxa@maxa:~/examples$ time ./consprod4

real 0m6.748s
user 0m7.376s
sys 0m5.000s

(your version with rand and array assignment)
bmaxa@maxa:~/examples$ time ./consprod5

real 0m23.733s
user 0m27.042s
sys 0m16.849s

(Howards version with rand and array assignment)
bmaxa@maxa:~/examples$ time ./consprod4

real 0m9.351s
user 0m10.257s
sys 0m6.900s

(java (Howards version) without rand and array assignment)
bmaxa@maxa:~/examples$ time java consprod

real 0m8.088s
user 0m8.841s
sys 0m5.596s

(java (Howards version) with rand and array assignment)
bmaxa@maxa:~/examples$ time java consprod

real 0m10.052s
user 0m11.117s
sys 0m7.208s


>
> ---
>
> #include <thread>
> #include <cstdlib>
> #include <functional>
> #include <atomic>
> #include <memory>
> #include <type_traits>
>
> class spinlock_mutex {
> std::atomic_flag flag = ATOMIC_FLAG_INIT;


gcc 4.6.3 doesn't implements non static member initialization ;(

>
> static_assert(std::is_nothrow_move_constructible<T >::value,
> "BlockingQueue cannot be exception-safe");


this is not yet in library ;(

>
> void put(const T t) {
> for (; {
> m_.lock();
> if (n < cap_)
> break;
> m_.unlock();
> }
> alloc_.construct(&queue_[f], std::move(t));
> f = ++f % cap_;


compiler warns about undefined behavior here.

> ++n;
> m_.unlock();
> }
>
> T take() {
> for (; {
> m_.lock();
> if (n > 0)
> break;
> m_.unlock();
> }
> const T t = std::move(queue_[b]);
> alloc_.destroy(&queue_[b]);
> b = ++b % cap_;


also warning about undefined behavior.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Why {} is much faster than Hash.new ? IƱaki Baz Castillo Ruby 5 01-17-2011 02:30 PM
Wow, Python much faster than MatLab Stef Mientki Python 11 01-01-2007 02:05 AM
Is a Java Application much faster than an Applet. Sanny Java 12 12-15-2006 02:47 AM
findcontrol("PlaceHolderPrice") why why why why why why why why why why why Mr. SweatyFinger ASP .Net 2 12-02-2006 03:46 PM
why is perl -e 'unlink(glob("*"))' so much faster than rm ? ewaguespack@gmail.com Perl Misc 13 07-19-2006 10:37 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57