NSOperationQueue vs pthread priority

NSOperationQueue vs pthread priority - c++

I have this problem:
I have a C++ code that use some threads. These thread are pthread type.
In my iPhone app I use NSOperationQueue and also some C++ code.
The problem is this: the C++ pthread always have lower priority than NsOperationQueue.
How can I fix this? I have also tried to give low priority to NSOpertionQueue but this fix does not work.

If you have to resort to twiddling priority (notably upwards), it's usually indicative of a design flaw in concurrent models. This should be reserved for very special cases, like a realtime thread (e.g. audio playback).
First assess how your threads and tasks operate, and make sure you have no other choice. Typically, you can do something simple, like reducing the operation queue's max operation count, reducing total thread count, or by grouping your tasks by the resource they require.
What method are you using to determine the threads' priorities?
Also note that setting an operation's priority affects the ordering of enqueued operations (not the thread itself).
I've always been able to solve this problem by tweaking distribution. You should stop reading now :)
Available, but NOT RECOMMENDED:
To lower an operation's priority, you could approach it in your operation's main:
- (void)main
{
#autorelease {
const double priority = [NSThread threadPriority];
const bool isMainThread = [NSThread isMainThread];
if (!isMainThread) {
[NSThread setThreadPriority:priority * 0.5];
}
do_your_work_here
if (!isMainThread) {
[NSThread setThreadPriority:priority];
}
}
}
If you really need to push the kernel after all that, this is how you can set a pthread's priority:
pthreads with real time priority
How to increase thread priority in pthreads?

Related

boost lockfree queue performance issue calling boost::lockfree::queue push(), pop(), empty() functions [duplicate]

So i'm using a boost::lockfree::spec_queue to communicate via two boost_threads running functors of two objects in my application.
All is fine except for the fact that the spec_queue::pop() method is non blocking. It returns True or False even if there is nothing in the queue. However my queue always seems to return True (problem #1). I think this is because i preallocate the queue.
typedef boost::lockfree::spsc_queue<q_pl, boost::lockfree::capacity<100000> > spsc_queue;
This means that to use the queue efficiently i have to busy wait constantly popping the queue using 100% cpu. Id rather not sleep for arbitrary amounts of time. I've used other queues in java which block until an object is made available. Can this be done with std:: or boost:: data structures?

A lock free queue, by definition, does not have blocking operations.
How would you synchronize on the datastructure? There is no internal lock, for obvious reasons, because that would mean all clients need to synchronize on it, making it your grandfathers locking concurrent queue.
So indeed, you will have to devise a waiting function yourself. How you do this depends on your use case, which is probably why the library doesn't supply one (disclaimer: I haven't checked and I don't claim to know the full documentation).
So what can you do:
As you already described, you can spin in a tight loop. Obviously, you'll do this if you know that your wait condition (queue non-empty) is always going to be satisfied very quickly.
Alternatively, poll the queue at a certain frequency (doing micro-sleeps in the mean time). Scheduling a good good frequency is an art: for some applications 100ms is optimal, for others, a potential 100ms wait would destroy throughput. So, vary and measure your performance indicators (don't forget about power consumption if your application is going to be deployed on many cores in a datacenter :)).
Lastly, you could arrive at a hybrid solution, spinning for a fixed number of iterations, and resorting to (increasing) interval polling if nothing arrives. This would nicely support servers applications where high loads occur in bursts.

Use a semaphore to cause the producers to sleep when the queue is full, and another semaphore to cause the consumers to sleep when the queue is empty.
when the queue is neither full nor empty, the sem_post and sem_wait operations are nonblocking (in newer kernels)
#include <semaphore.h>
template<typename lock_free_container>
class blocking_lock_free
{
public:
lock_free_queue_semaphore(size_t n) : container(n)
{
sem_init(&pop_semaphore, 0, 0);
sem_init(&push_semaphore, 0, n);
}
~lock_free_queue_semaphore()
{
sem_destroy(&pop_semaphore);
sem_destroy(&push_semaphore);
}
bool push(const lock_free_container::value_type& v)
{
sem_wait(&push_semaphore);
bool ret = container::bounded_push(v);
ASSERT(ret);
if (ret)
sem_post(&pop_semaphore);
else
sem_post(&push_semaphore); // shouldn't happen
return ret;
}
bool pop(lock_free_container::value_type& v)
{
sem_wait(&pop_semaphore);
bool ret = container::pop(v);
ASSERT(ret);
if (ret)
sem_post(&push_semaphore);
else
sem_post(&pop_semaphore); // shouldn't happen
return ret;
}
private:
lock_free_container container;
sem_t pop_semaphore;
sem_t push_semaphore;
};

boost::lockfree::spsc_queue busy wait strategy. Is there a blocking pop?

So i'm using a boost::lockfree::spec_queue to communicate via two boost_threads running functors of two objects in my application.
All is fine except for the fact that the spec_queue::pop() method is non blocking. It returns True or False even if there is nothing in the queue. However my queue always seems to return True (problem #1). I think this is because i preallocate the queue.
typedef boost::lockfree::spsc_queue<q_pl, boost::lockfree::capacity<100000> > spsc_queue;
This means that to use the queue efficiently i have to busy wait constantly popping the queue using 100% cpu. Id rather not sleep for arbitrary amounts of time. I've used other queues in java which block until an object is made available. Can this be done with std:: or boost:: data structures?

A lock free queue, by definition, does not have blocking operations.
How would you synchronize on the datastructure? There is no internal lock, for obvious reasons, because that would mean all clients need to synchronize on it, making it your grandfathers locking concurrent queue.
So indeed, you will have to devise a waiting function yourself. How you do this depends on your use case, which is probably why the library doesn't supply one (disclaimer: I haven't checked and I don't claim to know the full documentation).
So what can you do:
As you already described, you can spin in a tight loop. Obviously, you'll do this if you know that your wait condition (queue non-empty) is always going to be satisfied very quickly.
Alternatively, poll the queue at a certain frequency (doing micro-sleeps in the mean time). Scheduling a good good frequency is an art: for some applications 100ms is optimal, for others, a potential 100ms wait would destroy throughput. So, vary and measure your performance indicators (don't forget about power consumption if your application is going to be deployed on many cores in a datacenter :)).
Lastly, you could arrive at a hybrid solution, spinning for a fixed number of iterations, and resorting to (increasing) interval polling if nothing arrives. This would nicely support servers applications where high loads occur in bursts.

Use a semaphore to cause the producers to sleep when the queue is full, and another semaphore to cause the consumers to sleep when the queue is empty.
when the queue is neither full nor empty, the sem_post and sem_wait operations are nonblocking (in newer kernels)
#include <semaphore.h>
template<typename lock_free_container>
class blocking_lock_free
{
public:
lock_free_queue_semaphore(size_t n) : container(n)
{
sem_init(&pop_semaphore, 0, 0);
sem_init(&push_semaphore, 0, n);
}
~lock_free_queue_semaphore()
{
sem_destroy(&pop_semaphore);
sem_destroy(&push_semaphore);
}
bool push(const lock_free_container::value_type& v)
{
sem_wait(&push_semaphore);
bool ret = container::bounded_push(v);
ASSERT(ret);
if (ret)
sem_post(&pop_semaphore);
else
sem_post(&push_semaphore); // shouldn't happen
return ret;
}
bool pop(lock_free_container::value_type& v)
{
sem_wait(&pop_semaphore);
bool ret = container::pop(v);
ASSERT(ret);
if (ret)
sem_post(&push_semaphore);
else
sem_post(&pop_semaphore); // shouldn't happen
return ret;
}
private:
lock_free_container container;
sem_t pop_semaphore;
sem_t push_semaphore;
};

thread building block combined with pthreads

I have a queue with elements which needs to be processed. I want to process these elements in parallel. The will be some sections on each element which need to be synchronized. At any point in time there can be max num_threads running threads.
I'll provide a template to give you an idea of what I want to achieve.
queue q
process_element(e)
{
lock()
some synchronized area
// a matrix access performed here so a spin lock would do
unlock()
...
unsynchronized area
...
if( condition )
{
new_element = generate_new_element()
q.push(new_element) // synchonized access to queue
}
}
process_queue()
{
while( elements in q ) // algorithm is finished condition
{
e = get_elem_from_queue(q) // synchronized access to queue
process_element(e)
}
}
I can use
pthreads
openmp
intel thread building blocks
Top problems I have
Make sure that at any point in time I have max num_threads running threads
Lightweight synchronization methods to use on queue
My plan is to the intel tbb concurrent_queue for the queue container. But then, will I be able to use pthreads functions ( mutexes, conditions )? Let's assume this works ( it should ). Then, how can I use pthreads to have max num_threads at one point in time? I was thinking to create the threads once, and then, after one element is processes, to access the queue and get the next element. However it if more complicated because I have no guarantee that if there is not element in queue the algorithm is finished.
My question
Before I start implementing I'd like to know if there is an easy way to use intel tbb or pthreads to obtain the behaviour I want? More precisely processing elements from a queue in parallel
Note: I have tried to use tasks but with no success.

First off, pthreads gives you portability which is hard to walk away from. The following appear to be true from your question - let us know if these aren't true because the answer will then change:
1) You have a multi-core processor(s) on which you're running the code
2) You want to have no more than num_threads threads because of (1)
Assuming the above to be true, the following approach might work well for you:
Create num_threads pthreads using pthread_create
Optionally, bind each thread to a different core
q.push(new_element) atomically adds new_element to a queue. pthreads_mutex_lock and pthreads_mutex_unlock can help you here. Examples here: http://pages.cs.wisc.edu/~travitch/pthreads_primer.html
Use pthreads_mutexes for dequeueing elements
Termination is tricky - one way to do this is to add a TERMINATE element to the queue, which upon dequeueing, causes the dequeuer to queue up another TERMINATE element (for the next dequeuer) and then terminate. You will end up with one extra TERMINATE element in the queue, which you can remove by having a named thread dequeue it after all the threads are done.
Depending on how often you add/remove elements from the queue, you may want to use something lighter weight than pthread_mutex_... to enqueue/dequeue elements. This is where you might want to use a more machine-specific construct.

TBB is compatible with other threading packages.
TBB also emphasizes scalability. So when you port over your program to from a dual core to a quad core you do not have to adjust your program. With data parallel programming, program performance increases (scales) as you add processors.
Cilk Plus is also another runtime that provides good results.
www.cilkplus.org
Since pThreads is a low level theading library you have to decide how much control you need in your application because it does offer flexibility, but at a high cost in terms of programmer effort, debugging time, and maintenance costs.

My recommendation is to look at tbb::parallel_do. It was designed to process elements from a container in parallel, even if the container itself is not concurrent; i.e. parallel_do works with an std::queue correctly without any user synchronization (of course you would still need to protect your matrix access inside process_element(). Moreover, with parallel_do you can add more work on the fly, which looks like what you need, as process_element() creates and adds new elements to the work queue (the only caution is that the newly added work will be processed immediately, unlike putting in a queue which would postpone processing till after all "older" items). Also, you don't have to worry about termination: parallel_do will complete automatically as soon as all initial queue items and new items created on the fly are processed.
However, if, besides the computation itself, the work queue can be concurrently fed from another source (e.g. from an I/O processing thread), then parallel_do is not suitable. In this case, it might make sense to look at parallel_pipeline or, better, the TBB flow graph.
Lastly, an application can control the number of active threads with TBB, though it's not a recommended approach.

ring buffer without priority inversion

I have a high-priority process that needs to pass data to a low-priority process. I've written a basic ring buffer to handle the passing of data:
class RingBuffer {
public:
RingBuffer(int size);
~RingBuffer();
int count() {return (size + end - start) % size;}
void write(char *data, int bytes) {
// some work that uses only buffer and end
end = (end + bytes) % size;
}
void read(char *data, int bytes) {
// some work that uses only buffer and start
start = (start + bytes) % size;
}
private:
char *buffer;
const int size;
int start, end;
};
Here's the problem. Suppose the low-priority process has an oracle that tells it exactly how much data needs to be read, so that count() need never be called. Then (unless I'm missing something) there are no concurrency issues. However, as soon as the low-priority thread needs to call count() (the high-priority thread might want to call it too to check if the buffer is too full) there is the possibility that the math in count() or the update to end is not atomic, introducing a bug.
I could put a mutex around the accesses to start and end but that would cause priority inversion if the high-priority thread has to wait for the lock acquired by the low-priority thread.
I might be able to work something out using atomic operations but I'm not aware of a nice, cross-platform library providing these.
Is there a standard ring-buffer design that avoids these issues?

What you have should be OK, as long as you adhere to these guidelines:
Only one thread can do writes.
Only one thread can do reads.
Updates and accesses to start and end are atomic. This might be automatic, for example Microsoft states:
Simple reads and writes to
properly-aligned 32-bit variables are
atomic operations. In other words, you
will not end up with only one portion
of the variable updated; all bits are
updated in an atomic fashion.
You allow for the fact that count might be out of date even as you get the value. In the reading thread, count will return the minimum count you can rely on; for the writing thread count will return the maximum count and the true count might be lower.

Boost provides a circular buffer, but it's not thread safe. Unfortunately, I don't know of any implementation that is.
The upcoming C++ standard adds atomic operations to the standard library, so they'll be available in the future, but they aren't supported by most implementations yet.
I don't see any cross-platform solution to keeping count sane while both threads are writing to it, unless you implement locking.
Normally, you would probably use a messaging system and force the low priority thread to request that the high priority thread make updates, or something similar. For example, if the low priority thread consumes 15 bytes, it should ask the high priority thread to reduce the count by 15.
Essentially, you would limit 'write' access to the high priority thread, and only allow the low priority thread to read. This way, you can avoid all locking, and the high priority thread won't have to worry about waiting for a write to be completed by the lower thread, making the high priority thread truly high priority.

boost::interprocess offers cross-platform atomic functions in boost/interprocess/detail/atomic.hpp

Overhead due to use of Events

I have a custom thread pool class, that creates some threads that each wait on their own event (signal). When a new job is added to the thread pool, it wakes the first free thread so that it executes the job.
The problem is the following : I have around 1000 loops of each around 10'000 iterations do to. These loops must be executed sequentially, but I have 4 CPUs available. What I try to do is to split the 10'000 iteration loops into 4 2'500 iterations loops, ie one per thread. But I have to wait for the 4 small loops to finish before going to the next "big" iteration. This means that I can't bundle the jobs.
My problem is that using the thread pool and 4 threads is much slower than doing the jobs sequentially (having one loop executed by a separate thread is much slower than executing it directly in the main thread sequentially).
I'm on Windows, so I create events with CreateEvent() and then wait on one of them using WaitForMultipleObjects(2, handles, false, INFINITE) until the main thread calls SetEvent().
It appears that this whole event thing (along with the synchronization between the threads using critical sections) is pretty expensive !
My question is : is it normal that using events takes "a lot of" time ? If so, is there another mechanism that I could use and that would be less time-expensive ?
Here is some code to illustrate (some relevant parts copied from my thread pool class) :
// thread function
unsigned __stdcall ThreadPool::threadFunction(void* params) {
// some housekeeping
HANDLE signals[2];
signals[0] = waitSignal;
signals[1] = endSignal;
do {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
// try to get the next job parameters;
if (tp->getNextJob(threadId, data)) {
// execute job
void* output = jobFunc(data.params);
// tell thread pool that we're done and collect output
tp->collectOutput(data.ID, output);
}
tp->threadDone(threadId);
}
while (waitResult - WAIT_OBJECT_0 == 0);
// if we reach this point, endSignal was sent, so we are done !
return 0;
}
// create all threads
for (int i = 0; i < nbThreads; ++i) {
threadData data;
unsigned int threadId = 0;
char eventName[20];
sprintf_s(eventName, 20, "WaitSignal_%d", i);
data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction,
this, CREATE_SUSPENDED, &threadId);
data.threadId = threadId;
data.busy = false;
data.waitSignal = CreateEvent(NULL, true, false, eventName);
this->threads[threadId] = data;
// start thread
ResumeThread(data.handle);
}
// add job
void ThreadPool::addJob(int jobId, void* params) {
// housekeeping
EnterCriticalSection(&(this->mutex));
// first, insert parameters in the list
this->jobs.push_back(job);
// then, find the first free thread and wake it
for (it = this->threads.begin(); it != this->threads.end(); ++it) {
thread = (threadData) it->second;
if (!thread.busy) {
this->threads[thread.threadId].busy = true;
++(this->nbActiveThreads);
// wake thread such that it gets the next params and runs them
SetEvent(thread.waitSignal);
break;
}
}
LeaveCriticalSection(&(this->mutex));
}

This looks to me as a producer consumer pattern, which can be implented with two semaphores, one guarding the queue overflow, the other the empty queue.
You can find some details here.

Yes, WaitForMultipleObjects is pretty expensive. If your jobs are small, the synchronization overhead will start to overwhelm the cost of actually doing the job, as you're seeing.
One way to fix this is bundle multiple jobs into one: if you get a "small" job (however you evaluate such things), store it someplace until you have enough small jobs together to make one reasonably-sized job. Then send all of them to a worker thread for processing.
Alternately, instead of using signaling you could use a multiple-reader single-writer queue to store your jobs. In this model, each worker thread tries to grab jobs off the queue. When it finds one, it does the job; if it doesn't, it sleeps for a short period, then wakes up and tries again. This will lower your per-task overhead, but your threads will take up CPU even when there's no work to be done. It all depends on the exact nature of the problem.

Watch out, you are still asking for a next job after the endSignal is emitted.
for( ;; ) {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
if( waitResult - WAIT_OBJECT_0 != 0 )
return;
//....
}

Since you say that it is much slower in parallel than sequential execution, I assume that your processing time for your internal 2500 loop iterations is tiny (in the few micro seconds range). Then there is not much you can do except review your algorithm to split larger chunks of precessing; OpenMP won't help and every other synchronization techniques won't help either because they fundamentally all rely on events (spin loops do not qualify).
On the other hand, if your processing time of the 2500 loop iterations is larger than 100 micro seconds (on current PCs), you might be running into limitations of the hardware. If your processing uses a lot of memory bandwidth, splitting it to four processors will not give you more bandwidth, it will actually give you less because of collisions. You could also be running into problems of cache cycling where each of your top 1000 iteration will flush and reload the cache of the 4 cores. Then there is no one solution, and depending on your target hardware, there may be none.

If you are just parallelizing loops and using vs 2008, I'd suggest looking at OpenMP. If you're using visual studio 2010 beta 1, I'd suggesting looking at the parallel pattern library, particularly the "parallel for" / "parallel for each"
apis or the "task group class because these will likely do what you're attempting to do, only with less code.
Regarding your question about performance, here it really depends. You'll need to look at how much work you're scheduling during your iterations and what the costs are. WaitForMultipleObjects can be quite expensive if you hit it a lot and your work is small which is why I suggest using an implementation already built. You also need to ensure that you aren't running in debug mode, under a debugger and that the tasks themselves aren't blocking on a lock, I/O or memory allocation, and you aren't hitting false sharing. Each of these has the potential to destroy scalability.
I'd suggest looking at this under a profiler like xperf the f1 profiler in visual studio 2010 beta 1 (it has 2 new concurrency modes which help see contention) or Intel's vtune.
You could also share the code that you're running in the tasks, so folks could get a better idea of what you're doing, because the answer I always get with performance issues is first "it depends" and second, "have you profiled it."
Good Luck
-Rick

It shouldn't be that expensive, but if your job takes hardly any time at all, then the overhead of the threads and sync objects will become significant. Thread pools like this work much better for longer-processing jobs or for those that use a lot of IO instead of CPU resources. If you are CPU-bound when processing a job, ensure you only have 1 thread per CPU.
There may be other issues, how does getNextJob get its data to process? If there's a large amount of data copying, then you've increased your overhead significantly again.
I would optimise it by letting each thread keep pulling jobs off the queue until the queue is empty. that way, you can pass a hundred jobs to the thread pool and the sync objects will be used just the once to kick off the thread. I'd also store the jobs in a queue and pass a pointer, reference or iterator to them to the thread instead of copying the data.

The context switching between threads can be expensive too. It is interesting in some cases to develop a framework you can use to process your jobs sequentially with one thread or with multiple threads. This way you can have the best of the two worlds.
By the way, what is your question exactly ? I will be able to answer more precisely with a more precise question :)
EDIT:
The events part can consume more than your processing in some cases, but should not be that expensive, unless your processing is really fast to achieve. In this case, switching between thredas is expensive too, hence my answer first part on doing things sequencially ...
You should look for inter-threads synchronisation bottlenecks. You can trace threads waiting times to begin with ...
EDIT: After more hints ...
If I guess correctly, your problem is to efficiently use all your computer cores/processors to parralellize some processing essencialy sequential.
Take that your have 4 cores and 10000 loops to compute as in your example (in a comment). You said that you need to wait for the 4 threads to end before going on. Then you can simplify your synchronisation process. You just need to give your four threads thr nth, nth+1, nth+2, nth+3 loops, wait for the four threads to complete then going on. You should use a rendezvous or barrier (a synchronization mechanism that wait for n threads to complete). Boost has such a mechanism. You can look the windows implementation for efficiency. Your thread pool is not really suited to the task. The search for an available thread in a critical section is what is killing your CPU time. Not the event part.

It appears that this whole event thing
(along with the synchronization
between the threads using critical
sections) is pretty expensive !
"Expensive" is a relative term. Are jets expensive? Are cars? or bicycles... shoes...?
In this case, the question is: are events "expensive" relative to the time taken for JobFunction to execute? It would help to publish some absolute figures: How long does the process take when "unthreaded"? Is it months, or a few femtoseconds?
What happens to the time as you increase the threadpool size? Try a pool size of 1, then 2 then 4, etc.
Also, as you've had some issues with threadpools here in the past, I'd suggest some debug
to count the number of times that your threadfunction is actually invoked... does it match what you expect?
Picking a figure out of the air (without knowing anything about your target system, and assuming you're not doing anything 'huge' in code you haven't shown), I'd expect the "event overhead" of each "job" to be measured in microseconds. Maybe a hundred or so. If the time taken to perform the algorithm in JobFunction is not significantly MORE than this time, then your threads are likely to cost you time rather than save it.

As mentioned previously, the amount of overhead added by threading depends on the relative amount of time taken to do the "jobs" that you defined. So it is important to find a balance in the size of the work chunks that minimizes the number of pieces but does not leave processors idle waiting for the last group of computations to complete.
Your coding approach has increased the amount of overhead work by actively looking for an idle thread to supply with new work. The operating system is already keeping track of that and doing it a lot more efficiently. Also, your function ThreadPool::addJob() may find that all of the threads are in use and be unable to delegate the work. But it does not provide any return code related to that issue. If you are not checking for this condition in some way and are not noticing errors in the results, it means that there are idle processors always. I would suggest reorganizing the code so that addJob() does what it is named -- adds a job ONLY (without finding or even caring who does the job) while each worker thread actively gets new work when it is done with its existing work.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js