Bottleneck in parallel packet dispatcher - c++

I will say in advance that huge speed is needed and calling ExecutePackets is very expensive.
Necessary that the ExecutePackets function process many packages in parallel from different threads.
struct Packet {
bool responseStatus;
char data[1024];
};
struct PacketPool {
int packet_count;
Packet* packets[10];
}packet_pool;
std::mutex queue_mtx;
std::mutex request_mtx;
bool ParallelExecutePacket(Packet* p_packet) {
p_packet->responseStatus = false;
struct QueuePacket {
bool executed;
Packet* p_packet;
}queue_packet{ false, p_packet };
static std::list<std::reference_wrapper<QueuePacket>> queue;
//make queue
queue_mtx.lock();
queue.push_back(queue_packet);
queue_mtx.unlock();
request_mtx.lock();
if (!queue_packet.executed)
{
ZeroMemory(&packet_pool, sizeof(packet_pool));
//move queue to pequest_pool and clear queue
queue_mtx.lock();
auto iter = queue.begin();
while (iter != queue.end())
if (!(*iter).get().executed)
{
int current_count = packet_pool.packet_count++;
packet_pool.packets[current_count] = (*iter).get().p_packet;
(*iter).get().executed = true;
queue.erase(iter++);
}
else ++iter;
queue_mtx.unlock();
//execute packets
ExecutePackets(&packet_pool);
}
request_mtx.unlock();
return p_packet->responseStatus;
}
The ParallelExecutePacket function can be called from multiple loops at the same time. I want packets to be processed in batches of several. More precisely, so that each thread processes the entire queue. Then the number of ExecutePackets will be reduced, while not losing the number of processed packets.
However, in my code with multiple threads, the total number of packets processed is equal to the number of packets processed by one thread. And I don't understand why this is happening.
In my test, I created several threads and in each thread called ParallelExecutePacket in a loop.
The results are the number of processed requests per second.
Multithread:
Summ:91902
Thread 0 : 20826
Thread 1 : 40031
Thread 2 : 6057
Thread 3 : 12769
Thread 4 : 12219
Singlethread:
Summ:104902
Thread 0 : 104902
And if my version is not working,how implement what i need?

queue_mtx.lock();
auto iter = queue.begin();
while (iter != queue.end())
queue.erase(iter++);
queue_mtx.unlock();
Only one execution thread locks the queue at a time, drains all messages from it, and then unlocks it. Even if a thousand execution threads are available here only one of them will be able to do any work. All others get blocked.
The length of time the queue_mtx is held must be minimized as much as possible, it should be no more than the absoulte minimum it takes to pluck one messages out of the queue, removing it completely, then unlocking the queue while all the real work is done.
int current_count = packet_pool.packet_count++;
packet_pool.packets[current_count] = (*iter).get().p_packet;
This appears to be the extent of the work that's done here. Currently the shown code enjoys the benefit of being protected by the queue_mtx. If this is no longer protected by it, any more, then thread safety must be implemented here in some other way, if that's needed (it's unclear what any of this is, and whether there's a thread synchronization issue here, at all).

You never drop request_mtx during the while loop. That while loop includes ExecutePackets, so your thread blocks all of the others until it completes executing all the tasks it finds.
Also note that you wont actually see any speed ups from this style of parallelism. To have n threads of parallelism with this code, you need to have n callers calling into ParallelExecutePacket. This is exactly the same parallelism that would happen if you just let each one work on its own. Indeed, statistically speaking you will find that almost always every thread just runs its own task. Every now and then you'll get a threading contention which causes one thread to execute another's task. When this occurs, both threads slow down to the slower of the two.

Related

How to maintain certain frame rate in different threads

I have two different computational tasks that have to execute at certain frequencies. One has to be performed every 1ms and the other every 13.3ms. The tasks share some data.
I am having a hard time how to schedule these tasks and how to share data between them. One way that I thought might work is to create two threads, one for each task.
The first task is relatively simpler and can be handled in 1ms itself. But, when the second task (that is relatively more time-consuming) is going to launch, it will make a copy of the data that was just used by task 1, and continue to work on them.
Do you think this would work? How can it be done in c++?
There are multiple ways to do that in C++.
One simple way is to have 2 threads, as you described. Each thread does its action and then sleeps till the next period start. A working example:
#include <functional>
#include <iostream>
#include <chrono>
#include <thread>
#include <atomic>
#include <mutex>
std::mutex mutex;
std::atomic<bool> stop = {false};
unsigned last_result = 0; // Whatever thread_1ms produces.
void thread_1ms_action() {
// Do the work.
// Update the last result.
{
std::unique_lock<std::mutex> lock(mutex);
++last_result;
}
}
void thread_1333us_action() {
// Copy thread_1ms result.
unsigned last_result_copy;
{
std::unique_lock<std::mutex> lock(mutex);
last_result_copy = last_result;
}
// Do the work.
std::cout << last_result_copy << '\n';
}
void periodic_action_thread(std::chrono::microseconds period, std::function<void()> const& action) {
auto const start = std::chrono::steady_clock::now();
while(!stop.load(std::memory_order_relaxed)) {
// Do the work.
action();
// Wait till the next period start.
auto now = std::chrono::steady_clock::now();
auto iterations = (now - start) / period;
auto next_start = start + (iterations + 1) * period;
std::this_thread::sleep_until(next_start);
}
}
int main() {
std::thread a(periodic_action_thread, std::chrono::milliseconds(1), thread_1ms_action);
std::thread b(periodic_action_thread, std::chrono::microseconds(13333), thread_1333us_action);
std::this_thread::sleep_for(std::chrono::seconds(1));
stop = true;
a.join();
b.join();
}
If executing an action takes longer than one period to execute, then it sleeps till the next period start (skips one or more periods). I.e. each Nth action happens exactly at start_time + N * period, so that there is no time drift regardless of how long it takes to perform the action.
All access to the shared data is protected by the mutex.
So I'm thinking that task1 needs to make the copy, because it knows when it is safe to do so. Here is one simplistic model:
Shared:
atomic<Result*> latestResult = {0};
Task1:
Perform calculation
Result* pNewResult = new ResultBuffer
Copy result to pNewResult
latestResult.swap(pNewResult)
if (pNewResult)
delete pNewResult; // Task2 didn't take it!
Task2:
Result* pNewResult;
latestResult.swap(pNewResult);
process result
delete pNewResult;
In this model task1 and task2 only ever naggle when swapping a simple atomic pointer, which is quite painless.
Note that this makes many assumptions about your calculation. Could your task1 usefully calculate the result straight into the buffer, for example? Also note that at the start Task2 may find the pointer is still null.
Also it inefficiently new()s the buffers. You need 3 buffers to ensure there is never any significant naggling between the tasks, but you could just manage three buffer pointers under mutexes, such that Task 1 will have a set of data ready, and be writing another set of data, while task 2 is reading from a third set.
Note that even if you have task 2 copy the buffer, Task 1 still needs 2 buffers to avoid stalls.
You can use C++ threads and thread facilities like class thread and timer classes like steady_clock like it has been described in previous answer but if this solution works strongly depends on the platform your code is running on.
1ms and 13.3ms are pretty short time intervals and if your code is running on non-real time OS like Windows or non-RTOS Linux, there is no guarantee that OS scheduler will wake up your threads at exact times.
C++ 11 has the class high_resolution_clock that should use high resolution timer if your platform supports one but it still depends on the implementation of this class. And the bigger problem than the timer is using C++ wait functions. Neither C++ sleep_until nor sleep_for guarantees that they will wake up your thread at specified times. Here is the quote from C++ documentation.
sleep_for - blocks the execution of the current thread for at least the specified sleep_duration. sleep_for
Fortunately, most OS have some special facilities like Windows Multimedia Timers you can use if your threads are not woken up at expected times.
Here are more details. Precise thread sleep needed. Max 1ms error

How should boost::lockfree::spsc_queue's read_available and write_available be used?

Boost documentation for spsc_queue says:
read_available(): Thread-safe and wait-free, should only be called from the producer thread
write_available(): Thread-safe and wait-free, should only be called from the consumer thread
I would expect the most common use case to be just the other way around: producer thread (thread writing data to the queue) would need write_available(), and consumer thread (thread reading data from the queue) would need read_available().
If I need to know how much I can write to the queue in the producer thread, should I use QUEUE_CAPACITY - read_available()?
Any kind of size assessment is going to be a race in the lockfree world.
Simple reason being that the size might be changed on other threads before you act on the "measured size".
Single-Producer/Single-Consumer is special in the sense that the consumer knows nobody else can read from the the queue (so "read_available" will never decrease unless the consumer reads it itself). Similarly for the producer side.
It is clear that write_available is what you need. Sure, it could be more by the time you actually write, but you can't get more accurate. At least it will never be less (after all there is only 1 producer thread).
Note that the documentation appears to be in error (swapping the allowed threads)
This made me double check, and sure enough they use the functions internally in the expected ways (contradicting the erronous documentation claim):
ConstIterator push(ConstIterator begin, ConstIterator end, T * internal_buffer, size_t max_size)
{
const size_t write_index = write_index_.load(memory_order_relaxed); // only written from push thread
const size_t read_index = read_index_.load(memory_order_acquire);
const size_t avail = write_available(write_index, read_index, max_size);
I heartily suggest using the range-push shown here to automatically push the exact number of items possible. E.g.:
auto b = messages.begin(), e = messages.end();
do {
b = a.push(b, e)
} while (b != e);

confusion with semaphore definitions

For semaphore implementations, what does process specify? In the context of the producer/consumer problem, is the process the producer method/Consumer method? Or is it P() if we are in P() and the value is less than 0?
P() {
value = value –1;
If value < 0
add the calling process to this semaphore’s list;
block this process
}
EXAMPLE
If Consumer runs first before Producer produces its first item
Consumer would decrement the full value -> full = -1
and then since the value is less than 1, it would add the calling process to this semaphore’s list. But I’m not sure what process is.
And what does it mean to block this process? Does it mean that the entire method for consumer is at halt, and producer method runs?
code:
#define N 100
typedef int semaphore;
Semaphore fullBuffer = 0; // Initially, no item in buffer
Semaphore empty = N; // Initially, num empty buffer
Semaphore mutex = 1; // No thread updating the buffer
void producer(void) {
int item;
while(TRUE){
item = produce_item();
down(&empty);
down(&mutex);
insert_item(item);
up(&mutex);
up(&full);
}
}
void consumer(void) {
int item;
while(TRUE){
down(&full);
down(&mutex);
item = remove_item();
up(&mutex);
up(&empty);
consume_item(item);
}
}
A process, in this usage, is exactly like a thread. Usually when 'multiprocess' is used instead of 'multithreaded', it implies that the kernal handles the threading, which allows the computer to take advantage of multiple cores. However, that isn't important for this specific implementation, and is also false for this specific implementation, because nothing is atomic.
Blocking the process here means that a process that calls P and decrememnts the value to anything negative will halt its own execution when it reaches the 'block this process' command.
Assuming multi threading, your 'producer' command will continually decrease the empty semaphore unless it tries to decrement it below zero, in which case it will be halted, and only the 'consumer' command will run. At least, only 'consumer' will run until it increases the empty semaphore enough that 'producer' can now run. You can also switch both 'empty'<->'full' and 'producer'<->'consumer' in the previous two sentences, and they should remain correct.
Also, I suggest you read up on semaphores elsewhere, because they are a basic part of threading/multiprocessing, and other people have described them better than I ever could. (Look at the producer/consumer example there.)

C++ pthread'ed process running slower than single thread issue

I was trying to run a function on multiple pthreads in order to increase efficiency and runtime. This function performs a lot of matrix calculations and print statements. However, when I ran tests in order see the performance improvement, the single threaded code ran faster.
My tests went as follows:
-For the single-threaded: Run a for loop 1:1000 that called the function.
-For the multi-pthreaded: Spawn 100 pthreads, have a queue of 1000 items and a pthread_cond_wait and have the threads run the function until the queue is empty.
Here is my code for the pthreads (single-threaded is just a for loop instead):
# include <iostream>
# include <string>
# include <pthread.h>
# include <queue>
using namespace std;
# define NUM_THREADS 100
int main ( );
queue<int> testQueue;
void *playQueue(void* arg);
void matrix_exponential_test01 ( );
void matrix_exponential_test02 ( );
pthread_mutex_t queueLock;
pthread_cond_t queue_cv;
int main()
{
pthread_t threads[NUM_THREADS];
pthread_mutex_init(&queueLock, NULL);
pthread_cond_init (&queue_cv, NULL);
for( int i=0; i < NUM_THREADS; i++ )
{
pthread_create(&threads[i], NULL, playQueue, (void*)NULL);
}
pthread_mutex_lock (&queueLock);
for(int z=0; z<1000; z++)
{
testQueue.push(1);
pthread_cond_signal(&queue_cv);
}
pthread_mutex_unlock (&queueLock);
pthread_mutex_destroy(&queueLock);
pthread_cond_destroy(&queue_cv);
pthread_cancel(NULL);*/
return 0;
}
void* playQueue(void* arg)
{
bool accept;
while(true)
{
pthread_cond_wait(&queue_cv, &queueLock);
accept = false;
if(!testQueue.empty())
{
testQueue.pop();
accept = true;
}
pthread_mutex_unlock (&queueLock);
if(accept)
{
runtest();
}
}
pthread_exit(NULL);
}
My intuition tells me that the multi-threaded version should run faster, but it doesnt. Is there a reason, or is my code faulty? I am using C++ on Windows, and had to download a library to use pthreads.
First, your code is written in a way that only one thread will run at any time (your mutex is locked the whole time the thread is doing work). So at best you can expect your Code to be as fast as the single threaded version.
Also, all threads reading and writing the same memory each time. This way you force your CPU cores to synchronize their caches, meaning actually more load on the bus than would be caused by a single thread. Since you are not doing any computationally expensive stuff, it is likely that memory bandwidth is your actual bottleneck and thus the bus load added by cache synchronization slows down your program. Take a look at http://en.wikipedia.org/wiki/False_sharing for more information.
If runtest() is CPU-bound -- that is, doesn't do anything which might block on i/o or the like -- then there's not much point starting 100 threads, unless you have 100 cpus/cores ! [Edit: I now notice that runtest() does some print statements... file i/o probably won't block... so won't release the CPU.]
The code as currently shown holds the mutex while filling the queue, so nothing will start until the queue is full. By the time filling of the queue has finished signalling 1000 times, if any have reached the pthread_cond_wait(), then hopefully they will all have been started -- so all 100 will be waiting on the mutex.
As currently shown the waiting in playQueue() is broken. It should be something along the lines of:
pthread_mutex_wait(&queueLock) ;
while (testQueue.empty)
pthread_cond_wait(&queue_cv, &queueLock) ;
if (testQueue.eof)
val = NULL ;
else
val = testQueue.pop ;
pthread_mutex_unlock(&queueLock) ;
But, even when this is all sorted out, there is no guarantee you will see an improvement in performance, unless runtest() does a serious amount of work. [Edit: I now notice that it does "a lot of matrix calculations", which sounds like it could be plenty of work.]
One small suggestion, starting the worker threads and filling the queue could be overlapped, by (for instance) starting one worker with the job of starting all the others, or starting a thread to fill the queue.
Without knowing more about the problem, if the work can be statically divided across the worker threads -- for instance, give the first thread items 0..9, the second thread items 10..19, and so on -- so each worker can ignore all the others, reducing the amount of synchronisation operations.
In addition to the other good answers, you said your runtest() function does I/O.
So you could well be I/O bound, in which case all your threads have to wait in line like everybody else to empty out their buffers.

How to make thread synchronization without using mutex, semorphore, spinLock and futex?

This is an interview question, the interview has been done.
How to make thread synchronization without using mutex, semorphore, spinLock and futex ?
Given 5 threads, how to make 4 of them wait for a signal from the left thread at the same point ?
it means that when all threads (1,2,3,4) execute at a point in their thread function, they stop and wait for
signal from thread 5 send a signal otherwise they will not proceed.
My idea:
Use global bool variable as a flag, if thread 5 does not set it true, all other threads wait at one point and also set their
flag variable true. After the thread 5 find all threads' flag variables are true, it will set it flag var true.
It is a busy-wait.
Any better ideas ?
Thanks
the pseudo code:
bool globalflag = false;
bool a[10] = {false} ;
int main()
{
for (int i = 0 ; i < 10; i++)
pthread_create( threadfunc, i ) ;
while(1)
{
bool b = true;
for (int i = 0 ; i < 10 ; i++)
{
b = a[i] & b ;
}
if (b) break;
}
}
void threadfunc(i)
{
a[i] = true;
while(!globalflag);
}
Start with an empty linked list of waiting threads. The head should be set to 0.
Use CAS, compare and swap, to insert a thread at the head of the list of waiters. If the head =-1, then do not insert or wait. You can safely use CAS to insert items at the head of a linked list if you do it right.
After being inserted, the waiting thread should wait on SIGUSR1. Use sigwait() to do this.
When ready, the signaling thread uses CAS to set the head of wait list to -1. This prevents any more threads from adding themselves to the wait list. Then the signaling thread iterates the threads in the wait list and calls pthread_kill(&thread, SIGUSR1) to wake up each waiting thread.
If SIGUSR1 is sent before a call to sigwait, sigwait will return immediately. Thus, there will not be a race between adding a thread to the wait list and calling sigwait.
EDIT:
Why is CAS faster than a mutex? Laymen's answer (I'm a layman). Its faster for some things in some situations, because it has lower overhead when there is NO race. So if you can reduce your concurrent problem down to needing to change 8-16-32-64-128 bits of contiguous memory, and a race is not going to happen very often, CAS wins. CAS is basically a slightly more fancy/expensive mov instruction right where you were going to do a regular "mov" anyway. Its a "lock exchng" or something like that.
A mutex on the other hand is a whole bunch of extra stuff, that gets other cache lines dirty and uses more memory barriers, etc. Although CAS acts as a memory barrier on the x86, x64, etc. Then of course you have to unlock the mutex which is probably about the same amount of extra stuff.
Here is how you add an item to a linked list using CAS:
while (1)
{
pOldHead = pHead; <-- snapshot of the world. Start of the race.
pItem->pNext = pHead;
if (CAS(&pHead, pOldHead, pItem)) <-- end of the race if phead still is pOldHead
break; // success
}
So how often do you think your code is going to have multiple threads at that CAS line at the exact same time? In reality....not very often. We did tests that just looped adding millions of items with multiple threads at the same time and it happens way less than 1% of the time. In a real program, it might never happen.
Obviously if there is a race you have to go back and do that loop again, but in the case of a linked list, what does that cost you?
The downside is that you can't do very complex things to that linked list if you are going to use that method to add items to the head. Try implementing a double linked list. What a pain.
EDIT:
In the code above I use a macro CAS. If you are using linux, CAS = macro using __sync_bool_compare_and_swap. See gcc atomic builtins. If you are using windows, CAS = macro using something like InterlockedCompareExchange. Here is what an inline function in windows might look like:
inline bool CAS(volatile WORD* p, const WORD nOld, const WORD nNew) {
return InterlockedCompareExchange16((short*)p, nNew, nOld) == nOld;
}
inline bool CAS(volatile DWORD* p, const DWORD nOld, const DWORD nNew) {
return InterlockedCompareExchange((long*)p, nNew, nOld) == nOld;
}
inline bool CAS(volatile QWORD* p, const QWORD nOld, const QWORD nNew) {
return InterlockedCompareExchange64((LONGLONG*)p, nNew, nOld) == nOld;
}
inline bool CAS(void*volatile* p, const void* pOld, const void* pNew) {
return InterlockedCompareExchangePointer(p, (PVOID)pNew, (PVOID)pOld) == pOld;
}
Choose a signal to use, say SIGUSR1.
Use pthread_sigmask to block SIGUSR1.
Create the threads (they inherit the signal mask, hence 1 must be done first!)
Threads 1-4 call sigwait, blocking until SIGUSR1 is received.
Thread 5 calls kill() or pthread_kill 4 times with SIGUSR1. Since POSIX specifies that signals will be delivered to a thread which is not blocking the signal, it will be delivered to one of the threads waiting in sigwait(). There is thus no need to keep track of which threads have already received the signal and which haven't, with associated synchronization.
You can do this using SSE3's MONITOR and MWAIT instructions, available via the _mm_mwait and _mm_monitor intrinsics, Intel has an article on it here.
(there is also a patent for using memory-monitor-wait for lock contention here that may be of interest).
I think you are looking the Peterson's algorithm or Dekker's algorithm
They synced threads only based on shared memory