I am using ALSA to read in an audio stream. I have set my period size to 960 frames which are received every 20ms. I read in the PCM values using: snd_pcm_readi(handle, buffer, period_size);
Every time I fill this buffer, I need to iterate through it and perform multiple checks on the received values. Iterating through this using a simple for loop takes too long and I get a buffer overrun error on subsequent calls to snd_pcm_readi(). I have been told not to increase the ALSA buffer_size to prevent this from happening. A friend suggested I create a seperate thread to iterate through this buffer and perform the checks? How would this work given that I don't know exactly how long it will take for snd_pcm_readi() to fill the buffer, so knowing when to lock the buffer is a bit confusing to me.
A useful way to multithread an application for signal processing and computation is to use OpenMP (OMP). This avoids the developer having to use locking mechanisms themselves to synchronise multiple computation threads. Typically using locking is a bad thing in real time application programming.
In this example, a for loop is multithreaded in the C language :
#include <omp.h>
int N=960;
float audio[960]; // the signal to process
#pragma omp parallel for
for (int n=0; i<N; n++){ // perform the function on each sample in parallel
printf("n=%d thread num %d\n", n, omp_get_thread_num()); // remove this to speed up the code
audio[n]; // process your signal here
}
You can see a concrete example of this in the gtkIOStream FIR code base. The FIR filter's channels are processed independently, one per available thread.
To initalise the actual OMP subsystem specify the number of threads you want to use like this to maximised the available threads :
int MProcessors = omp_get_max_threads();
omp_set_num_threads(MProcessors);
If you would prefer to look at an approach which uses locking techniques, then you could go for a concurrency pattern such as that developed for the Nuclear Processing.
Related
I'm writing a c++ application, in which I'll receive 4096 bytes of data for every 0.5 seconds. This is processed and the output will be sent to some other application. Processing each set of data is taking nearly 2 seconds.
This is how exactly I'm doing this.
In my main function, I'm receiving the data and pushing it into a vector.
I've created a thread, which will always process the first element and deletes it immediately after processing. Below is the simulation of my application receiving part.
#include<iostream>
#include <unistd.h>
#include <vector>
#include <mutex>
#include <pthread.h>
using namespace std;
struct Student{
int id;
int age;
};
vector<Student> dustBin;
pthread_mutex_t lock1;
bool isEven=true;
void *processData(void* arg){
Student st1;
while(true)
{
if(dustBin.size())
{
printf("front: %d\tSize: %d\n",dustBin.front(),dustBin.size());
st1 = dustBin.front();
cout << "Currently Processing ID "<<st1.id<<endl;
sleep(2);
pthread_mutex_lock(&lock1);
dustBin.erase(dustBin.begin());
cout<<"Deleted"<<endl;
pthread_mutex_unlock(&lock1);
}
}
return NULL;
}
int main()
{
pthread_t ptid;
Student st;
dustBin.clear();
pthread_mutex_init(&lock1, NULL);
pthread_create(&ptid, NULL, &processData, NULL);
while(true)
{
for(int i=0; i<4096; i++)
{
st.id = i+1;
st.age = i+2;
pthread_mutex_lock(&lock1);
dustBin.push_back(st);
printf("Pushed: %d\n",st.id);
pthread_mutex_unlock(&lock1);
usleep(500000);
}
}
pthread_join(ptid, NULL);
pthread_mutex_destroy(&lock1);
}
The output of this code is
Output
In the output image posted here, you can observe the exact sequence of the processing. It is processing only one item for every 4 insertions.
Note that the reception time of data <<< processing time.
Because of this reason, my input buffer is growing very rapidly. And one more thing is that as the main thread and the processData thread are using a mutex, they are dependent on each other for the lock to release. Because of this reason my incoming buffer is getting locked sometimes leading to data misses. Please, someone, suggest to me how to handle this or suggest me some method to do.
Thanks & Regards
Vamsi
Undefined behavior
When you read data, you must lock before getting the size.
Busy waiting
You should always avoid tight loop that does nothing. Here if dustBin is empty, you will immediately check it against forever which will use 100% of that core and slow down everything else, drain the laptop battery and make it hotter than it should be. Very bad idea to write such code!
Learn multithreading first
You should read a book or 2 on multithreading. Doing multithreading right is hard and almost impossible without taking time to learn it properly. C++ Concurrency in Action is highly recommended for standard C++ multithreading.
Condition variable
Usually you will use a condition variable or some sort of event to tell the consumer thread when data is added so it does not have to wake up uselessly to check if it is the case.
Since you have a typical producer/consumer, you should be able to find lot of information on how to do it or special containers or other constructs that will help implement your code.
Output
Your printf and cout stuff will have an impact on the performance and since some are inside a lock and other not, you will probably get an improperly formatted output. If you really need output, a third thread might be a better option. In any case, you want to minimize the time you have a lock so formatting into a temporary buffer might be a good idea too.
By the way, standard output is relatively slow and it is perfectly possible that it might even be the reason why you are not able to process rapidly all data.
Processing rate
Obviously if you are able to produce 4096 bytes of data every 0.5 second but need 2 seconds to process that data, you have a serious problem.
You should really think about what you want to do in such case before asking a question here as without that information, we are making guess about possible solutions.
Here are some possibilities:
Slow down the producer. Obviously, this does not work if you get data in real time.
Optimize the consumer (better algorithms, better hardware, optimal parallelism…)
Skip some data
Obviously for performance problems, you should use a profiler to know were you lost your time. Once you know that, you will have a better idea where to check to improve you code.
Taking 2 seconds to process the data is really slow but we cannot help you since we have no idea of what your code is doing.
For example, if you add the data into a database and it is not able to follow up, you might want to batch multiple insert into a single command to reduce the overhead of communicating with the database over the network.
Another example, would be if you append the data to a file, you might want to keep the file open and accumulate some data before doing each write.
Container
A vector would not be a good choice if you remove item from the head one by one and it size become somewhat large (say more than 100 small items) as every other item need to be moved every time.
In addition to changing the container as suggested in a comment, another possibility would be to use 2 vectors and swap them. That way, you will be able to reduce the number of time you lock the mutex and process many item without needing a lock.
How to optimize
You should accumulate enough data (say 30 seconds), stop accumulating and then test your processing speed with that data. If you cannot process that data in less that about half the time (15 seconds), then you clearly need to improve your processing speed one way or another. One your consumer(s) is (are) fast enough, then you could optimize communication from the producer to the consumer(s).
You have to know if your bottleneck is I/O, database or what and if some part might be done in parallel.
There are probably a lot of optimization that can be done in the code you have not shown...
If you can't handle messages fast enough, you have to drop some.
Use a circular buffer of a fixed size.
Then if the provider is faster than the consumer, older entries will be overwritten.
If you cannot skip some data and you cannot process it fast enough, you are doomed.
Create two const variables, NBUFFERS and NTHREADS, make them both 8 initially if you have 16 cores and your processing is 4x too slow. Play with these values later.
Create NBUFFERS data buffers, each big enough to hold 4096 samples, In practice, just create a single large buffer and make offsets into it to divide it up.
Start NTHREADS. They will each continuously wait to be told which buffer to process and then they will process it and wait again for another buffer.
In your main program, go into a loop, receiving data. Receive the first 4096 samples into the first buffer and notify the first thread. Receive the second 4096 samples into the second buffer and notify the second thread.
buffer = (buffer + 1) % NBUFFERS
thread = (thread + 1) % NTHREADS
Rinse and repeat. As you have 8 threads, and data only arrives every 0.5 seconds, each thread will only get a new buffer every 4 seconds but only needs 2 seconds to clear the previous buffer.
I am using opencl 1.2 c++ wrapper for my project. I want to know what is the correct method to call my kernel. In my case, I have 2 devices and the data should be sent simultaneously to them.
I am dividing my data into two chunks and both the devices should be able to perform computations on them separately. They have no interconnection and they don't need to know what is happening in the other device.
When the data is sent to both the devices, I want to wait for the kernels to finish before my program goes further. Because I will be using results returned from both of the kernels. So I don't want to start reading the data before the kernels have returned.
I have 2 methods. Which one is programmatically correct in my case:
Method 1:
for (int i = 0; i < numberOfDevices; i++) {
// Enqueue the kernel.
kernelGA(cl::EnqueueArgs(queue[iter],
arguments etc...);
queue[i].flush();
}
// Wait for the kernels to return.
for (int i = 0; i < numberOfDevices; i++) {
queue[i].finish();
}
Method 2:
for (int i = 0; i < numberOfDevices; i++) {
// Enqueue the kernel.
kernelGA(cl::EnqueueArgs(queue[iter],
arguments etc...);
}
for (int i = 0; i < numberOfDevices; i++) {
queue[i].flush();
}
// Wait for the kernels to return.
for (int i = 0; i < numberOfDevices; i++) {
queue[i].finish();
}
Or none of them are correct and there is a better way to wait for my kernels to return?
Assuming each device Computes in its own memory:
I would go for multi threaded (for) loop version of your method-1. Because opencl doesnt force vendors to do asynchronous enqueuing. Nvidia for example, does synchronous enqueuing for some drivers and hardware while amd has asynchronous enqueuing.
When each device is driven by a separate thread, they should enqueue Write+Compute together before synchronising for reading partial results(second threaded loop)
Having multiple threads also advantageous for spin-wait type synchronization (clfinish) because multiple spin-wait loops are worked in parallel. This should save time in Order of a millisecond.
Flush helps some vendors like amd to start enqueueing Early.
To have correct input and correct output for all devices, only two finish commands are enough. One After Write+Compute then one After read(results). So each device get same time step data and produce results at same time step. Write and Compute doesnt need finish between them if queue type is in-order because it Computes one by one. Also this doesnt need read operations to be blocking.
Trivial finish commands always Kill performance.
Note: I already wrote a load balancer using all this, and it performs better When using event-based synchronization instead of finish. Finish is easier but has bigger synchronization times than an event based one.
Also single queue doesnt always push a gpu to its limits. Using at Least 4 queues per device ensures Latency hiding of Write and Compute on my amd system. Sometimes even 16 queues help a bit more. But for io bottlenecked situations May need even more.
Example:
thread1
Write
Compute
Synchronization with other thread
Thread2
Write
Compute
Synchronization with other thread
Thread 1
Read
Synchronization with other thread
Thread2
Read
Synchronization with other thread
Trivial synchronization Kills performance because drivers dont know your intention and they leave it as it is So you should elliminate unnecessary finish commands and convert blocking Writes to nonblocking ones where you can.
Zero synchronization is also wrong because opencl doesnt force vendors to start computing After several enqueues. It May indefinitely grow to gifabytes of memory in minutes or even seconds.
You should use Method 1. clFlush is the only way of guaranteeing that commands are issued to the device (and not just buffered somewhere before sending).
What is the best solution if I just want to parallelize the loop only and sequential saving to file using openmp. I have a file with a large volume of information I would like to split into equal chunks (16 bytes each) , encrypted using openmp (multithreading programming in C++). After completion of encryption process these chunks store in a single file, but the same sequence of the original.
i_count=meta->subchunk2_size-meta->subchunk2_size%16;// TO GET THE EXACT LENTH MODE 16
// Get the number of processors in this system
int iCPU = omp_get_num_procs();
// Now set the number of threads
omp_set_num_threads(iCPU);
#pragma omp parallel for ordered
for( i=0;i<i_count;i+=16)
{
fread(Store,sizeof(char),16,Rfile);// read
ENCRYPT();
#pragma omp ordered
fwrite(Store,sizeof(char),16,Wfile) // write
}
The program it supposed works parallel but the saving to file work sequential, but the implementation of the program shows it work in sequential order.
You're much better off reading the whole file into a buffer in one thread, processing the buffer in parallel without using ordered, and then writing the buffer in one thread. Something like this
fread(Store,sizeof(char),icount,Rfile);// read
#pragma omp parallel for schedule(static)
for( i=0;i<i_count;i+=16) {
ENCRYPT(&Store[i]); //ENCRYPT acts on 16 bytes at a time
}
fwrite(Store,sizeof(char),icount,Wfile) // write
If the file is too big to read in all at once then do it in chunks in a loop. The main point is that the ENCRYPT function should be much slower than reading writing files. Otherwise, there is no point in using multiple threads anyway because you can't really speed up reading writing files with multiple threads.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I need to decode audio data as fast as possible using the Opus decoder.
Currently my application is not fast enough.
The decoding is as fast as it can get, but I need to gain more speed.
I need to decode about 100 sections of audio. T
hese sections are not consecutive (they are not related to each other).
I was thinking about using multi-threading so that I don't have to wait until one of the 100 decodings are completed. In my dreams I could start everything in parallel.
I have not used multithreading before.
I would therefore like to ask if my approach is generally fine or if there is a thinking mistake somewhere.
Thank you.
This answer is probably going to need a little refinement from the community, since it's been a long while since I worked in this environment, but here's a start -
Since you're new to multi-threading in C++, start with a simple project to create a bunch of pthreads doing a simple task.
Here's a quick and small example of creating pthreads:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
void* ThreadStart(void* arg);
int main( int count, char** argv) {
pthread_t thread1, thread2;
int* threadArg1 = (int*)malloc(sizeof(int));
int* threadArg2 = (int*)malloc(sizeof(int));
*threadArg1 = 1;
*threadArg2 = 2;
pthread_create(&thread1, NULL, &ThreadStart, (void*)threadArg1 );
pthread_create(&thread2, NULL, &ThreadStart, (void*)threadArg2 );
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
free(threadArg1);
free(threadArg2);
}
void* ThreadStart(void* arg) {
int threadNum = *((int*)arg);
printf("hello world from thread %d\n", threadNum);
return NULL;
}
Next, you're going to be using multiple opus decoders. Opus appears to be thread safe, so long as you create separate OpusDecoder objects for each thread.
To feed jobs to your threads, you'll need a list of pending work units that can be accessed in a thread safe manner. You can use std::vector or std::queue, but you'll have to use locks around it when adding to it and when removing from it, and you'll want to use a counting semaphore so that the threads will block, but stay alive, while you slowly add workunits to the queue (say, buffers of files read from disk).
Here's some example code similar from above that shows how to use a shared queue, and how to make the threads wait while you fill the queue:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <queue>
#include <semaphore.h>
#include <unistd.h>
void* ThreadStart(void* arg);
static std::queue<int> workunits;
static pthread_mutex_t workunitLock;
static sem_t workunitCount;
int main( int count, char** argv) {
pthread_t thread1, thread2;
pthread_mutex_init(&workunitLock, NULL);
sem_init(&workunitCount, 0, 0);
pthread_create(&thread1, NULL, &ThreadStart, NULL);
pthread_create(&thread2, NULL, &ThreadStart, NULL);
// Make a bunch of workunits while the threads are running.
for (int i = 0; i < 200; i++ ){
pthread_mutex_lock(&workunitLock);
workunits.push(i);
sem_post(&workunitCount);
pthread_mutex_unlock(&workunitLock);
// Pretend that it takes some effort to create work units;
// this shows that the threads really do block patiently
// while we generate workunits.
usleep(5000);
}
// Sometime in the next while, the threads will be blocked on
// sem_wait because they're waiting for more workunits. None
// of them are quitting because they never saw an empty queue.
// Pump the semaphore once for each thread so they can wake
// up, see the empty queue, and return.
sem_post(&workunitCount);
sem_post(&workunitCount);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
pthread_mutex_destroy(&workunitLock);
sem_destroy(&workunitCount);
}
void* ThreadStart(void* arg) {
int workUnit;
bool haveUnit;
do{
sem_wait(&workunitCount);
pthread_mutex_lock(&workunitLock);
// Figure out if there's a unit, grab it under
// the lock, then release the lock as soon as we can.
// After we release the lock, then we can 'process'
// the unit without blocking everybody else.
haveUnit = !workunits.empty();
if ( haveUnit ) {
workUnit = workunits.front();
workunits.pop();
}
pthread_mutex_unlock(&workunitLock);
// Now that we're not under the lock, we can spend
// as much time as we want processing the workunit.
if ( haveUnit ) {
printf("Got workunit %d\n", workUnit);
}
}
while(haveUnit);
return NULL;
}
You would break your work up by task. Let's assume your process is in fact CPU bound (you indicate it is but… it's not usually that simple).
Right now, you decode 100 sections:
I was thinking about using multi-threading so that I don't have to wait until one of the 100 decodings are completed. In my dreams I could start everything in parallel.
Actually, you should use a number close to the number of cores on the machine.
Assuming a modern desktop (e.g. 2-8 cores), running 100 threads at once will just slow it down; The kernel will waste a lot of time switching from one thread to another and the process is also likely to use higher peak resources and contend for similar resources.
So just create a task pool which restricts the number of active tasks to the number of cores. Each task would (generally) represent the decoding work to perform for one input file (section). This way, the decoding process is not actually sharing data across multiple threads (allowing you to avoid locking and other resource contention).
When complete, go back and fine tune the number of processes in the task pool (e.g. using the exact same inputs and a stopwatch on multiple machines). The fastest may be lower or higher than the number of cores (most likely because of disk I/O). It also helps to profile.
I would therefore like to ask if my approach is generally fine or if there is a thinking mistake somewhere.
Yes, if the problem is CPU bound, then that is generally fine. This also assumes your decoder/dependent software is capable of running with multiple threads.
The problem you will realize if these are files on disk is that you will probably need to optimize how you read (and write?) the files from many cores. So allowing it to run 8 jobs at once can make your problem become disk bound -- and 8 simultaneous readers/writers is a bad way to use hard disks, so you may find that it is not as fast as you expected. Therefore, you may need to optimize I/O for your concurrent decode implementation. In this regard, using larger buffer sizes, but that comes at a cost in memory.
Instead of making your own threads and manage them, I suggest you use a thread pool and give your decoding tasks to the pool. The pool will assign tasks to as many threads as it and the system can handle. Though there are different types of thread pools so you can set some parameters like forcing it to use some specific number of threads or if you should allow the pool to keep increasing the number of threads.
One thing to keep in mind is that more threads doesn't mean they execute in parallel. I think the correct term is concurrently, unless you have the guarantee that each thread is run on a different CPU (which would give true parallelism)
Your entire pool can come to a halt if blocked for IO.
Before jumping into multithreading as solution to speed up things , Study the concept of Oversubscribing & under Subscribing .
If the processing of Audio involves .long blocking calls of IO , Then Multithreading is worth it.
Although the vagueness of you question doesn't really help...how about:
Create a list of audio files to convert.
While there is a free processor,
launch the decoder application with the next file in the queue.
Repeat until there is nothing else in the list
If, during testing, you discover the processors aren't always 100% busy, launch 2 decodes per processor.
It could be done quite easily with a bit of bash/tcl/python.
You can use threads in general but locking has some issues. I will base the answer around POSIX threads and locks but this is fairly general and you will able to port the idea to any platform. But if your jobs require any kind of locking, you may find the following useful. Also it is best to keep using the existing threads again and again because thread creations are costly(see thread pooling).
Locking is a bad idea in general for "realtime" audio since it adds latency, but that's for real time jobs for decoding/encoding they are perfectly ok, even for real time ones you can get better performance and no dropping frames by using some threading knowledge.
For audio, semaphores is a bad, bad idea. They are too slow on at least my system(POSIX semaphores) when I tried, but you will need them if you are thinking of cross thread locking(not the type of locking where you lock in one thread and unlock in the same thread). POSIX mutexes only allow self lock and self unlock(you have to do both in the same thread) otherwise the program might work but it's undefined behavior and should be avoided.
Most lock-free atomic operations might give you enough freedom from locks to use some functionality(like locking) but with better performance.
I'm developing a windows form application that aims at performing real time image manipulation. User interface is a key element and I want to use multithreading to perform image processing tasks separate from the UI thread. Not only this, but due to the intesity of the calculations I wish to perform the image processing on multiple threads.
One way I have managed to do this is by using the ThreadStart delegate to do the processing on multiple threads. This works quite well but I am explicitly creating each thread. Another approach I have looked at is using openMP to perform processing tasks on multiple threads. OpenMP seems to be a much simpler approach since it automatically assigns the processing to all available threads. This would be paricularly useful where the program is run on different computers with different numbers of cores on each. However I find that the user responsiveness is lacking with the openMP approach and I have come to the conclusion that this is due to calculations being done on the UI thread as well.
I hence tried to combine the two approaches, I start a new thread using threadstart from which I call a function that performs the image processing using openMP to parallelize a for loop. However when I do this the program does not use all of the threads available to it (it seems to only use 2 or 3 out of 8).
Hence my questions are the following: Would it be bad practice to try and do the multithreading in these two different ways? Is there a way to successfully implement the combination of approaches such that it does the image processing on all available threads but leaves the UI thread to handle user input? Or perhaps there is a simpler implementation using only one of the above approaches? I realise that I could probably dynamically create threads using threadstart but this seems like a more complicated approach then if I could use openMP instead.
Here is some pseudocode showing what I want to do (note: ProcessedData and ImageData are unsigned char arrays of pixel data):
//this sets up a new thread when called for in the form
public: static void ThreadProc()
{
ProcessedData=compute(ImageData); //this is the image processing, there are other variable as well as ImageData
};
//this computes the processed image when the user moves a label/node
private: void labelnode_MouseMove( Object^ /*sender*/, System::Windows::Forms::MouseEventArgs^ e ) {
//if not being moved or left mouse button not used, exit
if (!bMoving || e->Button != System::Windows::Forms::MouseButtons::Left)
{
return;
};
Thread^ oThread = gcnew Thread( gcnew ThreadStart( &Form1::ThreadProc ) );
oThread->Start(); //launch new thread to do calculations for image processing
};
The compute function contains the following parallel loop:
#define CHUNKSIZE 1
#pragma omp parallel
{
#pragma omp for schedule(dynamic, CHUNKSIZE)
for (i=0;i<numberofrows;i++)
{
//Process all of the pixels in the given row putting results in the processed data array.
};
};
UPDATE: Actually it turns out that the debugging tools were preventing the full performance of the application. Running the exe externally works as it should. However, now I have a memory leak which causes a crash after the process has been repeated enough times. I'm sure there was no memory leak before I parallelised with openMP so I'm wondering if there is some residual memory left from the continuous opening and closing of threads. Any ideas?
Your approach is right: it's a recommended pattern to keep the main thread for UI responsiveness and start a separate thread for compute-intensive job, using OpenMP (or any other parallel framework) to process this job in parallel.
Why it does not use all HW threads/cores this way is a separate question. It might be due to load imbalance, or some synchronization issues, or significant serial portions of computation, or the amount of work being insufficient for all threads all the time.