CPU Cores not Utilized properly using QThreads

CPU Cores not Utilized properly using QThreads - c++

Using : C++ (MinGW), Qt4.7.4, Vista (OS), intel core2vPro
I need to process 2 huge files in exactly the same way. So i would like to call the processing routine from 2 separate threads for 2 separate files. The GUI thread does nothing heavy; just displays a label and runs an event loop to check for emission of thread termination conditions and quits the main Application accordingly. I expected this to utilize the two cores (intel core2) somewhat equally, but on the contrary i see from Task Manager that one of the cores is highly utilized and the other is not (though not every time i run the code); also the time taken to process the 2 files is much more than the time taken to process one file (i thought it should have been equal or a little more but this is almost equal to processing the 2 files one after another in a non-threaded application). Can i somehow force the threads to use the cores that i specify?
QThread* ptrThread1=new QThread;
QThread* ptrThread2=new QThread;
ProcessTimeConsuming* ptrPTC1=new ProcessTimeConsuming();
ProcessTimeConsuming* ptrPTC2=new ProcessTimeConsuming();
ptrPTC1->moveToThread(ptrThread1);
ptrPTC2->moveToThread(ptrThread2);
//make connections to specify what to do when processing ends, threads terminate etc
//display some label to give an idea that the code is in execution
ptrThread1->start();
ptrThread2->start(); //i want this thread to be executed in the core other than the one used above
ptrQApplication->exec(); //GUI event loop for label display and signal-slot monitoring

Reading in parallel from a single mechanical disk often times (and probably in your case) will not yield any performance gain, since the mechanical head of the disk needs to spin every time to seek the next reading location, effectively making your reads sequential. Worse, if a lot of threads are trying to read, the performance may even degrade with respect to the sequential version, because the disk head is bounced to different locations of the disk and thus needs to spin back where it left off every time.
Generally, you cannot do better than reading the files in a sequence and then processing them in parallel using perhaps a producer-consumer model.

With mechanical hard drives, you need to explicitly control the ratio of time spent doing sequential reads vs. time spent seeking. The canonical way of doing it is with n+m objects running on m+min(n, QThread::idealThreadCount()) threads. Here, m is the number of hard drives that the files are on, and n is the number of files.
Each of m objects is reading files from given hard drive in a round robin fashion. Each read must be sufficiently large. On modern hard drives, let's budget 70Mbytes/s of bandwidth (you can benchmark the real value), 5ms for a seek. To waste at most 10% of the bandwidth, you only have 100ms or 100ms/(5ms/seek)=20 seeks per second. Thus you must read at least 70Mbytes/(20seeks+1)=3.3 Megabytes from each file before reading from the next file. This thread fills a buffer with file data, and the buffer then signals the relevant computation object that is attached to the other side of the buffer. When a buffer is busy, you simply skip reading from given file until the buffer becomes available again.
The other n objects are computation objects, they perform a computation upon a signal from a buffer that indicates the buffer is full. As soon as the buffer data is not needed anymore, the buffer is "reset" so that the file reader can refill it.
All reader objects need their own threads. The computation objects can be distributed among their own threads in a round-robin fashion, so that the threads all have within +1, -0 objects of each other.

I thought my empirical data might be of some use to this discussion. I have a directory with 980 txt files that I would like to read. In the Qt/C++ framework and running on an Intel i5 quad core, I created a GUI Application and added a class worker to read a file given its path. I pushed the worker into a thread, then repeated adding an additional thread each run. I timed roughly 13 mins with 1 thread, 9 minutes with 2, and 8 minutes with 3. So, in my case there was some benefit, but it degraded quickly.

Related

Why is 6-7 threads faster than 20?

In school we were introduced to C++11 threads. The teacher gave us a simple assessment to complete which was to make a basic web crawler using 20 threads. To me threading is pretty new, although I do understand the basics.
I would like to mention that I am not looking for someone to complete my assessment as it is already done. I only want to understand the reason why using 6 threads is always faster than using 20.
Please see code sample below.
main.cpp:
do
{
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i] = std::thread(SweepUrlList);
}
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i].join();
}
std::cout << std::endl;
WriteToConsole();
listUrl = listNewUrl;
listNewUrl.clear();
} while (listUrl.size() != 0);
Basically this assigns to each worker thread the job to complete which is the method SweepUrlList that can be found below and then join all thread.
while (1)
{
mutextGetNextUrl.lock();
std::set<std::string>::iterator it = listUrl.begin();
if (it == listUrl.end())
{
mutextGetNextUrl.unlock();
break;
}
std::string url(*it);
listUrl.erase(*it);
mutextGetNextUrl.unlock();
ExtractEmail(url, listEmail);
std::cout << ".";
}
So each worker thread loop until ListUrl is empty. ExtractEmail is a method that downloads the webpage (using curl) and parse it to extract emails from mailto links.
The only blocking call in ExtractEmail can be found below:
if(email.length() != 0)
{
mutextInsertNewEmail.lock();
ListEmail.insert(email);
mutextInsertNewEmail.unlock();
}
All answers are welcome and if possible links to any documentation you found to answer this question.

This is a fairly universal problem with threading, and at its core:
What you are demonstrating is thread Scheduling. The operating system is going to work with the various threads, and schedule work where there is currently not work.
Assuming you have 4 cores and hyper threading you have 8 processors that can carry the load, but also that of other applications (Operating System, C++ debugger, and your application to start).
In theory, you would probably be OK on performance up until about 8 intensive threads. After you reach the most threads your processor can effectively use, then threads begin to compete against each other for resources. This can be seen (especially with intensive applications and tight loops) by poor performance.
Finally, this is a simplified answer but I suspect what you are seeing.

The simple answer is choke points. Something that you are doing is causing a choke point. When this occurs there is a slow down. It could be in the number of active connections you are making to something, or merely the extra overhead of the number and memory size of the threads (see the below answer about cores being one of these chokes).
You will need to set up a series of monitors to investigate where your choke point is, and what needs to change in order to achieve scale. Many systems across every industry face this problem every day. Opening up the throttle at one end does not equal the same increase in the output at the other end. In cases it can decrease the output at the other end.
Take for example individuals leaving a hall. The goal is to get 100 people out of the building as quickly as possible. If single file produces a rate of 1 person every 1 second therefore 100 seconds to clear the building. We many be able to half that time by sending them out 2 abreast, so 50 seconds to clear the building. What if we then sent them out as 8 abreast. The door is only 2m wide, so with 8 abreast being equivalent to 4m, only 50% of the first row would make it through. The other 4 would then cause a blockage for the next row and so on. Depending on the rate, this could cause temporary blockages and increase the time 10 fold.

Threads are an operating system construct. Basically, each thread's state (which is basically all the CPU's registers and virtual memory mapping [which is a part of the process construct]) is saved by the operating system. Once the OS gives that specific thread "execution time" it restores this state and let it run. Once this time is finished, it has to save this state. The process of saving a specific thread's state and restoring another is called Context Switching, and it takes a significant amount of time (usually between a couple of hundreds to thousand of CPU cycles).
There are also additional penalties to context switching. Some of the processor's cache (like the virtual memory translation cache, called the TLB) has to be flushed, pipelining instruction to be discarded and more. Generally, you want to minimize context switching as much as possible.
If your CPU has 4 cores, than 4 threads can run simultaneously. If you try to run 20 threads on a 4 core system, then the OS has to manage time between those threads so it will seem like they run in parallel. E.g, threads 1-4 will run for 50 milliseconds, then 5-9 will run for 50 milliseconds, etc.
Therefore, if all of your threads are running CPU intensive operations, it is generally most efficient to make your program use the same amount of threads as cores (sometimes called 'processors' in windows). If you have more threads than cores, than context switching must happen, and it is overhead that can be minimized.

In general, more threads is not better. More threading provides value in two ways higher parallelism and less blocking. More threading hurts by higher memory, higher context switching and higher resource contention.
The value of more threads for higher parallelism is generally maximized between 1-2x the number of actual cores that you have available. If your threads are already CPU bound the maximum value is generally 1x number of cores.
The value of less blocking is much harder to quantify and depends on the type of work you are performing. If you are IO bound and your threads are primarily waiting for IO to be ready then a larger number of threads could be beneficial.
However if you have shared state between threads, or you are doing some form of message passing between threads then you will run into synchronization and contention issues. As the number of threads increases, the more these types of overhead as well as context switches dominates the time spent doing your task.
Amdahl's law is a useful measure to determine if higher parallelism will actually improve the total runtime of your job.
You also must be careful that your increased parallelism doesn't exceed some other resource like total memory or disk or network throughput. Once you have saturated the current bottleneck, you will not see improved performance by increasing the number of threads.
Before doing any performance tuning, it is important to understand what the dominant resource bottleneck is. There are lots of tools for doing system-wide resource monitoring. On Linux, one very useful tool is dstat. On Windows, you can use the Task Manager to monitor many of these resources.

Multithreaded Realtime audio programming - To block or Not to block

When writing audio software many people on the internet say it is paramount not to use either memory allocation or blocking code, i.e no locks. Due to the fact these are non deterministic so could cause the output buffer to underflow and the audio will glitch.
Real Time Audio Progrmaming
When I write video software, I generally use both, i.e. allocating video frames on the heap and passing between threads using locks and conditional variables (bounded buffers). I love the power this provides as a separate thread can be used for each operation, allowing the software to max out each of the cores, giving the best performance.
With audio I'd like to do something similar, passing frames of maybe 100 samples between threads, however, there are two issues.
How do I generate the frames without using memory allocation? I suppose I could use a pool of frames that have been pre-allocated but this seems messy.
I'm aware you can use lock free queue and that boost has a nice library to do this. This would be a great way to share between threads, but constantly polling the queue to see if data is available seems like a massive waist of CPU time.
In my experience using mutexes doesn't actually take much time at all, provided that the section where the mutex is locked is short.
What is the best way to achieve passing audio frames between threads, whilst keeping latency to a minimum, not wasting resources and using relatively little non-deterministic behaviour?

Seems like you did your research! You've already identified the two main problems that could be the root-cause of audio glitches. The question is: How much of this was important 10 years ago and is only folklore and cargo-cult programming these days.
My two cents:
1. Heap allocations in the rendering loop:
These can have quite a lot overhead depending on how small your processing chunks are. The main culprit is, that very few run-times have a per-thread heap, so each time you mess with the heap your performance depends on what other threads in your process do. If for example a GUI thread is currently deleting thousands of objects, and you - at the same time - access the heap from the audio rendering thread you may experience a significant delay.
Writing your own memory management with pre-allocated buffers may sound messy, but in the end it's just two functions that you can hide somewhere in a utility source. Since you usually know your allocation sizes in advance there is a lot of opportunity to fine-tune and optimize your memory management. You can store your segments as a simple linked list for example. If done right this has the benefit that you allocate the last used buffer again. This buffer has a very high probability of beeing in the cache.
If fixed size allocators don't work for you have a look at ring-buffers. They fit the use-cases of streaming audio very well.
2. To lock, or not to lock:
I'd say, these days using mutex and semaphore locks are fine if you can estimate that you do less than 1000 to 5000 of them per second (on a PC, things are different on something like a Raspberry Pi etc.). If you stay below that range it is unlikely that the overhead shows up in a performance profile.
Translated to your use-case: If you for example work with 48kHz audio and 100 sample chunks you generate roughly 960 lock/unlock operation in a simple two thread consumer/producer pattern. that is well within the range. In case you completely max out the rendering thread the locking will not show up in a profiling. If you on the other hand only use like 5% of the available processing power the locks may show up, but you will not have a performance problem either :-)
Going lock-less is also an option, but so are hybrid solutions that first do some lock-less tries and then fall back to hard locking. You'll get the best of both worlds that way. There is a lot of good stuff to read about this topic on the net.
In any case:
You should raise the thread priority of your non GUI threads gently to make sure that if they run into a lock, they get out of it quickly. It is also a good idea to read what Priority Inversion is, and what you can do to avoid it:
https://en.wikipedia.org/wiki/Priority_inversion

'I suppose I could use a pool of frames that have been pre-allocated but this seems messy' - not really. Either allocate an array of frames, or new up frames in a loop, and then shove the indices/pointers onto a blocking queue. Now you have an auto-managed pool of frames. Pop one off when you need a frame, push it back on when you are done with it. No continual malloc/free/new/delete, no chance or memory-runaway, simpler debugging, and frame flow-control, (if the pool runs out, threads asking for frames will wait until frames are released back into the pool), all built in.
Using an array may seem easier/safer/faster than a new loop, but newing individual frames does have an advantage - you can easily change the number of frames in the pool at runtime.

Um, why are you passing frames of 100 samples between threads?
Assuming that you are working at a nominal sample rate of 44.1kHz, and passing 100 samples at a time between threads, that presumes that your thread switching rate must be at least 100 samples / (44100 samples/s * 2). The 2 represents both the producer and the consumer. That means you have a time slice of ~1.13 ms for every 100 samples you send. Nearly all operating systems run at time slices greater than 10 ms. So it is impossible to build an audio engine where you are sharing only 100 samples between threads at 44.1kHz on a modern OS.
The solution is to buffer more samples per time slice, either via a queue or by using larger frames. Most modern real time audio APIs use 128 samples per channel (on dedicated audio hardware) or 256 samples per channel (on game consoles).
Ultimately, the answer to your question is mostly the answer you would expect... Pass around uniquely owned queues of pointers to buffers, not the buffers themselves; manage ALL audio buffers in a fixed pool allocated at program start; and lock all queues for as little time as necessary.
Interestingly, this is one of the few good situations in audio programming where there is a distinct performance advantage to busting out the assembly code. You definitely don't want a malloc and free occurring with every queue lock. Operating-system provided atomic locking functions can ALWAYS be improved upon, if you know your CPU.
One last thing: there's no such thing as a lockfree queue. All multithread "lockfree" queue implementations rely on a CPU barrier intrinsic or a hard compare-and-swap somewhere to make sure that exclusive access to memory is guaranteed per thread.

Multi threaded reading from a file in c++?

My application uses text file to store data to file.
I was testing for the fastest way of reading it by multi threading the operation.
I used the following 2 techniques:
Use as many streams as NUMBER_OF_PROCESSORS environment variable. Each stream is on a different thread. Divide total no of lines in file equally for each stream. Parse the text.
Only one stream parses the entire file and loads the data in memory. Create threads (= NUMBER_OF_PROCESSORS - 1) to parse data from memory.
The test was run on various file sizes 100kB - 800MB.
Data in file:
100.23123 -42343.342555 ...(and so on)
4928340 -93240.2 349 ...
...
The data is stored in 2D array of double.
Result: Both methods take approximately the same time for parsing the file.
Question: Which method should I choose?
Method 1 is bad for the Hard disk as multiple read access are performed at random locations simultaneously.
Method 2 is bad because memory required is proportional to file size. This can be partially overcome by limiting the container to a fixed size, deleting the parsed content and fill it again from the reader. But this increases the processing time.

Method 2 has a sequential bottleneck (the single-threaded reading and handing out of the work items). This will not scale indefinitely according to Amdahls Law. It is a very fair and reliable method, though.
Method 1 has not bottleneck and will scale. Be sure to not cause random IO on the disk. I'd use a mutex to have only one thread read at a time. Read in big sequential block of maybe 4-16MB. In the time the disk does a single head seek it could have read about 1MB of data.
If parsing the lines takes a considerable amount of time, you can't use method 2 because of the big sequential part. It would not scale. If parsing is fast, though, use method 2 because it is easier to get right.
To illustrate the concept of a bottleneck: Imagine 1.000.000 computation threads asking one reader thread to give them lines. That one reader thread would not be able to keep up handing out lines as quickly as they are demanded. You would not get 1e6 times the throughput. This would not scale. But if 1e6 threads read independently from a very fast IO device, you would get 1e6 times the throughput because there is no bottleneck. (I have used extreme numbers to make the point. The same idea applies in the small.)

I'd prefer slightly modified 2 method. I would read data sequentally in single thread by big chunks. Ready chunk is passed to a thread pool where data is processed. So you have concurrent reading & processing

With enough RAM you can do it without single-thread bottleneck. For Linux:
1) mmap you whole file to RAM with MAP_LOCKED, requires root or system wide permissions tune. Or without MAP_LOCKED for SSD, they handle random access well.
2) give each thread a start position. Process data from first newline after self start position to first newline after next thread start position.
PS What is your program CPU load? Probably HDD is the bottleneck.

FAST DMA benefit from FPGA using threads in C++

I am transferring data from FPGA PCIe thru DMA, which is very fast. I have 500 data with each data comprise of 80000 BYTES. Hence the time for all 500 data receiving and saving in .bin file is 0.5 seconds. If I do the same in .txt file (which is my final goal) it takes 15 seconds.
Hence Now what I want is to use threads in c++, where 1 thread (I call it as master thread) take DMA data(single data at a time) and simultaneously open the 500 other threads (one for each file) each file saving thread wait for some trigger event etc. (not much idea, since CPU inherently runs in sequential manner, causing problem for an FPGA designer who deals in parallel domain)
Please see the case I have explained could be the solution, but I need to know how to implement it if it is correct in++ ????
case
1st data(thru DMA) comes in master thread (where global memory is assigned using malloc() ) -> thread for file 1 is waiting for any TRIGGER etc. and as soon as it gets this trigger, copy the memory contents to its own allocated memory and then starts saving in the file, meanwhile it also triggers the 'master thread' to increment its counter and receive the next data and the process continues for the whole 500 data.
I am mostly and FPGA guy and c++ at this high level is first time, I am determined but am stuck. really messed up for two days reading loads of material over threads (in c++) mainly starting from createthreads() and going on and on, I thought the WaitForSingleObject might be solution but I cannot understand how to implement this...
any idea would be appreciable. I do not seek any code, I just seek the way to implement. For example those familiar with VHDL, they might know in VHDL we can use
Code: wait until abc'event and abc = '1';
but what to do here?
Thanks
sraza

The performance measurement you give show that the problem has nothing to do with DMA or threads. What's slow is converting from binary to string data.
Not surprising, since C++ iostreams are miserably slow and even the C stdio functions are significantly suboptimal
Use an optimized function for number->string conversion, and your 15 second time for writing a text file will get a lot closer to that 0.5 second time you have for binary. I'd expect 1.0 seconds or less, from this single change.

how to design threading for many short tasks

I want to use multi-threads to accelerate my program, but not sure which way is optimal.
Say we have 10000 small tasks, it takes maybe only 0.1s to finish one of them. Now I have a CPU with 12 cores and I want to use 12 threads to make it faster.
So far as I know, there are two ways:
1.Tasks Pool
There are always 12 threads running, each of them get one new task from the tasks pool after it finished its current work.
2.Separate Tasks
By separating the 10000 tasks into 12 parts and each thread works on one part.
The problem is, if I use tasks pool it is a waste of time for lock/unlock when multiple threads try to access the tasks pool. But the 2nd way is not ideal because some of the threads finish early, the total time depends on the slowest thread.
I am wondering how you deal with this kind of work and any other best way to do it? Thank you.
EDIT: Please note that the number 10000 is just for example, in practice, it may be 1e8 or more tasks and 0.1 per task is also an average time.
EDIT2: Thanks for all your answers :] It is good to know kinds of options.

So one midway between the two approaches is to break into say 100 batches of 100 tasks each and let the a core pick a batch of 100 tasks at a time from the task pool.
Perhaps if you model the randomness in execution time in a single core for a single task, and get an estimate of mutex locking time, you might be able to find an optimal batch size.
But without too much work we at least have the following lemma :
The slowest thread can only take at max 100*.1 = 10s more than others.

Task pool is always the best solution here. It's not just optimum time, it's also comprehensibility of code. You should never force your tasks to conform to the completely unrelated criteria of having the same number of subtasks as cores - your tasks have nothing to do with that (in general), and such a separation doesn't scale when you change machines, etc. It requires overhead to collaborate on combining results in subtasks for the final task, and just generally makes an easy task hard.
But you should not be worrying about the use of locks for taskpools. There are lockfree queues available if you ever determined them necessary. But determine that first. If time is your concern, use the appropriate methods of speeding up your task, and put your effort where you will get the most benefit. Profile your code. Why do your tasks take 0.1 s? Do they use an inefficient algorithm? Can loop unrolling help? If you find the hotspots in your code through profiling, you may find that locks are the least of your worries. And if you find everything is running as fast as possible, and you want that extra second from removing locks, search the internet with your favorite search engine for "lockfree queue" and "waitfree queue". Compare and swap makes atomic lists easy.

Both ways suggested in the question will perform well and similarly to each another (in simple cases with predictable and relatively long duration of the tasks). If the target system type is known and available (and if performance is really a top concern), the approach should be chosen based on prototyping and measurements.
Do not necessarily prejudice yourself as to the optimal number of threads matching the number of the cores. If this is a regular server or desktop system, there will be various system processes kicking in here and then and you may see your 12 threads variously floating between processors which hurts memory caching.
There are also crucial non-measurement factors you should check: do those small tasks require any resources to execute? Do these resources impose additional potential delays (blocking) or competition? Are there additional apps competing for the CPU power? Will the application need to be grow to accommodate different execution environments, task types, or user interaction models?
If the answer to all is negative, here are some additional approaches that you can measure and consider.
Use only 10 or 11 threads. You will observe a small slowdown, or even
a small speedup (the additional core will serve OS processes, so that
thread affinity of the rest will become more stable compared to 12
threads). Any concurrent interactive activity on the system will see
a big boost in responsiveness.
Create exactly 12 threads but explicitly set a different processor
affinity mask to each, to impose a 1-1 mapping between threads and processors.
This is good in the simplest near-academical case
where there are no resources other than CPU and shared memory
involved; you will see no chronic migration of threads across
processes. The drawback is an
algorithm closely coupled to a particular machine; on another machine
it could behave so poorly as to finish never at all (because of an
unrelated real time task that
blocks one of your threads forever).
Create 12 threads and split the tasks evenly. Have each thread
downgrade its own priority once it is past 40% and again once it is
past 80% of its load. This will improve load balancing inside your
process, but it will behave poorly if your application is competing
with other CPU-bound processes.

100ms/task - pile 'em on as they are - pool overhead will be insignificant.
OTOH..
1E8 tasks # 0.1s/task = 10,000,000 seconds
= 2777.7r hours
= 115.7 days
That's much more than the interval between patch Tuesday reboots.
Even if you run this on Linux, you should batch up the output and flush it to disk in such a manner that the job is restartable.
Is there a database involved? If so, you should have told us!

Each working thread may have its own small task queue with the capacity of no more than one or two memory pages. When the queue size becomes low (a half of capacity) it should send a signal to some manager thread to populate it with more tasks. If queue is organized in batches then working threads do not need to enter critical sections as long as current batch is not empty. Avoiding critical sections will give you extra cycles for actual job. Two batches per queue are enough, and in this case one batch can take one memory page, and so queue takes two.
The point of memory pages is that thread does not have to jump all over the memory to fetch data. If all data are in one place (one memory page) you avoid cache misses.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js