how to design threading for many short tasks

how to design threading for many short tasks - c++

I want to use multi-threads to accelerate my program, but not sure which way is optimal.
Say we have 10000 small tasks, it takes maybe only 0.1s to finish one of them. Now I have a CPU with 12 cores and I want to use 12 threads to make it faster.
So far as I know, there are two ways:
1.Tasks Pool
There are always 12 threads running, each of them get one new task from the tasks pool after it finished its current work.
2.Separate Tasks
By separating the 10000 tasks into 12 parts and each thread works on one part.
The problem is, if I use tasks pool it is a waste of time for lock/unlock when multiple threads try to access the tasks pool. But the 2nd way is not ideal because some of the threads finish early, the total time depends on the slowest thread.
I am wondering how you deal with this kind of work and any other best way to do it? Thank you.
EDIT: Please note that the number 10000 is just for example, in practice, it may be 1e8 or more tasks and 0.1 per task is also an average time.
EDIT2: Thanks for all your answers :] It is good to know kinds of options.

So one midway between the two approaches is to break into say 100 batches of 100 tasks each and let the a core pick a batch of 100 tasks at a time from the task pool.
Perhaps if you model the randomness in execution time in a single core for a single task, and get an estimate of mutex locking time, you might be able to find an optimal batch size.
But without too much work we at least have the following lemma :
The slowest thread can only take at max 100*.1 = 10s more than others.

Task pool is always the best solution here. It's not just optimum time, it's also comprehensibility of code. You should never force your tasks to conform to the completely unrelated criteria of having the same number of subtasks as cores - your tasks have nothing to do with that (in general), and such a separation doesn't scale when you change machines, etc. It requires overhead to collaborate on combining results in subtasks for the final task, and just generally makes an easy task hard.
But you should not be worrying about the use of locks for taskpools. There are lockfree queues available if you ever determined them necessary. But determine that first. If time is your concern, use the appropriate methods of speeding up your task, and put your effort where you will get the most benefit. Profile your code. Why do your tasks take 0.1 s? Do they use an inefficient algorithm? Can loop unrolling help? If you find the hotspots in your code through profiling, you may find that locks are the least of your worries. And if you find everything is running as fast as possible, and you want that extra second from removing locks, search the internet with your favorite search engine for "lockfree queue" and "waitfree queue". Compare and swap makes atomic lists easy.

Both ways suggested in the question will perform well and similarly to each another (in simple cases with predictable and relatively long duration of the tasks). If the target system type is known and available (and if performance is really a top concern), the approach should be chosen based on prototyping and measurements.
Do not necessarily prejudice yourself as to the optimal number of threads matching the number of the cores. If this is a regular server or desktop system, there will be various system processes kicking in here and then and you may see your 12 threads variously floating between processors which hurts memory caching.
There are also crucial non-measurement factors you should check: do those small tasks require any resources to execute? Do these resources impose additional potential delays (blocking) or competition? Are there additional apps competing for the CPU power? Will the application need to be grow to accommodate different execution environments, task types, or user interaction models?
If the answer to all is negative, here are some additional approaches that you can measure and consider.
Use only 10 or 11 threads. You will observe a small slowdown, or even
a small speedup (the additional core will serve OS processes, so that
thread affinity of the rest will become more stable compared to 12
threads). Any concurrent interactive activity on the system will see
a big boost in responsiveness.
Create exactly 12 threads but explicitly set a different processor
affinity mask to each, to impose a 1-1 mapping between threads and processors.
This is good in the simplest near-academical case
where there are no resources other than CPU and shared memory
involved; you will see no chronic migration of threads across
processes. The drawback is an
algorithm closely coupled to a particular machine; on another machine
it could behave so poorly as to finish never at all (because of an
unrelated real time task that
blocks one of your threads forever).
Create 12 threads and split the tasks evenly. Have each thread
downgrade its own priority once it is past 40% and again once it is
past 80% of its load. This will improve load balancing inside your
process, but it will behave poorly if your application is competing
with other CPU-bound processes.

100ms/task - pile 'em on as they are - pool overhead will be insignificant.
OTOH..
1E8 tasks # 0.1s/task = 10,000,000 seconds
= 2777.7r hours
= 115.7 days
That's much more than the interval between patch Tuesday reboots.
Even if you run this on Linux, you should batch up the output and flush it to disk in such a manner that the job is restartable.
Is there a database involved? If so, you should have told us!

Each working thread may have its own small task queue with the capacity of no more than one or two memory pages. When the queue size becomes low (a half of capacity) it should send a signal to some manager thread to populate it with more tasks. If queue is organized in batches then working threads do not need to enter critical sections as long as current batch is not empty. Avoiding critical sections will give you extra cycles for actual job. Two batches per queue are enough, and in this case one batch can take one memory page, and so queue takes two.
The point of memory pages is that thread does not have to jump all over the memory to fetch data. If all data are in one place (one memory page) you avoid cache misses.

Related

why does having more than one thread(parallel processing) in some specific cases degrade performance?

i noticed that having more than a thread running for some code is much much slower than having one thread, and i have been really pulling my hair to know why,can anyone help?
code explanation :
i have ,sometimes, a very large array that i need to process parts of in a parallel way for optimization,each "part" of a row gets looped on and processed on in a specific thread, now i've noticed that if i only have one "part",i.e the whole array and a single worker thread that runs through it is noticeably faster than if i divide the array and process it as separate sub arrays with different threads.
bool m_generate_row_worker(ull t_row_start,ull t_row_end)
{
for(;t_row_start<t_row_end;t_row_start++)
{
m_current_row[t_row_start]=m_singularity_checker(m_previous_row[t_row_start],m_shared_random_row[t_row_start]);
}
return true;
}
...
//code
...
for(unsigned short thread_indx=0;thread_indx<noThreads-1;thread_indx++)
{
m_threads_array[thread_indx]=std::thread(
m_generate_row_worker,this,
thread_indx*(m_parts_per_thread),(thread_indx+1)*(m_parts_per_thread));
}
m_threads_array[noThreads-1]=std::thread(m_generate_row_worker,this,
(noThreads-1)*(m_parts_per_thread),std::max((noThreads)*(m_parts_per_thread),m_blocks_per_row));
//join
for(unsigned short thread_indx=0;thread_indx<noThreads;thread_indx++)
{
m_threads_array[thread_indx].join();
}
//EDIT
inline ull m_singularity_checker(ull t_to_be_ckecked_with,ull
t_to_be_ckecked)
{
return (t_to_be_ckecked & (t_to_be_ckecked_with<<1)
& (t_to_be_ckecked_with>>1) ) | (t_to_be_ckecked_with &
t_to_be_ckecked);
}

why does having more than one thread(parallel processing) in some specific cases degrade performance?
Because thread creation has overhead. If the task to be performed has only small computational cost, then the cost of creating multiple threads is more than the time saved by parallelism. This is especially the case when creating significantly more threads than there are CPU cores.
Because many algorithms do not easily divide into independent sub-tasks. Dependencies on other threads requires synchronisation, which has overhead that can in some cases be more than the time saved by parallelism.
Because in poorly designed programs, synchronization can cause all tasks to be processed sequentially even if they are in separate threads.
Because (depending on CPU architecture) sometimes otherwise correctly implemented, and seemingly independent tasks have effectual dependency because they operate on the same area of memory. More specifically, when a threads writes into a piece of memory, all threads operating on the same cache line must synchronise (the CPU does this for you automatically) to remain consistent. The cost of cache misses is often much higher than the time saved by parallelism. This problem is called "false sharing".
Because sometimes introduction of multi threading makes the program more complex, which makes it more difficult for the compiler / optimiser to make use of instruction level parallelism.
...
In conclusion: Threads are not a silver bullet that automatically multiplies the performance of your program.
Regarding your program, we cannot count out any of the above potential issues given the excerpt that you have shown.
Some tips on avoiding or finding above issues:
Don't create more threads than you have cores, discounting the number of threads that are expected to be blocking (waiting for input, disk, etc).
Only use multi-threading with problems that are computationally expensive, (or to do work while a thread is blocking, but this may be more efficiently solved using asynchronous I/O and coroutines).
Don't do (or do as little as possible) I/O from more than one thread into a single device (disk, NIC, virtual terminal, ...) unless it is specially designed to handle it.
Minimise the number of dependencies between threads. Consider all access to global things that may cause synchronisation, and avoid them. For example, avoid memory allocation. Keep in mind that things like operations on standard containers do memory allocation.
Keep the memory touched by distinct threads far from each other (not adjacent small elements of array). If processing an array, divide it in consecutive blocks, rather than striping one element every (number of threads)th element. In some extreme cases, extra copying into thread specific data structures, and then joining in the end may be efficient.
If you've done all you can, and multi threading measures slower, consider whether perhaps it is not a good solution for your problem.

Using threads do not always mean that you will get more work done. For example using 2 threads does not mean you will get a task done in half the time. There is an overhead to setting up the threads and depending on how many cores and OS etc... how much context switching is occurring between threads (saving the thread stack/regs and loading the next one - it all adds up). At some point adding more threads will start to slow your program down since there will be more time spent switching between threads/setting threads up/down then there is work being done. So you may be a victim of this.
If you have 100 very small items (like 1 instruction) of work to do, then 100 threads will be guaranteed to be slower since you now have ("many instructions" + 1) x 100 of work to do. Where the "many instructions" are the work of setting up the threads and clearing them up at the end - and switching between them.
So, you may want to start to profile this for yourself.. How much work is done processing each row and how many threads in total are you setting up?
One very crude, but quick/simple way to start to measure is to just take the time elapsed to processes one row in isolation (e.g. use std::chrono functions to measure the time at the start of processing one row and then take the time at the end to see total time spent. Then maybe do the same test over the entire table to get an idea how total time.
If you find that a individual row is taking very little time then you may not be getting so much benefit from the threads... You may be better of splitting the table into chunks of work that are equal to the number of cores your CPU has, then start changing the number of threads (+/-) to find the sweet spot. Just making threads based on number of rows is a poor choice - you really want to design it to max out each core (for example).
So if you had 4 cores, maybe start by splitting the work into 4 threads to start with. Then test it with 8 if its better try 16, if its worse try 12....etc...
Also you might get different results on different PCs...

Why is 6-7 threads faster than 20?

In school we were introduced to C++11 threads. The teacher gave us a simple assessment to complete which was to make a basic web crawler using 20 threads. To me threading is pretty new, although I do understand the basics.
I would like to mention that I am not looking for someone to complete my assessment as it is already done. I only want to understand the reason why using 6 threads is always faster than using 20.
Please see code sample below.
main.cpp:
do
{
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i] = std::thread(SweepUrlList);
}
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i].join();
}
std::cout << std::endl;
WriteToConsole();
listUrl = listNewUrl;
listNewUrl.clear();
} while (listUrl.size() != 0);
Basically this assigns to each worker thread the job to complete which is the method SweepUrlList that can be found below and then join all thread.
while (1)
{
mutextGetNextUrl.lock();
std::set<std::string>::iterator it = listUrl.begin();
if (it == listUrl.end())
{
mutextGetNextUrl.unlock();
break;
}
std::string url(*it);
listUrl.erase(*it);
mutextGetNextUrl.unlock();
ExtractEmail(url, listEmail);
std::cout << ".";
}
So each worker thread loop until ListUrl is empty. ExtractEmail is a method that downloads the webpage (using curl) and parse it to extract emails from mailto links.
The only blocking call in ExtractEmail can be found below:
if(email.length() != 0)
{
mutextInsertNewEmail.lock();
ListEmail.insert(email);
mutextInsertNewEmail.unlock();
}
All answers are welcome and if possible links to any documentation you found to answer this question.

This is a fairly universal problem with threading, and at its core:
What you are demonstrating is thread Scheduling. The operating system is going to work with the various threads, and schedule work where there is currently not work.
Assuming you have 4 cores and hyper threading you have 8 processors that can carry the load, but also that of other applications (Operating System, C++ debugger, and your application to start).
In theory, you would probably be OK on performance up until about 8 intensive threads. After you reach the most threads your processor can effectively use, then threads begin to compete against each other for resources. This can be seen (especially with intensive applications and tight loops) by poor performance.
Finally, this is a simplified answer but I suspect what you are seeing.

The simple answer is choke points. Something that you are doing is causing a choke point. When this occurs there is a slow down. It could be in the number of active connections you are making to something, or merely the extra overhead of the number and memory size of the threads (see the below answer about cores being one of these chokes).
You will need to set up a series of monitors to investigate where your choke point is, and what needs to change in order to achieve scale. Many systems across every industry face this problem every day. Opening up the throttle at one end does not equal the same increase in the output at the other end. In cases it can decrease the output at the other end.
Take for example individuals leaving a hall. The goal is to get 100 people out of the building as quickly as possible. If single file produces a rate of 1 person every 1 second therefore 100 seconds to clear the building. We many be able to half that time by sending them out 2 abreast, so 50 seconds to clear the building. What if we then sent them out as 8 abreast. The door is only 2m wide, so with 8 abreast being equivalent to 4m, only 50% of the first row would make it through. The other 4 would then cause a blockage for the next row and so on. Depending on the rate, this could cause temporary blockages and increase the time 10 fold.

Threads are an operating system construct. Basically, each thread's state (which is basically all the CPU's registers and virtual memory mapping [which is a part of the process construct]) is saved by the operating system. Once the OS gives that specific thread "execution time" it restores this state and let it run. Once this time is finished, it has to save this state. The process of saving a specific thread's state and restoring another is called Context Switching, and it takes a significant amount of time (usually between a couple of hundreds to thousand of CPU cycles).
There are also additional penalties to context switching. Some of the processor's cache (like the virtual memory translation cache, called the TLB) has to be flushed, pipelining instruction to be discarded and more. Generally, you want to minimize context switching as much as possible.
If your CPU has 4 cores, than 4 threads can run simultaneously. If you try to run 20 threads on a 4 core system, then the OS has to manage time between those threads so it will seem like they run in parallel. E.g, threads 1-4 will run for 50 milliseconds, then 5-9 will run for 50 milliseconds, etc.
Therefore, if all of your threads are running CPU intensive operations, it is generally most efficient to make your program use the same amount of threads as cores (sometimes called 'processors' in windows). If you have more threads than cores, than context switching must happen, and it is overhead that can be minimized.

In general, more threads is not better. More threading provides value in two ways higher parallelism and less blocking. More threading hurts by higher memory, higher context switching and higher resource contention.
The value of more threads for higher parallelism is generally maximized between 1-2x the number of actual cores that you have available. If your threads are already CPU bound the maximum value is generally 1x number of cores.
The value of less blocking is much harder to quantify and depends on the type of work you are performing. If you are IO bound and your threads are primarily waiting for IO to be ready then a larger number of threads could be beneficial.
However if you have shared state between threads, or you are doing some form of message passing between threads then you will run into synchronization and contention issues. As the number of threads increases, the more these types of overhead as well as context switches dominates the time spent doing your task.
Amdahl's law is a useful measure to determine if higher parallelism will actually improve the total runtime of your job.
You also must be careful that your increased parallelism doesn't exceed some other resource like total memory or disk or network throughput. Once you have saturated the current bottleneck, you will not see improved performance by increasing the number of threads.
Before doing any performance tuning, it is important to understand what the dominant resource bottleneck is. There are lots of tools for doing system-wide resource monitoring. On Linux, one very useful tool is dstat. On Windows, you can use the Task Manager to monitor many of these resources.

Multithreading crowds out other processes

I have added multithreading to a raytracer I am writing, and while it does run much faster now, when it's running, my computer is almost unusably slow. Obviously I want to use all my PC's compute power, but I don't want it to prevent any other application from getting access to the CPUs.
I thought about having the threads sleep, but unless they all sleep at the same time, then the other threads would just eat up the extra time. Also, I don't necessarily want to give up a certain percentage of available compute power if I'm not going to use it.
Also, (This is not my official question) I've noticed that for some reason the first thread launched does more work than the second, and the second more than the third, and so on until like the last 5 threads (out of 32) won't actually get a crack at any work, despite the fact that there's plenty to go a around (there's at least 0.5M work items for them to chew through). If someone would like to venture a guess in the comments, it would be appreciated.

If you use the standard threads, you could try to use thread::hardware_concurrency to find out an estimate of the maximul number of threads that are really supported by hardware, in order not to overload your cpu.
If it returns 0 the information is not available. In other cases you could limit yourself to this number or a little bit below (thinking that other processes might use these as well).
If limiting the number of threads does not improve responsiveness, you can also consider calling from time to time this_thread::yield() to give opportunity to reschedule threads. But depending on the kind of job and synchronisation you use, this second alternative might decrease performance.

As requested, my comment as an answer:
It sounds like you've oversubscribed your poor CPU. Try reducing the number of threads?
If there's significantly more threads than hardware cores, a lot of time is going to be wasted switching between threads, scheduling them in the OS, and in contention over shared variables. It would also cause the general slowdown of the other running programs, because they have to contend with the high number of threads from your program (which by default all have the same priority as the other programs' threads in the eyes of the OS scheduler).

What is the best way to determine the number of threads to fire off in a machine with n cores? (C++)

I have a vector<int> with 10,000,000 (10 million) elements, and that my workstation has four cores. There is a function, called ThrFunc, that operates on an integer. Assume that the runtime for ThrFunc for each integer in the vector<int> is roughly the same.
How should I determine the optimal number of threads to fire off? Is the answer as simple as the number of elements divided by the number of cores? Or is there a more subtle computation?
Editing to provide extra information
No need for blocking; each function invocation needs only read-only
access

The optimal number of threads is likely to be either the number of cores in your machine or the number of cores times two.
In more abstract terms, you want the highest possible throughput. Getting the highest throughput requires the fewest contention points between the threads (since the original problem is trivially parallelizable). The number of contention points is likely to be the number of threads sharing a core or twice that, since a core can either run one or two logical threads (two with hyperthreading).
If your workload makes use of a resource of which you have fewer than four available (ALUs on Bulldozer? Hard disk access?) then the number of threads you should create will be limited by that.
The best way to find out the correct answer is, with all hardware questions, to test and find out.

Borealid's answer includes test and find out, which is impossible to beat as advice goes.
But there's perhaps more to testing this than you might think: you want your threads to avoid contention for data wherever possible. If the data is entirely read-only, then you might see best performance if your threads are accessing "similar" data -- making sure to walk through the data in small blocks at a time, so each thread is accessing data from the same pages over and over again. If the data is completely read-only, then there is no problem if each core gets its own copy of the cache lines. (Though this might not make the most use of each core's cache.)
If the data is in any way modified, then you will see significant performance enhancements if you keep the threads away from each other, by a lot. Most caches store data along cache lines, and you desperately want to keep each cache line from bouncing among CPUs for good performance. In that case, you might want to keep the different threads running on data that is actually far apart to avoid ever running into each other.
So: if you're updating the data while working on it, I'd recommend having N or 2*N threads of execution (for N cores), starting them with SIZE/N*M as their starting point, for threads 0 through M. (0, 1000, 2000, 3000, for four threads and 4000 data objects.) This will give you the best chance of feeding different cache lines to each core and allowing updates to proceed without cache line bouncing:
+--------------+---------------+--------------+---------------+--- ...
| first thread | second thread | third thread | fourth thread | first ...
+--------------+---------------+--------------+---------------+--- ...
If you're not updating the data while working on it, you might wish to start N or 2*N threads of execution (for N cores), starting them with 0, 1, 2, 3, etc.. and moving each one forward by N or 2*N elements with each iteration. This will allow the cache system to fetch each page from memory once, populate the CPU caches with nearly identical data, and hopefully keep each core populated with fresh data.
+-----------------------------------------------------+
| 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ... |
+-----------------------------------------------------+
I also recommend using sched_setaffinity(2) directly in your code to force the different threads to their own processors. In my experience, Linux aims to keep each thread on its original processor so much it will not migrate tasks to other cores that are otherwise idle.

Assuming ThrFunc is CPU-bound then you want probably one thread per core, and divide the elements between them.
If there's an I/O element to the function then the answer is more complicated, because you can have one or more threads per core waiting for I/O while another is executing. Do some tests and see what happens.

I agree with the previous comments. You should run tests to determine what number yields the best performance. However, this will only yield the best performance for the particular system you're optimizing for. In most scenarios, your program will be run on other people's machines, on the architecture of which you should not make too many assumptions.
A good way to numerically determine the number of threads to start would be to use
std::thread::hardware_concurrency()
This is part of the C++11 and should yield the number of logical cores in the current system. Logical cores means either the physical number of cores - in case the processor does not support hardware threads (ie HyperThreading) - or the number of hardware threads.
There's also a Boost-function that does the same, see Programmatically find the number of cores on a machine.

The optimal number of threads should equal the number of cores, in which situation the computation capacity of each core will be fully utilized, if the computation on each element is independently.

The optimal number of cores (threads) will probably be determined by when you achieve saturation of the memory system (caches and RAM). Another factor that could come into play is that of inter-core locking (locking a memory area that other cores might want to access, updating it and then unlocking it) and how efficient it is (how long the lock is in place and how often it is locked/unlocked).
A single core running a generic software whose code and data are not optmized for multi-core will come close to saturating memory all by itself. Adding more cores will, in such a scenario, result in a slower application.
So unless your code economizes heavily on memory accesses I'd guess the answer to your question is one (1).

I've found a real world example I'll put here for the ones who want a less technical / more intuitional answer:
Having multiple threads per core is like having two queues in an airport for each scanner(which people on both queues eventually have to pass through).
Two people at a time can put their baggage on the conveyer belt, but only one at a time can pass through the scanner. Now at this point, obviously there's a contention point at the entrance of the scanner, but what happens in reality is most of the times both queues function very well.
In this example, the queues represent threads and the scanner is the main functions of a core. As a general rule of thumb, the impact of each thread is 1.25th a core, i.e., it's not like having an entire new core. So if the task is CPU-bound slightly over the number of available processors is probably best.
But notice that if the task is IO-Bound, where threads will be spending most of their time waiting for external resources such as database connections, file systems, or other external sources of data, then you can assign (many) more threads than the number of available processors.
Source1, Source2

Is my approach eligible for a Thread Pool approach?

I have win 32 C++ application . I have to load 330,000 objects into memory . If I use sequential approach it takes around 16 min . In the threading approach I divide the 330,000 objects equally among 10 containers . I create 10 threads and I assign each thread one container of size 33000 objects to load them in memory . This approach took around 9 min .
INcreasing the number of threads did not help .....
Will I get any further improvement if I use ThreadPool ?

As always without specifics, it depends.
Are you loading objects from disk or creating them in memory? If you're loading them from disk then it's probably IO bound so increasing the number of threads probably won't help very much.
In the comment you mentioned you are loading from a database. I presume when you use threads you are making N queries simultaneously? Might be worth investigating the database console to understand how its coping with many concurrent queries.
On the other hand, if the objects are created as the result of some CPU bound process (e.g. calculating pi) then chances are increasing the number of threads grater than the number of CPU's probably won't increase performance (and as ronag points out in the comments will probably hurt performance due to the increased context switching).
Are there dependencies between the objects? That would again affect how things go.
You'd typically use a thread pool if you have a collection of independent tasks that you want to run with a configurable way of running them. It sounds like using a thread pool would be a good way to run lots of benchmarks with various thread settings. You could also make the number of threads configurable which would help when running on different architectures/systems.

IME, and yours, a few threads will speed up this kind of task. I'm guessing that the overall throughput is improved because of better use of the 'intelligent' disk cacheing available on modern controllers - the disk/controller spends less time idle because there are always threads wanting to read something. The diminishing returns sets in, however, after only a few threads are loaded in and you are disk-bound. In a slightly similar app, I found that any more than 6 threads provided no additional advantage & just used up more memory.
I can't see how pooling, or otherwise, of these threads would make any difference to performance - it's just a big job that has to be done :(
Tell your customers that they have to install an SSD
Rgds,
Martin

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js