why does having more than one thread(parallel processing) in some specific cases degrade performance? - c++

i noticed that having more than a thread running for some code is much much slower than having one thread, and i have been really pulling my hair to know why,can anyone help?
code explanation :
i have ,sometimes, a very large array that i need to process parts of in a parallel way for optimization,each "part" of a row gets looped on and processed on in a specific thread, now i've noticed that if i only have one "part",i.e the whole array and a single worker thread that runs through it is noticeably faster than if i divide the array and process it as separate sub arrays with different threads.
bool m_generate_row_worker(ull t_row_start,ull t_row_end)
{
for(;t_row_start<t_row_end;t_row_start++)
{
m_current_row[t_row_start]=m_singularity_checker(m_previous_row[t_row_start],m_shared_random_row[t_row_start]);
}
return true;
}
...
//code
...
for(unsigned short thread_indx=0;thread_indx<noThreads-1;thread_indx++)
{
m_threads_array[thread_indx]=std::thread(
m_generate_row_worker,this,
thread_indx*(m_parts_per_thread),(thread_indx+1)*(m_parts_per_thread));
}
m_threads_array[noThreads-1]=std::thread(m_generate_row_worker,this,
(noThreads-1)*(m_parts_per_thread),std::max((noThreads)*(m_parts_per_thread),m_blocks_per_row));
//join
for(unsigned short thread_indx=0;thread_indx<noThreads;thread_indx++)
{
m_threads_array[thread_indx].join();
}
//EDIT
inline ull m_singularity_checker(ull t_to_be_ckecked_with,ull
t_to_be_ckecked)
{
return (t_to_be_ckecked & (t_to_be_ckecked_with<<1)
& (t_to_be_ckecked_with>>1) ) | (t_to_be_ckecked_with &
t_to_be_ckecked);
}

why does having more than one thread(parallel processing) in some specific cases degrade performance?
Because thread creation has overhead. If the task to be performed has only small computational cost, then the cost of creating multiple threads is more than the time saved by parallelism. This is especially the case when creating significantly more threads than there are CPU cores.
Because many algorithms do not easily divide into independent sub-tasks. Dependencies on other threads requires synchronisation, which has overhead that can in some cases be more than the time saved by parallelism.
Because in poorly designed programs, synchronization can cause all tasks to be processed sequentially even if they are in separate threads.
Because (depending on CPU architecture) sometimes otherwise correctly implemented, and seemingly independent tasks have effectual dependency because they operate on the same area of memory. More specifically, when a threads writes into a piece of memory, all threads operating on the same cache line must synchronise (the CPU does this for you automatically) to remain consistent. The cost of cache misses is often much higher than the time saved by parallelism. This problem is called "false sharing".
Because sometimes introduction of multi threading makes the program more complex, which makes it more difficult for the compiler / optimiser to make use of instruction level parallelism.
...
In conclusion: Threads are not a silver bullet that automatically multiplies the performance of your program.
Regarding your program, we cannot count out any of the above potential issues given the excerpt that you have shown.
Some tips on avoiding or finding above issues:
Don't create more threads than you have cores, discounting the number of threads that are expected to be blocking (waiting for input, disk, etc).
Only use multi-threading with problems that are computationally expensive, (or to do work while a thread is blocking, but this may be more efficiently solved using asynchronous I/O and coroutines).
Don't do (or do as little as possible) I/O from more than one thread into a single device (disk, NIC, virtual terminal, ...) unless it is specially designed to handle it.
Minimise the number of dependencies between threads. Consider all access to global things that may cause synchronisation, and avoid them. For example, avoid memory allocation. Keep in mind that things like operations on standard containers do memory allocation.
Keep the memory touched by distinct threads far from each other (not adjacent small elements of array). If processing an array, divide it in consecutive blocks, rather than striping one element every (number of threads)th element. In some extreme cases, extra copying into thread specific data structures, and then joining in the end may be efficient.
If you've done all you can, and multi threading measures slower, consider whether perhaps it is not a good solution for your problem.

Using threads do not always mean that you will get more work done. For example using 2 threads does not mean you will get a task done in half the time. There is an overhead to setting up the threads and depending on how many cores and OS etc... how much context switching is occurring between threads (saving the thread stack/regs and loading the next one - it all adds up). At some point adding more threads will start to slow your program down since there will be more time spent switching between threads/setting threads up/down then there is work being done. So you may be a victim of this.
If you have 100 very small items (like 1 instruction) of work to do, then 100 threads will be guaranteed to be slower since you now have ("many instructions" + 1) x 100 of work to do. Where the "many instructions" are the work of setting up the threads and clearing them up at the end - and switching between them.
So, you may want to start to profile this for yourself.. How much work is done processing each row and how many threads in total are you setting up?
One very crude, but quick/simple way to start to measure is to just take the time elapsed to processes one row in isolation (e.g. use std::chrono functions to measure the time at the start of processing one row and then take the time at the end to see total time spent. Then maybe do the same test over the entire table to get an idea how total time.
If you find that a individual row is taking very little time then you may not be getting so much benefit from the threads... You may be better of splitting the table into chunks of work that are equal to the number of cores your CPU has, then start changing the number of threads (+/-) to find the sweet spot. Just making threads based on number of rows is a poor choice - you really want to design it to max out each core (for example).
So if you had 4 cores, maybe start by splitting the work into 4 threads to start with. Then test it with 8 if its better try 16, if its worse try 12....etc...
Also you might get different results on different PCs...

Related

What's the "real world" performance improvement for multithreading I can expect?

I'm programming a recursive tree search with multiple branches and works fine. To speed up I'm implementing a simple multithreading: I distribute the search into main branches and scatter them among the threads. Each thread doesn't have to interact with the others, and when a solve is found I add it to a common std::vector using a mutex this way:
if (CubeTest.IsSolved())
{ // Solve algorithm found
std::lock_guard<std::mutex> guard(SearchMutex); // Thread safe code
Solves.push_back(Alg); // Add the solve
}
I don't allocate variables in dynamic store (heap) with new and delete, since the memory needs are small.
The maximum number of threads I use is the quantity I get from: std::thread::hardware_concurrency()
I did some tests, always the same search but changing the amount or threads used, and I found things that I don't expected.
I know that if you double the amount of threads (if the processor has enougth capacity) you can't expect to double the performance, because of context switching and things like that.
For example, I have an old Intel Xeon X5650 with 6 cores / 12 threads. If I execute my code, until the sixth thread things are as expected, but if I use an additional thread the performace is worst. Using more threads increase the performace very little, to the point that the use of all avaliable threads (12) barely compensates for the use of only 6:
Threads vs processing time chart for Xeon X5650:
(I repeat the test several times and I show the average times of all the tests).
I repeat the tests in other computer with an Intel i7-4600U (2 cores / 4 threads) and I found this:
Threads vs processing time chart for i7-4600U:
I understand that with less cores the performance gain using more threads is worst.
I think also that when you start to use the second thread in the same core the performance is penalized in some way. Am I right? How can I improve the performance in this situation?
So my question is if this performance gains for multithreading is what I can expect in the real world, or on the other hand, this numbers are telling me that I'm doing things wrong and I should learn more about mutithreading programming.
What's the “real world” performance improvement for multithreading I can expect?
It depends on many factors. In general, the most optimistic improvement that one can hope for is reduction of runtime by factor of number of cores1. In most cases this is unachievable because of the need for threads to synchronise with one another.
In worst case, not only is there no improvement due to lack of parallelism, but also the overhead of synchronisation as well as cache contention can make the runtime much worse than the single threaded program.
Peak memory use often increases linearly by number of threads because each thread needs to operate on data of their own.
Total CPU time usage, and therefore energy use also increases due to extra time spent on synchronisation. This is relevant to systems that operate on battery power as well as those that have poor heat management (both apply to phones and laptops).
Binary size would be marginally larger due to extra code that deals with threads.
1 Whether you get all of the performance out of "logical" cores i.e. "hyper threading" or "clustered multi threading" also depends on many factors. Often, one executes the same function in all threads, in which case they tend to use the same parts of the CPU, in which case sharing the core with multiple threads doesn't necessarily yield benefit.
A CPU which uses hyperthreading claims to be able to execute two threads simultaneously on one core. But actually it doesn't. It just pretends to be able to do that. Internally it performs preemptive multitasking: Execute a bit of thread A, then switch to thread B, execute a bit of B, back to A and so on.
So what's the point of hyperthreading at all?
The thread switches inside the CPU are faster than thread switches managed by the thread scheduler of the operating system. So the performance gains are mostly through avoiding overhead of thread switches. But it does not allow the CPU core to perform more operations than it did before.
Conclusion: The performance gain you can expect from concurrency depend on the number of physical cores of the CPU, not logical cores.
Also keep in mind that thread synchronization methods like mutexes can become pretty expensive. So the less locking you can get away with the better. When you have multiple threads filling the same result set, then it can sometimes be better to let each thread build their own result set and then merge those sets later when all threads are finished.

Why is 6-7 threads faster than 20?

In school we were introduced to C++11 threads. The teacher gave us a simple assessment to complete which was to make a basic web crawler using 20 threads. To me threading is pretty new, although I do understand the basics.
I would like to mention that I am not looking for someone to complete my assessment as it is already done. I only want to understand the reason why using 6 threads is always faster than using 20.
Please see code sample below.
main.cpp:
do
{
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i] = std::thread(SweepUrlList);
}
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i].join();
}
std::cout << std::endl;
WriteToConsole();
listUrl = listNewUrl;
listNewUrl.clear();
} while (listUrl.size() != 0);
Basically this assigns to each worker thread the job to complete which is the method SweepUrlList that can be found below and then join all thread.
while (1)
{
mutextGetNextUrl.lock();
std::set<std::string>::iterator it = listUrl.begin();
if (it == listUrl.end())
{
mutextGetNextUrl.unlock();
break;
}
std::string url(*it);
listUrl.erase(*it);
mutextGetNextUrl.unlock();
ExtractEmail(url, listEmail);
std::cout << ".";
}
So each worker thread loop until ListUrl is empty. ExtractEmail is a method that downloads the webpage (using curl) and parse it to extract emails from mailto links.
The only blocking call in ExtractEmail can be found below:
if(email.length() != 0)
{
mutextInsertNewEmail.lock();
ListEmail.insert(email);
mutextInsertNewEmail.unlock();
}
All answers are welcome and if possible links to any documentation you found to answer this question.
This is a fairly universal problem with threading, and at its core:
What you are demonstrating is thread Scheduling. The operating system is going to work with the various threads, and schedule work where there is currently not work.
Assuming you have 4 cores and hyper threading you have 8 processors that can carry the load, but also that of other applications (Operating System, C++ debugger, and your application to start).
In theory, you would probably be OK on performance up until about 8 intensive threads. After you reach the most threads your processor can effectively use, then threads begin to compete against each other for resources. This can be seen (especially with intensive applications and tight loops) by poor performance.
Finally, this is a simplified answer but I suspect what you are seeing.
The simple answer is choke points. Something that you are doing is causing a choke point. When this occurs there is a slow down. It could be in the number of active connections you are making to something, or merely the extra overhead of the number and memory size of the threads (see the below answer about cores being one of these chokes).
You will need to set up a series of monitors to investigate where your choke point is, and what needs to change in order to achieve scale. Many systems across every industry face this problem every day. Opening up the throttle at one end does not equal the same increase in the output at the other end. In cases it can decrease the output at the other end.
Take for example individuals leaving a hall. The goal is to get 100 people out of the building as quickly as possible. If single file produces a rate of 1 person every 1 second therefore 100 seconds to clear the building. We many be able to half that time by sending them out 2 abreast, so 50 seconds to clear the building. What if we then sent them out as 8 abreast. The door is only 2m wide, so with 8 abreast being equivalent to 4m, only 50% of the first row would make it through. The other 4 would then cause a blockage for the next row and so on. Depending on the rate, this could cause temporary blockages and increase the time 10 fold.
Threads are an operating system construct. Basically, each thread's state (which is basically all the CPU's registers and virtual memory mapping [which is a part of the process construct]) is saved by the operating system. Once the OS gives that specific thread "execution time" it restores this state and let it run. Once this time is finished, it has to save this state. The process of saving a specific thread's state and restoring another is called Context Switching, and it takes a significant amount of time (usually between a couple of hundreds to thousand of CPU cycles).
There are also additional penalties to context switching. Some of the processor's cache (like the virtual memory translation cache, called the TLB) has to be flushed, pipelining instruction to be discarded and more. Generally, you want to minimize context switching as much as possible.
If your CPU has 4 cores, than 4 threads can run simultaneously. If you try to run 20 threads on a 4 core system, then the OS has to manage time between those threads so it will seem like they run in parallel. E.g, threads 1-4 will run for 50 milliseconds, then 5-9 will run for 50 milliseconds, etc.
Therefore, if all of your threads are running CPU intensive operations, it is generally most efficient to make your program use the same amount of threads as cores (sometimes called 'processors' in windows). If you have more threads than cores, than context switching must happen, and it is overhead that can be minimized.
In general, more threads is not better. More threading provides value in two ways higher parallelism and less blocking. More threading hurts by higher memory, higher context switching and higher resource contention.
The value of more threads for higher parallelism is generally maximized between 1-2x the number of actual cores that you have available. If your threads are already CPU bound the maximum value is generally 1x number of cores.
The value of less blocking is much harder to quantify and depends on the type of work you are performing. If you are IO bound and your threads are primarily waiting for IO to be ready then a larger number of threads could be beneficial.
However if you have shared state between threads, or you are doing some form of message passing between threads then you will run into synchronization and contention issues. As the number of threads increases, the more these types of overhead as well as context switches dominates the time spent doing your task.
Amdahl's law is a useful measure to determine if higher parallelism will actually improve the total runtime of your job.
You also must be careful that your increased parallelism doesn't exceed some other resource like total memory or disk or network throughput. Once you have saturated the current bottleneck, you will not see improved performance by increasing the number of threads.
Before doing any performance tuning, it is important to understand what the dominant resource bottleneck is. There are lots of tools for doing system-wide resource monitoring. On Linux, one very useful tool is dstat. On Windows, you can use the Task Manager to monitor many of these resources.

Simple multi-threading confusion for C++

I am developing a C++ application in Qt.
I have a very basic doubt, please forgive me if this is too stupid...
How many threads should I create to divide a task amongst them for minimum time?
I am asking this because my laptop is 3rd gen i5 processor (3210m). So since it is dual core & NO_OF_PROCESSORS environment variable is showing me 4. I had read in an article that dynamic memory for an application is only available for that processor which launched that application. So should I create only 1 thread (since env variable says 4 processors) or 2 threads (since my processor is dual core & env variable might be suggesting the no of cores) or 4 threads (if that article was wrong)?
Please forgive me since I am a beginner level programmer trying to learn Qt.
Thank You :)
Although hyperthreading is somewhat of a lie (you're told that you have 4 cores, but you really only have 2 cores, and another two that only run on what resources the former two don't use, if there's such a thing), the correct thing to do is still to use as many threads as NO_OF_PROCESSORS tells you.
Note that Intel isn't the only one lying to you, it's even worse on recent AMD processors where you have 6 alleged "real" cores, but in reality only 4 of them, with resources shared among them.
However, most of the time, it just more or less works out. Even in absence of explicitly blocking a thread (on a wait function or a blocking read), there's always a point where a core is stalled, for example in accessing memory due to a cache miss, which gives away resources that can be used by the hyperthreaded core.
Therefore, if you have a lot of work to do, and you can parallelize it nicely, you should really have as many workers as there are advertized cores (whether they're "real" or "hyper"). This way, you make maximum use of the available processor resources.
Ideally, one would create worker threads early at application startup, and have a task queue to hand tasks to workers. Since synchronization is often non-neglegible, the task queue should be rather "coarse". There is a tradeoff in maximum core usage and synchronization overhead.
For example, if you have 10 million elements in an array to process, you might push tasks that refer to 100,000 or 200,000 consecutive elements (you will not want to push 10 million tasks!). That way, you make sure that no cores stay idle on the average (if one finishes earlier, it pulls another task instead of doing nothing) and you only have a hundred or so synchronizations, the overhead of which is more or less neglegible.
If tasks involve file/socket reads or other things that can block for indefinite time, spawning another 1-2 threads is often no mistake (takes a bit of experimentation).
This totally depends on your workload, if you have a workload which is very cpu intensive you should stay closer to the number of threads your cpu has(4 in your case - 2 core * 2 for hyperthreading). A small oversubscription might be also be ok, as that can compensate for times where one of your threads waits for a lock or something else.
On the other side, if your application is not cpu dependent and is mostly waiting, you can even create more threads than your cpu count. You should however notice that thread creation can be quite an overhead. The only solution is to measure were your bottleneck is and optimize in that direction.
Also note that if you are using c++11 you can use std::thread::hardware_concurrency to get a portable way to determine the number of cpu cores you have.
Concerning your question about dynamic memory, you must have misunderstood something there.Generally all threads you create can access the memory you created in your application. In addition, this has nothing to do with C++ and is out of the scope of the C++ standard.
NO_OF_PROCESSORS shows 4 because your CPU has Hyper-threading. Hyper-threading is the Intel trademark for tech that enables a single core to execute 2 threads of the same application more or less at the same time. It work as long as e.g. one thread is fetching data and the other one accessing the ALU. If both need the same resource and instructions can't be reordered, one thread will stall. This is the reason you see 4 cores, even though you have 2.
That dynamic memory is only available to one of the Cores is IMO not quite right, but register contents and sometimes cache content is. Everything that resides in the RAM should be available to all CPUs.
More threads than CPUs can help, depending on how you operating systems scheduler works / how you access data etc. To find that you'll have to benchmark your code. Everything else will just be guesswork.
Apart from that, if you're trying to learn Qt, this is maybe not the right thing to worry about...
Edit:
Answering your question: We can't really tell you how much slower/faster your program will run if you increase the number of threads. Depending on what you are doing this will change. If you are e.g. waiting for responses from the network you could increase the number of threads much more. If your threads are all using the same hardware 4 threads might not perform better than 1. The best way is to simply benchmark your code.
In an ideal world, if you are 'just' crunching numbers should not make a difference if you have 4 or 8 threads running, the net time should be the same (neglecting time for context switches etc.) just the response time will differ. The thing is that nothing is ideal, we have caches, your CPUs all access the same memory over the same bus, so in the end they compete for access to resources. Then you also have an operating system that might or might not schedule a thread/process at a given time.
You also asked for an Explanation of synchronization overhead: If all your threads access the same data structures, you will have to do some locking etc. so that no thread accesses the data in an invalid state while it is being updated.
Assume you have two threads, both doing the same thing:
int sum = 0; // global variable
thread() {
int i = sum;
i += 1;
sum = i;
}
If you start two threads doing this at the same time, you can not reliably predict the output: It might happen like this:
THREAD A : i = sum; // i = 0
i += 1; // i = 1
**context switch**
THREAD B : i = sum; // i = 0
i += 1; // i = 1
sum = i; // sum = 1
**context switch**
THREAD A : sum = i; // sum = 1
In the end sum is 1, not 2 even though you started the thread twice.
To avoid this you have to synchronize access to sum, the shared data. Normally you would do this by blocking access to sum as long as needed. Synchronization overhead is the time that threads would be waiting until the resource is unlocked again, doing nothing.
If you have discrete work packages for each thread and no shared resources you should have no synchronization overhead.
The easiest way to get started with dividing work among threads in Qt is to use the Qt Concurrent framework. Example: You have some operation that you want to perform on every item in a QList (pretty common).
void operation( ItemType & item )
{
// do work on item, changing it in place
}
QList<ItemType> seq; // populate your list
// apply operation to every member of seq
QFuture<void> future = QtConcurrent::map( seq, operation );
// if you want to wait until all operations are complete before you move on...
future.waitForFinished();
Qt handles the threading automatically...no need to worry about it. The QFuture documenation describes how you can handle the map completion asymmetrically with signals and slots if you need to do that.

how to design threading for many short tasks

I want to use multi-threads to accelerate my program, but not sure which way is optimal.
Say we have 10000 small tasks, it takes maybe only 0.1s to finish one of them. Now I have a CPU with 12 cores and I want to use 12 threads to make it faster.
So far as I know, there are two ways:
1.Tasks Pool
There are always 12 threads running, each of them get one new task from the tasks pool after it finished its current work.
2.Separate Tasks
By separating the 10000 tasks into 12 parts and each thread works on one part.
The problem is, if I use tasks pool it is a waste of time for lock/unlock when multiple threads try to access the tasks pool. But the 2nd way is not ideal because some of the threads finish early, the total time depends on the slowest thread.
I am wondering how you deal with this kind of work and any other best way to do it? Thank you.
EDIT: Please note that the number 10000 is just for example, in practice, it may be 1e8 or more tasks and 0.1 per task is also an average time.
EDIT2: Thanks for all your answers :] It is good to know kinds of options.
So one midway between the two approaches is to break into say 100 batches of 100 tasks each and let the a core pick a batch of 100 tasks at a time from the task pool.
Perhaps if you model the randomness in execution time in a single core for a single task, and get an estimate of mutex locking time, you might be able to find an optimal batch size.
But without too much work we at least have the following lemma :
The slowest thread can only take at max 100*.1 = 10s more than others.
Task pool is always the best solution here. It's not just optimum time, it's also comprehensibility of code. You should never force your tasks to conform to the completely unrelated criteria of having the same number of subtasks as cores - your tasks have nothing to do with that (in general), and such a separation doesn't scale when you change machines, etc. It requires overhead to collaborate on combining results in subtasks for the final task, and just generally makes an easy task hard.
But you should not be worrying about the use of locks for taskpools. There are lockfree queues available if you ever determined them necessary. But determine that first. If time is your concern, use the appropriate methods of speeding up your task, and put your effort where you will get the most benefit. Profile your code. Why do your tasks take 0.1 s? Do they use an inefficient algorithm? Can loop unrolling help? If you find the hotspots in your code through profiling, you may find that locks are the least of your worries. And if you find everything is running as fast as possible, and you want that extra second from removing locks, search the internet with your favorite search engine for "lockfree queue" and "waitfree queue". Compare and swap makes atomic lists easy.
Both ways suggested in the question will perform well and similarly to each another (in simple cases with predictable and relatively long duration of the tasks). If the target system type is known and available (and if performance is really a top concern), the approach should be chosen based on prototyping and measurements.
Do not necessarily prejudice yourself as to the optimal number of threads matching the number of the cores. If this is a regular server or desktop system, there will be various system processes kicking in here and then and you may see your 12 threads variously floating between processors which hurts memory caching.
There are also crucial non-measurement factors you should check: do those small tasks require any resources to execute? Do these resources impose additional potential delays (blocking) or competition? Are there additional apps competing for the CPU power? Will the application need to be grow to accommodate different execution environments, task types, or user interaction models?
If the answer to all is negative, here are some additional approaches that you can measure and consider.
Use only 10 or 11 threads. You will observe a small slowdown, or even
a small speedup (the additional core will serve OS processes, so that
thread affinity of the rest will become more stable compared to 12
threads). Any concurrent interactive activity on the system will see
a big boost in responsiveness.
Create exactly 12 threads but explicitly set a different processor
affinity mask to each, to impose a 1-1 mapping between threads and processors.
This is good in the simplest near-academical case
where there are no resources other than CPU and shared memory
involved; you will see no chronic migration of threads across
processes. The drawback is an
algorithm closely coupled to a particular machine; on another machine
it could behave so poorly as to finish never at all (because of an
unrelated real time task that
blocks one of your threads forever).
Create 12 threads and split the tasks evenly. Have each thread
downgrade its own priority once it is past 40% and again once it is
past 80% of its load. This will improve load balancing inside your
process, but it will behave poorly if your application is competing
with other CPU-bound processes.
100ms/task - pile 'em on as they are - pool overhead will be insignificant.
OTOH..
1E8 tasks # 0.1s/task = 10,000,000 seconds
= 2777.7r hours
= 115.7 days
That's much more than the interval between patch Tuesday reboots.
Even if you run this on Linux, you should batch up the output and flush it to disk in such a manner that the job is restartable.
Is there a database involved? If so, you should have told us!
Each working thread may have its own small task queue with the capacity of no more than one or two memory pages. When the queue size becomes low (a half of capacity) it should send a signal to some manager thread to populate it with more tasks. If queue is organized in batches then working threads do not need to enter critical sections as long as current batch is not empty. Avoiding critical sections will give you extra cycles for actual job. Two batches per queue are enough, and in this case one batch can take one memory page, and so queue takes two.
The point of memory pages is that thread does not have to jump all over the memory to fetch data. If all data are in one place (one memory page) you avoid cache misses.

What is the best way to determine the number of threads to fire off in a machine with n cores? (C++)

I have a vector<int> with 10,000,000 (10 million) elements, and that my workstation has four cores. There is a function, called ThrFunc, that operates on an integer. Assume that the runtime for ThrFunc for each integer in the vector<int> is roughly the same.
How should I determine the optimal number of threads to fire off? Is the answer as simple as the number of elements divided by the number of cores? Or is there a more subtle computation?
Editing to provide extra information
No need for blocking; each function invocation needs only read-only
access
The optimal number of threads is likely to be either the number of cores in your machine or the number of cores times two.
In more abstract terms, you want the highest possible throughput. Getting the highest throughput requires the fewest contention points between the threads (since the original problem is trivially parallelizable). The number of contention points is likely to be the number of threads sharing a core or twice that, since a core can either run one or two logical threads (two with hyperthreading).
If your workload makes use of a resource of which you have fewer than four available (ALUs on Bulldozer? Hard disk access?) then the number of threads you should create will be limited by that.
The best way to find out the correct answer is, with all hardware questions, to test and find out.
Borealid's answer includes test and find out, which is impossible to beat as advice goes.
But there's perhaps more to testing this than you might think: you want your threads to avoid contention for data wherever possible. If the data is entirely read-only, then you might see best performance if your threads are accessing "similar" data -- making sure to walk through the data in small blocks at a time, so each thread is accessing data from the same pages over and over again. If the data is completely read-only, then there is no problem if each core gets its own copy of the cache lines. (Though this might not make the most use of each core's cache.)
If the data is in any way modified, then you will see significant performance enhancements if you keep the threads away from each other, by a lot. Most caches store data along cache lines, and you desperately want to keep each cache line from bouncing among CPUs for good performance. In that case, you might want to keep the different threads running on data that is actually far apart to avoid ever running into each other.
So: if you're updating the data while working on it, I'd recommend having N or 2*N threads of execution (for N cores), starting them with SIZE/N*M as their starting point, for threads 0 through M. (0, 1000, 2000, 3000, for four threads and 4000 data objects.) This will give you the best chance of feeding different cache lines to each core and allowing updates to proceed without cache line bouncing:
+--------------+---------------+--------------+---------------+--- ...
| first thread | second thread | third thread | fourth thread | first ...
+--------------+---------------+--------------+---------------+--- ...
If you're not updating the data while working on it, you might wish to start N or 2*N threads of execution (for N cores), starting them with 0, 1, 2, 3, etc.. and moving each one forward by N or 2*N elements with each iteration. This will allow the cache system to fetch each page from memory once, populate the CPU caches with nearly identical data, and hopefully keep each core populated with fresh data.
+-----------------------------------------------------+
| 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ... |
+-----------------------------------------------------+
I also recommend using sched_setaffinity(2) directly in your code to force the different threads to their own processors. In my experience, Linux aims to keep each thread on its original processor so much it will not migrate tasks to other cores that are otherwise idle.
Assuming ThrFunc is CPU-bound then you want probably one thread per core, and divide the elements between them.
If there's an I/O element to the function then the answer is more complicated, because you can have one or more threads per core waiting for I/O while another is executing. Do some tests and see what happens.
I agree with the previous comments. You should run tests to determine what number yields the best performance. However, this will only yield the best performance for the particular system you're optimizing for. In most scenarios, your program will be run on other people's machines, on the architecture of which you should not make too many assumptions.
A good way to numerically determine the number of threads to start would be to use
std::thread::hardware_concurrency()
This is part of the C++11 and should yield the number of logical cores in the current system. Logical cores means either the physical number of cores - in case the processor does not support hardware threads (ie HyperThreading) - or the number of hardware threads.
There's also a Boost-function that does the same, see Programmatically find the number of cores on a machine.
The optimal number of threads should equal the number of cores, in which situation the computation capacity of each core will be fully utilized, if the computation on each element is independently.
The optimal number of cores (threads) will probably be determined by when you achieve saturation of the memory system (caches and RAM). Another factor that could come into play is that of inter-core locking (locking a memory area that other cores might want to access, updating it and then unlocking it) and how efficient it is (how long the lock is in place and how often it is locked/unlocked).
A single core running a generic software whose code and data are not optmized for multi-core will come close to saturating memory all by itself. Adding more cores will, in such a scenario, result in a slower application.
So unless your code economizes heavily on memory accesses I'd guess the answer to your question is one (1).
I've found a real world example I'll put here for the ones who want a less technical / more intuitional answer:
Having multiple threads per core is like having two queues in an airport for each scanner(which people on both queues eventually have to pass through).
Two people at a time can put their baggage on the conveyer belt, but only one at a time can pass through the scanner. Now at this point, obviously there's a contention point at the entrance of the scanner, but what happens in reality is most of the times both queues function very well.
In this example, the queues represent threads and the scanner is the main functions of a core. As a general rule of thumb, the impact of each thread is 1.25th a core, i.e., it's not like having an entire new core. So if the task is CPU-bound slightly over the number of available processors is probably best.
But notice that if the task is IO-Bound, where threads will be spending most of their time waiting for external resources such as database connections, file systems, or other external sources of data, then you can assign (many) more threads than the number of available processors.
Source1, Source2