Determining Optimum Thread Count - c++

So, as part of a school assignment, we are being asked to determine what our optimum thread count is for our personal computers by constructing a toy program.
To start, we are to create a task that takes between 20 and 30 seconds to run. I chose to do a coin toss simulation, where the total number of heads and tails are accumulated and then displayed. On my machine, 300,000,000 tosses on one thread ended up at 25 seconds. After that, I went to 2 threads, then 4, then 8, 16, 32, and, just for fun, 100.
Here are the results:
* Thread Tosses per thread time(seconds)
* ------------------------------------------
* 1 300,000,000 25
* 2 150,000,000 13
* 4 75,000,000 13
* 8 37,500,000 13
* 16 18,750,000 14
* 32 9,375,000 14
* 100 3,000,000 14
And here is the code I'm using:
void toss()
{
int heads = 0, tails = 0;
default_random_engine gen;
uniform_int_distribution<int> dist(0,1);
int max =3000000; //tosses per thread
for(int x = 0; x < max; ++x){(dist(gen))?++heads:++tails;}
cout<<heads<<" "<<tails<<endl;
}
int main()
{
vector<thread>thr;
time_t st, fin;
st = time(0);
for(int i = 0;i < 100;++i){thr.push_back(thread(toss));} //thread count
for(auto& thread: thr){thread.join();}
fin = time(0);
cout<<fin-st<<" seconds\n";
return 0;
}
Now for the main question:
Past a certain point, I would've expected there to be a considerable decline in computing speed as more threads were added, but the results don't seem to show that.
Is there something fundamentally wrong with my code that would yield these sorts of results, or is this behavior considered normal? I'm very new to multi-threading, so I have a feeling it's the former....
Thanks!
EDIT: I am running this on a macbook with a 2.16 GHz Core 2 Duo (T7400) processor

Your results seem very normal to me. While thread creation has a cost, its not that much (especially compared to the per-second granularity of your tests). An extra 100 thread creations, destructions, and possible context-switches isn't going to change your timing by more than a few milliseconds I bet.
Running on my Intel i7-4790 # 3.60 GHz I get these numbers:
threads - seconds
-----------------
1 - 6.021
2 - 3.205
4 - 1.825
8 - 1.062
16 - 1.128
32 - 1.138
100 - 1.213
1000 - 2.312
10000 - 23.319
It takes many, many more threads to get to the point at which the extra threads make a noticeable difference. Only when I get to 1,000 threads do I see that the thread-management has made a significant difference and at 10,000 it dwarfs the loop (the loop is only doing 30,000 tosses at that point).
As towards your assignment, it should be fairly straightforward to see that the optimal number of threads for your system should be the same as the available threads that can be executed at once. There's not any processing power left to execute another thread until one is either done or yielded, which doesn't help you finish faster. And, any less threads and you aren't using all available resources. My CPU has 8 threads and the chart reflects that.

Edit 2 - To further elaborate on the "lack of performance penalty" part due to popular demand:
...I would've expected there to be a considerable decline in computing
speed as more threads were added, but the results don't seem to show
that.
I made this giant chart in order to better illustrate the scaling.
To explain the results:
The blue bar illustrates the total time to do all the tosses. Although that time decreases all the way up to 256 threads, the gains from doubling the thread count gets smaller and smaller. The CPU I ran this test had 4 physical and 8 logical cores. Scaling is pretty good all the way to 4 cores and decent to 8 cores, and then it plummets. Pipeline saturation allows to get minor gains all the way to 256 but it is simply not worth it.
The red bar illustrates the time per toss. It is nearly identical for 1 and 2 threads, as the CPU pipeline hasn't reached full saturation yet. It gets a minor hit at 4 threads, it still runs fine but now the pipeline is saturated, and at 8 threads it really shows that logical threads are not the same thing as physical, that gets progressively worse pushing above 8 threads.
The green bar illustrates the overhead, or how much lower actual performance is relative to the expected double boost from doubling the threads. Pushing above the available logical cores causes the overhead to skyrocket. Note that this is mostly thread synchronization, the actual thread scheduling overhead is probably constant after a given point, there is a minimal window of activity time a thread must receive, which explains why the thread switching doesn't come to the point of overwhelming the work throughput. In fact there is no severe performance drop all the way to 4k threads, which is expected as modern systems have to be able and often run over thousand threads in parallel. And again, most of that drop is due to thread synchronization, not thread switching.
The black outline bar illustrates the time difference relative to the lowest time. At 8 threads we only lose ~14% of absolute performance from not having the pipeline oversaturated, which is a good thing because it is most cases not really worth stressing the entire system over so little. It also shows that 1 thread is only ~6 times slower than the maximum the CPU can pull off. Which gives a figure of how good logical cores are compared to physical cores, 100% extra logical cores give a 50% boost in performance, in this use case a logical thread is ~50% as good as a physical thread, which also correlates to the ~47% boost we see going from 4 to 8. Note that this is a very simply workload though, in more demanding cases it is close to 20-25% for this particular CPU, and in some edge cases there is actually a performance hit.
Edit 1 - I foolishly forgot to isolate the computational workload from the thread synchronization workload.
Running the test with little to no work reveals that for high thread counts the thread management part takes the bulk of the time. Thus the thread switching penalty is indeed very small and possibly after a certain point a constant.
And it would make a lot of sense if you put yourself in the shoes of a thread scheduler maker. The scheduler can easily be protected from being choked by an unreasonably high switching to working ratio, so there is likely a minimal window of time the scheduler will give to a thread before switching to another, while the rest are put on hold. This ensures that the the switching to working ratio will never exceed the limits of what is reasonable. It would be much better to stall other threads than go crazy with thread switching, as the CPU will mostly be switching and doing very little actual work.
The optimal thread count is the available amount of logical CPU cores. This achieves optimal pipeline saturation.
If you use more you will suffer performance degradation due to the cost of thread context switching. The more threads, the more penalty.
If you use less, you will not be utilizing the full hardware potential.
There is also the problem of workload graining, which is very important when you utilize synchronization such as a mutex. If your concurrency is too finely grained you can experience performance drops even when going from 1 to 2 threads on a 8 thread machine. You'd want to reduce synchronization as much as possible, doing as much work as possible in between synchronizations, otherwise you can experience huge performance drops.
Note the difference between physical and logical CPU core. Processors with hyper-threading can have more than one logical core per physical core. "Secondary" logical cores do not have the same computational power as the "primary" as they are merely used to utilize vacancies in the processor pipeline usage.
For example, if you have a 4 core 8 thread CPU, in the case of a perfectly scaling workload you will see 4 times increase of performance going from 1 to 4 threads, but a lot less going from 4 to 8 threads, as evident from vu1p3n0x's answer.
You can look here for ways to determine the number of available CPU cores.

Related

Why is 6-7 threads faster than 20?

In school we were introduced to C++11 threads. The teacher gave us a simple assessment to complete which was to make a basic web crawler using 20 threads. To me threading is pretty new, although I do understand the basics.
I would like to mention that I am not looking for someone to complete my assessment as it is already done. I only want to understand the reason why using 6 threads is always faster than using 20.
Please see code sample below.
main.cpp:
do
{
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i] = std::thread(SweepUrlList);
}
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i].join();
}
std::cout << std::endl;
WriteToConsole();
listUrl = listNewUrl;
listNewUrl.clear();
} while (listUrl.size() != 0);
Basically this assigns to each worker thread the job to complete which is the method SweepUrlList that can be found below and then join all thread.
while (1)
{
mutextGetNextUrl.lock();
std::set<std::string>::iterator it = listUrl.begin();
if (it == listUrl.end())
{
mutextGetNextUrl.unlock();
break;
}
std::string url(*it);
listUrl.erase(*it);
mutextGetNextUrl.unlock();
ExtractEmail(url, listEmail);
std::cout << ".";
}
So each worker thread loop until ListUrl is empty. ExtractEmail is a method that downloads the webpage (using curl) and parse it to extract emails from mailto links.
The only blocking call in ExtractEmail can be found below:
if(email.length() != 0)
{
mutextInsertNewEmail.lock();
ListEmail.insert(email);
mutextInsertNewEmail.unlock();
}
All answers are welcome and if possible links to any documentation you found to answer this question.
This is a fairly universal problem with threading, and at its core:
What you are demonstrating is thread Scheduling. The operating system is going to work with the various threads, and schedule work where there is currently not work.
Assuming you have 4 cores and hyper threading you have 8 processors that can carry the load, but also that of other applications (Operating System, C++ debugger, and your application to start).
In theory, you would probably be OK on performance up until about 8 intensive threads. After you reach the most threads your processor can effectively use, then threads begin to compete against each other for resources. This can be seen (especially with intensive applications and tight loops) by poor performance.
Finally, this is a simplified answer but I suspect what you are seeing.
The simple answer is choke points. Something that you are doing is causing a choke point. When this occurs there is a slow down. It could be in the number of active connections you are making to something, or merely the extra overhead of the number and memory size of the threads (see the below answer about cores being one of these chokes).
You will need to set up a series of monitors to investigate where your choke point is, and what needs to change in order to achieve scale. Many systems across every industry face this problem every day. Opening up the throttle at one end does not equal the same increase in the output at the other end. In cases it can decrease the output at the other end.
Take for example individuals leaving a hall. The goal is to get 100 people out of the building as quickly as possible. If single file produces a rate of 1 person every 1 second therefore 100 seconds to clear the building. We many be able to half that time by sending them out 2 abreast, so 50 seconds to clear the building. What if we then sent them out as 8 abreast. The door is only 2m wide, so with 8 abreast being equivalent to 4m, only 50% of the first row would make it through. The other 4 would then cause a blockage for the next row and so on. Depending on the rate, this could cause temporary blockages and increase the time 10 fold.
Threads are an operating system construct. Basically, each thread's state (which is basically all the CPU's registers and virtual memory mapping [which is a part of the process construct]) is saved by the operating system. Once the OS gives that specific thread "execution time" it restores this state and let it run. Once this time is finished, it has to save this state. The process of saving a specific thread's state and restoring another is called Context Switching, and it takes a significant amount of time (usually between a couple of hundreds to thousand of CPU cycles).
There are also additional penalties to context switching. Some of the processor's cache (like the virtual memory translation cache, called the TLB) has to be flushed, pipelining instruction to be discarded and more. Generally, you want to minimize context switching as much as possible.
If your CPU has 4 cores, than 4 threads can run simultaneously. If you try to run 20 threads on a 4 core system, then the OS has to manage time between those threads so it will seem like they run in parallel. E.g, threads 1-4 will run for 50 milliseconds, then 5-9 will run for 50 milliseconds, etc.
Therefore, if all of your threads are running CPU intensive operations, it is generally most efficient to make your program use the same amount of threads as cores (sometimes called 'processors' in windows). If you have more threads than cores, than context switching must happen, and it is overhead that can be minimized.
In general, more threads is not better. More threading provides value in two ways higher parallelism and less blocking. More threading hurts by higher memory, higher context switching and higher resource contention.
The value of more threads for higher parallelism is generally maximized between 1-2x the number of actual cores that you have available. If your threads are already CPU bound the maximum value is generally 1x number of cores.
The value of less blocking is much harder to quantify and depends on the type of work you are performing. If you are IO bound and your threads are primarily waiting for IO to be ready then a larger number of threads could be beneficial.
However if you have shared state between threads, or you are doing some form of message passing between threads then you will run into synchronization and contention issues. As the number of threads increases, the more these types of overhead as well as context switches dominates the time spent doing your task.
Amdahl's law is a useful measure to determine if higher parallelism will actually improve the total runtime of your job.
You also must be careful that your increased parallelism doesn't exceed some other resource like total memory or disk or network throughput. Once you have saturated the current bottleneck, you will not see improved performance by increasing the number of threads.
Before doing any performance tuning, it is important to understand what the dominant resource bottleneck is. There are lots of tools for doing system-wide resource monitoring. On Linux, one very useful tool is dstat. On Windows, you can use the Task Manager to monitor many of these resources.

How can I measure how my multithreaded code scales (speedup)?

What would be the best way to measure the speedup of my program assuming I only have 4 cores? Obviously I could measure it up to 4, however it would be nice to know for 8, 16, and so on.
Ideally I'd like to know the amount of speedup per number of thread, similar to this graph:
Is there any way I can do this? Perhaps a method of simulating multiple cores?
I'm sorry, but in my opinion, the only reliable measurement is to actually get an 8, 16 or more cores machine and test on that.
Memory bandwidth saturation, number of CPU functional units and other hardware bottlenecks can have a huge impact on scalability. I know from personal experience that if a program scales on 2 cores and on 4 cores, it might dramatically slow down when run on 8 cores, simply because it's not enough to have 8 cores to be able to scale 8x.
You could try to predict what will happen, but there are a lot of factors that need to be taken into account:
caches - size, number of layers, shared / non-shared
memory bandwidth
number of cores vs. number of processors i.e. is it an 8-core machine or a dual-quad-core machine
interconnection between cores - a lower number of cores (2, 4) can still work reasonably well with a bus, but for 8 or more cores a more sophisticated interconnection is needed.
memory access - again, a lower number of cores work well with the SMP (symmetrical multiprocessing) model, while a higher number of core need a NUMA (non-uniform memory access) model.
I do neither think that there is a real way to do this, but one thing which comes to my mind is that you could use a virtual machine to simulate more cores. In VirtualBox for example you can select up to 16 cores out of the standard menu, but I am very confident that there are some hacks, which can make more of that and other VirtualMachines like VMware might even support more out of the Box.
bamboon and and doron are correct that many variables are at play, but if you have a tunable input size n, you can figure out the strong scaling and weak scaling of your code.
Strong scaling refers to fixing the problem size (e.g. n = 1M) and varying the number of threads available for computation. Weak scaling refers to fixing the problem size per thread (n = 10k/thread) and varying the number of threads available for computation.
It's true there's a lot of variables at work in any program -- however if you have some basic input size n, it's possible to get some semblance of scaling. On a n-body simulator I developed a few years back, I varied the threads for fixed size and the input size per thread and was able to reasonably calculate a rough measure of how well the multithreaded code scaled.
Since you only have 4 cores, you can only feasibly compute the scaling up to 4 threads. This severely limits your ability to see how well it scales to largely threaded loads. But this may not be an issue if your application is only used on machines where there are small core counts.
You really need to ask yourself the question: Is this going to be used on 10, 20, 40+ threads? If it is, the only way to accurately determine scaling to those regimes is to actually benchmark it on a platform where you have that hardware available.
Side note: Depending on your application, it may not matter that you only have 4 cores. Some workloads scale with increasing threads regardless of the real number of cores available, if many of those threads spend time "waiting" for something to happen (e.g. web servers). If you're doing pure computation though, this won't be the case
I don't believe this is possible since there are too many variables to be able to accurately extrapolate performace. Even assuming you are 100% parallel. There are other factors like bus speed and cache misses that might limit your performance, not to mention periferal performace. How all of these factors affect your code can only be done though measuring on your specific hardware platform.
I take it you are asking about measurement, so I won't address the issue of predicting the effect on higher numbers of cores.
This question can be viewed another way: how busy can you keep each thread, and what do they total up to? So for six threads, running at say 50% utilization each, means you have 3 equivalent processors running. Dividing that by say four processors, means that your methods are achieving 75% utilization. Comparing that utilization, against the clock-time of actual speedup, tells you how much of your utilization is new overhead, and how much is real speed up. Isn't that what you are really interested in?
The processor utilization can be computed in real-time a couple different ways. Threads can independently ask the system for their thread times, compute ratios and maintain global totals. If you have total control over your blocking states, you don't even need the system calls, because you can just keep track of the ratio of blocking to nonblocking machine cycles, for computing utilization. A real-time multithreading instrumentation package I developed uses such methods and they work well. The cpu clock counter in newer cpus reads on the inside of 20 machine cycles.

CPU speed and threads in C++

I have the following C++ program:
void testSpeed(int start, int end)
{
int temp = 0;
for(int i = start; i < end; i++)
{
temp++;
}
}
int main() {
using namespace boost;
timer aTimer;
// start two new threads that calls the "testSpeed" function
boost::thread my_thread1(&testSpeed, 0, 500000000);
boost::thread my_thread2(&testSpeed, 500000000, 1000000000);
// wait for both threads to finish
my_thread1.join();
my_thread2.join();
double elapsedSec = aTimer.elapsed();
double IOPS = 1/elapsedSec;
}
So the idea is to test the CPU speed in terms of integer operations per second (IOPS).
There are 1 billion iterations (operations), so on 1Ghz CPU we should get around billion integer operations per second, I believe.
My assumption is that more threads = more integer operations per second. But the more threads I try the less operations per second I see (I have more cores than threads).
What may causing such a behavior? Is it the threads overhead? Maybe I should try a much longer experiment to see if the threads actually help?
Thank you!
UPDATE:
So I changed the loop to run 18 billions times, and declared temp as volatile. Also added another testSpeed method with different name so now a single threaded executes both methods one after another, while two threads get each one method; so there shouldn't be any sync' issues, etc. And... still no change in behavior! single threaded is faster according to timer. Ahhh! I found the sucker, apparently timer is bluffing. The two threads take half the time to finish but timer tells me the single threaded run was two seconds faster. I'm now trying to understand why... Thanks everyone!
I am almost certain that compiler optimizes away your loops. Since you do not subtract the overhead of creating/synchronizing threads, you actually measure only that. So more threads you have, more overhead you create and more time it takes.
Overall, you can refer to the documentation of your CPU and find out about its frequency and how much ticks any given instruction takes. Testing it yourself using an approach like this is nearly impossible and is, well, useless. This is because of an overhead like context switches, transferring the execution from one CPU/core to another, scheduler swap-outs, branch mis-prediction. In real life you will also encounter cache misses and a lot of memory bus latency since there is no programs that fit into ~ 15 registers. So you better test a real program using some good profiler. For example, latest CPUs can give out CPU stall information, cache misses, branch mispredictions and a lot more. You can use a good profiler to actually decide when and how to parallel your program as well.
As the number of threads increases beyond a certain point, it leads to an increase in the number of cache misses (cache is being shared among the threads), but at the same time memory access latency is being masked by a large number of threads(while a thread is waiting for data to be fetched from the memory, other threads are running). Hence there is a trade off. Here is an interesting paper on this subject.
According to this paper, on a multi-core machine when the number of threads is very low (of the order of number of cores), the performance will increase on increasing the number of threads, because now the cores are being fully utilized.
After that, a further increase in the number of threads leads to the effect of cache misses dominating, thus leading to a degradation in the performance.
If the number of threads become very large, such that the amount of cache storage per thread almost become almost zero, all memory accesses are made from the main memory. But at the same time increased number of threads is also very effectively masking the increased memory access latency. This time the second effect dominates leading to an increase in the performance.
Thus the valley in the middle is the region with the worst performance.

What is the best way to determine the number of threads to fire off in a machine with n cores? (C++)

I have a vector<int> with 10,000,000 (10 million) elements, and that my workstation has four cores. There is a function, called ThrFunc, that operates on an integer. Assume that the runtime for ThrFunc for each integer in the vector<int> is roughly the same.
How should I determine the optimal number of threads to fire off? Is the answer as simple as the number of elements divided by the number of cores? Or is there a more subtle computation?
Editing to provide extra information
No need for blocking; each function invocation needs only read-only
access
The optimal number of threads is likely to be either the number of cores in your machine or the number of cores times two.
In more abstract terms, you want the highest possible throughput. Getting the highest throughput requires the fewest contention points between the threads (since the original problem is trivially parallelizable). The number of contention points is likely to be the number of threads sharing a core or twice that, since a core can either run one or two logical threads (two with hyperthreading).
If your workload makes use of a resource of which you have fewer than four available (ALUs on Bulldozer? Hard disk access?) then the number of threads you should create will be limited by that.
The best way to find out the correct answer is, with all hardware questions, to test and find out.
Borealid's answer includes test and find out, which is impossible to beat as advice goes.
But there's perhaps more to testing this than you might think: you want your threads to avoid contention for data wherever possible. If the data is entirely read-only, then you might see best performance if your threads are accessing "similar" data -- making sure to walk through the data in small blocks at a time, so each thread is accessing data from the same pages over and over again. If the data is completely read-only, then there is no problem if each core gets its own copy of the cache lines. (Though this might not make the most use of each core's cache.)
If the data is in any way modified, then you will see significant performance enhancements if you keep the threads away from each other, by a lot. Most caches store data along cache lines, and you desperately want to keep each cache line from bouncing among CPUs for good performance. In that case, you might want to keep the different threads running on data that is actually far apart to avoid ever running into each other.
So: if you're updating the data while working on it, I'd recommend having N or 2*N threads of execution (for N cores), starting them with SIZE/N*M as their starting point, for threads 0 through M. (0, 1000, 2000, 3000, for four threads and 4000 data objects.) This will give you the best chance of feeding different cache lines to each core and allowing updates to proceed without cache line bouncing:
+--------------+---------------+--------------+---------------+--- ...
| first thread | second thread | third thread | fourth thread | first ...
+--------------+---------------+--------------+---------------+--- ...
If you're not updating the data while working on it, you might wish to start N or 2*N threads of execution (for N cores), starting them with 0, 1, 2, 3, etc.. and moving each one forward by N or 2*N elements with each iteration. This will allow the cache system to fetch each page from memory once, populate the CPU caches with nearly identical data, and hopefully keep each core populated with fresh data.
+-----------------------------------------------------+
| 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ... |
+-----------------------------------------------------+
I also recommend using sched_setaffinity(2) directly in your code to force the different threads to their own processors. In my experience, Linux aims to keep each thread on its original processor so much it will not migrate tasks to other cores that are otherwise idle.
Assuming ThrFunc is CPU-bound then you want probably one thread per core, and divide the elements between them.
If there's an I/O element to the function then the answer is more complicated, because you can have one or more threads per core waiting for I/O while another is executing. Do some tests and see what happens.
I agree with the previous comments. You should run tests to determine what number yields the best performance. However, this will only yield the best performance for the particular system you're optimizing for. In most scenarios, your program will be run on other people's machines, on the architecture of which you should not make too many assumptions.
A good way to numerically determine the number of threads to start would be to use
std::thread::hardware_concurrency()
This is part of the C++11 and should yield the number of logical cores in the current system. Logical cores means either the physical number of cores - in case the processor does not support hardware threads (ie HyperThreading) - or the number of hardware threads.
There's also a Boost-function that does the same, see Programmatically find the number of cores on a machine.
The optimal number of threads should equal the number of cores, in which situation the computation capacity of each core will be fully utilized, if the computation on each element is independently.
The optimal number of cores (threads) will probably be determined by when you achieve saturation of the memory system (caches and RAM). Another factor that could come into play is that of inter-core locking (locking a memory area that other cores might want to access, updating it and then unlocking it) and how efficient it is (how long the lock is in place and how often it is locked/unlocked).
A single core running a generic software whose code and data are not optmized for multi-core will come close to saturating memory all by itself. Adding more cores will, in such a scenario, result in a slower application.
So unless your code economizes heavily on memory accesses I'd guess the answer to your question is one (1).
I've found a real world example I'll put here for the ones who want a less technical / more intuitional answer:
Having multiple threads per core is like having two queues in an airport for each scanner(which people on both queues eventually have to pass through).
Two people at a time can put their baggage on the conveyer belt, but only one at a time can pass through the scanner. Now at this point, obviously there's a contention point at the entrance of the scanner, but what happens in reality is most of the times both queues function very well.
In this example, the queues represent threads and the scanner is the main functions of a core. As a general rule of thumb, the impact of each thread is 1.25th a core, i.e., it's not like having an entire new core. So if the task is CPU-bound slightly over the number of available processors is probably best.
But notice that if the task is IO-Bound, where threads will be spending most of their time waiting for external resources such as database connections, file systems, or other external sources of data, then you can assign (many) more threads than the number of available processors.
Source1, Source2

VS 7.1 Release compile and multiple threads

VS 7.1 release mode does not seem to be properly parallelizing threads while debug mode does. Here is a summary of what is happening.
First, for what it's worth, here is the main piece of code that parallelizes, but I don't think it's an issue:
// parallelize the search
CWinThread* thread[THREADS];
for ( i = 0; i < THREADS; i++ ) {
thread[i] = AfxBeginThread( game_search, &parallel_params[i],
THREAD_PRIORITY_NORMAL, 0, CREATE_SUSPENDED );
thread[i]->m_bAutoDelete = FALSE;
thread[i]->ResumeThread();
}
for ( i = 0; i < THREADS; i++ ) {
WaitForSingleObject(thread[i]->m_hThread, INFINITE);
delete(thread[i]);
}
THREADS is a global variable that I set and I recompile if I want to change the number of threads. To give a bit of context this is a game playing program that searches game positions.
Here is what happens that doesn't make sense to me.
First, compiling in debug mode. If I set THREADS to 1 the one thread manages to search about 13,000 positions. If I set THREADS to 2, each thread searches about 13,000 positions. Great!
If I compile in release mode and set THREADS to 1 the thread manages to search about 30,000 positions, a typical speedup I'm used to seeing when moving from debug to release. But here is the kicker. When I compile with THREADS = 2 each thread only searches about 15,000 positions. Obviously half of what THREADS = 1 does, so effectively a release compile gives me no effective speedup whatsoever. :(
Watching task manager when these things run, with THREADS = 1 I see 50% CPU usage on my dual core machine and when THREADS = 2 I see 100% CPU usage. But the release compile seems to be giving me an effective CPU usage of 50%. Or something?!
Any thoughts? Is there something I should be setting in the Property Pages?
Update: The following is also posted below but it was suggested I update this post. It was also suggested I post code, but it is a quite large project. I'm hoping others have run into this kind of behavior themselves in the past and can shed some light on what going on.
I ran the program on a quad core system and got consistent but still confusing results. I know I am verging on getting away from a specific programming question and becoming a bit abstract, but I'd really like to hear any comments you might have to help explain the numbers I am seeing. For all of these tests I run for 30 seconds and according to task manager all threads are running full power for the entire 30 seconds.
When running in Debug mode, if I run with 1 thread it gets X amount of work done. If I run 2 threads each thread gets X amount of work done. Similarly with 3 and 4 threads. Scaling is perfect.
When running in Release mode, this is what happens:
With 1 thread: it gets Y amount of work done, where Y is nearly double X.
With 2 threads: Each thread gets Y amount of work done. Again, perfect scaling.
With 3 threads: 1 thread gets Y amount of work done, the other 2 threads get 2/3 Y amount of work done. I've lost about 2/3 of a CPU even though one is presumable completely idle. Task Manager shows 75% CPU usage.
With 4 threads: 1 thread gets Y amount of work done. The other 3 threads get 1/2 Y amount of work done. Now I've lost about 1.5 CPU's worth of computing. The Task Manager shows 100% CPU usage.
The obvious questions are:
(1) Repeating the earlier question, was does Debug mode scale so well, but not Release?
(2) Why is one core always able to get full usage but the others seem to fall off? This lack of symmetry is disturbing.
(3) Why are the others falling off? Memory bandwidth was suggested earlier but that seem like an awfully steep price.
Any comments or insights are most welcome. And, as always, thanks!
I think you should be using WaitForMultipleObjects().
The problem with multi-threading is that it is non-deterministic.
First of all, the DEBUG target doesn't optimize the code. It also adds additional code for runtime checks (e.g. asserts, traces in MFC, etc.).
The RELEASE target is optimized. So in release mode, the binary can be slightly different than in case of DEBUG mode.
What is the job executed by the thread is also important. For example, if your threads are using some IO operations, they will have some idle times, waiting for those IO operations to complete. Since in RELEASE mode the code to be executed is expected to be more efficient, the ratio between idle time and execution time might be different than in DEBUG mode.
I am only guessing possible explanations, given the provided information.
Later update:
You can use WaitForMultipleObjects to wait for all the threads to finish:
DWORD result = WaitForMultipleObjects(
numberOfThreads, // Number of thread handles in the array
threadHandleArray, // the array of thread handles
true, // true means wait for all the threads to finish
INFINITE); // wait indefinetly
if( result == WAIT_FAILED)
// Some error handling here
I'm not sure I understand why there are a different number of positions searched in Debug vs. Release. You are waiting for the threads to complete, so I would just expect the Release version to finish faster but for both versions to generate the same results.
Are you imposing a per-thread time limit? If so what is the mechanism for this?
In the absence of logic bugs, it would appear that your processing is CPU limited for the Debug case in both single and double threaded versions. In the release case, you are not getting any effective speedup which means that either the processing is more efficient and the processing is now limitied by something else (e.g. IO or memory bandwidth) or that any gains that you are making are offset by frequent context switching between the threads which might happen if you have a poor synchronization strategy between the threads.
It would be helpful to know exact what processing each thread does, what shared data they have and how often they need to synchronize with each other.
As Charles Bailey said, from you description it seems like you are imposing a per-thread time limit.
It could be the case that the timing mechanism you use references wall clock time in debug mode and CPU time (which sums across all processors/cores in use) in release mode. Thus, when THREADS = 2 in release mode, you use the total allotment of CPU time twice as fast, doing half as much work on each core.
Just an idea. Can you give more detail on your timing mechanism?
The fact that you get 30k positions from both 1 and 2 threads looks suspicious to me. Could that limit come from another component in your system? You mention each thread is totaly independent, but are you by any chance using any of the Interlocked* functions? They look innocent, but they actually force a synchronization of all CPU caches, which can be painful when trying to squeeze the most out of the CPU.
What I would do is have each thread do some dummy action such (string manipulation or so), just to waste some time. If that scales well, add a portion of the thread's real code to the dummy action, and test again. Repeat until the performance stops scaling, which means the latest code addition is the bottleneck.
Another direction I'd look into is making sure both threads are actually running concurrently, on different CPUs. Try bounding each thread to a single CPU. This is not something I'd leave in production, but if your system is loaded by other processes, you might not get the gain you expect from dual CPUs. After all, on a single CPU machine you'll probably get a lower throughput using two thread than what you'd get using one.
I ran the program on a quad core system and got consistent but still confusing results. I know I am verging on getting away from a specific programming question and becoming a bit abstract, but I'd really like to hear any comments you might have to help explain the numbers I am seeing. For all of these tests I run for 30 seconds and according to task manager all threads are running full power for the entire 30 seconds.
When running in Debug mode, if I run with 1 thread it gets X amount of work done. If I run 2 threads each thread gets X amount of work done. Similarly with 3 and 4 threads. Scaling is perfect.
When running in Release mode, this is what happens:
With 1 thread: it gets Y amount of work done, where Y is nearly double X.
With 2 threads: Each thread gets Y amount of work done. Again, perfect scaling.
With 3 threads: 1 thread gets Y amount of work done, the other 2 threads get 2/3 Y amount of work done. I've lost about 2/3 of a CPU even though one is presumable completely idle. Task Manager shows 75% CPU usage.
With 4 threads: 1 thread gets Y amount of work done. The other 3 threads get 1/2 Y amount of work done. Now I've lost about 1.5 CPU's worth of computing. The Task Manager shows 100% CPU usage.
The obvious questions are:
(1) Repeating the earlier question, was does Debug mode scale so well, but not Release?
(2) Why is one core always able to get full usage but the others seem to fall off? This lack of symmetry is disturbing.
(3) Why are the others falling off? Memory bandwidth was suggested earlier but that seem like an awfully steep price.
Any comments or insights are most welcome. And, as always, thanks!
There are many things that may hamper your performance.
One problem might be false sharing of cache lines.
When you have something like :
struct data
{
int cnt_parsed_thread[THREADS];
// ...
};
static data;
and in the threads itself :
threadFunc( int threadNum )
{
while( !end )
{
// ...
// do something
++data.cnt_parsed_thread[num];
}
}
You force both processors to send the cache line after each increment to the other processor, stalling computation enormously.
This problem can be worked around by spreading the falsely shared data into separate cachelines.
e.g. :
struct data
{
int cnt_parsed_thread[THREADS*CACHELINESIZE];
// ...
int& at( int k ) { return cnt_parsed_thread[k*CACHELINESIZE}; }
};
(CACHELINE size should be 64 bytes (I think), maybe play around with that.)