Increased Execution TIme even with increased number of CPUs, Why? - c++

I have run the same C++ problem size of different number of CPUs on an HPC cluster, but what I figured out is that when the number of CPUs increased the execution time also increased. I was expecting a significant decrease in execution time. Can anyone shed some light in this issue?
Below are my execution times per # of CPUs
Number of CPUs Problem size Time (seconds)
1 3000000 15.48
2 3000000 18.2
4 3000000 21.73
8 3000000 40.55
16 3000000 60.14
32 3000000 98.75
My thoughts:
Too much communications increased between the CPUs that leads to increased the execution time.

Hope this explains it:
"There are two major factors that influence performance: the speed of the CPUs themselves, and the speed of their access to memory. In a cluster, it’s fairly obvious that a given CPU will have fastest access to the RAM within the same computer (node). Perhaps more surprisingly, similar issues are relevant on a typical multicore laptop, due to differences in the speed of main memory and the cache. Consequently, a good multiprocessing environment should allow control over the “ownership” of a chunk of memory by a particular CPU."

Related

How does OpenMP actually reduce clock cycles?

It might be a silly question but, with OpenMP you can achieve to distribute the number of operations between all the cores your CPU has. Of course, it is going to be faster in 99% times because you went from a single core doing N operations to K cores doing the same amount operations at the same time.
Despite of this, the total amount of clock cycles should be the same, right? Because the number of operations is the same. Or I am wrong?
This question boils down more or less to the difference between CPU time and elapsed time. Indeed, we see more often than none here questions which start by "my code doesn't scale, why?", for which the first answer is "How did you measure the time?" (I let you make a quick search and I'm sure you'll find many results)
But to illustrate more how things work, let's imagine you have a fixed-size problem, for which you have an algorithm that is perfectly parallelized.You have 120 actions to do, each taking 1 second. Then, 1 CPU core would take 120s, 2 cores would take 60s, 3 cores 40s, etc.
That is the elapsed time that is decreasing. However, 2 cores, running for 60 seconds in parallel, will consume 120s of CPU time. This means that the overall number of clock cycles won't have reduced compared to having only one CPU core running.
In summary, for a perfectly parallelized problem, you expect to see your elapsed time scaling down perfectly with the number of cores used, and the CPU time to remain constant.
In reality, what you often see is the elapsed time to scale down less than expected, due to parallelization overheads and/or imperfect parallelization. By the meantime, you see the CPU time slightly increasing with the number of cores used, for the same reasons.
I think the answer depends on how you define the total amount of clock cycles. If you define it as the sum of all the clock cycles from the different cores then you are correct and there will not be fewer clock cycles. But if you define it as the amount of clock cycles for the "main" core between initiating and completing the distributed operations then it is hopefully fewer.

Is this behavior showing that I have a memory problem?

I have a LP problem with ~4 million variables and ~4 million constraints. I use gurobi to solve it. My PC has 4 cores and 8 GB memory.
According to the log file, it takes ~100 seconds to find the optimal solution. Then the CPU is released, but still almost full memory is being used. It hangs there, doing nothing for hours until it continues to run the script (e.g. print command) after the solving.
results = opt.solve(model, tee=True)
print("model solved")
I used barrier method with crossover disabled, this worked best. I also tried different number of threads to be used, it turned out using 4 is the best in terms of the hanging time (but still hours).
This hanging significantly increases the total run time, which is not desired.
I plan to upgrade the memory, but want to get answers from the community that it indeed is a memory issue. Is this a memory problem?
Likely the problem does not fit in memory and virtual memory (i.e. disk) is used. This is called thrashing when it is really bad. It can bring your machine to its knees. Depending on the number of nonzeros in the problem, the presolve statistics and the number of threads you are using, you need at least 16 GB (and may be more like 32 GB).
Also: try to reduce the number of threads Gurobi is using. It may be better to use 1 thread (after benchmarking which LP algorithm works best: primal or dual simplex or a barrier method). By default a concurrent LP method is used: use different LP solvers in parallel, significantly increasing the memory footprint.

Performance Decrease by Increasing Threads Number More Than 4 in 24 Core CPU

I have an Intel Xeon E5-2620 which has 24 on 2 CPU. I have write an application which creates 24 threads for decryption of AES using openssl. When I increase thread number from 1 to 24 on 1 million data decryption I get a result such as following image.
The problem is when I increase thread numbers all of core which I determined are becoming 100% and because of 32GB ram of system always at least half of the ram is free which indicate that the problem is not core usage or ram limit.
I wonder to know that should I set a special parameter for increasing performance in OS level or it is process limitation which can not attain more than 4 thread in maximum performance.
I have to mention that when I execute "openssl evp ..." for testing aes encryption decryption because of process fork it increase the performance about 20 times more than one core performance.
Does anyone has any idea?
I finally found the the reason. multiple CPUs have different rams on servers which have different distances. when I created threads until 4 threads are created on one single cpu but fifth thread will be placed on second cpu which decrease performance because of not using NUMA in os.
so when I disabled cores of second cpu, performance of 6 threads increased as expected.
you can disable 7th core using following command:
cd /sys/devices/system/cpu/
echo 0 > cpu6/online
If multiprocessing gives a 20x speedup and equivalent multithreading only gives 2.5x, there's clearly a bottleneck in the multithreaded code. Furthermore, this bottleneck is unrelated to the hardware architecture.
It could be something in your code, or in the underlying library. It's really impossible to tell without studying both in some detail.
I'd start by looking at lock contention in your multithreaded application.

in a worst case how much QPI latency can slow-down arbitrary application?

I'm developing low-latency HFT trading application.
I'm using single-CPU machine. Because it's much easier to configure and maintain, (no need to tune NUMA). Also, obviously, assuming we have enough resources, it should be definitely not slower than dual-CPU setup, and likely it will be a little bit faster, cause no QPI/NUMA latency.
HFT requires a lot of resources and now I realize I want to have much more cores. Also colocating two 1U single CPU machines is much more expensive than colocating one 1U dual-cpu machine, so even assuming I can "split" my program to two it's still make sense to use 1U dual-CPU machine.
So how fear QPI/NUMA latency is? If I move my application from single-CPU machine to dual-CPU machine how much slower it can be? Maximum I can afford several-microseconds delay, but not more. Can QPI/Numa introduce significant delay if not tuned correctly and how significant this delay would be?
Is it possible to write such application which runs much slower (more than several microseconds slower) on dual-CPU setup than single-CPU setup? I.e runs much slower on a faster computer? (of course assuming we have the same processors, memory, network card and everything else)
This is not trivially answerable, since it depends on so many factors. Is the code written for NUMA?
Is the code doing mostly reads, mostly writes or about equal? How much data is shared between threads that run on separate CPUs? How often is such data written to, forcing cache-refresh?
How does tasks get scheduled, how and when does the OS decide to move threads from one CPU socket to the next?
Does the code and data fit in cache?
Those are just a few factors that will change the results dramatically between a "works really well" and "gives really poor performance".
As with EVERYTHING performance-related, details can make a huge difference, and reading answers like this one on the internet will not give you a reliable answer that applies to YOUR situati8on. Benchmark your application, check performance counters and tweak based on that. [Given the price for a machine of the specs you describe in comments above, I'd expect the supplier would allow some sort of test, demo, "try before you buy", etc].
Assuming you have a worst case scenario, a memory access will be straddling two cache-lines (unaligned access of a 8-byte value, for example), which is split between your worst placed CPUs, and the MMU needs reloading, each of those page-table entries also being in the worst possible CPUs, and since the memory for that pair of memory locations is in different locations, needing new TLB entries for each of the two 4-byte reads to load your 64-bit value. (Each TLB entry is a separate location).
This means 2 x 4 x n, where n is something like 50-100 ns. So one memory access could, at least in theory take 1600 ns. So 1.6 microseconds. It's unlikely that you will get MUCH worse than this for a single operation. The overhead is a lot less than for example swapping to disk, which can add milliseconds to your execution time.
It is not very hard to write code that updates the same cache-line on multiple CPUs and thus causing dramatic reduction in performance - I remember a long time back when I first had an Athlon SMP system running a simple benchmark, where the author did this for a Dhrystone benchmark
int numberOfRuns[MAX_CPUS];
Now, numberOfRuns is the outer loop-counter, and updating that for each loop, on either CPU, would cause "false sharing" (so each time the counter was updated, the other CPU had to flush that cache-line).
Running this on 2 core SMP system gave 30% of the single CPU performance. So 3 times SLOWER than the one CPU, rather than faster as you'd expect. (This was some 12 or so years ago, so memory may be a little "off" on the exact details, but the essense of this story is still true - a badly written application can run slower on multiple cores compared to single core).
I'd expect at least that bad performance on a modern system where you have false sharing of commonly used variables.
In comparison, well-written code should run near N times faster, if there is little or no sharing between CPU cores. I have a highly CPU-bound, multithreaded, calculator for weird numbers, which gives near n-times performance gain both on my single-socket system at home and my two-socket system at work.
$ time ./weird -t 1 -e 100000
real 0m22.641s
user 0m22.660s
sys 0m0.003s
$ time ./weird -t 6 -e 100000
real 0m5.096s
user 0m25.333s
sys 0m0.005s
So about 11% overhead. That is sharing one variable [current number] which is atomically updated between threads (using C++ standard atomics). Unfortunately, I don't have a good example of "badly written code" to contrast this against.

How can I measure how my multithreaded code scales (speedup)?

What would be the best way to measure the speedup of my program assuming I only have 4 cores? Obviously I could measure it up to 4, however it would be nice to know for 8, 16, and so on.
Ideally I'd like to know the amount of speedup per number of thread, similar to this graph:
Is there any way I can do this? Perhaps a method of simulating multiple cores?
I'm sorry, but in my opinion, the only reliable measurement is to actually get an 8, 16 or more cores machine and test on that.
Memory bandwidth saturation, number of CPU functional units and other hardware bottlenecks can have a huge impact on scalability. I know from personal experience that if a program scales on 2 cores and on 4 cores, it might dramatically slow down when run on 8 cores, simply because it's not enough to have 8 cores to be able to scale 8x.
You could try to predict what will happen, but there are a lot of factors that need to be taken into account:
caches - size, number of layers, shared / non-shared
memory bandwidth
number of cores vs. number of processors i.e. is it an 8-core machine or a dual-quad-core machine
interconnection between cores - a lower number of cores (2, 4) can still work reasonably well with a bus, but for 8 or more cores a more sophisticated interconnection is needed.
memory access - again, a lower number of cores work well with the SMP (symmetrical multiprocessing) model, while a higher number of core need a NUMA (non-uniform memory access) model.
I do neither think that there is a real way to do this, but one thing which comes to my mind is that you could use a virtual machine to simulate more cores. In VirtualBox for example you can select up to 16 cores out of the standard menu, but I am very confident that there are some hacks, which can make more of that and other VirtualMachines like VMware might even support more out of the Box.
bamboon and and doron are correct that many variables are at play, but if you have a tunable input size n, you can figure out the strong scaling and weak scaling of your code.
Strong scaling refers to fixing the problem size (e.g. n = 1M) and varying the number of threads available for computation. Weak scaling refers to fixing the problem size per thread (n = 10k/thread) and varying the number of threads available for computation.
It's true there's a lot of variables at work in any program -- however if you have some basic input size n, it's possible to get some semblance of scaling. On a n-body simulator I developed a few years back, I varied the threads for fixed size and the input size per thread and was able to reasonably calculate a rough measure of how well the multithreaded code scaled.
Since you only have 4 cores, you can only feasibly compute the scaling up to 4 threads. This severely limits your ability to see how well it scales to largely threaded loads. But this may not be an issue if your application is only used on machines where there are small core counts.
You really need to ask yourself the question: Is this going to be used on 10, 20, 40+ threads? If it is, the only way to accurately determine scaling to those regimes is to actually benchmark it on a platform where you have that hardware available.
Side note: Depending on your application, it may not matter that you only have 4 cores. Some workloads scale with increasing threads regardless of the real number of cores available, if many of those threads spend time "waiting" for something to happen (e.g. web servers). If you're doing pure computation though, this won't be the case
I don't believe this is possible since there are too many variables to be able to accurately extrapolate performace. Even assuming you are 100% parallel. There are other factors like bus speed and cache misses that might limit your performance, not to mention periferal performace. How all of these factors affect your code can only be done though measuring on your specific hardware platform.
I take it you are asking about measurement, so I won't address the issue of predicting the effect on higher numbers of cores.
This question can be viewed another way: how busy can you keep each thread, and what do they total up to? So for six threads, running at say 50% utilization each, means you have 3 equivalent processors running. Dividing that by say four processors, means that your methods are achieving 75% utilization. Comparing that utilization, against the clock-time of actual speedup, tells you how much of your utilization is new overhead, and how much is real speed up. Isn't that what you are really interested in?
The processor utilization can be computed in real-time a couple different ways. Threads can independently ask the system for their thread times, compute ratios and maintain global totals. If you have total control over your blocking states, you don't even need the system calls, because you can just keep track of the ratio of blocking to nonblocking machine cycles, for computing utilization. A real-time multithreading instrumentation package I developed uses such methods and they work well. The cpu clock counter in newer cpus reads on the inside of 20 machine cycles.