I have some extremely simple C++ code that I was certain would run 3x faster with multithreading but somehow only runs 3% faster (or less) on both GCC and MSVC on Windows 10.
There are no mutex locks and no shared resources. And I can't see how false sharing or cache thrashing could be at play since each thread only modifies a distinct segment of the array, which has over a billion int values. I realize there are many questions on SO like this but I haven't found any that seem to solve this particular mystery.
One hint might be that moving the array initialization into the loop of the add() function does make the function 3x faster when multithreaded vs single-threaded (~885ms vs ~2650ms).
Note that only the add() function is being timed and takes ~600ms on my machine. My machine has 4 hyperthreaded cores, so I'm running the code with threadCount set to 8 and then to 1.
Any idea what might be going on? Is there any way to turn off (when appropriate) the features in processors that cause things like false sharing (and possibly like what we're seeing here) to happen?
#include <chrono>
#include <iostream>
#include <thread>
void startTimer();
void stopTimer();
void add(int* x, int* y, int threadIdx);
namespace ch = std::chrono;
auto start = ch::steady_clock::now();
const int threadCount = 8;
int itemCount = 1u << 30u; // ~1B items
int itemsPerThread = itemCount / threadCount;
int main() {
int* x = new int[itemCount];
int* y = new int[itemCount];
// Initialize arrays
for (int i = 0; i < itemCount; i++) {
x[i] = 1;
y[i] = 2;
}
// Call add() on multiple threads
std::thread threads[threadCount];
startTimer();
for (int i = 0; i < threadCount; ++i) {
threads[i] = std::thread(add, x, y, i);
}
for (auto& thread : threads) {
thread.join();
}
stopTimer();
// Verify results
for (int i = 0; i < itemCount; ++i) {
if (y[i] != 3) {
std::cout << "Error!";
}
}
delete[] x;
delete[] y;
}
void add(int* x, int* y, int threadIdx) {
int firstIdx = threadIdx * itemsPerThread;
int lastIdx = firstIdx + itemsPerThread - 1;
for (int i = firstIdx; i <= lastIdx; ++i) {
y[i] = x[i] + y[i];
}
}
void startTimer() {
start = ch::steady_clock::now();
}
void stopTimer() {
auto end = ch::steady_clock::now();
auto duration = ch::duration_cast<ch::milliseconds>(end - start).count();
std::cout << duration << " ms\n";
}
You may be simply hitting the memory transfer rate of your machine, you are doing 8GB of reads and 4GB of writes.
On my machine your test completes in about 500ms which is 24GB/s (which is similar to the results given by a memory bandwidth tester).
As you hit each memory address with a single read and a single write the caches aren't much use as you aren't reusing memory.
Your problem is not the processor. You ran against the RAM read and write latency. As your cache is able to hold some megabytes of data and you exceed this storage by far. Multi-threading is so long useful, as long as you can shovel data into your processor. The cache in your processor is incredibly fast, compared to your RAM. As you exceed your cache storage, this results in a RAM latency test.
If you want to see the advantages of multi-threading, you have to choose data sizes in range of your cache size.
EDIT
Another thing to do, would be to create a higher workload for the cores, so the storage latency goes unrecognized.
sidenote: keep in mind, your core has several execution units. one or more for each type of operation - integer, float, shift and so on. That means, one core can execute more then one command per step. In particular one operation per execution unit. You can keep the data size of the test data and do more stuff with it - be creative =) Filling the queue with integer operations only, will give you an advantage in multi-threading. If you can variate in your code, when and where you do different operations, do it, this also will show impact on the speedup. Or avoid it, if you want to see a nice speedup on multi-threading.
to avoid any kind of optimization, you should use randomized test data. so neither the compiler nor the processor itself can predict what the outcome of your operation is.
Also avoid doing branches like if and while. Each decision the processor has to predict and execute, will slow you down and alter the result. With branch-prediction, you will never get a deterministic result. Later in a "real" program, be my guest and do what you want. But when you want to explore the multi-threading world, this could lead you to wrong conclusions.
BTW
Please use a delete for every new you use, to avoid memory leaks. AND even better, avoid plain pointers, new and delete. You should use RAII. I advice to use std::array or std::vector, simple a STL-container. This will save you tons of debugging time and headaches.
Speedup from parallelization is limited by the portion of the task that remains serial. This is called Amdahl's law. In your case, a decent amount of that serial time is spent initializing the array.
Are you compiling the code with -O3? If so, the compiler might be able to unroll and/or vectorize some of the loops. The loop strides are predictable, so hardware prefetching might help as well.
You might want to also explore if using all 8 hyperthreads are useful or if it's better to run 1 thread per core (I am going to guess that since the problem is memory-bound, you'll likely benefit from all 8 hyperthreads).
Nevertheless, you'll still be limited by memory bandwidth. Take a look at the roofline model. It'll help you reason about the performance and what speedup you can theoretically expect. In your case, you're hitting the memory bandwidth wall that effectively limits the ops/sec achievable by your hardware.
Related
If we have an array of integer pointers which all pointing to the same int, and loop over it doing ++ operation, it'll be 100% slower than those pointers pointing to two different ints. Here is a concrete example
int* data[2];
int a, b;
a = b = 0;
for (auto i = 0ul; i < 2; ++i) {
// Case 3: 2.5 sec
data[i] = &a;
// Case 2: 1.25 sec
// if (i & 1)
// data[i] = &a;
// else
// data[i] = &b;
}
for (auto i = 0ul; i < 1000000000; ++i) {
// Case 1: 0.5sec
// asm volatile("" : "+g"(i)); // deoptimize
// ++*data[0];
++*data[i & 1];
}
In summary, the observations are: (described the loop body)
case 1 (fast): ++*pointer[0]
case 2 (medium): ++*pointer[i] with half pointer pointing to one int and other half pointing to another int.
case 3 (slow): ++*pointer[i] with all pointer pointing to the same int
Here are my current thoughts. Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation, while in Case 2 and Case 3, we need to write the result out in each iteration. The reason that Case 3 is slower than Case 2 is because when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.
Do I understand it correctly? Is there any way to make Case 3 faster without changing the pointer array? (perhaps adding some CPU hints?)
The question is extracted from the real problem https://github.com/ClickHouse/ClickHouse/pull/7550
You've discovered one of the effects that causes bottlenecks in histograms. A workaround for that problem is to keep multiple arrays of counters and rotate through them, so repeated runs of the same index are distributed over 2 or 4 different counters in memory.
(Then loop over the arrays of counters to sum them down into one final set of counts. This part can benefit from SIMD.)
Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation
No, it's not the CPU, it's a compile-time optimization.
++*pointer[0] is fast because the compiler can hoist the store/reload out of the loop and actually just increment a register. (If you don't use the result, it might optimize away even that.)
Assumption of no data-race UB lets the compiler assume that nothing else is modifying pointer[0] so it's definitely the same object being incremented every time. And the as-if rule lets it keep *pointer[0] in a register instead of actually doing a memory-destination increment.
So that means 1 cycle latency for the increment, and of course it can combine multiple increments into one and do *pointer[0] += n if it fully unrolls and optimizes away the loop.
when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.
Yes, the data dependency through that memory location is the problem. Without knowing at compile time that the pointers all point to the same place, the compiler will make asm that does actually increment the pointed-to memory location.
"wait for the write to finish" isn't strictly accurate, though. The CPU has a store buffer to decouple store execution from cache misses, and out-of-order speculative exec from stores actually committing to L1d and being visible to other cores. A reload of recently-stored data doesn't have to wait for it to commit to cache; store forwarding from the store-buffer to a reload is a thing once the CPU detects it.
On modern Intel CPUs, store-forwarding latency is about 5 cycles, so a memory-destination add has 6-cycle latency. (1 for the add, 5 for the store/reload if it's on the critical path.)
And yes, out-of-order execution lets two of these 6-cycle-latency dependency chains run in parallel. And the loop overhead is hidden under that latency, again by OoO exec.
Related:
Store-to-Load Forwarding and Memory Disambiguation in x86 Processors
on stuffedcow.net
Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?
How does store to load forwarding happens in case of unaligned memory access?
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
Why is execution time of a process shorter when another process shares the same HT core (On Sandybridge-family, store-forwarding latency can be reduced if you don't try to reload right away.)
Is there any way to make Case 3 faster without changing the pointer array?
Yes, if that case is expected, maybe branch on it:
int *current_pointer = pointer[0];
int repeats = 1;
...
loop {
if (pointer[i] == current_pointer) {
repeats++;
} else {
*current_pointer += repeats;
current_pointer = pointer[i];
repeats = 1;
}
}
We optimize by counting a run-length of repeating the same pointer.
This is totally defeated by Case 2 and will perform poorly if long runs are not common.
Short runs can be hidden by out-of-order exec; only when the dep chain becomes long enough to fill the ROB (reorder buffer) do we actually stall.
I have decided to compare the times of passing by value and by reference in C++ (g++ 5.4.0) with the following code:
#include <iostream>
#include <sys/time.h>
using namespace std;
int fooVal(int a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int fooRef(int & a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int main() {
int a = 0;
struct timeval stop, start;
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooVal(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooRef(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
return 0;
}
It was expected that the fooRef execution would take much more time in comparison with fooVal case because of "looking up" referenced value in memory while performing operations inside fooRef. But the result proved to be unexpected for me:
The loop has taken 18446744073708648210 microseconds
The loop has taken 99967 microseconds
And the next time I run the code it can produce something like
The loop has taken 97275 microseconds
The loop has taken 99873 microseconds
Most of the time produced values are close to each other (with fooRef being just a little bit slower), but sometimes outbursts like in the output from the first run can happen (both for fooRef and fooVal loops).
Could you please explain this strange result?
UPD: Optimizations were turned off, O0 level.
If gettimeofday() function relies on operating system clock, this clock is not really designed for dealing with microseconds in an accurate manner. The clock is typically updated periodically and only frequently enough to give the appearance of showing seconds accurately for the purpose of working with date/time values. Sampling at the microsecond level may be unreliable for a benchmark such as the one you are performing.
You should be able to work around this limitation by making your test time much longer; for example, several seconds.
Again, as mentioned in other answers and comments, the effects of which type of memory is accessed (register, cache, main, etc.) and whether or not various optimizations are applied, could substantially impact results.
As with working around the time sampling limitation, you might be able to somewhat work around the memory type and optimization issues by making your test data set much larger such that memory optimizations aimed at smaller blocks of memory are effectively bypassed.
Firstly, you should look at the assembly language to see if there are any differences between passing by reference and passing by value.
Secondly, make the functions equivalent by passing by constant reference. Passing by value says that the original variable won't be changed. Passing by constant reference keeps the same principle.
My belief is that the two techniques should be equivalent in both assembly language and performance.
I'm no expert in this area, but I would tend to think that the reason why the two times are somewhat equivalent is due to cache memory.
When you need to access a memory location (Say, address 0xaabbc125 on an IA-32 architecure), the CPU copies the memory block (addresses 0xaabbc000 to 0xaabbcfff) to your cache memory. Reading from and writing to the memory is very slow, but once it's been copied into you cache, you can access values very quickly. This is useful because programs usually require the same range of addresses over and over.
Since you execute the same code over and over and that your code doesn't require a lot of memory, the first time the function is executed, the memory block(s) is (are) copied to your cache once, which probably takes most of the 97000 time units. Any subsequent calls to your fooVal and fooRef functions will require addresses that are already in your cache, so they will require only a few nanoseconds (I'd figure roughly between 10ns and 1µs). Thus, dereferencing the pointer (since a reference is implemented as a pointer) is about double the time compared to just accessing a value, but it's double of not much anyway.
Someone who is more of an expert may have a better or more complete explanation than mine, but I think this could help you understand what's going on here.
A little idea : try to run the fooVal and fooRef functions a few times (say, 10 times) before setting start and beginning the loop. That way, (if my explanation was correct!) the memory block will (should) be already into cache when you begin looping them, which means you won't be taking caching in your times.
About the super-high value you got, I can't explain that. But the value is obviously wrong.
It's not a bug, it's a feature! =)
I give the following example to illustrate my question:
void fun(int i, float *pt)
{
// do something based on i
std::cout<<*(pt+i)<<std::endl;
}
const unsigned int LOOP = 2000000007;
void fun_without_optmization()
{
float *example;
example = new float [LOOP];
for(unsigned int i=0; i<LOOP; i++)
{
fun(i,example);
}
delete []example;
}
void fun_with_optimization()
{
float *example;
example = new float [LOOP];
unsigned int unit_loop = LOOP/10;
unsigned int left_loop = LOOP%10;
pt = example;
for(unsigend int i=0; i<unit_loop; i++)
{
fun(0,pt);
fun(1,pt);
fun(2,pt);
fun(3,pt);
fun(4,pt);
fun(5,pt);
fun(6,pt);
fun(7,pt);
fun(8,pt);
fun(9,pt);
pt=pt+10;
}
delete []example;
}
As far as I understand, function fun_without_optimization() and function fun_with_optimization() should perform the same. The only argument why the second function is better than the first is that the pointer calculation in fun becomes simple. Any other arguments why the second function is better?
Unrolling a loop in which I/O is performed is like moving the landing strip for a B747 from London an inch eastward in JFK.
Re: "Any other arguments why the second function is better?" - would you accept the answer explaining why it is NOT better?
Manually unrolling a loop is error-prone, as is clearly illustrated by your code: you forgot to process the tail left_loop.
For at least a couple of decades compiler does this optimization for you.
How do you know the optimal number of iteration to put in that unrolled loop? Do you target a specific cache size and calculate the length of assembly instructions in bytes? The compiler might.
Your messing with the otherwise clean loop can prevent other optimizations, like the use of SIMD.
The bottom line is: if you know something that your compiler doesn't (specific pattern of the run-time data, details of the targeted execution environment, etc.), and you know what you are doing - you can try manual loop unrolling. But even then - profile.
The technique you describe is called loop unrolling; potentially this increases performance, as the time for evaluation of the control structures (update of te loop variable and checking the termination condition) becomes smaller. However, decent compilers can do this for you and maintainability of the code decreases if done manually.
This is an optimization technique used for parallel architectures (architectures that support VLIW instructions). Depending on the number DALU (most common 4) and ALU(most common 2) units the architecture supports, and the level of "parallelization" the code supports, multiple instructions can be executes in one cycle.
So this code:
for (int i=0; i<n;i++) //n multiple of 4, for simplicity
a+=temp; //just a random instruction
Will actually execute faster on a parallel architecture if rewritten like:
for (int i=0;i<n ;i+=4)
{
temp0 = temp0 +temp1; //reads and additions can be executed in parallel
temp1 = temp2 +temp3;
a=temp0+temp1+a;
}
There is a limit to how much you can parallelize your code, a limit imposed by the physical ALUs/DALUs the CPU has. That's why it's important to know your architecture before you attempt to (properly) optimize your code.
It does not stop here: the code you want to optimize has to be a continuous block of code, meaning no jumps ( no function calls, no chance of flow instructions), for maximum efficiency.
Writing your code, like:
for(unsigend int i=0; i<unit_loop; i++)
{
fun(0,pt);
fun(1,pt);
fun(2,pt);
fun(3,pt);
fun(4,pt);
fun(5,pt);
fun(6,pt);
fun(7,pt);
fun(8,pt);
fun(9,pt);
pt=pt+10;
}
Wold not do much, unless the compiler inlines the function calls; and it looks like to many instructions anyway...
On a different note: while it's true that you ALWAYS have to work with the compiler when optimizing your code, you should NEVER rely only on it when you what to get the maximum optimization out of your code. Remember, the compiler handles 'the general case' while you are likely interested in a particular situation - that's why some compiles have special directives to help with the optimization process.
I am thinking about heavy memory cache optimization and like to have some feedback.
Consider this example:
class example
{
float phase1;
float phaseInc;
float factor;
public:
void process(float* buffer,unsigned int iSamples)//<-high prio audio thread
{
for(unsigned int i = 0; i < iSamples; i++)// mostly iSamples is 32
{
phase1 += phaseInc;
float f1 = sinf(phase1);//<-sinf is just an example!
buffer[i] = f1*factor;
}
}
};
optimization idea:
void example::process(float* buffer,unsigned int iSamples)
{
float stackMemory[3];// should fit in L1
memcpy(stackMemory,&phase1,sizeof(float)*3);// get all memory at once
for(unsigned int i = 0; i < iSamples; i++)
{
stackMemory[0] += stackMemory[1];
float f1 = sinf(stackMemory[0]);
buffer[i] = f1*stackMemory[2];
}
memcpy(&phase1,stackMemory,sizeof(float)*1);// write back only changed mameory
}
Note that the real sample loop will contain thousands of operations.
So the stackMemory can become quite big.
I think it will be not more then 32kb (are there any smaller L1's out there ?).
Does the order of the used variables in this stackmemory matter ?
I hope not, because i'd like to order them so that i can reduce the writeback size.
Or does the L1 cache have the same cachline behaviour that RAM has ?
I have the feeling that i am somehow doing what prefetch is made for, but all i read about prefetch is relative vague about how to use it efficently. Try and error is not an option with 5000+ lines of code.
Code will run on Win,Mac and iOS.
Any ARM<->Intel issues to expect ?
Is it possible that this kind optimization is useless since all memory is accessed and transferred to L1 on the first iteration of the loop anyway ?
Thanks for any hints and ideas.
At first I thought there was a good chance that the second one could be slower as a result of additional memory access and instructions required for memcpy, while the first could simply work directly with these three class members already loaded into registers.
Nevertheless, I tried fiddling with the code in GCC 5.2 with both -O2 and -O3 and found that, no matter what I tried, I got identical assembly instructions for both. This is pretty amazing considering all the extra conceptual work that memcpy typically has to do that apparently got squashed away to zilch.
The one case I can think of where your second version might be faster in some scenario, on some compiler, is if the aliasing involved to access this->data_member interfered with an optimization and caused redundant loads and stores to/from registers.
It would have nothing to do with the L1 cache in that case and everything to do with register allocation on the compiler side. Caches are largely irrelevant here when you're loading the same memory (member variables) regardless for a contiguous chunk of data, it has entirely to do with registers. Nevertheless, I couldn't find a single scenario where I could cause that to happen where the compiler did a worse job with one over the other -- every case I tested yielded identical results. In a sufficiently complex real world case, perhaps there might be a difference.
Then again, in such a case, it should be on the safer side to simply do:
void process(float* buffer,unsigned int iSamples)
{
const float pi = phaseInc;
const float p1 = phase1;
const float fact = factor;
for(unsigned int i = 0; i < iSamples; i++)
{
phase1 += pi;
float f1 = sinf(p1);
buffer[i] = f1*fact;
}
}
There's no need to jump through hoops with memcpy to store the results into an array and back. That puts additional strain on the optimizer even if, in my findings, the optimizer managed to eliminate the overhead typically associated.
I realize your example is simplified, but there should not be a need to reduce the structure down to such a primitive array no matter how many data members you're dealing with (unless such an array actually is the most convenient representation). From a performance standpoint, a compiler will have an "easier" time (even if optimizers today are pretty amazing and can handle this) optimizing if you just use local variables instead of an array to which you memcpy aggregate data members in and out.
Is it possible to get the remaining available memory on a system (x86, x64, PowerPC / Windows, Linux or MacOS) in standard C++11 without crashing ?
A naive way would be to try allocating very large arrays starting by too large size, catch exceptions everytime it fails and decrease the size until no exception is thrown. But maybe there is a more efficient/clever method...
EDIT 1: In fact I do not need the exact amount of memory. I would like to know approximately (error bar of 100MB) how much my code could use when I start it.
EDIT 2 :
What do you think of this code ? Is it secure to run it at the start of my program or it could corrupt the memory ?
#include <iostream>
#include <array>
#include <list>
#include <initializer_list>
#include <stdexcept>
int main(int argc, char* argv[])
{
static const long long int megabyte = 1024*1024;
std::array<char, megabyte> content({{'a'}});
std::list<decltype(content)> list1;
std::list<decltype(content)> list2;
const long long int n1 = list1.max_size();
const long long int n2 = list2.max_size();
long long int i1 = 0;
long long int i2 = 0;
long long int result = 0;
for (i1 = 0; i1 < n1; ++i1) {
try {
list1.push_back(content);
}
catch (const std::exception&) {
break;
}
}
for (i2 = 0; i2 < n2; ++i2) {
try {
list2.push_back(content);
}
catch (const std::exception&) {
break;
}
}
list1.clear();
list2.clear();
result = (i1+i2)*sizeof(content);
std::cout<<"Memory available for program execution = "<<result/megabyte<<" MB"<<std::endl;
return 0;
}
This is highly dependent on the OS/platform. The approach that you suggest need not even work in real life. In some platforms the OS will grant you all your memory requests, but not really give you the memory until you use it, at which point you get a SEGFAULT...
The standard does not have anything related to this.
It seems to me that the answer is no, you cannot do it in standard C++.
What you could do instead is discussed under How to get available memory C++/g++? and the contents linked there. Those are all platform specific stuff. It's not standard but it least it helps you to solve the problem you are dealing with.
As others have mentioned, the problem is hard to precisely define, much less solve. Does virtual memory on the hard disk count as "available"? What about if the system implements a prompt to delete files to obtain more hard disk space, meanwhile suspending your program? (This is exactly what happens on OS X.)
The system probably implements a memory hierarchy which gets slower as you use more. You might try detecting the performance cliff between RAM and disk by allocating and initializing chunks of memory while using the C alarm interrupt facility or clock or localtime/mktime, or the C++11 clock facilities. Wall-clock time should appear to pass quicker as the machine slows down under the stress of obtaining memory from less efficient resources. (But this makes the assumption that it's not stressed by anything else such as another process.) You would want to tell the user what the program is attempting, and save the results to an editable configuration file.
I would advise using a configurable maximum amount of memory instead. Since some platforms overcommit memory, it's not easy to tell how much memory you will actually have access to. It's also not polite to assume that you have exclusive access to 100% of the memory available, many systems will have other programs running.