How can I resolve data dependency in pointer arrays? - c++

If we have an array of integer pointers which all pointing to the same int, and loop over it doing ++ operation, it'll be 100% slower than those pointers pointing to two different ints. Here is a concrete example
int* data[2];
int a, b;
a = b = 0;
for (auto i = 0ul; i < 2; ++i) {
// Case 3: 2.5 sec
data[i] = &a;
// Case 2: 1.25 sec
// if (i & 1)
// data[i] = &a;
// else
// data[i] = &b;
}
for (auto i = 0ul; i < 1000000000; ++i) {
// Case 1: 0.5sec
// asm volatile("" : "+g"(i)); // deoptimize
// ++*data[0];
++*data[i & 1];
}
In summary, the observations are: (described the loop body)
case 1 (fast): ++*pointer[0]
case 2 (medium): ++*pointer[i] with half pointer pointing to one int and other half pointing to another int.
case 3 (slow): ++*pointer[i] with all pointer pointing to the same int
Here are my current thoughts. Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation, while in Case 2 and Case 3, we need to write the result out in each iteration. The reason that Case 3 is slower than Case 2 is because when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.
Do I understand it correctly? Is there any way to make Case 3 faster without changing the pointer array? (perhaps adding some CPU hints?)
The question is extracted from the real problem https://github.com/ClickHouse/ClickHouse/pull/7550

You've discovered one of the effects that causes bottlenecks in histograms. A workaround for that problem is to keep multiple arrays of counters and rotate through them, so repeated runs of the same index are distributed over 2 or 4 different counters in memory.
(Then loop over the arrays of counters to sum them down into one final set of counts. This part can benefit from SIMD.)
Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation
No, it's not the CPU, it's a compile-time optimization.
++*pointer[0] is fast because the compiler can hoist the store/reload out of the loop and actually just increment a register. (If you don't use the result, it might optimize away even that.)
Assumption of no data-race UB lets the compiler assume that nothing else is modifying pointer[0] so it's definitely the same object being incremented every time. And the as-if rule lets it keep *pointer[0] in a register instead of actually doing a memory-destination increment.
So that means 1 cycle latency for the increment, and of course it can combine multiple increments into one and do *pointer[0] += n if it fully unrolls and optimizes away the loop.
when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.
Yes, the data dependency through that memory location is the problem. Without knowing at compile time that the pointers all point to the same place, the compiler will make asm that does actually increment the pointed-to memory location.
"wait for the write to finish" isn't strictly accurate, though. The CPU has a store buffer to decouple store execution from cache misses, and out-of-order speculative exec from stores actually committing to L1d and being visible to other cores. A reload of recently-stored data doesn't have to wait for it to commit to cache; store forwarding from the store-buffer to a reload is a thing once the CPU detects it.
On modern Intel CPUs, store-forwarding latency is about 5 cycles, so a memory-destination add has 6-cycle latency. (1 for the add, 5 for the store/reload if it's on the critical path.)
And yes, out-of-order execution lets two of these 6-cycle-latency dependency chains run in parallel. And the loop overhead is hidden under that latency, again by OoO exec.
Related:
Store-to-Load Forwarding and Memory Disambiguation in x86 Processors
on stuffedcow.net
Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?
How does store to load forwarding happens in case of unaligned memory access?
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
Why is execution time of a process shorter when another process shares the same HT core (On Sandybridge-family, store-forwarding latency can be reduced if you don't try to reload right away.)
Is there any way to make Case 3 faster without changing the pointer array?
Yes, if that case is expected, maybe branch on it:
int *current_pointer = pointer[0];
int repeats = 1;
...
loop {
if (pointer[i] == current_pointer) {
repeats++;
} else {
*current_pointer += repeats;
current_pointer = pointer[i];
repeats = 1;
}
}
We optimize by counting a run-length of repeating the same pointer.
This is totally defeated by Case 2 and will perform poorly if long runs are not common.
Short runs can be hidden by out-of-order exec; only when the dep chain becomes long enough to fill the ROB (reorder buffer) do we actually stall.

Related

Simple C++ Loop Not Benefitting from Multithreading

I have some extremely simple C++ code that I was certain would run 3x faster with multithreading but somehow only runs 3% faster (or less) on both GCC and MSVC on Windows 10.
There are no mutex locks and no shared resources. And I can't see how false sharing or cache thrashing could be at play since each thread only modifies a distinct segment of the array, which has over a billion int values. I realize there are many questions on SO like this but I haven't found any that seem to solve this particular mystery.
One hint might be that moving the array initialization into the loop of the add() function does make the function 3x faster when multithreaded vs single-threaded (~885ms vs ~2650ms).
Note that only the add() function is being timed and takes ~600ms on my machine. My machine has 4 hyperthreaded cores, so I'm running the code with threadCount set to 8 and then to 1.
Any idea what might be going on? Is there any way to turn off (when appropriate) the features in processors that cause things like false sharing (and possibly like what we're seeing here) to happen?
#include <chrono>
#include <iostream>
#include <thread>
void startTimer();
void stopTimer();
void add(int* x, int* y, int threadIdx);
namespace ch = std::chrono;
auto start = ch::steady_clock::now();
const int threadCount = 8;
int itemCount = 1u << 30u; // ~1B items
int itemsPerThread = itemCount / threadCount;
int main() {
int* x = new int[itemCount];
int* y = new int[itemCount];
// Initialize arrays
for (int i = 0; i < itemCount; i++) {
x[i] = 1;
y[i] = 2;
}
// Call add() on multiple threads
std::thread threads[threadCount];
startTimer();
for (int i = 0; i < threadCount; ++i) {
threads[i] = std::thread(add, x, y, i);
}
for (auto& thread : threads) {
thread.join();
}
stopTimer();
// Verify results
for (int i = 0; i < itemCount; ++i) {
if (y[i] != 3) {
std::cout << "Error!";
}
}
delete[] x;
delete[] y;
}
void add(int* x, int* y, int threadIdx) {
int firstIdx = threadIdx * itemsPerThread;
int lastIdx = firstIdx + itemsPerThread - 1;
for (int i = firstIdx; i <= lastIdx; ++i) {
y[i] = x[i] + y[i];
}
}
void startTimer() {
start = ch::steady_clock::now();
}
void stopTimer() {
auto end = ch::steady_clock::now();
auto duration = ch::duration_cast<ch::milliseconds>(end - start).count();
std::cout << duration << " ms\n";
}
You may be simply hitting the memory transfer rate of your machine, you are doing 8GB of reads and 4GB of writes.
On my machine your test completes in about 500ms which is 24GB/s (which is similar to the results given by a memory bandwidth tester).
As you hit each memory address with a single read and a single write the caches aren't much use as you aren't reusing memory.
Your problem is not the processor. You ran against the RAM read and write latency. As your cache is able to hold some megabytes of data and you exceed this storage by far. Multi-threading is so long useful, as long as you can shovel data into your processor. The cache in your processor is incredibly fast, compared to your RAM. As you exceed your cache storage, this results in a RAM latency test.
If you want to see the advantages of multi-threading, you have to choose data sizes in range of your cache size.
EDIT
Another thing to do, would be to create a higher workload for the cores, so the storage latency goes unrecognized.
sidenote: keep in mind, your core has several execution units. one or more for each type of operation - integer, float, shift and so on. That means, one core can execute more then one command per step. In particular one operation per execution unit. You can keep the data size of the test data and do more stuff with it - be creative =) Filling the queue with integer operations only, will give you an advantage in multi-threading. If you can variate in your code, when and where you do different operations, do it, this also will show impact on the speedup. Or avoid it, if you want to see a nice speedup on multi-threading.
to avoid any kind of optimization, you should use randomized test data. so neither the compiler nor the processor itself can predict what the outcome of your operation is.
Also avoid doing branches like if and while. Each decision the processor has to predict and execute, will slow you down and alter the result. With branch-prediction, you will never get a deterministic result. Later in a "real" program, be my guest and do what you want. But when you want to explore the multi-threading world, this could lead you to wrong conclusions.
BTW
Please use a delete for every new you use, to avoid memory leaks. AND even better, avoid plain pointers, new and delete. You should use RAII. I advice to use std::array or std::vector, simple a STL-container. This will save you tons of debugging time and headaches.
Speedup from parallelization is limited by the portion of the task that remains serial. This is called Amdahl's law. In your case, a decent amount of that serial time is spent initializing the array.
Are you compiling the code with -O3? If so, the compiler might be able to unroll and/or vectorize some of the loops. The loop strides are predictable, so hardware prefetching might help as well.
You might want to also explore if using all 8 hyperthreads are useful or if it's better to run 1 thread per core (I am going to guess that since the problem is memory-bound, you'll likely benefit from all 8 hyperthreads).
Nevertheless, you'll still be limited by memory bandwidth. Take a look at the roofline model. It'll help you reason about the performance and what speedup you can theoretically expect. In your case, you're hitting the memory bandwidth wall that effectively limits the ops/sec achievable by your hardware.

Timing of using variables passed by reference and by value in C++

I have decided to compare the times of passing by value and by reference in C++ (g++ 5.4.0) with the following code:
#include <iostream>
#include <sys/time.h>
using namespace std;
int fooVal(int a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int fooRef(int & a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int main() {
int a = 0;
struct timeval stop, start;
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooVal(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooRef(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
return 0;
}
It was expected that the fooRef execution would take much more time in comparison with fooVal case because of "looking up" referenced value in memory while performing operations inside fooRef. But the result proved to be unexpected for me:
The loop has taken 18446744073708648210 microseconds
The loop has taken 99967 microseconds
And the next time I run the code it can produce something like
The loop has taken 97275 microseconds
The loop has taken 99873 microseconds
Most of the time produced values are close to each other (with fooRef being just a little bit slower), but sometimes outbursts like in the output from the first run can happen (both for fooRef and fooVal loops).
Could you please explain this strange result?
UPD: Optimizations were turned off, O0 level.
If gettimeofday() function relies on operating system clock, this clock is not really designed for dealing with microseconds in an accurate manner. The clock is typically updated periodically and only frequently enough to give the appearance of showing seconds accurately for the purpose of working with date/time values. Sampling at the microsecond level may be unreliable for a benchmark such as the one you are performing.
You should be able to work around this limitation by making your test time much longer; for example, several seconds.
Again, as mentioned in other answers and comments, the effects of which type of memory is accessed (register, cache, main, etc.) and whether or not various optimizations are applied, could substantially impact results.
As with working around the time sampling limitation, you might be able to somewhat work around the memory type and optimization issues by making your test data set much larger such that memory optimizations aimed at smaller blocks of memory are effectively bypassed.
Firstly, you should look at the assembly language to see if there are any differences between passing by reference and passing by value.
Secondly, make the functions equivalent by passing by constant reference. Passing by value says that the original variable won't be changed. Passing by constant reference keeps the same principle.
My belief is that the two techniques should be equivalent in both assembly language and performance.
I'm no expert in this area, but I would tend to think that the reason why the two times are somewhat equivalent is due to cache memory.
When you need to access a memory location (Say, address 0xaabbc125 on an IA-32 architecure), the CPU copies the memory block (addresses 0xaabbc000 to 0xaabbcfff) to your cache memory. Reading from and writing to the memory is very slow, but once it's been copied into you cache, you can access values very quickly. This is useful because programs usually require the same range of addresses over and over.
Since you execute the same code over and over and that your code doesn't require a lot of memory, the first time the function is executed, the memory block(s) is (are) copied to your cache once, which probably takes most of the 97000 time units. Any subsequent calls to your fooVal and fooRef functions will require addresses that are already in your cache, so they will require only a few nanoseconds (I'd figure roughly between 10ns and 1µs). Thus, dereferencing the pointer (since a reference is implemented as a pointer) is about double the time compared to just accessing a value, but it's double of not much anyway.
Someone who is more of an expert may have a better or more complete explanation than mine, but I think this could help you understand what's going on here.
A little idea : try to run the fooVal and fooRef functions a few times (say, 10 times) before setting start and beginning the loop. That way, (if my explanation was correct!) the memory block will (should) be already into cache when you begin looping them, which means you won't be taking caching in your times.
About the super-high value you got, I can't explain that. But the value is obviously wrong.
It's not a bug, it's a feature! =)

Unaligned load versus unaligned store

The short question is that if I have a function that takes two vectors. One is input and the other is output (no alias). I can only align one of them, which one should I choose?
The longer version is that, consider a function,
void func(size_t n, void *in, void *out)
{
__m256i *in256 = reinterpret_cast<__m256i *>(in);
__m256i *out256 = reinterpret_cast<__m256i *>(out);
while (n >= 32) {
__m256i data = _mm256_loadu_si256(in256++);
// process data
_mm256_storeu_si256(out256++, data);
n -= 32;
}
// process the remaining n % 32 bytes;
}
If in and out are both 32-bytes aligned, then there's no penalty of using vmovdqu instead of vmovdqa. The worst case scenario is that both are unaligned, and one in four load/store will cross the cache-line boundary.
In this case, I can align one of them to the cache line boundary by processing a few elements first before entering the loop. However, the question is which should I choose? Between unaligned load and store, which one is worse?
Risking to state the obvious here: There is no "right answer" except "you need to benchmark both with actual code and actual data". Whichever variant is faster strongly depends on the CPU you are using, the amount of calculations you are doing on each package and many other things.
As noted in the comments, you should also try non-temporal stores. What also sometimes can help is to load the input of the following data packet inside the current loop, i.e.:
__m256i next = _mm256_loadu_si256(in256++);
for(...){
__m256i data = next; // usually 0 cost
next = _mm256_loadu_si256(in256++);
// do computations and store data
}
If the calculations you are doing have unavoidable data latencies, you should also consider calculating two packages interleaved (this uses twice as many registers though).

Does it take more time to access a value in a pointed struct than a local value?

I have a struct which holds values that are used as the arguments of a for loop:
struct ARGS {
int endValue;
int step;
int initValue;
}
ARGS * arg = ...; //get a pointer to an initialized struct
for (int i = arg->initValue; i < arg->endValue; i+=arg->step) {
//...
}
Since the values of initValue and step are checked each iteration, would it be faster if I move them to local values before using in the for loop?
initValue = arg->initValue;
endValue = arg->endValue;
step = arg->step;
for (int i = initValue; i < endValue; i+=step) {
//...
}
The clear cut answer is that in 99.9% of the cases it does not matter, and you should not be concerned with it. Now, there might be different micro differences that won't matter to mostly anyone. The gory details depend on the architecture and optimizer. But bear with me, understand not no mean very very high probability that there is no difference.
// case 1
ARGS * arg = ...; //get a pointer to an initialized struct
for (int i = arg->initValue; i < endValue; i+=arg->step) {
//...
}
// case 2
initValue = arg->initValue;
step = arg->step;
for (int i = initValue; i < endValue; i+=step) {
//...
}
In the case of initValue, there will not be a difference. The value will be loaded through the pointer and stored into the initValue variable, just to store it in i. Chances are that the optimizer will skip initValue and write directly to i.
The case of step is a bit more interesting, in that the compiler can prove that the local variable step is not shared by any other thread and can only change locally. If the pressure on registers is small, it can keep step in a register and never have to access the real variable. On the other hand, it cannot assume that arg->step is not changing by external means, and is required to go to memory to read the value. Understand that memory here means L1 cache most probably. A L1 cache hit on a Core i7 takes approximately 4 cpu cycles, which roughly means 0.5 * 10-9 seconds (on a 2Ghz processor). And that is under the worst case assumption that the compiler can maintain step in a register, which may not be the case. If step cannot be held on a register, you will pay for the access to memory (cache) in both cases.
Write code that is easy to understand, then measure. If it is slow, profile and figure out where the time is really spent. Chances are that this is not the place where you are wasting cpu cycles.
This depends on your architecture. If it is a RISC or CISC processor, then that will affect how memory is accessed, and on top of that the addressing modes will affect it as well.
On the ARM code I work with, typically the base address of a structure will be moved into a register, and then it will execute a load from that address plus an offset. To access a variable, it will move the address of the variable into the register, then execute the load without an offset. In this case it takes the same amount of time.
Here's what the example assembly code might look like on ARM for accessing the second int member of a strcture compared to directly accessing a variable.
ldr r0, =MyStruct ; struct {int x, int y} MyStruct
ldr r0, [r0, #4] ; load MyStruct.y into r0
ldr r1, =MyIntY ; int MyIntX, MyIntY
ldr r1, [r1] ; directly load MyIntY into r0.
If your architecture does not allow addressing with offsets, then it would need to move the address into a register and then perform the addition of the offset.
Additionally, since you've tagged this as C++ as well, if you overload the -> operator for the type, then this will invoke your own code which could take longer.
The problem is that the two version are not identical. If the code in the ... part modifies the values in arg then the two options will behave differently (the "optimized" one will use the step and end value using the original values, not the updated ones).
If the optimizer can prove by looking at the code that this is not going to happen then the performance will be the same because moving things out of loops is a common optimization performed today. However it's quite possible that something in ... will POTENTIALLY change the content of the structure and in this case the optimizer must be paranoid and the generated code will reload the values from the structure at each iteration. How costly it will be depends on the processor.
For example if the arg pointer is received as a parameter and the code in ... calls any external function for which the code is unknown to the compiler (including things like malloc) then the compiler must assume that MAY BE the external code knows the address of the structure and MAY BE it will change the end or step values and thus the optimizer is forbidden to move those computations out of the loop because doing so would change the behavior of the code.
Even if it's obvious for you that malloc is not going to change the contents of your structure this is not obvious at all for the compiler, for which malloc is just an external function that will be linked in at a later step.

Using SSE for vector initialzation

I am relative new to C++ (moved from Java for performance for my scientific app) and I know nothing about SSE. Still, I need to improve the very simple following code:
int myMax=INT_MAX;
int size=18000003;
vector<int> nodeCost(size);
/* init part */
for (int k=0;k<size;k++){
nodeCost[k]=myMax;
}
I have measured the time for the initialization part and it takes 13ms which is way too big for my scientific app (the entire algorithm runs in 22ms which means that the initialization takes 1/2 of the total time). Keep in mind that the initialization part will be repeated multiple times for the same vector.
As you see the size of the vector is not divided by 4. Is there a way to accelerate the initialization with SSE? Can you suggest how? Do I need to use arrays or SSE can be used with vectors as well?
Please, since I need your help let's all avoid a) "how did you measure the time" or b) "premature optimization is the root of all evil" which are both reasonable for you to ask but a) the measured time is correct b) I agree with it but I have no other choice. I do not want to parallelize the code with OpenMP, so SSE is the only fallback.
Thanks for your help
Use the vector's constructor:
std::vector<int> nodeCost(size, myMax);
This will most likely use an optimized "memset"-type of implementation to fill the vector.
Also tell your compiler to generate architecture-specific code (e.g. -march=native -O3 on GCC). On my x86_64 machine, this produces the following code for filling the vector:
L5:
add r8, 1 ;; increment counter
vmovdqa YMMWORD PTR [rax], ymm0 ;; magic, ymm contains the data, and eax...
add rax, 32 ;; ... the "end" pointer for the vector
cmp r8, rdi ;; loop condition, rdi holds the total size
jb .L5
The movdqa instruction, size-prefixed for 256-bit operations, copies 32 bytes to memory at once; it is part of the AVX instruction set.
Try std::fill first as already suggested, and then if that's still not fast enough you can go to SIMD if you really need to. Note that, depending on your CPU and memory sub-system, for large vectors such as this you may well hit your DRAM's maximum bandwidth and that could be the limiting factor. Anyway, here's a fairly simple SSE implementation:
#include <emmintrin.h>
const __m128i vMyMax = _mm_set1_epi32(myMax);
int * const pNodeCost = &nodeCost[0];
for (k = 0; k < size - 3; k += 4)
{
_mm_storeu_si128((__m128i *)&pNodeCost[k], vMyMax);
}
for ( ; k < size; ++k)
{
pNodeCost[k] = myMax;
}
This should work well on modern CPUs - for older CPUs you might need to handle the potential data misalignment better, i.e. use _mm_store_si128 rather than _mm_storeu_si128. E.g.
#include <emmintrin.h>
const __m128i vMyMax = _mm_set1_epi32(myMax);
int * const pNodeCost = &nodeCost[0];
for (k = 0; k < size && (((intptr_t)&pNodeCost[k] & 15ULL) != 0); ++k)
{ // initial scalar loop until we
pNodeCost[k] = myMax; // hit 16 byte alignment
}
for ( ; k < size - 3; k += 4) // 16 byte aligned SIMD loop
{
_mm_store_si128((__m128i *)&pNodeCost[k], vMyMax);
}
for ( ; k < size; ++k) // scalar loop to take care of any
{ // remaining elements at end of vector
pNodeCost[k] = myMax;
}
This is an extension of the ideas in Mats Petersson's comment.
If you really care about this, you need to improve your referential locality. Plowing through 72 megabytes of initialization, only to come back later to overwrite it, is extremely unfriendly to the memory hierarchy.
I do not know how to do this in straight C++, since std::vector always initializes itself. But you might try (1) using calloc and free to allocate the memory; and (2) interpreting the elements of the array as "0 means myMax and n means n-1". (I am assuming "cost" is non-negative. Otherwise you need to adjust this scheme a bit. The point is to avoid the explicit initialization.)
On a Linux system, this can help because calloc of a sufficiently large block does not need to explicitly zero the memory, since pages acquired directly from the kernel are already zeroed. Better yet, they only get mapped and zeroed the first time you touch them, which is very cache-friendly.
(On my Ubuntu 13.04 system, Linux calloc is smart enough not to explicitly initialize. If yours is not, you might have to do an mmap of /dev/zero to use this approach...)
Yes, this does mean every access to the array will involve adding/subtracting 1. (Although not for operations like "min" or "max".) Main memory is pretty darn slow by comparison, and simple arithmetic like this can often happen in parallel with whatever else you are doing, so there is a decent chance this could give you a big performance win.
Of course whether this helps will be platform dependent.