Is branch prediction still significantly speeding up array processing? [closed]

Is branch prediction still significantly speeding up array processing? [closed] - c++

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I was reading a interesting post about why is it faster to process a sorted array than an unsorted array? and saw a comment made by #mp31415 that said:
Just for the record. On Windows / VS2017 / i7-6700K 4GHz there is NO difference between two versions. It takes 0.6s for both cases. If number of iterations in the external loop is increased 10 times the execution time increases 10 times too to 6s in both cases
So I tried it on a online c/c++ compiler (with, I suppose, modern server architecture), I get, for the sorted and unsorted, respectively, ~1.9s and ~1.85s, not so much different but repeatable.
So I wonder if it is still true for modern architectures?
Question was from 2012, not so far from now...
Or where am I wrong?
Question precision for reopening:
Please forget about me adding the C code as example. This was a terrible mistake. Not only erroneous was the code, posting it misled people who were focusing on the code itself, rather than on the question.
When I tried first the C++ code used in the link above and got only 2% difference (1.9s & 1.85s).
My first question and intent was about the previous post, its c++ code and the comment of #mp31415.
#rustyx made an interesting comment, and I wondered if it could explain what I observed.
Interestingly, a debug build exhibits 400% difference between sorted/unsorted, and a release build at most 5% difference (i7-7700).
In other words, my question is:
Why does the c++ code in the previous post did not worked with as good performances as those claimed by the previous OP?
precised by:
Does the timing difference between the release build and debug build could explain it?

You're a victim of the as-if rule:
... conforming implementations are required to emulate (only) the observable behavior of the abstract machine ...
Consider the function under test ...
const size_t arraySize = 32768;
int *data;
long long test()
{
long long sum = 0;
for (size_t i = 0; i < 100000; ++i)
{
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
return sum;
}
And the generated assembly (VS 2017, x86_64 /O2 mode)
The machine does not execute your loops, instead it executes a similar program that does this:
long long test()
{
long long sum = 0;
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
for (size_t i = 0; i < 20000; ++i)
{
if (data[c] >= 128)
sum += data[c] * 5;
}
}
return sum;
}
Observe how the optimizer reversed the order of the loops and defeated your benchmark.
Obviously the latter version is much more branch-predictor-friendly.
You can in turn defeat the loop hoisting optimization by introducing a dependency in the outer loop:
long long test()
{
long long sum = 0;
for (size_t i = 0; i < 100000; ++i)
{
sum += data[sum % 15]; // <== dependency!
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
return sum;
}
Now this version again exhibits a massive difference between sorted/unsorted data. On my system (i7-7700) 1.6s vs 11s (or 700%).
Conclusion: branch predictor is more important than ever these days when we are facing unprecedented pipeline depths and instruction-level parallelism.

Related

Measuring cpu cycles or another unit that doesn't depends on cpu frequency and doesn't count time/cycles in Sleep? WinAPI C++

I need a profiling feature in two ways. The first one is total time that the code spends and the second one is an unit that doesn't depends on cpu-freq and sleeps in the code. (The Profiling is needed for our software with own language/interpreter, it runs on Windows)
My problem is by the second one.
Results with GetThreadTimes depends on cpu-freq and not accurate (10 - 15ms) see for more: Why GetThreadTimes is wrong? Kalmbachnet
QueryThreadCycleTime also depends on implementation. (and also counts in sleep, as i tested) see for more: What
does QueryThreadCycleTime actually count? OldNewThing
QueryPerformanceCounter is a accurate counter, but cpu-freq-change changes the result and sleeps are also included.
Is it possible what i want to do? or is there any other way? How/What does visual studio profiling do?
Note: I know that my question seems like a duplicate. I tried to comment some old answers to same questions (like: Another question) to get better answers, but my comments are deleted after 1-2 days. (see: meta for comments deleted)
EDIT: (My test code for QueryThreadCycleTime)
static void foo()
{
for (int i = 0; i < 3; i++)
{
Sleep(20);
for (int x = 0; x < 1000; x++)
x = x + 1 - 1;
}
}
static void testCycles()
{
HANDLE hThread = nullptr;
::DuplicateHandle(::GetCurrentProcess(), ::GetCurrentThread(), ::GetCurrentProcess(), &hThread, 0, false, DUPLICATE_SAME_ACCESS);
std::vector<ULONG64> results;
results.resize(7);
for (auto &tElapsed : results)
{
ULONG64 tStart = 0;
::QueryThreadCycleTime(hThread, &tStart);
foo();
ULONG64 tEnd = 0;
::QueryThreadCycleTime(hThread, &tEnd);
tElapsed = tEnd - tStart;
}
::CloseHandle(hThread);
}
And there is results;
with Sleep(20)
in Thread
123383
192271
128028
208208
277983
223377
155222
in Main-Thread
191616
120002
126258
125267
141934
204753
125243
with Sleep(1000)
in Thread
121595
143863
182068
307464
388448
342315
468244
in Main-Thread
289568
256256
348599
359328
234065
167849
299888

Why in C++ overwritingis is slower than writing?

USELESS QUESTION - ASKED TO BE DELETED
I have to run a piece of code that manages a video stream from camera.
I am trying to boost it, and I realized a weird C++ behaviour. (I have to admit I am realizing I do not know C++)
The first piece of code run faster than the seconds, why? It might be possible that the stack is almost full?
Faster version
double* temp = new double[N];
for(int i = 0; i < N; i++){
temp[i] = operation(x[i],y[i]);
res = res + (temp[i]*temp[i])*coeff[i];
}
Slower version1
double temp;
for(int i = 0; i < N; i++){
temp = operation(x[i],y[i]);
res = res + (temp*temp)*coeff[i];
}
Slower version2
for(int i = 0; i < N; i++){
double temp = operation(x[i],y[i]);
res = res + (temp*temp)*coeff[i];
}
EDIT
I realized the compiler was optimizing the product between elemnts of coeff and temp. I beg your pardon for the unuseful question. I will delete this post.

This has obviously nothing to do with "writing vs overwriting".
Assuming your results are indeed correct, I can guess that your "faster" version can be vectorized (i.e. pipelined) by the compiler more efficiently.
The difference in that in this version you allocate a storage space for temp, whereas each iteration uses its own member of the array, hence all the iterations can be executed independently.
Your "slow 1" version creates a (sort of) false dependence on a single temp variable. A primitive compiler might "buy" it, producing a non-pipelined code.
Your "slow 2" version seems to be ok actually, loop iterations are independent.
Why is this still slower?
I can guess that this is due to the use of the same CPU registers. That is, arithmetic on double is usually implemented via FPU stack registers, this is the interference between loop iterations.

Why is my C++ code three times slower than the C equivalent on LeetCode? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I've been doing some of the LeetCode problems, and I notice that the C solutions are a couple of times faster than the exact same thing in C++. For example:
Updated with a couple of simpler examples:
Given a sorted array and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order. You may assume no duplicates in the array. (Link to question on LeetCode)
My solution in C, runs in 3 ms:
int searchInsert(int A[], int n, int target) {
int left = 0;
int right = n;
int mid = 0;
while (left<right) {
mid = (left + right) / 2;
if (A[mid]<target) {
left = mid + 1;
}
else if (A[mid]>target) {
right = mid;
}
else {
return mid;
}
}
return left;
}
My other C++ solution, exactly the same but as a member function of the Solution class runs in 13 ms:
class Solution {
public:
int searchInsert(int A[], int n, int target) {
int left = 0;
int right = n;
int mid = 0;
while (left<right) {
mid = (left + right) / 2;
if (A[mid]<target) {
left = mid + 1;
}
else if (A[mid]>target) {
right = mid;
}
else {
return mid;
}
}
return left;
}
};
Even simpler example:
Reverse the digits of an integer. Return 0 if the result will overflow. (Link to question on LeetCode)
The C version runs in 6 ms:
int reverse(int x) {
long rev = x % 10;
x /= 10;
while (x != 0) {
rev *= 10L;
rev += x % 10;
x /= 10;
if (rev>(-1U >> 1) || rev < (1 << 31)) {
return 0;
}
}
return rev;
}
And the C++ version is exactly the same but as a member function of the Solution class, and runs for 19 ms:
class Solution {
public:
int reverse(int x) {
long rev = x % 10;
x /= 10;
while (x != 0) {
rev *= 10L;
rev += x % 10;
x /= 10;
if (rev>(-1U >> 1) || rev < (1 << 31)) {
return 0;
}
}
return rev;
}
};
I see how there would be considerable overhead from using vector of vector as a 2D array in the original example if the LeetCode testing system doesn't compile the code with optimisation enabled. But the simpler examples above shouldn't suffer that issue because the data structures are pretty raw, especially in the second case where all you have is long or integer arithmetics. That's still slower by a factor of three.
I'm starting to think that there might be something odd happening with the way LeetCode do the benchmarking in general because even in the C version of the integer reversing problem you get a huge bump in running time from just replacing the line
if (rev>(-1U >> 1) || rev < (1 << 31)) {
with
if (rev>INT_MAX || rev < INT_MIN) {
Now, I suppose having to #include<limits.h> might have something to do with that but it seems a bit extreme that this simple change bumps the execution time from just 6 ms to 19 ms.

Lately I've been seeing the vector<vector<int>> suggestion a lot for doing 2d arrays in C++, and I've been pointing out to people why this really isn't a good idea. It's a handy trick to know when slapping together temporary code, but there's (almost) never any reason to ever use it for real code. The right thing to do is to use a class that wraps a contiguous block of memory.
So my first reaction might be to point to this as a possible source for the disparity. However you're also using int** in the C version, which is generally a sign of the exact same problem as vector<vector<int>>.
So instead I decided to just compare the two solutions.
http://coliru.stacked-crooked.com/a/fa8441cc5baa0391
6468424
6588511
That's the time taken by the 'C version' vs the 'C++ version' in nanoseconds.
My results don't show anything like the disparity you describe. Then it occurred to me to check a common mistake people make when benchmarking
http://coliru.stacked-crooked.com/a/e57d791876b9252b
18386695
42400612
Notice that the -O3 flag from the first example has become -O0, which disables optimization.
Conclusion: you're probably comparing unoptimized executables.
C++ supports building rich abstractions that don't require overhead, but eliminating the the overhead does require certain code transformations that play havoc with the 'debuggability' of code.
That means debug builds avoid those transformations and therefore C++ debug builds are often slower than debug builds of C style code because C style code just doesn't use much abstraction. Seeing a 130% slowdown such as the above is not at all surprising when timing, for example, machine code that uses function calls in place of simple store instructions.
Some code really needs optimizations in order to have reasonable performance even for debugging, so compilers often offer a mode that applies some optimizations which don't cause too much trouble for debuggers. Clang and gcc use -O1 for this, and you can see that even this level of optimization essentially eliminates the gap in this program between C style code and the more C++ style code:
http://coliru.stacked-crooked.com/a/13967ebcfcfa4073
8389992
8196935
Update:
In those later examples optimization shouldn't make a difference, since the C++ is not using any abstraction beyond what the C version is doing. I'm guessing that the explanation for this is that the examples are being compiled with different compilers or with some other different compiler options. Without knowing how the compilation is done I would say it makes no sense to compare these runtime numbers; LeetCode is clearly not producing an apples to apples comparison.

You are using vector of vector in your C++ code snippet. Vectors are sequence containers in C++ that are like arrays that can change in size. Instead of vector<vector<int>> if you use statically allocated arrays, that would be better. You may use your own Array class as well with operator [] overloaded, but vector has more overhead as it dynamically resizes when you add more elements than its original size. In C++, you use call by reference to further reduce your time if you compare that with C. C++ should run even faster if written well.

Python around 40 times slower than c++ [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I was implementing a rolling median solution and was not sure why my python implementation was around 40 times slower than c++ implementation.
Here are the complete implementations
C++
#include <iostream>
#include <vector>
#include <string.h>
using namespace std;
int tree[17][65536];
void insert(int x) { for (int i=0; i<17; i++) { tree[i][x]++; x/=2; } }
void erase(int x) { for (int i=0; i<17; i++) { tree[i][x]--; x/=2; } }
int kThElement(int k) {
int a=0, b=16;
while (b--) { a*=2; if (tree[b][a]<k) k-=tree[b][a++]; }
return a;
}
long long sumOfMedians(int seed, int mul, int add, int N, int K) {
long long result = 0;
memset(tree, 0, sizeof(tree));
vector<long long> temperatures;
temperatures.push_back( seed );
for (int i=1; i<N; i++)
temperatures.push_back( ( temperatures.back()*mul+add ) % 65536 );
for (int i=0; i<N; i++) {
insert(temperatures[i]);
if (i>=K) erase(temperatures[i-K]);
if (i>=K-1) result += kThElement( (K+1)/2 );
}
return result;
}
// default input
// 47 5621 1 125000 1700
// output
// 4040137193
int main()
{
int seed,mul,add,N,K;
cin >> seed >> mul >> add >> N >> K;
cout << sumOfMedians(seed,mul,add,N,K) << endl;
return 0;
}
Python
def insert(tree,levels,n):
for i in xrange(levels):
tree[i][n] += 1
n /= 2
def delete(tree,levels,n):
for i in xrange(levels):
tree[i][n] -= 1
n /= 2
def kthElem(tree,levels,k):
a = 0
for b in reversed(xrange(levels)):
a *= 2
if tree[b][a] < k:
k -= tree[b][a]
a += 1
return a
def main():
seed,mul,add,N,K = map(int,raw_input().split())
levels = 17
tree = [[0] * 65536 for _ in xrange(levels)]
temps = [0] * N
temps[0] = seed
for i in xrange(1,N):
temps[i] = (temps[i-1]*mul + add) % 65536
result = 0
for i in xrange(N):
insert(tree,levels,temps[i])
if (i >= K):
delete(tree,levels,temps[i-K])
if (i >= K-1):
result += kthElem(tree,levels,((K+1)/2))
print result
# default input
# 47 5621 1 125000 1700
# output
# 4040137193
main()
On the above mentioned input (in the comments of the code) C++ code took around 0.06 seconds while python took around 2.3 seconds.
Can some one suggest the possible problems with my python code and how to improve to less than 10x performance hit?
I dont expect it to be anywhere near c++ implementation but to the order of 5-10x. I know I can optimize this by using libraries like numpy (and/or scipy). I am asking this question from the point of view of using python for solving programming challenges. These libraries are usually not allowed in these challenges. I am just asking if it is even possible to beat the timelimit for this algorithm in python.
If somebody is interested C++ code is borrowed from Floating median problem at http://community.topcoder.com/tc?module=Static&d1=match_editorials&d2=srm310
[Edit]
For those who think using numpy arrays will improve the performance, it does not. On the otherhand just using numpy ndarray instead of list of list, performace further degraded to around 14 seconds which is more than 200x slowdown from c++.

Pure Python code which is compute-bound and written procedurally is likely to be slow, as you have found. If you want to make something in Python which runs quickly for tasks like this, you'll need to use some C (or C++, Fortran, or other) extensions, which are abundant. For example, statistics and math people use NumPy and SciPy and related tools, which are easy to use from Python but which are actually implemented in compiled languages and have high performance (if used carefully).
If you want to try to squeeze a bit more performance out of pure Python, you can try using the "cProfile" module to analyze your code. But it probably won't get anywhere near C++ speed unless you use smarter modules like NumPy or write your own extensions.
You might gain a small amount by refactoring this:
reversed(xrange(levels))
Especially if you are using Python 2.x, as this will create an actual list. You can instead do something like this:
xrange(levels - 1, -1, -1)

Can some one suggest [...] how to improve to less than 10x performance hit?
Profile the code.
Look into using NumPy instead of native lists.
If that turns out to not be enough, look into using Cython for the critical part.

Optimize indexed array summation

I have the following C++ code:
const int N = 1000000
int id[N]; //Value can range from 0 to 9
float value[N];
// load id and value from an external source...
int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
{
++size[id[i]];
sum[id[i]] += value[i];
}
How should I optimize the loop?
I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.

This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when
id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0
in this case, the obvious approach (pseudocode here)
x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);
won't work either!
You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.
One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:
int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
int id0 = id[i*2+0], id1 = id[i*2+1];
++size[id0];
++size2[id1];
sum[id0] += value[i*2+0];
sum2[id1] += value[i*2+1];
}
// if N was odd, process last element
if (N & 1) {
++size[id[N]];
sum[id[N]] += value[N];
}
// add partial sums together
for (int i = 0; i < 10; i++) {
size[i] += size2[i];
sum[i] += sum2[i];
}
Whether this helps or not depends on the target CPU though.

Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.
register int index;
for(int i = 0; i < N; ++i)
{
index = id[i];
++size[index];
sum[index] += value[i];
}
The MSDN docs state this about register:
The register keyword specifies that
the variable is to be stored in a
machine register.. Microsoft Specific
The compiler does not accept user
requests for register variables;
instead, it makes its own register
choices when global
register-allocation optimization (/Oe
option) is on. However, all other
semantics associated with the register
keyword are honored.

Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:
int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
{
index = 2 * i;
++size[id[index]];
sum[id[index]] += value[index];
index++;
++size[id[index]];
sum[id[index]] += value[index];
}
which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.

Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.
Do not optimise until you KNOW where the bottleneck is.
Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:
You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.
In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.
Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.

Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.
See here (Page 95 in the PDF) for more info from Intel.

This computation is trivially parallelizable; just add
#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)
immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.
If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:
for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)
for (k += size[j], tmp = 0; i < k; i++)
tmp += value[i];

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js