Why is dividing slower than bitshifting in C++? - c++

I wrote two pieces of code, one that divides a random number by two, and one that bitshifts the same random number right once. As I understand it, this should produce the same result. However, when I time both pieces of code, I consistently get data saying that the shifting is faster. Why is that?
Shifting code:
double iterations = atoi(argv[1]) * 1000;
int result = 0;
cout << "Doing " << iterations << " iterations." << endl;
srand(31459);
for(int i=0;i<iterations;i++){
if(i % 2 == 0){
result = result + (rand()>>1);
}else{
result = result - (rand()>>1);
}
}
Dividing code:
double iterations = atoi(argv[1]) * 1000;
int result = 0;
cout << "Doing " << iterations << " iterations." << endl;
srand(31459);
for(int i=0;i<iterations;i++){
if(i % 2 == 0){
result = result + (rand() / 2);
}else{
result = result - (rand() / 2);
}
}
Timing and results:
$ time ./divide 1000000; time ./shift 1000000
Doing 1e+09 iterations.
real 0m12.291s
user 0m12.260s
sys 0m0.021s
Doing 1e+09 iterations.
real 0m12.091s
user 0m12.056s
sys 0m0.019s
$ time ./shift 1000000; time ./divide 1000000
Doing 1e+09 iterations.
real 0m12.083s
user 0m12.028s
sys 0m0.035s
Doing 1e+09 iterations.
real 0m12.198s
user 0m12.158s
sys 0m0.028s
Addtional information:
I am not using any optimizations when compiling
I am running this on a virtualized install of Fedora 20, kernal: 3.12.10-300.fc20.x86_64

It's not; it's slower on the architecture you're running on. It's almost always slower because the hardware behind bit shifting is trivial, while division is a bit of a nightmare. In base 10, what's easier for you, 78358582354 >> 3 or 78358582354 / 85? Instructions generally take the same time to execute regardless of input, and in you case, it's the compiler's job to convert /2 to >>1; the CPU just does as it's told.

It isn't actually slower. I've run your benchmark using nonius like so:
#define NONIUS_RUNNER
#include "Nonius.h++"
#include <type_traits>
#include <random>
#include <vector>
NONIUS_BENCHMARK("Divide", [](nonius::chronometer meter)
{
std::random_device rd;
std::uniform_int_distribution<int> dist(0, 9);
std::vector<int> storage(meter.runs());
meter.measure([&](int i) { storage[i] = storage[i] % 2 == 0 ? storage[i] - (dist(rd) >> 1) : storage[i] + (dist(rd) >> 1); });
})
NONIUS_BENCHMARK("std::string destruction", [](nonius::chronometer meter)
{
std::random_device rd;
std::uniform_int_distribution<int> dist(0, 9);
std::vector<int> storage(meter.runs());
meter.measure([&](int i) { storage[i] = storage[i] % 2 == 0 ? storage[i] - (dist(rd) / 2) : storage[i] + (dist(rd) / 2); });
})
And these are the results:
As you can see both of them are neck and neck.
(You can find the html output here)
P.S: It seems I forgot to rename the second test. My bad.

It seems that difference in resuls is bellow the results spread, so you cann't really tell if it is different. But in general division can't be done in single opperation, bit shift can, so bit shift usualy should be faster.
But as you have literal 2 in your code, I would guess that compiler, even without optimizations produces identical code.

Note that rand returns int and divide int (signed by default) by 2 is not the same as shifting by 1. You can easily check generated asm and see the difference, or simply check resulting binary size:
> g++ -O3 boo.cpp -c -o boo # divide
> g++ -O3 foo.cpp -c -o foo # shift
> ls -la foo boo
... 4016 ... boo # divide
... 3984 ... foo # shift
Now add static_cast patch:
if (i % 2 == 0) {
result = result + (static_cast<unsigned>(rand())/2);
}
else {
result = result - (static_cast<unsigned>(rand())/2);
}
and check the size again:
> g++ -O3 boo.cpp -c -o boo # divide
> g++ -O3 foo.cpp -c -o foo # shift
> ls -la foo boo
... 3984 ... boo # divide
... 3984 ... foo # shift
to be sure you can verify that generated asm in both binaries is the same

Related

Change value of randomly generated number upon second compilation

I applied the random number generator to my code although the first number generated doesn't change when I run the code second or the third time. The other numbers change however and the issue is only on the first value. I'm using code blocks; Cygwin GCC compiler (c++ 17). Seeding using time.
#include <iostream>
#include <random>
#include <ctime>
int main()
{
std::default_random_engine randomGenerator(time(0));
std::uniform_int_distribution randomNumber(1, 20);
int a, b, c;
a = randomNumber(randomGenerator);
b = randomNumber(randomGenerator);
c = randomNumber(randomGenerator);
std::cout<<a<<std::endl;
std::cout<<b<<std::endl;
std::cout<<c<<std::endl;
return 0;
}
In such a case when I run the code the first time it may produce a result like a = 4, b = 5, c = 9. The second and further time (a) remains 4 but (b) and (c) keep changing.
Per my comment, the std::mt19937 is the main PRNG you should consider. It's the best one provided in <random>. You should also seed it better. Here I use std::random_device.
Some people will moan about how std::random_device falls back to a deterministic seed when a source of true random data can't be found, but that's pretty rare outside of low-level embedded stuff.
#include <iostream>
#include <random>
int main() {
std::mt19937 randomGenerator(std::random_device{}());
std::uniform_int_distribution randomNumber(1, 20);
for (int i = 0; i < 3; ++i) {
std::cout << randomNumber(randomGenerator) << ' ';
}
std::cout << '\n';
return 0;
}
Output:
~/tmp
❯ ./a.out
8 2 16
~/tmp
❯ ./a.out
7 12 14
~/tmp
❯ ./a.out
8 12 4
~/tmp
❯ ./a.out
18 8 7
Here we see four runs that are pretty distinct. Because your range is small, you'll see patterns pop up every now and again. There are other areas of improvement, notably providing a more robust seed to the PRNG, but for a toy program, this suffices.

Branch prediction does not improve the performance [duplicate]

I discovered this popular ~9-year-old SO question and decided to double-check its outcomes.
So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got:
Sorted - 0.549702s:
~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
0.549702
sum = 314931600000
Unsorted - 0.546554s:
~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
0.546554
sum = 314931600000
I am pretty sure that the fact that unsorted version turned out to be faster by 3ms is just noise, but it seems it is not slower anymore.
So, what has changed in the architecture of CPU (so that it is not an order of magnitude slower anymore)?
Here are results from multiple runs:
Unsorted: 0.543557 0.551147 0.541722 0.555599
Sorted: 0.542587 0.559719 0.53938 0.557909
Just in case, here is my main.cpp:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
// std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
return 0;
}
Update
With larger number of elements (627680):
Unsorted
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
10.3814
Sorted:
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
10.6885
I think the question is still relevant - almost no difference.
Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.
Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128 test. Instead it uses vector compare instructions (pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.
The rough C++ equivalent would be
sum += data[c] & -(data[c] >= 128);
The code actually keeps two running 64-bit sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.
Some of the extra complexity is to take care of sign-extending the 32-bit data elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on -mavx2 and you'll see a simpler vpmovsxdq ymm5, xmm5 in its place.
The code also looks long because the loop has been unrolled, processing 8 elements of data per iteration.

Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

I discovered this popular ~9-year-old SO question and decided to double-check its outcomes.
So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got:
Sorted - 0.549702s:
~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
0.549702
sum = 314931600000
Unsorted - 0.546554s:
~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
0.546554
sum = 314931600000
I am pretty sure that the fact that unsorted version turned out to be faster by 3ms is just noise, but it seems it is not slower anymore.
So, what has changed in the architecture of CPU (so that it is not an order of magnitude slower anymore)?
Here are results from multiple runs:
Unsorted: 0.543557 0.551147 0.541722 0.555599
Sorted: 0.542587 0.559719 0.53938 0.557909
Just in case, here is my main.cpp:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
// std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
return 0;
}
Update
With larger number of elements (627680):
Unsorted
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
10.3814
Sorted:
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
10.6885
I think the question is still relevant - almost no difference.
Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.
Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128 test. Instead it uses vector compare instructions (pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.
The rough C++ equivalent would be
sum += data[c] & -(data[c] >= 128);
The code actually keeps two running 64-bit sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.
Some of the extra complexity is to take care of sign-extending the 32-bit data elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on -mavx2 and you'll see a simpler vpmovsxdq ymm5, xmm5 in its place.
The code also looks long because the loop has been unrolled, processing 8 elements of data per iteration.

Very large execution time differences for virtually same C++ and Python code

I was trying to write a solution for Problem 12 (Project Euler) in Python. The solution was just too slow, so I tried checking up other people's solution on the internet. I found this code written in C++ which does virtually the same exact thing as my python code, with just a few insignificant differences.
Python:
def find_number_of_divisiors(n):
if n == 1:
return 1
div = 2 # 1 and the number itself
for i in range(2, n/2 + 1):
if (n % i) == 0:
div += 1
return div
def tri_nums():
n = 1
t = 1
while 1:
yield t
n += 1
t += n
t = tri_nums()
m = 0
for n in t:
d = find_number_of_divisiors(n)
if m < d:
print n, ' has ', d, ' divisors.'
m = d
if m == 320:
exit(0)
C++:
#include <iostream>
int main(int argc, char *argv[])
{
unsigned int iteration = 1;
unsigned int triangle_number = 0;
unsigned int divisor_count = 0;
unsigned int current_max_divisor_count = 0;
while (true) {
triangle_number += iteration;
divisor_count = 0;
for (int x = 2; x <= triangle_number / 2; x ++) {
if (triangle_number % x == 0) {
divisor_count++;
}
}
if (divisor_count > current_max_divisor_count) {
current_max_divisor_count = divisor_count;
std::cout << triangle_number << " has " << divisor_count
<< " divisors." << std::endl;
}
if (divisor_count == 318) {
exit(0);
}
iteration++;
}
return 0;
}
The python code takes 1 minute and 25.83 seconds on my machine to execute. While the C++ code takes around 4.628 seconds. Its like 18x faster. I had expected the C++ code to be faster but not by this great margin and that too just for a simple solution which consists of just 2 loops and a bunch of increments and mods.
Although I would appreciate answers on how to solve this problem, the main question I want to ask is Why is C++ code so much faster? Am I using/doing something wrongly in python?
Replacing range with xrange:
After replacing range with xrange the python code takes around 1 minute 11.48 seconds to execute. (Around 1.2x faster)
This is exactly the kind of code where C++ is going to shine compared to Python: a single fairly tight loop doing arithmetic ops. (I'm going to ignore algorithmic speedups here, because your C++ code uses the same algorithm, and it seems you're explicitly not asking for that...)
C++ compiles this kind of code down to a relatively few number of instructions for the processor (and everything it does probably all fits in the super-fast levels of CPU cache), while Python has a lot of levels of indirection it's going through for each operation. For example, every time you increase a number it's checking that the number didn't just overflow and need to be moved into a bigger data type.
That said, all is not necessarily lost! This is also the kind of code that a just-in-time compiler system like PyPy will do well at, since once it's gone through the loop a few times it compiles the code to something similar to what the C++ code starts at. On my laptop:
$ time python2.7 euler.py >/dev/null
python euler.py 72.23s user 0.10s system 97% cpu 1:13.86 total
$ time pypy euler.py >/dev/null
pypy euler.py > /dev/null 13.21s user 0.03s system 99% cpu 13.251 total
$ clang++ -o euler euler.cpp && time ./euler >/dev/null
./euler > /dev/null 2.71s user 0.00s system 99% cpu 2.717 total
using the version of the Python code with xrange instead of range. Optimization levels don't make a difference for me with the C++ code, and neither does using GCC instead of Clang.
While we're at it, this is also a case where Cython can do very well, which compiles almost-Python code to C code that uses the Python APIs, but uses raw C when possible. If we change your code just a little bit by adding some type declarations, and removing the iterator since I don't know how to handle those efficiently in Cython, getting
cdef int find_number_of_divisiors(int n):
cdef int i, div
if n == 1:
return 1
div = 2 # 1 and the number itself
for i in xrange(2, n/2 + 1):
if (n % i) == 0:
div += 1
return div
cdef int m, n, t, d
m = 0
n = 1
t = 1
while True:
n += 1
t += n
d = find_number_of_divisiors(t)
if m < d:
print n, ' has ', d, ' divisors.'
m = d
if m == 320:
exit(0)
then on my laptop I get
$ time python -c 'import euler_cy' >/dev/null
python -c 'import euler_cy' > /dev/null 4.82s user 0.02s system 98% cpu 4.941 total
(within a factor of 2 of the C++ code).
Rewriting the divisor counting algorithm to use divisor function makes the run time reduces to less than 1 second. It is still possible to make it faster, but not really necessary.
This is to show that: before you do any optimization trick with the language features and compiler, you should check whether your algorithm is the bottleneck or not. The trick with compiler/interpreter is indeed quite powerful, as shown in Dougal's answer where the gap between Python and C++ is closed for the equivalent code. However, as you can see, the change in algorithm immediately give a huge performance boost and lower the run time to around the level of algorithmically inefficient C++ code (I didn't test the C++ version, but on my 6-year-old computer, the code below finishes running in ~0.6s).
The code below is written and tested with Python 3.2.3.
import math
def find_number_of_divisiors(n):
if n == 1:
return 1
num = 1
count = 1
div = 2
while (n % div == 0):
n //= div
count += 1
num *= count
div = 3
while (div <= pow(n, 0.5)):
count = 1
while n % div == 0:
n //= div
count += 1
num *= count
div += 2
if n > 1:
num *= 2
return num
Here's my own variant built on nhahtdh's factor-counting optimization plus my own prime factorization code:
def prime_factors(x):
def factor_this(x, factor):
factors = []
while x % factor == 0:
x /= factor
factors.append(factor)
return x, factors
x, factors = factor_this(x, 2)
x, f = factor_this(x, 3)
factors += f
i = 5
while i * i <= x:
for j in (2, 4):
x, f = factor_this(x, i)
factors += f
i += j
if x > 1:
factors.append(x)
return factors
def product(series):
from operator import mul
return reduce(mul, series, 1)
def factor_count(n):
from collections import Counter
c = Counter(prime_factors(n))
return product([cc + 1 for cc in c.values()])
def tri_nums():
n, t = 1, 1
while 1:
yield t
n += 1
t += n
if __name__ == '__main__':
m = 0
for n in tri_nums():
d = factor_count(n)
if m < d:
print n, ' has ', d, ' divisors.'
m = d
if m == 320:
break

How to let Boost::random and Matlab produce the same random numbers

To check my C++ code, I would like to be able to let Boost::Random and Matlab produce the same random numbers.
So for Boost I use the code:
boost::mt19937 var(static_cast<unsigned> (std::time(0)));
boost::uniform_int<> dist(1, 6);
boost::variate_generator<boost::mt19937&, boost::uniform_int<> > die(var, dist);
die.engine().seed(0);
for(int i = 0; i < 10; ++i) {
std::cout << die() << " ";
}
std::cout << std::endl;
Which produces (every run of the program):
4 4 5 6 4 6 4 6 3 4
And for matlab I use:
RandStream.setDefaultStream(RandStream('mt19937ar','seed',0));
randi(6,1,10)
Which produces (every run of the program):
5 6 1 6 4 1 2 4 6 6
Which is bizarre, since both use the same algorithm, and same seed.
What do I miss?
It seems that Python (using numpy) and Matlab seems comparable, in the random uniform numbers:
Matlab
RandStream.setDefaultStream(RandStream('mt19937ar','seed',203));rand(1,10)
0.8479 0.1889 0.4506 0.6253 0.9697 0.2078 0.5944 0.9115 0.2457 0.7743
Python:
random.seed(203);random.random(10)
array([ 0.84790006, 0.18893843, 0.45060688, 0.62534723, 0.96974765,
0.20780668, 0.59444858, 0.91145688, 0.24568615, 0.77430378])
C++Boost
0.8479 0.667228 0.188938 0.715892 0.450607 0.0790326 0.625347 0.972369 0.969748 0.858771
Which is identical to ever other Python and Matlab value...
I have to agree with the other answers, stating that these generators are not "absolute". They may produce different results according to the implementation. I think the simplest solution would be to implement your own generator. It might look daunting (Mersenne twister sure is by the way) but take a look at Xorshift, an extremely simple though powerful one. I copy the C implementation given in the Wikipedia link :
uint32_t xor128(void) {
static uint32_t x = 123456789;
static uint32_t y = 362436069;
static uint32_t z = 521288629;
static uint32_t w = 88675123;
uint32_t t;
t = x ^ (x << 11);
x = y; y = z; z = w;
return w = w ^ (w >> 19) ^ (t ^ (t >> 8));
}
To have the same seed, just put any values you want int x,y,z,w (except(0,0,0,0) I believe). You just need to be sure that Matlab and C++ use both 32 bit for these unsigned int.
Using the interface like
randi(6,1,10)
will apply some kind of transformation on the raw result of the random generator. This transformation is not trivial in general and Matlab will almost certainly do a different selection step than Boost.
Try comparing raw data streams from the RNGs - chances are they are the same
In case this helps anyone interested in the question:
In order to the get the same behavior for the Twister algorithm:
Download the file
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/CODES/mt19937ar.c
Try the following:
#include <stdint.h>
// mt19937ar.c content..
int main(void)
{
int i;
uint32_t seed = 100;
init_genrand(seed);
for (i = 0; i < 100; ++i)
printf("%.20f\n",genrand_res53());
return 0;
}
Make sure the same values are generated within matlab:
RandStream.setGlobalStream( RandStream.create('mt19937ar','seed',100) );
rand(100,1)
randi() seems to be simply ceil( rand()*maxval )
Thanks to Fezvez's answer I've written xor128 in matlab:
function [ w, state ] = xor128( state )
%XOR128 implementation of Xorshift
% https://en.wikipedia.org/wiki/Xorshift
% A starting state might be [123456789, 362436069, 521288629, 88675123]
x = state(1);
y = state(2);
z = state(3);
w = state(4);
% t1 = (x << 11)
t1 = bitand(bitshift(x,11),hex2dec('ffffffff'));
% t = x ^ (x << 11)
t = bitxor(x,t1);
x = y;
y = z;
z = w;
% t2 = (t ^ (t >> 8))
t2 = bitxor(t, bitshift(t,-8));
% t3 = w ^ (w >> 19)
t3 = bitxor(w, bitshift(w,-19));
% w = w ^ (w >> 19) ^ (t ^ (t >> 8))
w = bitxor(t3, t2);
state = [x y z w];
end
You need to pass state in to xor128 every time you use it. I've written a "tester" function which simply returns a vector with random numbers. I tested 1000 numbers output by this function against values output by cpp with gcc and it is perfect.
function [ v ] = txor( iterations )
%TXOR test xor128, returns vector v of length iterations with random number
% output from xor128
% output
v = zeros(iterations,1);
state = [123456789, 362436069, 521288629, 88675123];
i = 1;
while i <= iterations
disp(i);
[t,state] = xor128(state);
v(i) = t;
i = i + 1;
end
I would be very careful assuming that two different implementations of pseudo random generators (even though based on the same algorithms) produce the same result. There could be that one of the implementations use some sort of tweak, hence producing different results. If you need two equal "random" distributions I suggest you either precalculate a sequence, store and access from both C++ and Matlab or create your own generator. It should be fairly easy to implement MT19937 if you use the pseudocode on Wikipedia.
Take care ensuring that both your Matlab and C++ code runs on the same architecture (that is, both runs on either 32 or 64-bit) - using a 64 bit integer in one implementation and a 32 bit integer in the other will lead to different results.