Code ran on Visual Studio 2019 Version 16.11.8 with /O2 Optimization and Intel CPU. I am trying to find the root cause for this counter-intuitive result I get that without attributes is statistically faster than with attributes via t-test. I am not sure what is the root cause for this. Could it be some sort of cache? Or some magic the compiler is doing - I cannot really read assembly
#include <chrono>
#include <iomanip>
#include <iostream>
#include <numeric>
#include <random>
#include <vector>
#include <cmath>
#include <functional>
static const size_t NUM_EXPERIMENTS = 1000;
double calc_mean(std::vector<double>& vec) {
double sum = 0;
for (auto& x : vec)
sum += x;
return sum / vec.size();
}
double calc_deviation(std::vector<double>& vec) {
double sum = 0;
for (int i = 0; i < vec.size(); i++)
sum = sum + (vec[i] - calc_mean(vec)) * (vec[i] - calc_mean(vec));
return sqrt(sum / (vec.size()));
}
double calc_ttest(std::vector<double> vec1, std::vector<double> vec2){
double mean1 = calc_mean(vec1);
double mean2 = calc_mean(vec2);
double sd1 = calc_deviation(vec1);
double sd2 = calc_deviation(vec2);
double t_test = (mean1 - mean2) / sqrt((sd1 * sd1) / vec1.size() + (sd2 * sd2) / vec2.size());
return t_test;
}
namespace with_attributes {
double calc(double x) noexcept {
if (x > 2) [[unlikely]]
return sqrt(x);
else [[likely]]
return pow(x, 2);
}
} // namespace with_attributes
namespace no_attributes {
double calc(double x) noexcept {
if (x > 2)
return sqrt(x);
else
return pow(x, 2);
}
} // namespace with_attributes
std::vector<double> benchmark(std::function<double(double)> calc_func) {
std::vector<double> vec;
vec.reserve(NUM_EXPERIMENTS);
std::mt19937 mersenne_engine(12);
std::uniform_real_distribution<double> dist{ 1, 2.2 };
for (size_t i = 0; i < NUM_EXPERIMENTS; i++) {
const auto start = std::chrono::high_resolution_clock::now();
for (auto size{ 1ULL }; size != 100000ULL; ++size) {
double x = dist(mersenne_engine);
calc_func(x);
}
const std::chrono::duration<double> diff =
std::chrono::high_resolution_clock::now() - start;
vec.push_back(diff.count());
}
return vec;
}
int main() {
std::vector<double> vec1 = benchmark(with_attributes::calc);
std::vector<double> vec2 = benchmark(no_attributes::calc);
std::cout << "with attribute: " << std::fixed << std::setprecision(6) << calc_mean(vec1) << '\n';
std::cout << "without attribute: " << std::fixed << std::setprecision(6) << calc_mean(vec2) << '\n';
std::cout << "T statistics" << std::fixed << std::setprecision(6) << calc_ttest(vec1, vec2) << '\n';
}
Per godbolt, the two functions generates identical assembly under msvc
movsd xmm1, QWORD PTR __real#4000000000000000
comisd xmm0, xmm1
jbe SHORT $LN2#calc
xorps xmm1, xmm1
ucomisd xmm1, xmm0
ja SHORT $LN7#calc
sqrtsd xmm0, xmm0
ret 0
$LN7#calc:
jmp sqrt
$LN2#calc:
jmp pow
Since msvc is not open source, one could only guess why msvc would choose to ignore this optimization -- maybe because two branches are all function calls (it's tail call so jmp instead of call) and that's too costly for [[likely]] to make a difference.
But if clang is used, it's smart enough to optimize power 2 into x * x, so different code would be generated. Following that lead, if your code is modified into
double calc(double x) noexcept {
if (x > 2)
return x + 1;
else
return x - 2;
}
msvc would also output different layout.
Compilers are smart. These days, they are very smart. They do a lot of work to figure out when they need to do things.
The likely and unlikely attributes exist to solve extremely specific problems. Problems that only become apparent after deep analysis of the performance characteristics, and generated assembly, of a particular piece of performance-critical code. They are not a salve you rub into any old code to make it go faster.
They are a scalpel. And without surgical training, a scalpel is likely to be misused.
So unless you have specific knowledge of a performance problem which analysis of assembly shows can be solved by better branch prediction, you should not assume that any use of these attributes will make any particular code go faster.
That is, the result you're getting is entirely legitimate.
Related
I am testing the performance of the C++ standard library algorithms and encountered a weird thing.
Here is my code to compare the performance of std::count vs plain for loop:
#include <algorithm>
#include <vector>
#include <iostream>
#include <chrono>
using namespace std::chrono;
int my_count(const std::vector<int>& v, int val) {
int num = 0;
for (int i: v) {
if (i == val)
num++;
}
return num;
}
int main()
{
int total_count = 0;
std::vector<int> v;
v.resize(100000000);
// Fill vector
for (int i = 0; i < v.size(); i++) {
v[i] = i % 10000;
}
int val = 1;
{
auto start = high_resolution_clock::now();
total_count += std::count(v.begin(), v.end(), val);
auto stop = high_resolution_clock::now();
std::cout << "std::count time: " << duration_cast<microseconds>(stop - start).count() << std::endl;
}
{
auto start = high_resolution_clock::now();
total_count += my_count(v, val);
auto stop = high_resolution_clock::now();
std::cout << "my_count time: " << duration_cast<microseconds>(stop - start).count() << std::endl;
}
// We need this so the compiler does not prune the code above
std::cout << "Total items: " << total_count << std::endl;
}
With MinGW I get this:
std::count time: 65827
my_count time: 64861
And with MSVC I get a pretty weird result:
std::count time: 65532
my_count time: 28584
The MinGW result seems reasonable since, as far as I know, STL count function if roughly equal to the plain for loop, but the MSVC result seems weird - why the plain for loop is more than 2x faster than std::count?
These results are reproducible on my machine - it's not something that occurs once, but it occurs each time I run the code. I even tried changing the function order, running multiple for loops to avoid caching or branch prediction bias, but I still get the same result.
Is there any reason for this?
This is because MSVC vectorizes your manually written code, but is unable to do the same for std::count.
This is how vectorized code looks:
movdqa xmm5, XMMWORD PTR __xmm#00000001000000010000000100000001
and rcx, -8
xorps xmm3, xmm3
xorps xmm2, xmm2
npad 3
$LL4#my_count:
movdqu xmm1, XMMWORD PTR [rax]
add r8, 8
movdqa xmm0, xmm5
paddd xmm0, xmm3
pcmpeqd xmm1, xmm4
pand xmm0, xmm1
pandn xmm1, xmm3
movdqa xmm3, xmm0
movdqa xmm0, xmm5
por xmm3, xmm1
paddd xmm0, xmm2
movdqu xmm1, XMMWORD PTR [rax+16]
add rax, 32 ; 00000020H
pcmpeqd xmm1, xmm4
pand xmm0, xmm1
pandn xmm1, xmm2
movdqa xmm2, xmm0
por xmm2, xmm1
cmp r8, rcx
jne SHORT $LL4#my_count
You can see how it load 4 ones in xmm5 register at the beginning. This value will be used to maintain 4 separate counters that track result for 1st, 2nd, 3rd and 4th DWORDs. Once counting is done, those 4 values will be added together to form the result of the function.
The issue with MSVC vectorizer seems to lie in the fact that counter, data type and argument type should be "compatible":
Return type should match in size the data type
Argument type should be equal or less in size to data type
If any of those constraints is not met, the code is not vectorized. This makes sense as if your data type is 32-bit wide you have to operate on 32-bit counters to make them work together, so if your return type is 64-bit wide instead some additional manipulations are required (which is what GCC is able to do, but this still slows down std::count compared to manually written loop).
This is the case where manually written loop should be preferred as subtle differences in semantic (int return type) make it easier to vectorize (even for GCC, which generates shorter code).
Well, that seems to be an iterator issue.
I've made an extended test:
#include <algorithm>
#include <vector>
#include <iostream>
#include <chrono>
using namespace std::chrono;
int std_count(const std::vector<int>& v, int val) {
return std::count(v.begin(), v.end(), val);
}
int my_count_for(const std::vector<int>& v, int val) {
int num = 0;
for (int i = 0; i < v.size(); i++) {
if (v[i] == val) {
num++;
}
}
return num;
}
int my_count_for_in(const std::vector<int>& v, int val) {
int num = 0;
for (int i : v) {
if (i == val) {
num++;
}
}
return num;
}
int my_count_iter(const std::vector<int>& v, int val) {
int num = 0;
for (auto i = v.begin(); i < v.end(); i++) {
if (*i == val) {
num++;
}
}
return num;
}
int main()
{
std::vector<int> v;
v.resize(1000000);
// Fill vector
for (int i = 0; i < v.size(); i++) {
v[i] = i % 10000;
}
int val = 1;
int num_iters = 1000;
int total_count = 0;
for (int a = 0; a < 3; a++) {
{
auto start = high_resolution_clock::now();
for (int i = 0; i < num_iters; i++) {
total_count += std_count(v, val);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << "std::count time: " << duration.count() << std::endl;
}
{
auto start = high_resolution_clock::now();
for (int i = 0; i < num_iters; i++) {
total_count += my_count_for(v, val);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << "my_count_for time: " << duration.count() << std::endl;
}
{
auto start = high_resolution_clock::now();
for (int i = 0; i < num_iters; i++) {
total_count += my_count_for_in(v, val);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << "my_count_for_in time: " << duration.count() << std::endl;
}
{
auto start = high_resolution_clock::now();
for (int i = 0; i < num_iters; i++) {
total_count += my_count_iter(v, val);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << "my_count_iter time: " << duration.count() << std::endl;
}
std::cout << std::endl;
}
std::cout << total_count << std::endl;
std::cin >> total_count;
}
And here's what I get:
std::count time: 679683
my_count_for time: 235269
my_count_for_in time: 228185
my_count_iter time: 650714
std::count time: 656192
my_count_for time: 231248
my_count_for_in time: 231050
my_count_iter time: 652598
std::count time: 660295
my_count_for time: 238812
my_count_for_in time: 225893
my_count_iter time: 648812
Still seems quite weird that STL function is not the fastest way to solve the task. If someone knows the detailed answer, please share it with me.
I'm starting with SIMD programming but i don't know what to do at this moment. I'm trying to diminish runtime but its doing it the other way.
This is my basic code:
https://codepaste.net/a8ut89
void blurr2(double * u, double * r) {
int i;
double dos[2] = { 2.0, 2.0 };
for (i = 0; i < SIZE - 1; i++) {
r[i] = u[i] + u[i + 1];
}
}
blurr2: 0.43s
int contarNegativos(double * u) {
int i;
int contador = 0;
for (i = 0; i < SIZE; i++) {
if (u[i] < 0) {
contador++;
}
}
return contador;
}
negativeCount: 1.38s
void ord(double * v, double * u, double * r) {
int i;
for (i = 0; i < SIZE; i += 2) {
r[i] = *(__int64*)&(v[i]) | *(__int64*)&(u[i]);
}
}
ord: 0.33
And this is my SIMD code:
https://codepaste.net/fbg1g5
void blurr2(double * u, double * r) {
__m128d rp2;
__m128d rdos;
__m128d rr;
int i;
int sizeAux = SIZE % 2 == 1 ? SIZE : SIZE - 1;
double dos[2] = { 2.0, 2.0 };
rdos = *(__m128d*)dos;
for (i = 0; i < sizeAux; i += 2) {
rp2 = *(__m128d*)&u[i + 1];
rr = _mm_add_pd(*(__m128d*)&u[i], rp2);
*((__m128d*)&r[i]) = _mm_div_pd(rr, rdos);
}
}
blurr2: 0.42s
int contarNegativos(double * u) {
__m128d rcero;
__m128d rr;
int i;
double cero[2] = { 0.0, 0.0 };
int contador = 0;
rcero = *(__m128d*)cero;
for (i = 0; i < SIZE; i += 2) {
rr = _mm_cmplt_pd(*(__m128d*)&u[i], rcero);
if (((__int64 *)&rr)[0]) {
contador++;
};
if (((__int64 *)&rr)[1]) {
contador++;
};
}
return contador;
}
negativeCount: 1.42s
void ord(double * v, double * u, double * r) {
__m128d rr;
int i;
for (i = 0; i < SIZE; i += 2) {
*((__m128d*)&r[i]) = _mm_or_pd(*(__m128d*)&v[i], *(__m128d*)&u[i]);
}
}
ord: 0.35s
**Differents solutions.
Can you explain me what i'm doing wrong? I'm a bit lost...
Use _mm_loadu_pd instead of pointer-casting and dereferencing a __m128d. Your code is guaranteed to segfault on gcc/clang where __m128d is assumed to be aligned.
blurr2: multiply by 0.5 instead of dividing by 2. It will be much faster. (I commented the same thing on a question with the exact same code in the last day or two, was that also you?)
negativeCount: _mm_castpd_si128 the compare result to integer, and accumulate it with _mm_sub_epi64. (The bit pattern is all-zero or all-one, i.e. 2's complement 0 / -1).
#include <immintrin.h>
#include <stdint.h>
static const size_t SIZE = 1024;
uint64_t countNegative(double * u) {
__m128i counts = _mm_setzero_si128();
for (size_t i = 0; i < SIZE; i += 2) {
__m128d cmp = _mm_cmplt_pd(_mm_loadu_pd(&u[i]), _mm_setzero_pd());
counts = _mm_sub_epi64(counts, _mm_castpd_si128(cmp));
}
//return counts[0] + counts[1]; // GNU C only, and less efficient
// horizontal sum
__m128i hi64 = _mm_shuffle_epi32(counts, _MM_SHUFFLE(1, 0, 3, 2));
counts = _mm_add_epi64(counts, hi64);
uint64_t scalarcount = _mm_cvtsi128_si64(counts);
return scalarcount;
}
To learn more about efficient vector horizontal sums, see Fastest way to do horizontal float vector sum on x86. But the first rule is to do it outside the loop.
(source + asm on the Godbolt compiler explorer)
From MSVC (which I'm guessing you're using, or you'd get segfaults from *(__m128d*)foo), the inner loop is:
$LL4#countNegat:
movups xmm0, XMMWORD PTR [rcx]
lea rcx, QWORD PTR [rcx+16]
cmpltpd xmm0, xmm2
psubq xmm1, xmm0
sub rax, 1
jne SHORT $LL4#countNegat
It could maybe go faster with unrolling (and maybe two vector accumulators), but this is fairly good and might go close to 1.25 clocks per 16 bytes on Sandybridge/Haswell. (Bottleneck on 5 fused-domain uops).
Your version was actually unpacking to integer inside the inner loop! And if you were using MSVC -Ox, it was actually branching instead of using a branchless compare + conditional add. I'm surprised it wasn't slower than the scalar version.
Also, (int64_t *)&rr violates strict aliasing. char* can alias anything, but it's not safe to cast other pointers onto SIMD vectors and expect it to work. If it does, you got lucky. Compilers usually generate similar code for that or intrinsics, and usually not worse for proper intrinsics.
Do you know that ord function with SIMD is not 1:1 to ord function without using SIMD instructions ?
In ord function without using SIMD, result of OR operation is calculated for even indexes
r[0] = v[0] | u[0],
r[2] = v[2] | u[2],
r[4] = v[4] | u[4]
what with odd indexes? maybe, if OR operations are calculated for all indexes, it will take more time than now.
I have the following AVX and Native codes:
__forceinline double dotProduct_2(const double* u, const double* v)
{
_mm256_zeroupper();
__m256d xy = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
__m256d temp = _mm256_hadd_pd(xy, xy);
__m128d dotproduct = _mm_add_pd(_mm256_extractf128_pd(temp, 0), _mm256_extractf128_pd(temp, 1));
return dotproduct.m128d_f64[0];
}
__forceinline double dotProduct_1(const D3& a, const D3& b)
{
return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
}
And respective test scripts:
std::cout << res_1 << " " << res_2 << " " << res_3 << '\n';
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < (1 << 30); ++i)
{
zx_1 += dotProduct_1(aVx[i % 10000], aVx[(i + 1) % 10000]);
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
std::cout << "NAIVE : " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
}
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < (1 << 30); ++i)
{
zx_2 += dotProduct_2(&aVx[i % 10000][0], &aVx[(i + 1) % 10000][0]);
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
std::cout << "AVX : " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
}
std::cout << math::min2(zx_1, zx_2) << " " << zx_1 << " " << zx_2;
Well, all of the data are aligned by 32. (D3 with __declspec... and aVx arr with _mm_malloc()..)
And, as i can see, native variant is equal/or faster than AVX variant. I can't understand it's nrmally behaviour ? Because i'm think that AVX is 'super FAST' ... If not, how i can optimize it ? I compile it on MSVC 2015(x64), with arch AVX. Also, my hardwre is intel i7 4750HQ(haswell)
Simple profiling with basic loops isn't a great idea - it usually just means you are memory bandwidth limited, so the tests end up coming out at about the same speed (memory is typically slower than the CPU, and that's basically all you are testing here).
As others have said, your code example isn't great, because you are constantly going across the lanes (which I assume is just to find the fastest dot product, and not specifically because a sum of all the dot products is the desired result?). To be honest, if you really need a fast dot product (for AOS data as presented here), I think I would prefer to replace the VHADDPD with a VADDPD + VPERMILPD (trading an additional instruction for twice the throughput, and a lower latency)
double dotProduct_3(const double* u, const double* v)
{
__m256d dp = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
__m128d a = _mm256_extractf128_pd(dp, 0);
__m128d b = _mm256_extractf128_pd(dp, 1);
__m128d c = _mm_add_pd(a, b);
__m128d yy = _mm_unpackhi_pd(c, c);
__m128d dotproduct = _mm_add_pd(c, yy);
return _mm_cvtsd_f64(dotproduct);
}
asm:
dotProduct_3(double const*, double const*):
vmovapd ymm0,YMMWORD PTR [rsi]
vmulpd ymm0,ymm0,YMMWORD PTR [rdi]
vextractf128 xmm1,ymm0,0x1
vaddpd xmm0,xmm1,xmm0
vpermilpd xmm1,xmm0,0x3
vaddpd xmm0,xmm1,xmm0
vzeroupper
ret
Generally speaking, if you are using horizontal adds, you're doing it wrong! Whilst a 256bit register may seem ideal for a Vector4d, it's not actually a particularly great representation (especially if you consider that AVX512 is now available!). A very similar question to this came up recently: For C++ Vector3 utility class implementations, is array faster than struct and class?
If you want performance, then structure-of-arrays is the best way to go.
struct HybridVec4SOA
{
__m256d x;
__m256d y;
__m256d z;
__m256d w;
};
__m256d dot(const HybridVec4SOA& a, const HybridVec4SOA& b)
{
return _mm256_fmadd_pd(a.w, b.w,
_mm256_fmadd_pd(a.z, b.z,
_mm256_fmadd_pd(a.y, b.y,
_mm256_mul_pd(a.x, b.x))));
}
asm:
dot(HybridVec4SOA const&, HybridVec4SOA const&):
vmovapd ymm1,YMMWORD PTR [rdi+0x20]
vmovapd ymm2,YMMWORD PTR [rdi+0x40]
vmovapd ymm3,YMMWORD PTR [rdi+0x60]
vmovapd ymm0,YMMWORD PTR [rsi]
vmulpd ymm0,ymm0,YMMWORD PTR [rdi]
vfmadd231pd ymm0,ymm1,YMMWORD PTR [rsi+0x20]
vfmadd231pd ymm0,ymm2,YMMWORD PTR [rsi+0x40]
vfmadd231pd ymm0,ymm3,YMMWORD PTR [rsi+0x60]
ret
If you compare the latencies (and more importantly throughput) of load/mul/fmadd compared to hadd and extract, and then consider that the SOA version is computing 4 dot products at a time (instead of 1), you'll start to understand why it's the way to go...
You add too much overhead with vzeroupper and hadd instructions. Good way to write it, is to do all multiplies in a loop and aggregate the result just once at the end. Imagine you unroll original loop 4 times and use 4 accumulators:
for(i=0; i < (1<<30); i+=4) {
s0 += a[i+0] * b[i+0];
s1 += a[i+1] * b[i+1];
s2 += a[i+2] * b[i+2];
s3 += a[i+3] * b[i+3];
}
return s0+s1+s2+s3;
And now just replace unrolled loop with SIMD mul and add (or even FMA intrinsic if available)
There is something that baffles me with integer arithmetic in tutorials. To be precise, integer division.
The seemingly preferred method is by casting the divisor into a float, then rounding the float to the nearest whole number, then cast that back into integer:
#include <cmath>
int round_divide_by_float_casting(int a, int b){
return (int) std::roundf( a / (float) b);
}
Yet this seems like scratching your left ear with your right hand. I use:
int round_divide (int a, int b){
return a / b + a % b * 2 / b;
}
It's no breakthrough, but the fact that it is not standard makes me wonder if I am missing anything?
Despite my (albeit limited) testing, I couldn't find any scenario where the two methods give me different results. Did someone run into some sort of scenario where the int → float → int casting produced more accurate results?
Arithmetic solution
If one defined what your functions should return, she would describe it as something close as "f(a, b) returns the closest integer of the result of the division of a by b in the real divisor ring."
Thus, the question can be summarized as: can we define this closest integer using only integer division. I think we can.
There is exactly two candidates as the closest integer: a / b and (a / b) + 1(1). The selection is easy, if a % b is closer to 0 as it is to b, then a / b is our result. If not, (a / b) + 1 is.
One could then write something similar to, ignoring optimization and good practices:
int divide(int a, int b)
{
const int quot = a / b;
const int rem = a % b;
int result;
if (rem < b - rem) {
result = quot;
} else {
result = quot + 1;
}
return result;
}
While this definition satisfies out needs, one could optimize it by not computing two times the division of a by b with the use of std::div():
int divide(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem >= b - dv.rem) {
++result;
}
return result;
}
The analysis of the problem we did earlier assures us of the well defined behaviour of our implementation.
(1)There is just one last thing to check: how does it behaves when a or b is negative? This is left to the reader ;).
Benchmark
#include <iostream>
#include <iomanip>
#include <string>
// solutions
#include <cmath>
#include <cstdlib>
// benchmak
#include <limits>
#include <random>
#include <chrono>
#include <algorithm>
#include <functional>
//
// Solutions
//
namespace
{
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem >= b - dv.rem)
{
++result;
}
return result;
}
}
//
// benchmark
//
class Randomizer
{
std::mt19937 _rng_engine;
std::uniform_int_distribution<int> _distri;
public:
Randomizer() : _rng_engine(std::time(0)), _distri(std::numeric_limits<int>::min(), std::numeric_limits<int>::max())
{
}
template<class ForwardIt>
void operator()(ForwardIt begin, ForwardIt end)
{
std::generate(begin, end, std::bind(_distri, _rng_engine));
}
};
class Clock
{
std::chrono::time_point<std::chrono::steady_clock> _start;
public:
static inline std::chrono::time_point<std::chrono::steady_clock> now() { return std::chrono::steady_clock::now(); }
Clock() : _start(now())
{
}
template<class DurationUnit>
std::size_t end()
{
return std::chrono::duration_cast<DurationUnit>(now() - _start).count();
}
};
//
// Entry point
//
int main()
{
Randomizer randomizer;
std::array<int, 1000> dividends; // SCALE THIS UP (1'000'000 would be great)
std::array<int, dividends.size()> divisors;
std::array<int, dividends.size()> results;
randomizer(std::begin(dividends), std::end(dividends));
randomizer(std::begin(divisors), std::end(divisors));
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = round_divide_by_float_casting(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "round_divide_by_float_casting(): " << std::setprecision(3) << unit_time << " ns\n";
}
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = round_divide_by_modulo(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "round_divide_by_modulo(): " << std::setprecision(3) << unit_time << " ns\n";
}
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = divide_by_quotient_comparison(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "divide_by_quotient_comparison(): " << std::setprecision(3) << unit_time << " ns\n";
}
}
Outputs:
g++ -std=c++11 -O2 -Wall -Wextra -Werror main.cpp && ./a.out
round_divide_by_float_casting(): 54.7 ns
round_divide_by_modulo(): 24 ns
divide_by_quotient_comparison(): 25.5 ns
Demo
The two arithmetic solutions' performances are not distinguishable (their benchmark converges when you scale the bench size up).
It would really depend on the processor, and the range of the integer which is better (and using double would resolve most of the range issues)
For modern "big" CPUs like x86-64 and ARM, integer division and floating point division are roughly the same effort, and converting an integer to a float or vice versa is not a "hard" task (and does the correct rounding directly in that conversion, at least), so most likely the resulting operations are.
atmp = (float) a;
btmp = (float) b;
resfloat = divide atmp/btmp;
return = to_int_with_rounding(resfloat)
About four machine instructions.
On the other hand, your code uses two divides, one modulo and a multiply, which is quite likely longer on such a processor.
tmp = a/b;
tmp1 = a % b;
tmp2 = tmp1 * 2;
tmp3 = tmp2 / b;
tmp4 = tmp + tmp3;
So five instructions, and three of those are "divide" (unless the compiler is clever enough to reuse a / b for a % b - but it's still two distinct divides).
Of course, if you are outside the range of number of digits that a float or double can hold without losing digits (23 bits for float, 53 bits for double), then your method MAY be better (assuming there is no overflow in the integer math).
On top of all that, since the first form is used by "everyone", it's the one that the compiler recognises and can optimise.
Obviously, the results depend on both the compiler being used and the processor it runs on, but these are my results from running the code posted above, compiled through clang++ (v3.9-pre-release, pretty close to released 3.8).
round_divide_by_float_casting(): 32.5 ns
round_divide_by_modulo(): 113 ns
divide_by_quotient_comparison(): 80.4 ns
However, the interesting thing I find when I look at the generated code:
xorps %xmm0, %xmm0
cvtsi2ssl 8016(%rsp,%rbp), %xmm0
xorps %xmm1, %xmm1
cvtsi2ssl 4016(%rsp,%rbp), %xmm1
divss %xmm1, %xmm0
callq roundf
cvttss2si %xmm0, %eax
movl %eax, 16(%rsp,%rbp)
addq $4, %rbp
cmpq $4000, %rbp # imm = 0xFA0
jne .LBB0_7
is that the round is actually a call. Which really surprises me, but explains why on some machines (particularly more recent x86 processors), it is faster.
g++ gives better results with -ffast-math, which gives around:
round_divide_by_float_casting(): 17.6 ns
round_divide_by_modulo(): 43.1 ns
divide_by_quotient_comparison(): 18.5 ns
(This is with increased count to 100k values)
Prefer the standard solution. Use std::div family of functions declared in cstdlib.
See: http://en.cppreference.com/w/cpp/numeric/math/div
Casting to float and then to int may be very inefficient on some architectures, for example, microcontrollers.
Thanks for the suggestions so far. To shed some light I made a test setup to compare performance.
#include <iostream>
#include <string>
#include <cmath>
#include <cstdlib>
#include <chrono>
using namespace std;
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem <= b - dv.rem) {
++result;
}
return result;
}
int main()
{
int itr = 1000;
//while (true) {
auto begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
divide_by_quotient_comparison(i, j);
}
}
auto end = std::chrono::steady_clock::now();
cout << "divide_by_quotient_comparison(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
round_divide_by_float_casting(i, j);
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_float_casting(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
round_divide_by_modulo(i, j);
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_modulo(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
//}
return 0;
}
The results I got on my machine (i7 with Visual Studio 2015) was as follows: the modulo arithmetic was about twice as fast as the int → float → int casting method. The method relying on std::div_t (suggested by #YSC and #teroi) was faster than the int → float → int, but slower than the modulo arithmetic method.
A second test was performed to avoid certain compiler optimizations pointed out by #YSC:
#include <iostream>
#include <string>
#include <cmath>
#include <cstdlib>
#include <chrono>
#include <vector>
using namespace std;
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem <= b - dv.rem) {
++result;
}
return result;
}
int main()
{
int itr = 100;
vector <int> randi, randj;
for (int i = 0; i < itr; i++) {
randi.push_back(rand());
int rj = rand();
if (rj == 0)
rj++;
randj.push_back(rj);
}
vector<int> f, m, q;
while (true) {
auto begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
q.push_back( divide_by_quotient_comparison(randi[i] , randj[j]) );
}
}
auto end = std::chrono::steady_clock::now();
cout << "divide_by_quotient_comparison(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
f.push_back( round_divide_by_float_casting(randi[i], randj[j]) );
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_float_casting(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
m.push_back( round_divide_by_modulo(randi[i], randj[j]) );
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_modulo(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
cout << endl;
f.clear();
m.clear();
q.clear();
}
return 0;
}
In this second test the slowest was the divide_by_quotient() reliant on std::div_t, followed by divide_by_float(), and the fastest again was the divide_by_modulo(). However this time the performance difference was much, much lower, less than 20%.
The documentation of std::hypot says that:
Computes the square root of the sum of the squares of x and y, without undue overflow or underflow at intermediate stages of the computation.
I struggle to conceive a test case where std::hypot should be used over the trivial sqrt(x*x + y*y).
The following test shows that std::hypot is roughly 20x slower than the naive calculation.
#include <iostream>
#include <chrono>
#include <random>
#include <algorithm>
int main(int, char**) {
std::mt19937_64 mt;
const auto samples = 10000000;
std::vector<double> values(2 * samples);
std::uniform_real_distribution<double> urd(-100.0, 100.0);
std::generate_n(values.begin(), 2 * samples, [&]() {return urd(mt); });
std::cout.precision(15);
{
double sum = 0;
auto s = std::chrono::steady_clock::now();
for (auto i = 0; i < 2 * samples; i += 2) {
sum += std::hypot(values[i], values[i + 1]);
}
auto e = std::chrono::steady_clock::now();
std::cout << std::fixed <<std::chrono::duration_cast<std::chrono::microseconds>(e - s).count() << "us --- s:" << sum << std::endl;
}
{
double sum = 0;
auto s = std::chrono::steady_clock::now();
for (auto i = 0; i < 2 * samples; i += 2) {
sum += std::sqrt(values[i]* values[i] + values[i + 1]* values[i + 1]);
}
auto e = std::chrono::steady_clock::now();
std::cout << std::fixed << std::chrono::duration_cast<std::chrono::microseconds>(e - s).count() << "us --- s:" << sum << std::endl;
}
}
So I'm asking for guidance, when must I use std::hypot(x,y) to obtain correct results over the much faster std::sqrt(x*x + y*y).
Clarification: I'm looking for answers that apply when x and y are floating point numbers. I.e. compare:
double h = std::hypot(static_cast<double>(x),static_cast<double>(y));
to:
double xx = static_cast<double>(x);
double yy = static_cast<double>(y);
double h = std::sqrt(xx*xx + yy*yy);
The answer is in the documentation you quoted
Computes the square root of the sum of the squares of x and y, without undue overflow or underflow at intermediate stages of the computation.
If x*x + y*y overflows, then if you carry out the calculation manually, you'll get the wrong answer. If you use std::hypot, however, it guarantees that the intermediate calculations will not overflow.
You can see an example of this disparity here.
If you are working with numbers which you know will not overflow the relevant representation for your platform, you can happily use the naive version.