What is faster on division? doubles / floats / UInt32 / UInt64 ? in C++/C - c++

I did some speed testing to figure out what is the fastest, when doing multiplication or division on numbers. I had to really work hard to defeat the optimiser. I got nonsensical results such as a massive loop operating in 2 microseconds, or that multiplication was the same speed as division (if only that were true).
After I finally worked hard enough to defeat enough of the compiler optimisations, while still letting it optimise for speed, I got these speed results. They maybe of interest to someone else?
If my test is STILL FLAWED, let me know, but be kind seeing as I just spend two hours writing this crap :P
64 time: 3826718 us
32 time: 2476484 us
D(mul) time: 936524 us
D(div) time: 3614857 us
S time: 1506020 us
"Multiplying to divide" using doubles seems the fastest way to do a division, followed by integer division. I did not test the accuracy of division. Could it be that "proper division" is more accurate? I have no desire to find out after these speed test results as I'll just be using integer division on a base 10 constant and letting my compiler optimise it for me ;) (and not defeating it's optimisations either).
Here's the code I used to get the results:
#include <iostream>
int Run(int bla, int div, int add, int minus) {
// these parameters are to force the compiler to not be able to optimise away the
// multiplications and divides :)
long LoopMax = 100000000;
uint32_t Origbla32 = 1000000000;
long i = 0;
uint32_t bla32 = Origbla32;
uint32_t div32 = div;
clock_t Time32 = clock();
for (i = 0; i < LoopMax; i++) {
div32 += add;
div32 -= minus;
bla32 = bla32 / div32;
bla32 += bla;
bla32 = bla32 * div32;
}
Time32 = clock() - Time32;
uint64_t bla64 = bla32;
clock_t Time64 = clock();
uint64_t div64 = div;
for (long i = 0; i < LoopMax; i++) {
div64 += add;
div64 -= minus;
bla64 = bla64 / div64;
bla64 += bla;
bla64 = bla64 * div64;
}
Time64 = clock() - Time64;
double blaDMul = Origbla32;
double multodiv = 1.0 / (double)div;
double multomul = div;
clock_t TimeDMul = clock();
for (i = 0; i < LoopMax; i++) {
multodiv += add;
multomul -= minus;
blaDMul = blaDMul * multodiv;
blaDMul += bla;
blaDMul = blaDMul * multomul;
}
TimeDMul = clock() - TimeDMul;
double blaDDiv = Origbla32;
clock_t TimeDDiv = clock();
for (i = 0; i < LoopMax; i++) {
multodiv += add;
multomul -= minus;
blaDDiv = blaDDiv / multomul;
blaDDiv += bla;
blaDDiv = blaDDiv / multodiv;
}
TimeDDiv = clock() - TimeDDiv;
float blaS = Origbla32;
float divS = div;
clock_t TimeS = clock();
for (i = 0; i < LoopMax; i++) {
divS += add;
divS -= minus;
blaS = blaS / divS;
blaS += bla;
blaS = blaS * divS;
}
TimeS = clock() - TimeS;
printf("64 time: %i us (%i)\n", (int)Time64, (int)bla64);
printf("32 time: %i us (%i)\n", (int)Time32, bla32);
printf("D(mul) time: %i us (%f)\n", (int)TimeDMul, blaDMul);
printf("D(div) time: %i us (%f)\n", (int)TimeDDiv, blaDDiv);
printf("S time: %i us (%f)\n", (int)TimeS, blaS);
return 0;
}
int main(int argc, char* const argv[]) {
Run(0, 10, 0, 0); // adds and minuses 0 so it doesn't affect the math, only kills the opts
return 0;
}

There are lots of ways to perform certain arithmetic, so there might not be a single answer (shifting, fractional multiplication, actual division, some round-trip through a logarithm unit, etc; these might all have different relative costs depending on the operands and resource allocation).
Let the compiler do its thing with the program and data flow information it has.
For some data applicable to assembly on x86, you might look at: "Instruction latencies and throughput for AMD and Intel x86 processors"

What is fastest will depend entirely on the target architecture. It looks here like you're interested only in the platform you happen to be on, which guessing from your execution times seems to be 64-bit x86, either Intel (Core2?) or AMD.
That said, floating-point multiplication by the inverse will be the fastest on many platforms, but is, as you speculate, usually less accurate than a floating-point divide (two roundings instead of one -- whether or not that matters for your usage is a separate question). In general, you are better off re-arranging your algorithm to use fewer divides than you are jumping through hoops to make division as efficient as possible (the fastest division is the one you don't do), and make sure to benchmark before you spend time optimizing at all, as algorithms that bottleneck on division are few and far between.
Also, if you have integer sources and need an integer result, make sure to include the cost of conversion between integer and floating-point in your benchmarking.
Since you're interested in timings on a specific machine, you should be aware that Intel now publishes this information in their Optimization Reference Manual (pdf). Specifically, you will be interested in the tables of Appendix C section 3.1, "Latency and Throughput with Register Operands".
Be aware that integer divide timings depend strongly on the actual values involved. Based on the information in that guide, it seems that your timing routines still have a fair bit of overhead, as the performance ratios you measure don't match up with Intel's published information.

As Stephen mentioned, use the optimisation manual - but you should also be considering the use of SSE instructions. These can do 4 or 8 divisions / multiplications in a single instruction.
Also, it is fairly common for a division to take a single clock cycle to process. The result may not be available for several clock cycles (called latency), however the next division can begin during this time (overlapping with the first) as long as it does not require the result from the first. This is due to pipe-lining in the CPU, in the same way as you can wash more clothes while the previous load is still drying.
Multiplying to divide is a common trick, and should be used wherever your divisor changes infrequently.
There is a very good chance that you will spend time and effort making the maths fast only to discover that it is the speed of memory access (as you navigate the input and write the output) that limits your final implimentation.

I wrote a flawed test to do this on MSVC 2008
double i32Time = GetTime();
{
volatile __int32 i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 61;
count++;
}
}
i32Time = GetTime() - i32Time;
double i64Time = GetTime();
{
volatile __int64 i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 61;
count++;
}
}
i64Time = GetTime() - i64Time;
double fTime = GetTime();
{
volatile float i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 4.0f;
count++;
}
}
fTime = GetTime() - fTime;
double fmTime = GetTime();
{
volatile float i = 4;
const float div = 1.0f / 4.0f;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i *= div;
count++;
}
}
fmTime = GetTime() - fmTime;
double dTime = GetTime();
{
volatile double i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 4.0f;
count++;
}
}
dTime = GetTime() - dTime;
double dmTime = GetTime();
{
volatile double i = 4;
const double div = 1.0f / 4.0f;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i *= div;
count++;
}
}
dmTime = GetTime() - dmTime;
DebugOutput( _T( "%f\n" ), i32Time );
DebugOutput( _T( "%f\n" ), i64Time );
DebugOutput( _T( "%f\n" ), fTime );
DebugOutput( _T( "%f\n" ), fmTime );
DebugOutput( _T( "%f\n" ), dTime );
DebugOutput( _T( "%f\n" ), dmTime );
DebugBreak();
I then ran it on an AMD64 Turion 64 in 32-bit mode. The results I got were as follows:
0.006622
0.054654
0.006283
0.006353
0.006203
0.006161
The reason the test is flawed is the usage of volatile which forces the compiler to re-load the variable from memory just in case its changed. All in it show there is precious little difference between any of the implementations on this machine (__int64 is obviously slow).
It also categorically shows that the MSVC compiler performs the multiply by reciprocal optimisation. I imagine GCC does the same if not better. If i change the float and double division checks to divide by "i" then it increases the time significantly. Though, while a lot of that could be the re-loading from disk, it is obvious the compiler can't optimise that away so easily.
To understand such micro-optimisations try reading this pdf.
All in I'd argue that if you are worrying about such things you obviously haven't profiled your code. Profile and fix the problems as and when they actually ARE a problem.

Agner Fog has done some pretty detailed measurements himself, which can be found here. If you're really trying to optimize stuff, you should read the rest of the documents from his software optimization resources as well.
I would point out that, even if you are measuring non-vectorized floating point operations, the compiler has two options for the generated assembly: it can use the FPU instructions (fadd, fmul) or it can use SSE instructions while still manipulate one floating point value per instruction (addss, mulss). In my experience the SSE instructions are faster and have less inaccuracies, but compilers don't make it the default because it could break compatibility with code that relies on the old behavior. You can turn it on in gcc with the -mfpmath=sse flag.

Related

Are floating point operations faster than int operations?

So I'm doing a little benchmark for measuring operations per second for different operator/type combinations on C++ and now I'm stuck. My tests for +/int and +/float looks like
int int_plus(){
int x = 1;
for (int i = 0; i < NUM_ITERS; ++i){
x += i;
}
return x;
}
float float_plus(){
float x = 1.0;
for (int i = 0; i < NUM_ITERS; ++i){
x += i;
}
return x;
}
And time measurement looks like
//same for float
start = chrono::high_resolution_clock::now();
int res_int = int_plus();
end = chrono::high_resolution_clock::now();
diff = end - start;
ops_per_sec = NUM_ITERS / (diff.count() / 1000);
When I run tests I get
3.65606e+08 ops per second for int_plus
3.98838e+08 ops per second for float plus
But as I understand float operations is always slower than int operations, but my tests show greater value on float type.
So there is the question: Am I wrong or there's something wrong with code? Or, maybe, something else?
There's a few things that could be going on. Optimization can be part of it. Using a #define constant could be letting the compiler do who-knows-what.
Note that the loop code is also being counted. Now, that's a constant for both loops, but it's part of your time, and that means you're doing a lot of int operations, not just 1 * NUM_ITERS.
If NUM_ITERS is relatively small, then the execution time is going to be very low, and that means the overhead of a method call probably dwarfs the cost of the operations inside the method.
Optimization level will also matter.
I'm not sure what else.

Why do I get a larger relative speedup vs. scalar from SIMD intrinsics with larger arrays?

I want to learn SIMD programming. Now I have some interesting moment in my code.
I just want to measure the time of work of my code. I try to apply some base function for my array with a particular size.
Firstly I try to use function that was written with SIMD instructions and after that I try to use usual aproach. And I compare time of this two realizations the same function.
I defined performance like (time without sse) / (time using sse).
But when my size is 8 , I have performance is 1.3, and when my size = 512 - I have Performance = 3, if I have size = 1000 performance = 4, if size = 4000 -> performance = 5.
I don't understand why my performance is increasing when size of array is increasing.
My code
void init(double* v, size_t size) {
for (int i = 0; i < size; ++i) {
v[i] = i / 10.0;
}
}
void sub_func_sse(double* v, int start_idx) {
__m256d vector = _mm256_loadu_pd(v + start_idx);
__m256d base = _mm256_set_pd(2.0, 2.0, 2.0, 2.0);
for (int i = 0; i < 128; ++i) {
vector = _mm256_mul_pd(vector, base);
}
_mm256_storeu_pd(v + start_idx, vector);
}
void sub_func(double& item) {
for (int k = 0; k < 128; ++k) {
item *= 2.0;
}
}
int main() {
const size_t size = 8;
double* v = new double[size];
init(v, size);
const int num_repeat = 2000;//I should repeat my measuraments
//because I want to get average time - it is more clear information
double total_time_sse = 0;
for (int p = 0; p < num_repeat; ++p) {
init(v, size);
TimerHc t;
t.restart();
for (int i = 0; i < size; i += 8) {
sub_func_sse(v, i);
}
total_time_sse += t.toc();
}
double total_time = 0;
for (int p = 0; p < num_repeat; ++p) {
init(v, size);
TimerHc t;
t.restart();
for (int i = 0; i < size; ++i) {
sub_func(v[i]);
}
total_time += t.toc();
}
std::cout << "time using sse = " << total_time_sse / num_repeat << std::endl <<
"time without sse = " << total_time / num_repeat << std::endl;
system("pause");
}
I defined performance like (time without sse) / (time using sse).
What you measure is speedup.
The speedup you can expect from applying parallelizations is modelled by Amdahl's law. It relates the savings in those parts that can be made faster (by parallelization or other means) to the total speedup. Amdahl's law can be rather intimidating, because it basically says that making parts faster will not always gain you a total speedup. The limit in achievable speedup is determined by the relative fraction of the workload that can be parallelized.
Gustavon's law takes a different point of view. In a nutshell, it states that you just have to increase the workload to make efficient use of parallelization. More workload in total has typically less impact on overhead from parallelization and the non-parallel part of computations, hence (according to Amdahl's law) results in more efficient use of parallelism.
...and in some sense, that's what you are observing here. The bigger your array, the more impact parallelization has.
PS: This is just some handwaving to explain why the effect you see is not too surprising. Luckily there is another answer which addresses your specific benchmark in more detail.
You're probably a victim of CPU frequency scaling; for stable results you should disable dynamic frequency scaling and turbo boost, or at least warm up the CPU before starting the measurement.
Since you start by measuring SSE performance and then proceed to regular performance, the CPU frequency is low in the beginning, so SSE performance appears worse.
Having said that, there are a few other issues with your approach:
The overhead of high_frequency_clock::now() calls compared to the work being measured is high; move the time measurement to outside the for (..num_repeat loop, i.e. time the entire loop, not individual iterations (then optionally divide the measured time by the number of iterations).
The results of the computation are never used; the compiler is free to optimize the work out entirely. Make sure to "use" the result, e.g. by printing it.
It is quite inefficient to multiply a double by 2.0. Indeed, the non-SSE version is optimized to an ADD instead (item *= 2.0 ==> vaddsd xmm0, xmm0, xmm0). So your hand-made SSE version is losing out.
An optimizing compiler will probably auto-vectorize your non-SSE code. To be sure, always check the generated assembly. Link to godbolt
Use a benchmarking framework like Google Benchmark; it will help you avoid many pitfalls associated with code benchmarking.

What is the point implementing custom math functions in C++ (like SQRT)?

In the past few weeks I was wondering what is the point that people trying to re-invent the wheel and spend hours to write their own sqrt function for example. The built-in version is optimized well, precise and stable enough.
I am speaking about the Carmack-style Square Root for example. What is the point? It will lose precision during the approximation and it uses casting.
Intel style SSE Square Root was giving precise results, but was slower in my calculations than the standard SQRT.
By average, all the above tricks were beaten by far the standard SQRT. So my question is, what is the point?
My PC has the below CPU:
Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz.
I've got the following results for each method (I've fixed the performance test according to the below suggestion by the helpful comment, thanks for that n.m.):
(Please keep in mind that if you're using approximation like Newton method, you'll lose precision so you must align your calculation accordingly.)
You can find the source code below for reference.
#include <chrono>
#include <cmath>
#include <deque>
#include <iomanip>
#include <iostream>
#include <immintrin.h>
#include <random>
using f64 = double;
using s64 = int64_t;
using u64 = uint64_t;
static constexpr u64 cycles = 24;
static constexpr u64 sample_max = 1000000;
f64 sse_sqrt(const f64 x) {
__m128d root = _mm_sqrt_pd(_mm_load_pd(&x));
return *(reinterpret_cast<f64*>(&root));
}
constexpr f64 carmack_sqrt(const f64 x) {
union {
f64 x;
s64 i;
} u = {};
u.x = x;
u.i = 0x5fe6eb50c7b537a9 - (u.i >> 1);
f64 xhalf = 0.5 * x;
u.x = u.x * (1.5 - xhalf * u.x * u.x);
# u.x = u.x * (1.5 - xhalf * u.x * u.x);
# u.x = u.x * (1.5 - xhalf * u.x * u.x);
# ... so on, if you want more precise result ...
return u.x * x;
}
int main(int /* argc */, char ** /*argv*/) {
std::random_device r;
std::default_random_engine e(r());
std::uniform_real_distribution<f64> dist(1, sample_max);
std::deque<f64> samples(sample_max);
for (auto& sample : samples) {
sample = dist(e);
}
// std sqrt
{
std::cout << "> Measuring std sqrt.\r\n> Please wait . . .\r\n";
f64 result = 0;
auto t1 = std::chrono::high_resolution_clock::now();
for (auto cycle = 0; cycle < cycles; ++cycle) {
for (auto& sample : samples) {
result += std::sqrt(static_cast<f64>(sample));
}
}
auto t2 = std::chrono::high_resolution_clock::now();
auto dt = t2 - t1;
std::cout << "> Accumulated result: " << std::setprecision(19) << result << "\n";
std::cout << "> Total execution time: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(dt).count() << " ms.\r\n\r\n";
}
// sse sqrt
{
std::cout << "> Measuring sse sqrt.\r\n> Please wait . . .\r\n";
f64 result = 0;
auto t1 = std::chrono::high_resolution_clock::now();
for (auto cycle = 0; cycle < cycles; ++cycle) {
for (auto& sample : samples) {
result += sse_sqrt(static_cast<f64>(sample));
}
}
auto t2 = std::chrono::high_resolution_clock::now();
auto dt = t2 - t1;
std::cout << "> Accumulated result: " << std::setprecision(19) << result << "\n";
std::cout << "> Total execution time: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(dt).count() << " ms.\r\n\r\n";
}
// carmack sqrt
{
std::cout << "> Measuring carmack sqrt.\r\n> Please wait . . .\r\n";
f64 result = 0;
auto t1 = std::chrono::high_resolution_clock::now();
for (auto cycle = 0; cycle < cycles; ++cycle) {
for (auto& sample : samples) {
result += carmack_sqrt(static_cast<f64>(sample));
}
}
auto t2 = std::chrono::high_resolution_clock::now();
auto dt = t2 - t1;
std::cout << "> Accumulated result: " << std::setprecision(19) << result << "\n";
std::cout << "> Total execution time: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(dt).count() << " ms.\r\n\r\n";
}
std::cout << "> Press any key to exit . . .\r\n";
std::getchar();
return 0;
}
Please note that I am not here for criticizing anybody, I am just here for learning, experimenting and trying to figure out my own method and the best toolset to choose from.
I am writing my own game engine to one of my portfolio. I am appreciating your kind answers and I am open for any suggestions.
Have a nice day.
That fast reciprocal square root trick is mostly obsolete. SSE's built in approximate reciprocal square root which exists since the Pentium 3 has completely replaced it on the PC platform. Other platforms usually have their own reciprocal square root, for example ARM has VRSQRTE and a handy instruction that does the Newton step too.
By the way, turning the result into a non-reciprocal square root usually makes it less useful: the primary use case is normalizing a vector, where a "straight" square root is annoying (it would have to be divided by) while a reciprocal square root fits exactly (then it's a multiply).
As often, your benchmark isn't quite accurate. I happen to have done some relevant tests a while ago, where the relevant parts look like this:
std::sqrt based:
HMM_INLINE float HMM_LengthVec4(hmm_vec4 A)
{
float Result = std::sqrt(HMM_LengthSquaredVec4(A));
return(Result);
}
HMM_INLINE hmm_vec4 HMM_NormalizeVec4(hmm_vec4 A)
{
hmm_vec4 Result = {0};
float VectorLength = HMM_LengthVec4(A);
/* NOTE(kiljacken): We need a zero check to not divide-by-zero */
if (VectorLength != 0.0f)
{
float Multiplier = 1.0f / VectorLength;
#ifdef HANDMADE_MATH__USE_SSE
__m128 SSEMultiplier = _mm_set1_ps(Multiplier);
Result.InternalElementsSSE = _mm_mul_ps(A.InternalElementsSSE, SSEMultiplier);
#else
Result.X = A.X * Multiplier;
Result.Y = A.Y * Multiplier;
Result.Z = A.Z * Multiplier;
Result.W = A.W * Multiplier;
#endif
}
return (Result);
}
SSE reciprocal square root plus Newton step:
HMM_INLINE hmm_vec4 HMM_NormalizeVec4_new(hmm_vec4 A)
{
hmm_vec4 Result;
// square elements and add them together, result is in every lane
__m128 t0 = _mm_mul_ps(A.InternalElementsSSE, A.InternalElementsSSE);
__m128 t1 = _mm_add_ps(t0, _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(2, 3, 0, 1)));
__m128 sq = _mm_add_ps(t1, _mm_shuffle_ps(t1, t1, _MM_SHUFFLE(0, 1, 2, 3)));
// compute reciprocal square root with Newton step for ~22bit accuracy
__m128 rLen = _mm_rsqrt_ps(sq);
__m128 half = _mm_set1_ps(0.5);
__m128 threehalf = _mm_set1_ps(1.5);
__m128 t = _mm_mul_ps(_mm_mul_ps(sq, half), _mm_mul_ps(rLen, rLen));
rLen = _mm_mul_ps(rLen, _mm_sub_ps(threehalf, t));
// multiply elements by the reciprocal of the vector length
__m128 normed = _mm_mul_ps(A.InternalElementsSSE, rLen);
// normalize zero-vector to zero, not to NaN
__m128 zero = _mm_setzero_ps();
Result.InternalElementsSSE = _mm_andnot_ps(_mm_cmpeq_ps(A.InternalElementsSSE, zero), normed);
return (Result);
}
SSE reciprocal square root without Newton step:
HMM_INLINE hmm_vec4 HMM_NormalizeVec4_lowacc(hmm_vec4 A)
{
hmm_vec4 Result;
// square elements and add them together, result is in every lane
__m128 t0 = _mm_mul_ps(A.InternalElementsSSE, A.InternalElementsSSE);
__m128 t1 = _mm_add_ps(t0, _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(2, 3, 0, 1)));
__m128 sq = _mm_add_ps(t1, _mm_shuffle_ps(t1, t1, _MM_SHUFFLE(0, 1, 2, 3)));
// compute reciprocal square root without Newton step for ~12bit accuracy
__m128 rLen = _mm_rsqrt_ps(sq);
// multiply elements by the reciprocal of the vector length
__m128 normed = _mm_mul_ps(A.InternalElementsSSE, rLen);
// normalize zero-vector to zero, not to NaN
__m128 zero = _mm_setzero_ps();
Result.InternalElementsSSE = _mm_andnot_ps(_mm_cmpeq_ps(A.InternalElementsSSE, zero), normed);
return (Result);
}
(quick-bench)
As you can see I measured throughput and latency separately, and the distinction mattered a lot. Reciprocal square root with a Newton step takes a long time, about as long as using a normal square root, but can be processed at a higher throughput. Without Newton step, a single vector-normalize operation takes less time from start to finish too, and the throughput becomes even better than before. Anyway this should demonstrate that there is some point to doing something about your square roots.
By the way the above code is not meant to be Good Practice, that would be normalizing 4 vectors simultaneously, so as to not waste a 4-wide SIMD operation on calculating a single (reciprocal) square root. That's not really the issue here though.
For fun and profit?
Based on your question, there is no reason to do so, but if you want to learn a language, it is recommended to solve mathematical problems, because they rely on integers/floats (which are primitives in (mostly) any language) and the algorithms are well documented.
In "real" code, one should use the provided methods by libc as long as you have one. Embedded platforms usually lack of a libc or roll their own, so you have to implement your own.
The point of Carmack’s technique was to extract better performance from integer operations than could be got from floating-point operations in the 1990’s. Since that time, floating point performance has massively improved! As you can see in your own benchmarks. There would be no practical reason to use that technique in new code unless you faced a similar constraint in your hardware, which the i7 doesn’t.
Limitation of standard library usage due to scope of project
What is the point implementing custom math functions in C++ (like SQRT)?
In addition to what has already been mentioned in the other answers, the choice of a project implementing their own (custom) math functions could be due to:
Limitations placed on the project, e.g. due to some standard, to not make use of the math library shipped with the compiler.
One example could be an ASIL classified project adhering to the ISO 26262 Standard, say, making use of a compiler that provides adequate qualification w.r.t. correctly compiling the project's source code, but that does not provide adequate qualification for the shipped standard library, where e.g. the math library could be linked in only by object and not source code (for the latter, appropriate tests and source code qualification could be written by the project themselves).

Improving computation speed for sine/cosine and large arrays

for signal processing I need to compute relatively large C arrays as shown in the code part below. This is working fine so far, unfortunately, the implementation is slow. The size of "calibdata" is arround 150k and needs to be calculated for different frequencies/phases. Is there a way to improve speed significantly? Doing the same with logical indexing in MATLAB is way faster.
What I tried already:
using taylor approximation of sine: no siginificant improvement.
using std::vector, also no siginificant improvement.
code:
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier){
for (int i = 0; i < size; i++)
result += calibdata[i] * cos((2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180) - (PI / 2)));
result = fabs(result / size);
return result;}
Best regards,
Thomas
When optimizing code for speed, step 1 is to enable compiler optimizations. I hope you've done that already.
Step 2 is to profile the code and see exactly how the time is being spent. Without profiling, you're just guessing, and you could end up trying to optimize the wrong thing.
For example, your guess seems to be that the cos function is the bottleneck. But the other possibility is that the calculation of the angle is the bottleneck. Here's how I would refactor the code to reduce the time spent calculating the angle.
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier)
{
double result = 0;
double angle = phase * (PI / 180) - (PI / 2);
double delta = 2 * PI * freqscale[currentcarrier] / fs;
for (int i = 0; i < size; i++)
{
result += calibdata[i] * cos( angle );
angle += delta;
}
return fabs(result / size);
}
Okay, I'm probably going to get flogged for this answer, but I would use the GPU for this. Because your array doesn't appear to be self-referential, the best speedup you're going to get for large arrays is through parallelization... by far. I don't use MATLAB, but I just did a quick search for GPU utilization on the MathWorks site:
http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html?requestedDomain=www.mathworks.com
Outside of MATLAB you could use OpenCL or CUDA yourself.
Your enemies in execution time are:
Division
Function calls (including implicit ones in loops)
Accessing data from diffent areas
Operating dissimilar instructions
You should research on Data Driving programming and using the data cache effectively.
Division
Whether with hardware support or software support division takes a long time by its very nature. Eliminate if possibly by changing the numeric base or factoring out of the loop (if possible).
Function Calls
The most efficient method of execution is sequential. Processors are optimized for this. A branch may require the processor perform some additional calculation (branch prediction) or reloading of the instruction cache / pipeline. A waste of time (that could be spent executing data instructions).
The optimization for this is to use techniques like loop unrolling and inlining of small functions. Also reduce the quantity of branches by simplifying expressions and using Boolean algebra.
Accessing data from different areas
Modern processors are optimized to operate on local data (data in one area). One example is loading an internal cache with data. Specifically, loading a cache line with data. For example, if the data from your arrays is in one location and the cosine data in another, this may cause the data cache to be reloaded, again wasting time.
A better solution is to place all data contiguously or to contiguously access all the data. Rather than making many discontiguous accesses to the cosine table, look up a batch of cosine values sequentially (without any other data accesses between).
Dissimilar Instructions
Modern processors are more efficient at processing a batch of similar instructions. For example the pattern load, add, store is more efficient for blocks when all the loading is performed, then all adding, then all storing.
Summary
Here's an example:
register double result = 0.0;
register unsigned int i = 0U;
for (i = 0; i < size; i += 2)
{
register double cos_angle1 = /* ... */;
register double cos_angle2 = /* ... */;
result += calibdata[i + 0] * cos_angle1;
result += calibdata[i + 1] * cos_angle2;
}
The above loop is unrolled and like operations are performed in groups.
Although the keyword register may be deprecated, it is a suggestion to the compiler to use dedicated registers (if possible).
You can try to use the definition of cosine based on the complex exponential:
where j^2=-1.
Store exp((2 * PI*freqscale[currentcarrier] / fs)*j) and exp(phase*j). Evaluating cos(...) then resumes to a couple of products and additions in the for loops, and sin(), cos() and exp() are only called a couple of times.
Here goes the implementation:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <complex.h>
#include <time.h>
#define PI 3.141592653589
typedef struct cos_plan{
double complex* expo;
int size;
}cos_plan;
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier){
double result=0; //initialization
for (int i = 0; i < size; i++){
result += calibdata[i] * cos ( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.) - (PI / 2.)) );
//printf("i %d cos %g\n",i,cos ( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.) - (PI / 2.)) ));
}
result = fabs(result / size);
return result;
}
double phase_func2(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier, cos_plan* plan){
//first, let's compute the exponentials:
//double complex phaseexp=cos(phase*(PI / 180.) - (PI / 2.))+sin(phase*(PI / 180.) - (PI / 2.))*I;
//double complex phaseexpm=conj(phaseexp);
double phasesin=sin(phase*(PI / 180.) - (PI / 2.));
double phasecos=cos(phase*(PI / 180.) - (PI / 2.));
if (plan->size<size){
double complex *tmp=realloc(plan->expo,size*sizeof(double complex));
if(tmp==NULL){fprintf(stderr,"realloc failed\n");exit(1);}
plan->expo=tmp;
plan->size=size;
}
plan->expo[0]=1;
//plan->expo[1]=exp(2 *I* PI*freqscale[currentcarrier]/fs);
plan->expo[1]=cos(2 * PI*freqscale[currentcarrier]/fs)+sin(2 * PI*freqscale[currentcarrier]/fs)*I;
//printf("%g %g\n",creall(plan->expo[1]),cimagl(plan->expo[1]));
for(int i=2;i<size;i++){
if(i%2==0){
plan->expo[i]=plan->expo[i/2]*plan->expo[i/2];
}else{
plan->expo[i]=plan->expo[i/2]*plan->expo[i/2+1];
}
}
//computing the result
double result=0; //initialization
for(int i=0;i<size;i++){
//double coss=0.5*creall(plan->expo[i]*phaseexp+conj(plan->expo[i])*phaseexpm);
double coss=creall(plan->expo[i])*phasecos-cimagl(plan->expo[i])*phasesin;
//printf("i %d cos %g\n",i,coss);
result+=calibdata[i] *coss;
}
result = fabs(result / size);
return result;
}
int main(){
//the parameters
long n=100000000;
double* calibdata=malloc(n*sizeof(double));
if(calibdata==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
int freqnb=42;
double* freqscale=malloc(freqnb*sizeof(double));
if(freqscale==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
for (int i = 0; i < freqnb; i++){
freqscale[i]=i*i*0.007+i;
}
double fs=n;
double phase=0.05;
//populate calibdata
for (int i = 0; i < n; i++){
calibdata[i]=i/((double)n);
calibdata[i]=calibdata[i]*calibdata[i]-calibdata[i]+0.007/(calibdata[i]+3.0);
}
//call to sample code
clock_t t;
t = clock();
double res=phase_func(calibdata,n, freqscale, fs, phase, 13);
t = clock() - t;
printf("first call got %g in %g seconds.\n",res,((float)t)/CLOCKS_PER_SEC);
//initialize
cos_plan plan;
plan.expo=malloc(n*sizeof(double complex));
plan.size=n;
t = clock();
res=phase_func2(calibdata,n, freqscale, fs, phase, 13,&plan);
t = clock() - t;
printf("second call got %g in %g seconds.\n",res,((float)t)/CLOCKS_PER_SEC);
//cleaning
free(plan.expo);
free(calibdata);
free(freqscale);
return 0;
}
Compile with gcc main.c -o main -std=c99 -lm -Wall -O3. Using the code you provided, it take 8 seconds with size=100000000 on my computer while the execution time of the proposed solution takes 1.5 seconds... It is not so impressive, but it is not negligeable.
The solution that is presented does not involve any call to cos of sin in the for loops. Indeed, there are only multiplications and additions. The bottleneck is either the memory bandwidth or the tests and access to memory in the exponentiation by squaring (most likely first issue, since i add to use an additional array of complex).
For complex number in c, see:
How to work with complex numbers in C?
Computing e^(-j) in C
If the problem is memory bandwidth, then parallelism is required... and directly computing cos would be easier. Additional simplifications coud have be performed if freqscale[currentcarrier] / fs were an integer. Your problem is really close to the computation of Discrete Cosine Transform, the present trick is close to the Discrete Fourier Transform and the FFTW library is really good at computing these transforms.
Notice that the present code can produce innacurate results due to loss of significance : result can be much larger than cos(...)*calibdata[] when size is large. Using partial sums can resolve the issue.
Simple trig identity to eliminate the - (PI / 2). This is also more accurate than attempting the subtraction which uses machine_PI. This is important when values are near π/2.
cosine(x - π/2) == -sine(x)
Use of const and restrict: Good compilers can perform more optimizations with this knowledge. (See also #user3528438)
// double phase_func(double* calibdata, long size,
// double* freqscale, double fs, double phase, int currentcarrier) {
double phase_func(const double* restrict calibdata, long size,
const double* restrict freqscale, double fs, double phase, int currentcarrier) {
Some platforms perform faster calculations with float vs double with a tolerable loss of precision. YMMV. Profile code both ways.
// result += calibdata[i] * cos(...
result += calibdata[i] * cosf(...
Minimize recalculations.
double angle_delta = ...;
double angle_current = ...;
for (int i = 0; i < size; i++) {
result += calibdata[i] * cos(angle_current);
angle_current += angle_delta;
}
Unclear why code uses long size and and int currentcarrier. I'd expect the same type and to use type size_t. This is idiomatic for array indexing. #Daniel Jour
Reversing loops can allow a compare to 0 rather than compare to variable. Sometimes a modest performance gain.
Insure compiler optimizations are well enabled.
All together
double phase_func2(const double* restrict calibdata, size_t size,
const double* restrict freqscale, double fs, double phase,
size_t currentcarrier) {
double result = 0.0;
double angle_delta = 2.0 * PI * freqscale[currentcarrier] / fs;
double angle_current = angle_delta * (size - 1) + phase * (PI / 180);
size_t i = size;
while (i) {
result -= calibdata[--i] * sinf(angle_current);
angle_current -= angle_delta;
}
result = fabs(result / size);
return result;
}
Leveraging the cores you have, without resorting to the GPU, use OpenMP. Testing with VS2015, the invariants are lifted out of the loop by the optimizer. Enabling AVX2 and OpenMP.
double phase_func3(double* calibdata, const int size, const double* freqscale,
const double fs, const double phase, const size_t currentcarrier)
{
double result{};
constexpr double PI = 3.141592653589;
#pragma omp parallel
#pragma omp for reduction(+: result)
for (int i = 0; i < size; ++i) {
result += calibdata[i] *
cos( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.0) - (PI / 2.0)));
}
result = fabs(result / size);
return result;
}
The original version with AVX enabled took: ~1.4 seconds
and adding OpenMP brought it down to: ~0.51 seconds.
Pretty nice return for two pragmas and a compiler switch.

Fast percentile in C++

My program calculates a Monte Carlo simulation for the value-at-risk metric. To simplify as much as possible, I have:
1/ simulated daily cashflows
2/ to get a sample of a possible 1-year cashflow,
I need to draw 365 random daily cashflows and sum them
Hence, the daily cashflows are an empirically given distrobution function to be sampled 365 times. For this, I
1/ sort the daily cashflows into an array called *this->distro*
2/ calculate 365 percentiles corresponding to random probabilities
I need to do this simulation of a yearly cashflow, say, 10K times to get a population of simulated yearly cashflows to work with. Having the distribution function of daily cashflows prepared, I do the sampling like...
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
generatedVal = 0.0;
for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
{
prob = (FLT_TYPE)fastrand(); // prob [0,1]
dIdx = prob * dMaxDistroIndex; // scale prob to distro function size
// to get an index into distro array
_floor = ((FLT_TYPE)(long)dIdx); // fast version of floor
_ceil = _floor + 1.0f; // 'fast' ceil:)
iIdx1 = (unsigned int)( _floor );
iIdx2 = iIdx1 + 1;
// interpolation per se
generatedVal += this->distro[iIdx1]*(_ceil - dIdx );
generatedVal += this->distro[iIdx2]*(dIdx - _floor);
}
this->yearlyCashflows[idxSim] = generatedVal ;
}
The code inside of both for cycles does linear interpolation. If, say USD 1000 corresponds to prob=0.01, USD 10000 corresponds to prob=0.1 then if I don't have an empipirical number for p=0.05 I want to get USD 5000 by interpolation.
The question: this code runs correctly, though the profiler says that the program spends cca 60% of its runtime on the interpolation per se. So my question is, how can I make this task faster? Sample runtimes reported by VTune for specific lines are as follows:
prob = (FLT_TYPE)fastrand(); // 0.727s
dIdx = prob * dMaxDistroIndex; // 1.435s
_floor = ((FLT_TYPE)(long)dIdx); // 0.718s
_ceil = _floor + 1.0f; // -
iIdx1 = (unsigned int)( _floor ); // 4.949s
iIdx2 = iIdx1 + 1; // -
// interpolation per se
generatedVal += this->distro[iIdx1]*(_ceil - dIdx ); // -
generatedVal += this->distro[iIdx2]*(dIdx - _floor); // 12.704s
Dashes mean the profiler reports no runtimes for those lines.
Any hint will be greatly appreciated.
Daniel
EDIT:
Both c.fogelklou and MSalters have pointed out great enhancements. The best code in line with what c.fogelklou said is
converter = distroDimension / (FLT_TYPE)(RAND_MAX + 1)
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
generatedVal = 0.0;
for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
{
dIdx = (FLT_TYPE)fastrand() * converter;
iIdx1 = (unsigned long)dIdx);
_floor = (FLT_TYPE)iIdx1;
generatedVal += this->distro[iIdx1] + this->diffs[iIdx1] *(dIdx - _floor);
}
}
and the best I have along MSalter's lines is
normalizer = 1.0/(FLT_TYPE)(RAND_MAX + 1);
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
generatedVal = 0.0;
for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
{
dIdx = (FLT_TYPE)fastrand()* normalizer ;
iIdx1 = fastrand() % _g.xDayCount;
generatedVal += this->distro[iIdx1];
generatedVal += this->diffs[iIdx1]*dIdx;
}
}
The second code is approx. 30 percent faster. Now, of 95s of total runtime, the last line consumes 68s. The last but one line consumes only 3.2s hence the double*double multiplication must be the devil. I thought of SSE - saving the last three operands into an array and then carry out a vector multiplication of this->diffs[i]*dIdx[i] and add this to this->distro[i] but this code ran 50 percent slower. Hence, I think I hit the wall.
Many thanks to all.
D.
This is a proposal for a small optimization, removing the need for ceil, two casts, and one of the multiplies. If you are running on a fixed point processor, that would explain why the multiplies and casts between float and int are taking so long. In that case, try using fixed point optimizations or turning on floating point in your compiler if the CPU supports it!
for ( unsigned int idxSim = 0; idxSim < _g.xSimulationCount; idxSim++ )
{
generatedVal = 0.0;
for ( register unsigned int idxDay = 0; idxDay < 365; idxDay ++ )
{
prob = (FLT_TYPE)fastrand(); // prob [0,1]
dIdx = prob * dMaxDistroIndex; // scale prob to distro function size
// to get an index into distro array
iIdx1 = (long)dIdx;
_floor = (FLT_TYPE)iIdx1; // fast version of floor
iIdx2 = iIdx1 + 1;
// interpolation per se
{
const FLT_TYPE diff = this->distro[iIdx2] - this->distro[iIdx1];
const FLT_TYPE interp = this->distro[iIdx1] + diff * (dIdx - _floor);
generatedVal += interp;
}
}
this->yearlyCashflows[idxSim] = generatedVal ;
}
I would recommend to fix fastrand. Floating-point code isn't the fastest in the world, but what is especially slow is the switching between floating point and integer code. Since you need an integer index, use an integer random function.
It may even be advantageous to pre-generate all 365 random values in a loop. Since you need only log2(dMaxDistroIndex) bits of randomness per value, you may be able to reduce the number of RNG calls.
You would subsequently pick a random number between 0 and 1 for the interpolation fraction.