Are floating point operations faster than int operations?

Are floating point operations faster than int operations? - c++

So I'm doing a little benchmark for measuring operations per second for different operator/type combinations on C++ and now I'm stuck. My tests for +/int and +/float looks like
int int_plus(){
int x = 1;
for (int i = 0; i < NUM_ITERS; ++i){
x += i;
}
return x;
}
float float_plus(){
float x = 1.0;
for (int i = 0; i < NUM_ITERS; ++i){
x += i;
}
return x;
}
And time measurement looks like
//same for float
start = chrono::high_resolution_clock::now();
int res_int = int_plus();
end = chrono::high_resolution_clock::now();
diff = end - start;
ops_per_sec = NUM_ITERS / (diff.count() / 1000);
When I run tests I get
3.65606e+08 ops per second for int_plus
3.98838e+08 ops per second for float plus
But as I understand float operations is always slower than int operations, but my tests show greater value on float type.
So there is the question: Am I wrong or there's something wrong with code? Or, maybe, something else?

There's a few things that could be going on. Optimization can be part of it. Using a #define constant could be letting the compiler do who-knows-what.
Note that the loop code is also being counted. Now, that's a constant for both loops, but it's part of your time, and that means you're doing a lot of int operations, not just 1 * NUM_ITERS.
If NUM_ITERS is relatively small, then the execution time is going to be very low, and that means the overhead of a method call probably dwarfs the cost of the operations inside the method.
Optimization level will also matter.
I'm not sure what else.

Related

Float rounding error in time-critical C++ loop, looking for an efficient solution

As a premise, I am aware that this problem has been addressed already, but never in this specific scenario, from what I could find searching.
In a time-critical piece of code, I have a loop where a float value x must grow linearly from exactly 0 to-and-including exactly 1 in 'z' steps.
The un-optimized solution, but which would work without rounding errors, is:
const int z = (some number);
int c;
float x;
for(c=0; c<z; c++)
{
x = (float)c/(float)(z-1);
// do something with x here
}
obviously I can avoid the float conversions and use two loop variables and caching (float)(z-1):
const int z = (some number);
int c;
float xi,x;
const float fzm1 = (float)(z-1);
for(c=0,xi=0.f; c<z; c++, xi+=1.f)
{
x=xi/fzm1;
// do something with x
}
But who would ever repeat a division by a constant for every loop pass ? Obviously anyone would turn it into a multiplication:
const int z = (some number);
int c;
float xi,x;
const float invzm1 = 1.f/(float)(z-1);
for(c=0,xi=0.f; c<z; c++, xi+=1.f)
{
x=xi * invzm1;
// do something with x
}
Here is where obvious rounding issues may start to manifest.
For some integer values of z, (z-1)*(1.f/(float)(z-1)) won't give exactly one but 0.999999..., so the value assumed by x in the last loop cycle won't be exactly one.
If using an adder instead, i.e
const int z = (some number);
int c;
float x;
const float x_adder = 1.f/(float)(z-1);
for(c=0,x=0.f; c<z; c++, x+=x_adder)
{
// do something with x
}
the situation is even worse, because the error in x_adder will build up.
So the only solution I can see is using a conditional somewhere, like:
const int z = (some number);
int c;
float xi,x;
const float invzm1 = 1.f/(float)(z-1);
for(c=0,xi=0.f; c<z; c++, xi+=1.f)
{
x = (c==z-1) ? 1.f : xi * invzm1;
// do something with x
}
but in a time-critical loop a branch should be avoided if possible !
Oh, and I can't even split the loop and do
for(c=0,xi=0.f; c<z-1; c++, xi+=1.f) // note: loop runs now up to and including z-2
{
x=xi * invzm1;
// do something with x
}
x=1.f;
// do something with x
because I would have to replicate the whole block of code 'do something with x' which is not short or simple either, I cannot make of it a function call (would be inefficient, too many local variables to pass) nor I want to use #defines (would be very poor and inelegant and impractical).
Can you figure out any efficient or smart solution to this problem ?

First, a general consideration: xi += 1.f introduces a loop-carried dependency chain of however many cycles your CPU needs for floating point addition (probably 3 or 4). It also kills any attempt at vectorization unless you compile with -ffast-math. If you run on a modern super-scalar desktop CPU, I recommend using an integer counter and converting to float in each iteration.
In my opinion, avoiding int->float conversions is outdated advise from the era of x87 FPUs. Of course you have to consider the entire loop for the final verdict but the throughput is generally comparable to floating point addition.
For the actual problem, we may look at what others have done, for example Eigen in the implementation of their LinSpaced operation. There is also a rather extensive discussion in their bug tracker.
Their final solution is so simple that I think it is okay to paraphrase it here, simplified for your specific case:
float step = 1.f / (n - 1);
for(int i = 0; i < n; ++i)
float x = (i + 1 == n) ? 1.f : i * step;
The compiler may choose to peel off the last iteration to get rid of the branch but in general it is not too bad anyway. In scalar code branch prediction will work well. In vectorized code it's a packed compare and a blend instruction.
We may also force the decision to peel off the last iteration by restructuring the code appropriately. Lambdas are very helpful for this since they are a) convenient to use and b) very strongly inlined.
auto loop_body = [&](int i, float x) mutable {
...;
};
for(int i = 0; i < n - 1; ++i)
loop_body(i, i * step);
if(n > 0)
loop_body(n - 1, 1.f);
Checking with Godbolt (using a simple array initialization for the loop body), GCC only vectorizes the second version. Clang vectorizes both but does a better job with the second.

What you need is Bresenham's line algorithm.
It would allow you to avoid multiplication and divisions and use add/sub only. Just scale your range so that it could be represented by integer numbers and round up at final stage if precise split to parts is mathematically (or "representatively") impossible.

Consider using this:
const int z = (some number > 0);
const int step = 1000000/z;
for(int c=0; c<z-1; ++c)
{
x += step; //just if you really need the conversion, divide it by 1000000 when required
// do something with x
}
x = 1.f;
//do the last step with x
No conversions if you don't really need it, first and last values are as expected, multiplication is reduced to accumulation.
By changing of 1000000 you can manually control the precision.

I suggest that you start with the last alternative you have shown and use lambda to avoid passing local variables:
auto do_something_with_x = [&](float x){/*...*/}
for(c=0,xi=0.f; c<z-1; c++, xi+=1.f) // note: loop runs now up to and including z-2
{
x=xi * invzm1;
do_something_with_x(x);
}
do_something_with_x(1.f);

Why do I get a larger relative speedup vs. scalar from SIMD intrinsics with larger arrays?

I want to learn SIMD programming. Now I have some interesting moment in my code.
I just want to measure the time of work of my code. I try to apply some base function for my array with a particular size.
Firstly I try to use function that was written with SIMD instructions and after that I try to use usual aproach. And I compare time of this two realizations the same function.
I defined performance like (time without sse) / (time using sse).
But when my size is 8 , I have performance is 1.3, and when my size = 512 - I have Performance = 3, if I have size = 1000 performance = 4, if size = 4000 -> performance = 5.
I don't understand why my performance is increasing when size of array is increasing.
My code
void init(double* v, size_t size) {
for (int i = 0; i < size; ++i) {
v[i] = i / 10.0;
}
}
void sub_func_sse(double* v, int start_idx) {
__m256d vector = _mm256_loadu_pd(v + start_idx);
__m256d base = _mm256_set_pd(2.0, 2.0, 2.0, 2.0);
for (int i = 0; i < 128; ++i) {
vector = _mm256_mul_pd(vector, base);
}
_mm256_storeu_pd(v + start_idx, vector);
}
void sub_func(double& item) {
for (int k = 0; k < 128; ++k) {
item *= 2.0;
}
}
int main() {
const size_t size = 8;
double* v = new double[size];
init(v, size);
const int num_repeat = 2000;//I should repeat my measuraments
//because I want to get average time - it is more clear information
double total_time_sse = 0;
for (int p = 0; p < num_repeat; ++p) {
init(v, size);
TimerHc t;
t.restart();
for (int i = 0; i < size; i += 8) {
sub_func_sse(v, i);
}
total_time_sse += t.toc();
}
double total_time = 0;
for (int p = 0; p < num_repeat; ++p) {
init(v, size);
TimerHc t;
t.restart();
for (int i = 0; i < size; ++i) {
sub_func(v[i]);
}
total_time += t.toc();
}
std::cout << "time using sse = " << total_time_sse / num_repeat << std::endl <<
"time without sse = " << total_time / num_repeat << std::endl;
system("pause");
}

I defined performance like (time without sse) / (time using sse).
What you measure is speedup.
The speedup you can expect from applying parallelizations is modelled by Amdahl's law. It relates the savings in those parts that can be made faster (by parallelization or other means) to the total speedup. Amdahl's law can be rather intimidating, because it basically says that making parts faster will not always gain you a total speedup. The limit in achievable speedup is determined by the relative fraction of the workload that can be parallelized.
Gustavon's law takes a different point of view. In a nutshell, it states that you just have to increase the workload to make efficient use of parallelization. More workload in total has typically less impact on overhead from parallelization and the non-parallel part of computations, hence (according to Amdahl's law) results in more efficient use of parallelism.
...and in some sense, that's what you are observing here. The bigger your array, the more impact parallelization has.
PS: This is just some handwaving to explain why the effect you see is not too surprising. Luckily there is another answer which addresses your specific benchmark in more detail.

You're probably a victim of CPU frequency scaling; for stable results you should disable dynamic frequency scaling and turbo boost, or at least warm up the CPU before starting the measurement.
Since you start by measuring SSE performance and then proceed to regular performance, the CPU frequency is low in the beginning, so SSE performance appears worse.
Having said that, there are a few other issues with your approach:
The overhead of high_frequency_clock::now() calls compared to the work being measured is high; move the time measurement to outside the for (..num_repeat loop, i.e. time the entire loop, not individual iterations (then optionally divide the measured time by the number of iterations).
The results of the computation are never used; the compiler is free to optimize the work out entirely. Make sure to "use" the result, e.g. by printing it.
It is quite inefficient to multiply a double by 2.0. Indeed, the non-SSE version is optimized to an ADD instead (item *= 2.0 ==> vaddsd xmm0, xmm0, xmm0). So your hand-made SSE version is losing out.
An optimizing compiler will probably auto-vectorize your non-SSE code. To be sure, always check the generated assembly. Link to godbolt
Use a benchmarking framework like Google Benchmark; it will help you avoid many pitfalls associated with code benchmarking.

weird performance in C++ (VC 2010)

I have this loop written in C++, that compiled with MSVC2010 takes a long time to run. (300ms)
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
float val = 0;
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
val += original[k*w + m] * ds[k - i + hr][m - j + hr];
}
}
heat_map[i*w + j] = val;
}
}
}
It seemed a bit strange to me, so I did some tests then changed a few bits to inline assembly: (specifically, the code that sums "val")
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
__asm {
fldz
}
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
float val = original[k*w + m] * ds[k - i + hr][m - j + hr];
__asm {
fld val
fadd
}
}
}
float val1;
__asm {
fstp val1
}
heat_map[i*w + j] = val1;
}
}
}
Now it runs in half the time, 150ms. It does exactly the same thing, but why is it twice as quick? In both cases it was run in Release mode with optimizations on. Am I doing anything wrong in my original C++ code?

I suggest you try different floating-point calculation models supported by the compiler - precise, strict or fast (see /fp option) - with your original code before making any conclusions. I suspect that your original code was compiled with some overly restrictive floating-point model (not followed by your assembly in the second version of the code), which is why the original is much slower.
In other words, if the original model was indeed too restrictive, then you were simply comparing apples to oranges. The two versions didn't really do the same thing, even though it might seem so at the first sight.
Note, for example, that in the first version of the code the intermediate sum is accumulated in a float value. If it was compiled with precise model, the intermediate results would have to be rounded to the precision of float type, even if the variable val was optimized away and the internal FPU register was used instead. In your assembly code you don't bother to round the accumulated result, which is what could have contributed to its better performance.
I'd suggest you compile both versions of the code in /fp:fast mode and see how their performances compare in that case.

A few things to check out:
You need to check that is actually is the same code. As in, are your inline assembly statements exactly the same as those generated by the compiler? I can see three potential differences (potential because they may be optimised out). The first is the initial setting of val to zero, the second is the extra variable val1 (unlikely since it will most likely just change the constant subtraction of the stack pointer), the third is that your inline assembly version may not put the interim results back into val.
You need to make sure your sample space is large. You didn't mention whether you'd done only one run of each version or a hundred runs but, the more runs, the better, so as to remove the effect of "noise" in your statistics.
An even better measurement would be CPU time rather than elapsed time. Elapsed time is subject to environmental changes (like your virus checker or one of your services deciding to do something at the time you're testing). The large sample space will alleviate, but not necessarily solve, this.

Why is this code so slow?

So I have this function used to calculate statistics (min/max/std/mean). Now the thing is this runs generally on a 10,000 by 15,000 matrix. The matrix is stored as a vector<vector<int> > inside the class. Now creating and populating said matrix goes very fast, but when it comes down to the statistics part it becomes so incredibly slow.
E.g. to read all the pixel values of the geotiff one pixel at a time takes around 30 seconds. (which involves a lot of complex math to properly georeference the pixel values to a corresponding point), to calculate the statistics of the entire matrix it takes around 6 minutes.
void CalculateStats()
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
for(size_t row = 0; row < vals.size(); row++)
{
for(size_t col = 0; col < vals.at(row).size(); col++)
{
double mean_prev = new_mean;
T value = get(row, col);
new_mean += (value - new_mean) / (cnt + 1);
new_standard_dev += (value - new_mean) * (value - mean_prev);
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
stats_standard_dev = sqrt(new_standard_dev / (vals.size() * vals.at(0).size()) + 1);
std::cout << stats_standard_dev << std::endl;
}
Am I doing something horrible here?
EDIT
To respond to the comments, T would be an int.
EDIT 2
I fixed my std algorithm, and here is the final product:
void CalculateStats(const std::vector<double>& ignore_values)
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
int n = 0;
double delta = 0.0;
double mean2 = 0.0;
std::vector<double>::const_iterator ignore_begin = ignore_values.begin();
std::vector<double>::const_iterator ignore_end = ignore_values.end();
for(std::vector<std::vector<T> >::const_iterator row = vals.begin(), row_end = vals.end(); row != row_end; ++row)
{
for(std::vector<T>::const_iterator col = row->begin(), col_end = row->end(); col != col_end; ++col)
{
// This method of calculation is based on Knuth's algorithm.
T value = *col;
if(std::find(ignore_begin, ignore_end, value) != ignore_end)
continue;
n++;
delta = value - new_mean;
new_mean = new_mean + (delta / n);
mean2 = mean2 + (delta * (value - new_mean));
// Find new max/min's.
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
}
}
stats_standard_dev = mean2 / (n - 1);
stats_min = new_min;
stats_max = new_max;
stats_mean = new_mean;
This still takes ~120-130 seconds to do this, but it's a huge improvement :)!

Have you tried to profile your code?
You don't even need a fancy profiler. Just stick some debug timing statements in there.
Anything I tell you would just be an educated guess (and probably wrong)
You could be getting lots of cache misses due to the way you're accessing the contents of the vector. You might want to cache some of the results to size() but I don't know if that's the issue.

I just profiled it. 90% of the execution time was in this line:
new_mean += (value - new_mean) / (cnt + 1);

You should calculate the sum of values, min, max and count in the first loop,
then calculate the mean in one operation by dividing sum/count,
then in a second loop calculate std_dev's sum
That would probably be a bit faster.

First thing I spotted is that you evaluate vals.at(row).size() in the loop, which, obviously, isn't supposed to improve performance. It also applies to vals.size(), but of course inner loop is worse. If vals is a vector of vector, you better use iterators or at least keep reference for the outer vector (because get() with indices parameters surely eats up quite some time as well).
This code sample is supposed to illustrate my intentions ;-)
for(TVO::const_iterator i=vals.begin(),ie=vals.end();i!=ie;++i) {
for(TVI::const_iterator ii=i->begin(),iie=i->end();ii!=iie;++ii) {
T value = *ii;
// the rest
}
}

First, change your row++ to ++row. A minor thing, but you want speed, so that will help
Second, make your row < vals.size into some const comparison instead. The compiler doesn't know that vals won't change, so it has to play nice and always call size.
what is the 'get' method in the middle there? What does that do? That might be your real problem.
I'm not too sure about your std dev calculation. Take a look at the wikipedia page on calculating variance in a single pass (they have a quick explanation of Knuth's algorithm, which is an expansion of a recursion relation).

It's slow because you're benchmarking debug code.
Building and running the code on Windows XP using VS2008:
a Release build with the default optimisation level, the code in the OP runs in 2734 ms.
a Debug build with the default of no optimisation, the code in the OP runs in a massive 398,531 ms.
In comments below you say you're not using optimisation, and this appears to make a big difference in this case - normally it's less that a factor of ten, but in this case it's over a hundred times slower.
I'm using VS2008 rather than 2005, but it's probably similar:
In the Debug build, there are two range checks on each access, each of which calls std::vector::size() using a non-inlined function call and requires a branch predicition. There is overhead involved both with function calls and with branches.
In the Release build, the compiler optimizes away the range checks ( I don't know whether it just drops them, or does flow analysis based on the limits of the loop ), and the vector access becomes a small amount of inline pointer arithmetic with no branches.
No-one cares how fast the debug build is. You should be unit testing the release build anyway, as that's the build which has to work correctly. Only use the Debug build if you don't all the information you want if you try and step through the code.
The code as posted runs in < 1.5 seconds on my PC with test data of 15000 x 10000 integers all equal to 42. You report that it's running in 230 times slower that that. Are you on a 10 MHz processor?
Though there are other suggestions for making it faster ( such as moving it to use SSE, if all the values are representable using 8bit types ), but there's clearly something else which is making it slow.
On my machine, neither a version which hoisted a reference to the vector for the row and hoisting the size of the row, nor a version which used iterator had any measurable benefit ( with g++ -O3 using iterators takes 1511ms repeatably; the hoisted and original version both take 1485ms ). Not optimising means it runs in 7487ms ( original ), 3496ms ( hoisted ) or 5331ms ( iterators ).
But unless you're running on a very low power device, or are paging, or a running non-optimised code with a debugger attached, it shouldn't be this slow, and whatever is making it slow is not likely to be the code you've posted.
( as a side note, if you test it with values with a deviation of zero your SD comes out as 1 )

There are far too many calculations in the inner loop:
For the descriptive statistics (mean, standard
deviation) the only thing required is to compute the sum
of value and the sum of squared value. From these
two sums the mean and standard deviation can be computed
after the outer loop (together with a third value, the
number of samples - n is your new/updated code). The
equations can be derived from the definitions or found
on the web, e.g. Wikipedia. For instance the mean is
just sum of value divided by n. For the n version (in
contrast to the n-1 version - however n is large in
this case so it doesn't matter which one is used) the
standard deviation is: sqrt( n * sumOfSquaredValue -
sumOfValue * sumOfValue). Thus only two floating point
additions and one multiplication are needed in the
inner loop. Overflow is not a problem with these sums as
the range for doubles is 10^318. In particular you will
get rid of the expensive floating point division that
the profiling reported in another answer has revealed.
A lesser problem is that the minimum and maximum are
rewritten every time (the compiler may or may not
prevent this). As the minimum quickly becomes small and
the maximum quickly becomes large, only the two comparisons
should happen for the majority of loop iterations: use
if statements instead to be sure. It can be argued, but
on the other hand it is trivial to do.

I would change how I access the data. Assuming you are using std::vector for your container you could do something like this:
vector<vector<T> >::const_iterator row;
vector<vector<T> >::const_iterator row_end = vals.end();
for(row = vals.begin(); row < row_end; ++row)
{
vector<T>::const_iterator value;
vector<T>::const_iterator value_end = row->end();
for(value = row->begin(); value < value_end; ++value)
{
double mean_prev = new_mean;
new_mean += (*value - new_mean) / (cnt + 1);
new_standard_dev += (*value - new_mean) * (*value - mean_prev);
// find new max/min's
new_min = min(*value, new_min);
new_max = max(*value, new_max);
cnt++;
}
}
The advantage of this is that in your inner loop you aren't consulting the outter vector, just the inner one.
If you container type is a list, this will be significantly faster. Because the look up time of get/operator[] is linear for a list and constant for a vector.
Edit, I moved the call to end() out of the loop.

Move the .size() calls to before each loop, and make sure you are compiling with optimizations turned on.

If your matrix is stored as a vector of vectors, then in the outer for loop you should directly retrieve the i-th vector, and then operate on that in the inner loop. Try that and see if it improves performance.

I'm nor sure of what type vals is but vals.at(row).size() could take a long time if itself iterates through the collection. Store that value in a variable. Otherwise it could make the algorithm more like O(n³) than O(n²)

I think that I would rewrite it to use const iterators instead of row and col indexes. I would set up a const const_iterator for row_end and col_end to compare against, just to make certain it wasn't making function calls at every loop end.

As people have mentioned, it might be get(). If it accesses neighbors, for instance, you will totally smash the cache which will greatly reduce the performance. You should profile, or just think about access patterns.

Coming a bit late to the party here, but a couple of points:
You're effectively doing numerical work here. I don't know much about numerical algorithms, but I know enough to know that references and expert support are often useful. This discussion thread offers some references; and Numerical Recipes is a standard (if dated) work.
If you have the opportunity to redesign your matrix, you want to try using a valarray and slices instead of vectors of vectors; one advantage that immediately comes to mind is that you're guaranteed a flat linear layout, which makes cache pre-fetching and SIMD instructions (if your compiler can use them) more effective.

In the inner loop, you shouldn't be testing size, you shouldn't be doing any divisions, and iterators can also be costly. In fact, some unrolling would be good in there.
And, of course, you should pay attention to cache locality.
If you get the loop overhead low enough, it might make sense to do it in separate passes: one to get the sum (which you divide to get the mean), one to get the sum of squares (which you combine with the sum to get the variance), and one to get the min and/or max. The reason is to simplify what is in the inner unrolled loop so the compiler can keep stuff in registers.
I couldn't get the code to compile, so I couldn't pinpoint issues for sure.

I have modified the algorithm to get rid of almost all of the floating-point division.
WARNING: UNTESTED CODE!!!
void CalculateStats()
{
//OHGOD
double accum_f;
double accum_sq_f;
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
const int oku = 100000000;
int accum_ichi = 0;
int accum_oku = 0;
int accum_sq_ichi = 0;
int accum_sq_oku = 0;
size_t cnt = 0;
int v1 = 0;
int v2 = 0;
v1 = vals.size();
for(size_t row = 0; row < v1; row++)
{
v2 = vals.at(row).size();
for(size_t col = 0; col < v2; col++)
{
T value = get(row, col);
int accum_ichi += value;
int accum_sq_ichi += (value * value);
// perform carries
accum_oku += (accum_ichi / oku);
accum_ichi %= oku;
accum_sq_oku += (accum_sq_ichi / oku);
accum_sq_ichi %= oku;
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
// now, and only now, do we use floating-point arithmetic
accum_f = (double)(oku) * (double)(accum_oku) + (double)(accum_ichi);
accum_sq_f = (double)(oku) * (double)(accum_sq_oku) + (double)(accum_sq_ichi);
new_mean = accum_f / (double)(cnt);
// standard deviation formula from Wikipedia
stats_standard_dev = sqrt((double)(cnt)*accum_sq_f - accum_f*accum_f)/(double)(cnt);
std::cout << stats_standard_dev << std::endl;
}

What is faster on division? doubles / floats / UInt32 / UInt64 ? in C++/C

I did some speed testing to figure out what is the fastest, when doing multiplication or division on numbers. I had to really work hard to defeat the optimiser. I got nonsensical results such as a massive loop operating in 2 microseconds, or that multiplication was the same speed as division (if only that were true).
After I finally worked hard enough to defeat enough of the compiler optimisations, while still letting it optimise for speed, I got these speed results. They maybe of interest to someone else?
If my test is STILL FLAWED, let me know, but be kind seeing as I just spend two hours writing this crap :P
64 time: 3826718 us
32 time: 2476484 us
D(mul) time: 936524 us
D(div) time: 3614857 us
S time: 1506020 us
"Multiplying to divide" using doubles seems the fastest way to do a division, followed by integer division. I did not test the accuracy of division. Could it be that "proper division" is more accurate? I have no desire to find out after these speed test results as I'll just be using integer division on a base 10 constant and letting my compiler optimise it for me ;) (and not defeating it's optimisations either).
Here's the code I used to get the results:
#include <iostream>
int Run(int bla, int div, int add, int minus) {
// these parameters are to force the compiler to not be able to optimise away the
// multiplications and divides :)
long LoopMax = 100000000;
uint32_t Origbla32 = 1000000000;
long i = 0;
uint32_t bla32 = Origbla32;
uint32_t div32 = div;
clock_t Time32 = clock();
for (i = 0; i < LoopMax; i++) {
div32 += add;
div32 -= minus;
bla32 = bla32 / div32;
bla32 += bla;
bla32 = bla32 * div32;
}
Time32 = clock() - Time32;
uint64_t bla64 = bla32;
clock_t Time64 = clock();
uint64_t div64 = div;
for (long i = 0; i < LoopMax; i++) {
div64 += add;
div64 -= minus;
bla64 = bla64 / div64;
bla64 += bla;
bla64 = bla64 * div64;
}
Time64 = clock() - Time64;
double blaDMul = Origbla32;
double multodiv = 1.0 / (double)div;
double multomul = div;
clock_t TimeDMul = clock();
for (i = 0; i < LoopMax; i++) {
multodiv += add;
multomul -= minus;
blaDMul = blaDMul * multodiv;
blaDMul += bla;
blaDMul = blaDMul * multomul;
}
TimeDMul = clock() - TimeDMul;
double blaDDiv = Origbla32;
clock_t TimeDDiv = clock();
for (i = 0; i < LoopMax; i++) {
multodiv += add;
multomul -= minus;
blaDDiv = blaDDiv / multomul;
blaDDiv += bla;
blaDDiv = blaDDiv / multodiv;
}
TimeDDiv = clock() - TimeDDiv;
float blaS = Origbla32;
float divS = div;
clock_t TimeS = clock();
for (i = 0; i < LoopMax; i++) {
divS += add;
divS -= minus;
blaS = blaS / divS;
blaS += bla;
blaS = blaS * divS;
}
TimeS = clock() - TimeS;
printf("64 time: %i us (%i)\n", (int)Time64, (int)bla64);
printf("32 time: %i us (%i)\n", (int)Time32, bla32);
printf("D(mul) time: %i us (%f)\n", (int)TimeDMul, blaDMul);
printf("D(div) time: %i us (%f)\n", (int)TimeDDiv, blaDDiv);
printf("S time: %i us (%f)\n", (int)TimeS, blaS);
return 0;
}
int main(int argc, char* const argv[]) {
Run(0, 10, 0, 0); // adds and minuses 0 so it doesn't affect the math, only kills the opts
return 0;
}

There are lots of ways to perform certain arithmetic, so there might not be a single answer (shifting, fractional multiplication, actual division, some round-trip through a logarithm unit, etc; these might all have different relative costs depending on the operands and resource allocation).
Let the compiler do its thing with the program and data flow information it has.
For some data applicable to assembly on x86, you might look at: "Instruction latencies and throughput for AMD and Intel x86 processors"

What is fastest will depend entirely on the target architecture. It looks here like you're interested only in the platform you happen to be on, which guessing from your execution times seems to be 64-bit x86, either Intel (Core2?) or AMD.
That said, floating-point multiplication by the inverse will be the fastest on many platforms, but is, as you speculate, usually less accurate than a floating-point divide (two roundings instead of one -- whether or not that matters for your usage is a separate question). In general, you are better off re-arranging your algorithm to use fewer divides than you are jumping through hoops to make division as efficient as possible (the fastest division is the one you don't do), and make sure to benchmark before you spend time optimizing at all, as algorithms that bottleneck on division are few and far between.
Also, if you have integer sources and need an integer result, make sure to include the cost of conversion between integer and floating-point in your benchmarking.
Since you're interested in timings on a specific machine, you should be aware that Intel now publishes this information in their Optimization Reference Manual (pdf). Specifically, you will be interested in the tables of Appendix C section 3.1, "Latency and Throughput with Register Operands".
Be aware that integer divide timings depend strongly on the actual values involved. Based on the information in that guide, it seems that your timing routines still have a fair bit of overhead, as the performance ratios you measure don't match up with Intel's published information.

As Stephen mentioned, use the optimisation manual - but you should also be considering the use of SSE instructions. These can do 4 or 8 divisions / multiplications in a single instruction.
Also, it is fairly common for a division to take a single clock cycle to process. The result may not be available for several clock cycles (called latency), however the next division can begin during this time (overlapping with the first) as long as it does not require the result from the first. This is due to pipe-lining in the CPU, in the same way as you can wash more clothes while the previous load is still drying.
Multiplying to divide is a common trick, and should be used wherever your divisor changes infrequently.
There is a very good chance that you will spend time and effort making the maths fast only to discover that it is the speed of memory access (as you navigate the input and write the output) that limits your final implimentation.

I wrote a flawed test to do this on MSVC 2008
double i32Time = GetTime();
{
volatile __int32 i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 61;
count++;
}
}
i32Time = GetTime() - i32Time;
double i64Time = GetTime();
{
volatile __int64 i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 61;
count++;
}
}
i64Time = GetTime() - i64Time;
double fTime = GetTime();
{
volatile float i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 4.0f;
count++;
}
}
fTime = GetTime() - fTime;
double fmTime = GetTime();
{
volatile float i = 4;
const float div = 1.0f / 4.0f;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i *= div;
count++;
}
}
fmTime = GetTime() - fmTime;
double dTime = GetTime();
{
volatile double i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 4.0f;
count++;
}
}
dTime = GetTime() - dTime;
double dmTime = GetTime();
{
volatile double i = 4;
const double div = 1.0f / 4.0f;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i *= div;
count++;
}
}
dmTime = GetTime() - dmTime;
DebugOutput( _T( "%f\n" ), i32Time );
DebugOutput( _T( "%f\n" ), i64Time );
DebugOutput( _T( "%f\n" ), fTime );
DebugOutput( _T( "%f\n" ), fmTime );
DebugOutput( _T( "%f\n" ), dTime );
DebugOutput( _T( "%f\n" ), dmTime );
DebugBreak();
I then ran it on an AMD64 Turion 64 in 32-bit mode. The results I got were as follows:
0.006622
0.054654
0.006283
0.006353
0.006203
0.006161
The reason the test is flawed is the usage of volatile which forces the compiler to re-load the variable from memory just in case its changed. All in it show there is precious little difference between any of the implementations on this machine (__int64 is obviously slow).
It also categorically shows that the MSVC compiler performs the multiply by reciprocal optimisation. I imagine GCC does the same if not better. If i change the float and double division checks to divide by "i" then it increases the time significantly. Though, while a lot of that could be the re-loading from disk, it is obvious the compiler can't optimise that away so easily.
To understand such micro-optimisations try reading this pdf.
All in I'd argue that if you are worrying about such things you obviously haven't profiled your code. Profile and fix the problems as and when they actually ARE a problem.

Agner Fog has done some pretty detailed measurements himself, which can be found here. If you're really trying to optimize stuff, you should read the rest of the documents from his software optimization resources as well.
I would point out that, even if you are measuring non-vectorized floating point operations, the compiler has two options for the generated assembly: it can use the FPU instructions (fadd, fmul) or it can use SSE instructions while still manipulate one floating point value per instruction (addss, mulss). In my experience the SSE instructions are faster and have less inaccuracies, but compilers don't make it the default because it could break compatibility with code that relies on the old behavior. You can turn it on in gcc with the -mfpmath=sse flag.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js