I am creating a testing utility that requires high usage of sqrt() function. After digging in possible optimisations, I have decided to try inline assembler in C++. The code is:
#include <iostream>
#include <cstdlib>
#include <cmath>
#include <ctime>
using namespace std;
volatile double normalSqrt(double a){
double b = 0;
for(int i = 0; i < ITERATIONS; i++){
b = sqrt(a);
}
return b;
}
volatile double asmSqrt(double a){
double b = 0;
for(int i = 0; i < ITERATIONS; i++){
asm volatile(
"movq %1, %%xmm0 \n"
"sqrtsd %%xmm0, %%xmm1 \n"
"movq %%xmm1, %0 \n"
: "=r"(b)
: "g"(a)
: "xmm0", "xmm1", "memory"
);
}
return b;
}
int main(int argc, char *argv[]){
double a = atoi(argv[1]);
double c;
std::clock_t start;
double duration;
start = std::clock();
c = asmSqrt(a);
duration = std::clock() - start;
cout << "asm sqrt: " << c << endl;
cout << duration << " clocks" <<endl;
cout << "Start: " << start << " end: " << start + duration << endl;
start = std::clock();
c = normalSqrt(a);
duration = std::clock() - start;
cout << endl << "builtin sqrt: " << c << endl;
cout << duration << " clocks" << endl;
cout << "Start: " << start << " end: " << start + duration << endl;
return 0;
}
I am compiling this code using this script that sets number of iterations, starts profiling, and opens profiling output in VIM:
#!/bin/bash
DEFAULT_ITERATIONS=1000000
if [ $# -eq 1 ]; then
echo "Setting ITERATIONS to $1"
DEFAULT_ITERATIONS=$1
else
echo "Using default value: $DEFAULT_ITERATIONS"
fi
rm -rf asd
g++ -msse4 -std=c++11 -O0 -ggdb -pg -DITERATIONS=$DEFAULT_ITERATIONS test.cpp -o asd
./asd 16
gprof asd gmon.out > output.txt
vim -O output.txt
true
The output is:
Using default value: 1000000
asm sqrt: 4
3802 clocks
Start: 1532 end: 5334
builtin sqrt: 4
5501 clocks
Start: 5402 end: 10903
The question is why the sqrtsd instruction takes only 3802 clocks, to count square root of 16, and sqrt() takes 5501 clocks?
Does it have something to do with HW implementation of certain instructions? Thank you.
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 21
Model: 48
Model name: AMD A8-7600 Radeon R7, 10 Compute Cores 4C+6G
Stepping: 1
CPU MHz: 3100.000
CPU max MHz: 3100,0000
CPU min MHz: 1400,0000
BogoMIPS: 6188.43
Virtualization: AMD-V
L1d cache: 16K
L1i cache: 96K
L2 cache: 2048K
NUMA node0 CPU(s): 0-3
Floating point arithmetic has to take into consideration rounding. Most C/C++ compilers adopt IEEE 754, so they have an "ideal" algorithm to perform operations such as square root. They are then free to optimize, but they must return the same result down to the last decimal, in all cases. So their freedom to optimize is not complete, in fact it is severely constrained.
Your algorithm probably is off by a digit or two part of the time. Which could be completely negligible for some users, but could also cause nasty bugs for some others, so it's not allowed by default.
If you care more for speed than standard compliance, try poking around with the options of your compiler. For instance in GCC the first I'd try is -funsafe-math-optimizations, which should enable optimizations disregarding strict standard compliance. Once you tweak it enough, you should come closer to and possibly pass your handmade implementation's speed.
Ignoring the other problems, it will still be the case that sqrt() is a bit slower than sqrtsd, unless compiled with specific flags.
sqrt() has to potentially set errno, it has to check whether it's in that case. It will still boil down to the native square root instruction on any reasonable compiler, but it will have a little overhead. Not a lot of overhead like your flaws test showed, but still some.
You can see that in action here.
Some compile flags suppress this test. For example for GCC, fno-math-errno and ffinite-math-only.
Related
I tested the following code on my machine to see how much throughput I can get. The code does not do very much except assigning each thread two nested loop,
#include <chrono>
#include <iostream>
int main() {
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
for(int thread = 0; thread < 24; thread++) {
float i = 0.0f;
while(i < 100000.0f) {
float j = 0.0f;
while (j < 100000.0f) {
j = j + 1.0f;
}
i = i + 1.0f;
}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = end_time - start_time;
std::cout << time / std::chrono::milliseconds(1) << std::endl;
return 0;
}
To my surprise, the throughput is very low according to perf
$ perf stat -e all_dc_accesses -e fp_ret_sse_avx_ops.all cmake-build-release/roofline_prediction
8907
Performance counter stats for 'cmake-build-release/roofline_prediction':
325.372.690 all_dc_accesses
240.002.400.000 fp_ret_sse_avx_ops.all
8,909514307 seconds time elapsed
202,819795000 seconds user
0,059613000 seconds sys
With 240.002.400.000 FLOPs in 8.83 seconds, the machine achieved only 27.1 GFLOPs/second, way below the CPU's capacity of 392 GFLOPs/sec (I got this number from a roofline modelling software).
My question is, how can I achieved higher throughput?
Compiler: GCC 9.3.0
CPU: AMD Threadripper 1920X
Optimization level: -O3
OpenMP's flag: -fopenmp
Compiled with GCC 9.3 with those options, the inner loop looks like this:
.L3:
addss xmm0, xmm2
comiss xmm1, xmm0
ja .L3
Some other combinations of GCC version / options may result in the loop being elided, after all it doesn't really do anything (except waste time).
The addss forms a loop-carried dependency with only itself in it. That is not fast though, on Zen 1 that takes 3 cycles per iteration, so the number of additions per cycle is 1/3. The maximum number of floating point additions per cycle could be attained by having at least 6 independent addps instructions (256bit vaddps may help a bit, but Zen 1 executes such 256bit SIMD instructions with 2 128bit operations internally), to deal with the latency of 3 and the throughput of 2 per cycle (so 6 operations need to be active at any time). That would correspond to 8 additions per cycles, 24 times as much as the current code.
From a C++ program, it may be possible to coax the compiler into generating suitable machine code by:
Using -ffast-math (if possible, which it isn't always)
Using explicit vectorization using _mm_add_ps
Manually unrolling the loop, using (at least 6) independent accumulators
I am new to the field of SSE2 and AVX. I write the following code to test the performance of both SSE2 and AVX.
#include <cmath>
#include <iostream>
#include <chrono>
#include <emmintrin.h>
#include <immintrin.h>
void normal_res(float* __restrict__ a, float* __restrict__ b, float* __restrict__ c, unsigned long N) {
for (unsigned long n = 0; n < N; n++) {
c[n] = sqrt(a[n]) + sqrt(b[n]);
}
}
void normal(float* a, float* b, float* c, unsigned long N) {
for (unsigned long n = 0; n < N; n++) {
c[n] = sqrt(a[n]) + sqrt(b[n]);
}
}
void sse(float* a, float* b, float* c, unsigned long N) {
__m128* a_ptr = (__m128*)a;
__m128* b_ptr = (__m128*)b;
for (unsigned long n = 0; n < N; n+=4, a_ptr++, b_ptr++) {
__m128 asqrt = _mm_sqrt_ps(*a_ptr);
__m128 bsqrt = _mm_sqrt_ps(*b_ptr);
__m128 add_result = _mm_add_ps(asqrt, bsqrt);
_mm_store_ps(&c[n], add_result);
}
}
void avx(float* a, float* b, float* c, unsigned long N) {
__m256* a_ptr = (__m256*)a;
__m256* b_ptr = (__m256*)b;
for (unsigned long n = 0; n < N; n+=8, a_ptr++, b_ptr++) {
__m256 asqrt = _mm256_sqrt_ps(*a_ptr);
__m256 bsqrt = _mm256_sqrt_ps(*b_ptr);
__m256 add_result = _mm256_add_ps(asqrt, bsqrt);
_mm256_store_ps(&c[n], add_result);
}
}
int main(int argc, char** argv) {
unsigned long N = 1 << 30;
auto *a = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
auto *b = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
auto *c = static_cast<float*>(aligned_alloc(128, N*sizeof(float)));
std::chrono::time_point<std::chrono::system_clock> start, end;
for (unsigned long i = 0; i < N; ++i) {
a[i] = 3141592.65358;
b[i] = 1234567.65358;
}
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
normal(a, b, c, N);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "normal elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
normal_res(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "normal restrict elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
sse(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "sse elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
start = std::chrono::system_clock::now();
for (int i = 0; i < 5; i++)
avx(a, b, c, N);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
std::cout << "avx elapsed time: " << elapsed_seconds.count() / 5 << std::endl;
return 0;
}
I compile my program by using g++ complier as the following.
g++ -msse -msse2 -mavx -mavx512f -O2
The results are as the following. It seems that there is no further improvement when I use more advanced 256 bits vectors.
normal elapsed time: 10.5311
normal restrict elapsed time: 8.00338
sse elapsed time: 0.995806
avx elapsed time: 0.973302
I have two questions.
Why does not AVX give me further improvement? Is it because the memory bandwidth?
According to my experiment, the SSE2 perform 10 times faster than the naive version. Why is that? I expect the SSE2 can only be 4 times faster based on its 128 bits vectors with respect to single precision floating points. Thanks a lot.
There are several issues here....
Memory bandwidth is very likely to be important for these array sizes -- more notes below.
Throughput for SSE and AVX square root instructions may not be what you expect on your processor -- more notes below.
The first test ("normal") may be slower than expected because the output array is instantiated (i.e., virtual to physical mappings are created) during the timed part of the test. (Just fill c with zeros in the loop that initializes a and b to fix this.)
Memory Bandwidth Notes:
With N = 1<<30 and float variables, each array is 4GiB.
Each test reads two arrays and writes to a third array. This third array must also be read from memory before being overwritten -- this is called a "write allocate" or a "read for ownership".
So you are reading 12 GiB and writing 4 GiB in each test. The SSE and AVX tests therefore correspond to ~16 GB/s of DRAM bandwidth, which is near the high end of the range typically seen for single-threaded operation on recent processors.
Instruction Throughput Notes:
The best reference for instruction latency and throughput on x86 processors is "instruction_tables.pdf" from https://www.agner.org/optimize/
Agner defines "reciprocal throughput" as the average number of cycles per retired instruction when the processor is given a workload of independent instructions of the same type.
As an example, for an Intel Skylake core, the throughput of SSE and AVX SQRT is the same:
SQRTPS (xmm) 1/throughput = 3 --> 1 instruction every 3 cycles
VSQRTPS (ymm) 1/throughput = 6 --> 1 instruction every 6 cycles
Execution time for the square roots is expected to be (1<<31) square roots / 4 square roots per SSE SQRT instruction * 3 cycles per SSE SQRT instruction / 3 GHz = 0.54 seconds (randomly assuming a processor frequency).
Expected throughput for the "normal" and "normal_res" cases depends on the specifics of the generated assembly code.
Scalar being 10x instead of 4x slower:
You're getting page faults in c[] inside the scalar timed region because that's the first time you're writing it. If you did tests in a different order, whichever one was first would pay that large penalty. That part is a duplicate of this mistake: Why is iterating though `std::vector` faster than iterating though `std::array`? See also Idiomatic way of performance evaluation?
normal pays this cost in its first of the 5 passes over the array. Smaller arrays and a larger repeat count would amortize this even more, but better to memset or otherwise fill your destination first to pre-fault it ahead of the timed region.
normal_res is also scalar but is writing into an already-dirtied c[]. Scalar is 8x slower than SSE instead of the expected 4x.
You used sqrt(double) instead of sqrtf(float) or std::sqrt(float). On Skylake-X, this perfectly accounts for an extra factor of 2 throughput. Look at the compiler's asm output on the Godbolt compiler explorer (GCC 7.4 assuming the same system as your last question). I used -mavx512f (which implies -mavx and -msse), and no tuning options, to hopefully get about the same code-gen you did. main doesn't inline normal_res, so we can just look at the stand-alone definition for it.
normal_res(float*, float*, float*, unsigned long):
...
vpxord zmm2, zmm2, zmm2 # uh oh, 512-bit instruction reduces turbo clocks for the next several microseconds. Silly compiler
# more recent gcc would just use `vpxor xmm0,xmm0,xmm0`
...
.L5: # main loop
vxorpd xmm0, xmm0, xmm0
vcvtss2sd xmm0, xmm0, DWORD PTR [rdi+rbx*4] # convert to double
vucomisd xmm2, xmm0
vsqrtsd xmm1, xmm1, xmm0 # scalar double sqrt
ja .L16
.L3:
vxorpd xmm0, xmm0, xmm0
vcvtss2sd xmm0, xmm0, DWORD PTR [rsi+rbx*4]
vucomisd xmm2, xmm0
vsqrtsd xmm3, xmm3, xmm0 # scalar double sqrt
ja .L17
.L4:
vaddsd xmm1, xmm1, xmm3 # scalar double add
vxorps xmm4, xmm4, xmm4
vcvtsd2ss xmm4, xmm4, xmm1 # could have just converted in-place without zeroing another destination to avoid a false dependency :/
vmovss DWORD PTR [rdx+rbx*4], xmm4
add rbx, 1
cmp rcx, rbx
jne .L5
The vpxord zmm only reduces turbo clock for a few milliseconds (I think) at the start of each call to normal and normal_res. It doesn't keep using 512-bit operations so clock speed can jump back up again later. This might partially account for it not being exactly 8x.
The compare / ja is because you didn't use -fno-math-errno so GCC still calls actual sqrt for inputs < 0 to get errno set. It's doing if (!(0 <= tmp)) goto fallback, jumping on 0 > tmp or unordered. "Fortunately" sqrt is slow enough that it's still the only bottleneck. Out-of-order exec of the conversion and compare/branching means the SQRT unit is still kept busy ~100% of the time.
vsqrtsd throughput (6 cycles) is 2x slower than vsqrtss throughput (3 cycles) on Skylake-X, so using double costs a factor of 2 in scalar throughput.
Scalar sqrt on Skylake-X has the same throughput as the corresponding 128-bit ps / pd SIMD version. So 6 cycles per 1 number as a double vs. 3 cycles per 4 floats as a ps vector fully explains the 8x factor.
The extra 8x vs. 10x slowdown for normal was just from page faults.
SSE vs. AVX sqrt throughput
128-bit sqrtps is sufficient to get the full throughput of the SIMD div/sqrt unit; assuming this is a Skylake-server like your last question, it's 256 bits wide but not fully pipelined. The CPU can alternate sending a 128-bit vector into the low or high half to take advantage of the full hardware width even when you're only using 128-bit vectors. See Floating point division vs floating point multiplication (FP div and sqrt run on the same execution unit.)
See also instruction latency/throughput numbers on https://uops.info/, or on https://agner.org/optimize/.
The add/sub/mul/fma are all 512-bits wide and fully pipelined; use that (e.g. to evaluate a 6th order polynomial or something) if you want something that can scale with vector width. div/sqrt is a special case.
You'd expect a benefit from using 256-bit vectors for SQRT only if you had a bottleneck on the front-end (4/clock instruction / uop throughput), or if you were doing a bunch of add/sub/mul/fma work with the vectors as well.
256-bit isn't worse, but it doesn't help when the only computation bottleneck is on the div/sqrt unit's throughput.
See John McCalpin's answer for more details about write-only costing about the same as a read+write, because of RFOs.
With so little computation per memory access, you're probably close to bottlenecking on memory bandwidth again / still. Even if the FP SQRT hardware was wider / faster, you might not in practice have your code run any faster. Instead you'd just have the core spend more time doing nothing while waiting for data to arrive from memory.
It seems you are getting exactly the expected speedup from 128-bit vectors (2x * 4x = 8x), so apparently the __m128 version is not bottlenecked on memory bandwidth either.
2x sqrt per 4 memory accesses is about the same as the a[i] = sqrt(a[i]) (1x sqrt per load + store) you were doing in the code you posted in chat, but you didn't give any numbers for that. That one avoided the page-fault problem because it was rewriting an array in-place after initializing it.
In general rewriting an array in-place is a good idea if you for some reason keep insisting on trying to get a 4x / 8x / 16x SIMD speedup using these insanely huge arrays that won't even fit in L3 cache.
Memory access is pipelined, and overlaps with computation (assuming sequential access so prefetchers can be pulling it in continuously without having to compute the next address): faster computation doesn't speed up overall progress. Cache lines arrive from memory at some fixed maximum bandwidth, with ~12 cache line transfers in flight at once (12 LFBs in Skylake). Or L2 "superqueue" can track more cache lines than that (maybe 16?), so L2 prefetch is reading ahead of where the CPU core is stalled.
As long as your computation can keep up with that rate, making it faster will just leave more cycles of doing nothing before the next cache line arrives.
(The store buffer writing back to L1d and then evicting dirty lines is also happening, but the basic idea of core waiting for memory still works.)
You could think of it like stop-and-go traffic in a car: a gap opens ahead of your car. Closing that gap faster doesn't gain you any average speed, it just means you have to stop faster.
If you want to see the benefit of AVX and AVX512 over SSE, you'll need smaller arrays (and a higher repeat-count). Or you'll need lots of ALU work per vector, like a polynomial.
In many real-world problems, the same data is used repeatedly so caches work. And it's possible to break up your problem into doing multiple things to one block of data while it's hot in cache (or even while loaded in registers), to increase the computational intensity enough to take advantage of the compute vs. memory balance of modern CPUs.
I am currently recording some frame times in MS instead of ticks. I know this can be an issue as we are adding all the frame times (in MS) together and then dividing by the number of frames. This could cause bad results due to floating point precision.
It would make more sense to add all the tick counts together then convert to MS once at the end.
However, I am wondering what the actual difference would be for a small number of samples? I expect to have between 900-1800 samples. Would this be an issue at all?
I have made this small example and run it on GCC 4.9.2:
// Example program
#include <iostream>
#include <string>
int main()
{
float total = 0.0f;
double total2 = 0.0f;
for(int i = 0; i < 1000000; ++i)
{
float r = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
total += r;
total2 += r;
}
std::cout << "Total: " << total << std::endl;
std::cout << "Total2: " << total2 << std::endl;
}
Result:
Total: 500004 Total2: 500007
So as far as I can tell with 1 million values we do not lose a lot of precision. Though I am not sure if what I have written is a reasonable test or actually testing what I want to test.
So my question is, how many floats can I add together before precision becomes an issue? I expect my values to be between 1 and 60 MS. I would like the end precision to be within 1 millisecond. I have 900-1800 values.
Example Value: 15.1345f for 15 milliseconds.
Counterexample
Using the assumptions below about the statement of the problem (times are effectively given as values such as .06 for 60 milliseconds), if we convert .06 to float and add it 1800 times, the computed result is 107.99884796142578125. This differs from the mathematical result, 108.000, by more than .001. Therefore, the computed result will sometimes differ from the mathematical result by more than 1 millisecond, so the goal desired in the question is not achievable in these conditions. (Further refinement of the problem statement and alternate means of computation may be able to achieve the goal.)
Original Analysis
Suppose we have 1800 integer values in [1, 60] that are converted to float using float y = x / 1000.f;, where all operations are implemented using IEEE-754 basic 32-bit binary floating-point with correct rounding.
The conversions of 1 to 60 to float are exact. The division by 1000 has an error of at most ½ ULP(.06), which is ½ • 2−5 • 2−23 = 2−29. 1800 such errors amount to at most 1800 • 2−29.
As the resulting float values are added, there may be an error of at most ½ ULP in each addition, where the ULP is that of the current result. For a loose analysis, we can bound this with the ULP of the final result, which is at most around 1800 • .06 = 108, which has an ULP of 26 • 2−23 = 2−17. So each of the 1799 additions has an error of at most 2−17, so the total errors in the additions is at most 1799 • 2−18.
Thus, the total error during divisions and additions is at most 1800 • 2−29 + 1799 • 2−18, which is about .006866.
That is a problem. I expect a better analysis of the errors in the additions would halve the error bound, as it is an arithmetic progression from 0 to the total, but that still leaves a potential error above .003, which means there is a possibility the sum could be off by several milliseconds.
Note that if the times are added as integers, the largest potential sum is 1800•60 = 108,000, which is well below the first integer not representable in float (16,777,217). Addition of these integers in float would be error-free.
This bound of .003 is small enough that some additional constraints on the problem and some additional analysis might, just might, push it below .0005, in which case the computed result will always be close enough to the correct mathematical result that rounding the computed result to the nearest millisecond would produce the correct answer.
For example, if it were known that, while the times range from 1 to 60 milliseconds, the total is always less than 7.8 seconds, that could suffice.
As much as possible, reduce the errors caused by floating point calculations
Since you've already described measuring your individual timings in milliseconds, it's far better if you accumulate those timings using integer values before you finally divide them:
std::milliseconds duration{};
for(Timing const& timing : timings) {
//Lossless integer accumulation, in a scenario where overflow is extremely unlikely
//or possibly even impossible for your problem domain
duration += std::milliseconds(timing.getTicks());
}
//Only one floating-point calculation performed, error is minimal
float averageTiming = duration.count() / float(timings.size());
The Errors that accumulate are highly particular to the scenario
Consider these two ways of accumulating values:
#include<iostream>
int main() {
//Make them volatile to prevent compilers from optimizing away the additions
volatile float sum1 = 0, sum2 = 0;
for(float i = 0.0001; i < 1000; i += 0.0001) {
sum1 += i;
}
for(float i = 1000; i > 0; i -= 0.0001) {
sum2 += i;
}
std::cout << "Sum1: " << sum1 << std::endl;
std::cout << "Sum2: " << sum2 << std::endl;
std::cout << "% Difference: " << (sum2 - sum1) / (sum1 > sum2 ? sum1 : sum2) * 100 << "%" << std::endl;
return 0;
}
Results may vary on some machines (particularly machines that don't have IEEE754 floats), but in my tests, the second value was 3% different than the first value, a difference of 13 million. That can be pretty significant.
Like before, the best option is to minimize the number of calculations performed using floating point values until the last possible step before you need them as floating point values. That will minimize accuracy losses.
Just for what it's worth, here's some code to demonstrate that yes, after 1800 items, a simple accumulation can be incorrect by more than 1 millisecond, but Kahan summation maintains the required level of accuracy.
#include <iostream>
#include <iterator>
#include <iomanip>
#include <vector>
#include <numeric>
template <class InIt>
typename std::iterator_traits<InIt>::value_type accumulate(InIt begin, InIt end)
{
typedef typename std::iterator_traits<InIt>::value_type real;
real sum = real();
real running_error = real();
for (; begin != end; ++begin)
{
real difference = *begin - running_error;
real temp = sum + difference;
running_error = (temp - sum) - difference;
sum = temp;
}
return sum;
}
int main()
{
const float addend = 0.06f;
const float count = 1800.0f;
std::vector<float> d;
std::fill_n(std::back_inserter(d), count, addend);
float result = std::accumulate(d.begin(), d.end(), 0.0f);
float result2 = accumulate(d.begin(), d.end());
float reference = count * addend;
std::cout << " simple: " << std::setprecision(20) << result << "\n";
std::cout << " Kahan: " << std::setprecision(20) << result2 << "\n";
std::cout << "Reference: " << std::setprecision(20) << reference << "\n";
}
For this particular test, it appears that double precision is sufficient, at least for the input values I tried--but to be honest, I'm still a bit leery of it, especially when exhaustive testing isn't reasonable, and better techniques are easily available.
I am testing out using the clock_t function's in c++ and I ran across a problem. When I compile I do it on 2 different compilers. Visual studio on my Windows 7 computer (2012), and g++ on a Unix system called "ranger". When I just compiled my code in an attempt to output the time in seconds (up to a thousandth of a second) it takes to run different sort functions, it seems that the g++ compiler completely ignores my attempt to divide the time stamp by 1000 in order to convert it from milliseconds to a second format. Any advice? Is there a difference between g++ and Visual Studio's compiler in regard's to this?
A short code snippet(Output and what i do for division):
//Select Sort
begin = clock(); //Begin time
selectionSort(array, n);
end = clock(); //End time
d_select = ((float)(end/1000.0) - (float)(begin/1000.0)); //Calculates the time in MS, and converts from MS, to a float second.
//Output data
cout << setprecision(3) << fixed; //Set precision to 3 decimal places, with a fixed output (0's are displayed regardless of rounding)
cout << n << "\t" << d_bubble << "\t" << d_insert << "\t" << d_merge << "\t" << d_quick << "\t"
<< d_select << endl;
Visual Studio output (Correct):
n Bubble Insert Merge Quick Select
100000 12.530 1.320 0.000 0.030 2.900
Unix output(Incorrect) :
n Bubble Insert Merge Quick Select
100000 51600.000 11700.000 30.000 150.000 18170.000
Any suggestions? Thanks!
Divide by CLOCKS_PER_SEC, not 1000. On Unix, and POSIX in general, clock() gives a value in microseconds, not milliseconds.
Note that that, and clock_t, are integers; so if you want fractional seconds, convert to a floating-point format before dividing:
d_select = float(end - begin) / CLOCKS_PER_SEC;
Why does this bit of code,
const float x[16] = { 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
1.923, 2.034, 2.145, 2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
y[i] = x[i];
}
for (int j = 0; j < 9000000; j++)
{
for (int i = 0; i < 16; i++)
{
y[i] *= x[i];
y[i] /= z[i];
y[i] = y[i] + 0.1f; // <--
y[i] = y[i] - 0.1f; // <--
}
}
run more than 10 times faster than the following bit (identical except where noted)?
const float x[16] = { 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
1.923, 2.034, 2.145, 2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
y[i] = x[i];
}
for (int j = 0; j < 9000000; j++)
{
for (int i = 0; i < 16; i++)
{
y[i] *= x[i];
y[i] /= z[i];
y[i] = y[i] + 0; // <--
y[i] = y[i] - 0; // <--
}
}
when compiling with Visual Studio 2010 SP1.
The optimization level was -02 with sse2 enabled.
I haven't tested with other compilers.
Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!!
Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can't handle them directly and must trap and resolve them using microcode.
If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0 or 0.1 is used.
Here's the test code compiled on x64:
int main() {
double start = omp_get_wtime();
const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
float y[16];
for(int i=0;i<16;i++)
{
y[i]=x[i];
}
for(int j=0;j<9000000;j++)
{
for(int i=0;i<16;i++)
{
y[i]*=x[i];
y[i]/=z[i];
#ifdef FLOATING
y[i]=y[i]+0.1f;
y[i]=y[i]-0.1f;
#else
y[i]=y[i]+0;
y[i]=y[i]-0;
#endif
if (j > 10000)
cout << y[i] << " ";
}
if (j > 10000)
cout << endl;
}
double end = omp_get_wtime();
cout << end - start << endl;
system("pause");
return 0;
}
Output:
#define FLOATING
1.78814e-007 1.3411e-007 1.04308e-007 0 7.45058e-008 6.70552e-008 6.70552e-008 5.58794e-007 3.05474e-007 2.16067e-007 1.71363e-007 1.49012e-007 1.2666e-007 1.11759e-007 1.04308e-007 1.04308e-007
1.78814e-007 1.3411e-007 1.04308e-007 0 7.45058e-008 6.70552e-008 6.70552e-008 5.58794e-007 3.05474e-007 2.16067e-007 1.71363e-007 1.49012e-007 1.2666e-007 1.11759e-007 1.04308e-007 1.04308e-007
//#define FLOATING
6.30584e-044 3.92364e-044 3.08286e-044 0 1.82169e-044 1.54143e-044 2.10195e-044 2.46842e-029 7.56701e-044 4.06377e-044 3.92364e-044 3.22299e-044 3.08286e-044 2.66247e-044 2.66247e-044 2.24208e-044
6.30584e-044 3.92364e-044 3.08286e-044 0 1.82169e-044 1.54143e-044 2.10195e-044 2.45208e-029 7.56701e-044 4.06377e-044 3.92364e-044 3.22299e-044 3.08286e-044 2.66247e-044 2.66247e-044 2.24208e-044
Note how in the second run the numbers are very close to zero.
Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.
To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code:
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
Then the version with 0 is no longer 10x slower and actually becomes faster. (This requires that the code be compiled with SSE enabled.)
This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.
Timings: Core i7 920 # 3.5 GHz:
// Don't flush denormals to zero.
0.1f: 0.564067
0 : 26.7669
// Flush denormals to zero.
0.1f: 0.587117
0 : 0.341406
In the end, this really has nothing to do with whether it's an integer or floating-point. The 0 or 0.1f is converted/stored into a register outside of both loops. So that has no effect on performance.
Using gcc and applying a diff to the generated assembly yields only this difference:
73c68,69
< movss LCPI1_0(%rip), %xmm1
---
> movabsq $0, %rcx
> cvtsi2ssq %rcx, %xmm1
81d76
< subss %xmm1, %xmm0
The cvtsi2ssq one being 10 times slower indeed.
Apparently, the float version uses an XMM register loaded from memory, while the int version converts a real int value 0 to float using the cvtsi2ssq instruction, taking a lot of time. Passing -O3 to gcc doesn't help. (gcc version 4.2.1.)
(Using double instead of float doesn't matter, except that it changes the cvtsi2ssq into a cvtsi2sdq.)
Update
Some extra tests show that it is not necessarily the cvtsi2ssq instruction. Once eliminated (using a int ai=0;float a=ai; and using a instead of 0), the speed difference remains. So #Mysticial is right, the denormalized floats make the difference. This can be seen by testing values between 0 and 0.1f. The turning point in the above code is approximately at 0.00000000000000000000000000000001, when the loops suddenly takes 10 times as long.
Update<<1
A small visualisation of this interesting phenomenon:
Column 1: a float, divided by 2 for every iteration
Column 2: the binary representation of this float
Column 3: the time taken to sum this float 1e7 times
You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower.
0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 45 ms
0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 43 ms
0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 43 ms
0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 42 ms
0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 48 ms
0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 43 ms
0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 42 ms
0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 42 ms
0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 42 ms
0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 43 ms
0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 42 ms
0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 44 ms
0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 42 ms
0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 42 ms
0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 789 ms
0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 788 ms
0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 788 ms
0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 795 ms
0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 896 ms
0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 813 ms
0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 798 ms
0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 791 ms
0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 802 ms
0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 809 ms
0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 795 ms
0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 835 ms
0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 864 ms
0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 915 ms
0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 918 ms
0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 881 ms
0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 857 ms
0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 861 ms
0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 855 ms
0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 887 ms
0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 799 ms
0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 828 ms
0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 815 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 44 ms
An equivalent discussion about ARM can be found in Stack Overflow question Denormalized floating point in Objective-C?.
It's due to denormalized floating-point use. How to get rid of both it and the performance penalty? Having scoured the Internet for ways of killing denormal numbers, it seems there is no "best" way to do this yet. I have found these three methods that may work best in different environments:
Might not work in some GCC environments:
// Requires #include <fenv.h>
fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV);
Might not work in some Visual Studio environments: 1
// Requires #include <xmmintrin.h>
_mm_setcsr( _mm_getcsr() | (1<<15) | (1<<6) );
// Does both FTZ and DAZ bits. You can also use just hex value 0x8040 to do both.
// You might also want to use the underflow mask (1<<11)
Appears to work in both GCC and Visual Studio:
// Requires #include <xmmintrin.h>
// Requires #include <pmmintrin.h>
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
The Intel compiler has options to disable denormals by default on modern Intel CPUs. More details here
Compiler switches. -ffast-math, -msse or -mfpmath=sse will disable denormals and make a few other things faster, but unfortunately also do lots of other approximations that might break your code. Test carefully! The equivalent of fast-math for the Visual Studio compiler is /fp:fast but I haven't been able to confirm whether this also disables denormals.1
Dan Neely's comment ought to be expanded into an answer:
It is not the zero constant 0.0f that is denormalized or causes a slow down, it is the values that approach zero each iteration of the loop. As they come closer and closer to zero, they need more precision to represent and they become denormalized. These are the y[i] values. (They approach zero because x[i]/z[i] is less than 1.0 for all i.)
The crucial difference between the slow and fast versions of the code is the statement y[i] = y[i] + 0.1f;. As soon as this line is executed each iteration of the loop, the extra precision in the float is lost, and the denormalization needed to represent that precision is no longer needed. Afterwards, floating point operations on y[i] remain fast because they aren't denormalized.
Why is the extra precision lost when you add 0.1f? Because floating point numbers only have so many significant digits. Say you have enough storage for three significant digits, then 0.00001 = 1e-5, and 0.00001 + 0.1 = 0.1, at least for this example float format, because it doesn't have room to store the least significant bit in 0.10001.
In short, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; isn't the no-op you might think it is.
Mystical said this as well: the content of the floats matters, not just the assembly code.
EDIT: To put a finer point on this, not every floating point operation takes the same amount of time to run, even if the machine opcode is the same. For some operands/inputs, the same instruction will take more time to run. This is especially true for denormal numbers.
In gcc you can enable FTZ and DAZ with this:
#include <xmmintrin.h>
#define FTZ 1
#define DAZ 1
void enableFtzDaz()
{
int mxcsr = _mm_getcsr ();
if (FTZ) {
mxcsr |= (1<<15) | (1<<11);
}
if (DAZ) {
mxcsr |= (1<<6);
}
_mm_setcsr (mxcsr);
}
also use gcc switches: -msse -mfpmath=sse
(corresponding credits to Carl Hetherington [1])
[1] http://carlh.net/plugins/denormals.php
CPUs are only a bit slower for denormal numbers for a long time. My Zen2 CPU needs five clock cycles for a computation with denormal inputs and denormal outputs and four clock cycles with a normalized number.
This is a small benchmark written with Visual C++ to show the slightly peformance-degrading effect of denormal numbers:
#include <iostream>
#include <cstdint>
#include <chrono>
using namespace std;
using namespace chrono;
uint64_t denScale( uint64_t rounds, bool den );
int main()
{
auto bench = []( bool den ) -> double
{
constexpr uint64_t ROUNDS = 25'000'000;
auto start = high_resolution_clock::now();
int64_t nScale = denScale( ROUNDS, den );
return (double)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / nScale;
};
double
tDen = bench( true ),
tNorm = bench( false ),
rel = tDen / tNorm - 1;
cout << tDen << endl;
cout << tNorm << endl;
cout << trunc( 100 * 10 * rel + 0.5 ) / 10 << "%" << endl;
}
This is the MASM assembly part.
PUBLIC ?denScale##YA_K_K_N#Z
CONST SEGMENT
DEN DQ 00008000000000000h
ONE DQ 03FF0000000000000h
P5 DQ 03fe0000000000000h
CONST ENDS
_TEXT SEGMENT
?denScale##YA_K_K_N#Z PROC
xor rax, rax
test rcx, rcx
jz byeBye
mov r8, ONE
mov r9, DEN
test dl, dl
cmovnz r8, r9
movq xmm1, P5
mov rax, rcx
loopThis:
movq xmm0, r8
REPT 52
mulsd xmm0, xmm1
ENDM
sub rcx, 1
jae loopThis
mov rdx, 52
mul rdx
byeBye:
ret
?denScale##YA_K_K_N#Z ENDP
_TEXT ENDS
END
It would be nice to see some results in the comments.