__rdtscp calibration unstable under Linux on Intel Xeon X5550

__rdtscp calibration unstable under Linux on Intel Xeon X5550 - c++

I'm trying to use __rdtscp intrinsinc function to measure time intervals. Target platform is Linux x64, CPU Intel Xeon X5550. Although constant_tsc flag is set for this processor, calibrating __rdtscp gives very different results:
$ taskset -c 1 ./ticks
Ticks per usec: 256
$ taskset -c 1 ./ticks
Ticks per usec: 330.667
$ taskset -c 1 ./ticks
Ticks per usec: 345.043
$ taskset -c 1 ./ticks
Ticks per usec: 166.054
$ taskset -c 1 ./ticks
Ticks per usec: 256
$ taskset -c 1 ./ticks
Ticks per usec: 345.043
$ taskset -c 1 ./ticks
Ticks per usec: 256
$ taskset -c 1 ./ticks
Ticks per usec: 330.667
$ taskset -c 1 ./ticks
Ticks per usec: 256
$ taskset -c 1 ./ticks
Ticks per usec: 330.667
$ taskset -c 1 ./ticks
Ticks per usec: 330.667
$ taskset -c 1 ./ticks
Ticks per usec: 345.043
$ taskset -c 1 ./ticks
Ticks per usec: 256
$ taskset -c 1 ./ticks
Ticks per usec: 125.388
$ taskset -c 1 ./ticks
Ticks per usec: 360.727
$ taskset -c 1 ./ticks
Ticks per usec: 345.043
As we can see the difference between program executions can be up to 3 times (125-360). Such instability is not appropriate for any measurements.
Here is the code (gcc 4.9.3, running on Oracle Linux 6.6, kernel 3.8.13-55.1.2.el6uek.x86_64):
// g++ -O3 -std=c++11 -Wall ticks.cpp -o ticks
#include <x86intrin.h>
#include <ctime>
#include <cstdint>
#include <iostream>
int main()
{
timespec start, end;
uint64_t s = 0;
const double rdtsc_ticks_per_usec = [&]()
{
unsigned int dummy;
clock_gettime(CLOCK_MONOTONIC, &start);
uint64_t rd_start = __rdtscp(&dummy);
for (size_t i = 0; i < 1000000; ++i) ++s;
uint64_t rd_end = __rdtscp(&dummy);
clock_gettime(CLOCK_MONOTONIC, &end);
double usec_dur = double(end.tv_sec) * 1E6 + end.tv_nsec / 1E3;
usec_dur -= double(start.tv_sec) * 1E6 + start.tv_nsec / 1E3;
return (double)(rd_end - rd_start) / usec_dur;
}();
std::cout << s << std::endl;
std::cout << "Ticks per usec: " << rdtsc_ticks_per_usec << std::endl;
return 0;
}
When I run very similar program under Windows 7, i7-4470, VS2015 the result of calibration is pretty stable, small difference in last digit only.
So the question - what is that issue about? Is it CPU issue, Linux issue or my code issue?

Other sources of jitter will be there if you don't also ensure the cpu is isolated. You really want to avoid having another process scheduled on that core.
Also ideally, you run a tickless kernel so that you never run kernel code at all on that core. In the above code I guess that only is going to matter if you get unlucky enough to get the tick or context switch between the call to clock_gettime() and __rdtscp
Making s volatile is another way to defeat that kind of compiler optimisation.

Definitely it was my code (or gcc) issue. Compiler optimized out the loop replacing it with s = 1000000.
To prevent gcc to optimize this calibrating loop shall be changed this way:
for (size_t i = 0; i < 1000000; ++i) s += i;
Or more simple and correct way (thanks to Hal):
for (volatile size_t i = 0; i < 1000000; ++i);

Related

Is GNU compiler -fexpensive-optimization flag connected to -ffp-contract?

In the application I am working on a colleague noticed that some results minimally change when compiling with -march=native (on a skylake CPU) or with the less aggressive -march=x86-64. I therefore tried to come up with a MWE to clarify myself the compiler behaviour (of course the real case is way more complex, hence the MWE looks stupid, but that's not the point). Before entering details, let me state my question, which can be split in two parts:
Part 1: How is -fexpensive-optimizations related to -ffp-contract? Does the former imply in some sense the latter?
Part 2: Since the default value for floating-point expression contraction is -ffp-contract=fast, independently from optimisation level, why changing it to off with -O2 in the MWE below is fixing the discrepancy? This is a way of rephrasing my comment about the manual description of this flag (see below).
Bonus question: Why in my MWE reducing the std::vector to one entry changes the decimal number representation and leaving the std::vector out makes any discrepancy go away?
Consider the following MWE:
#include <iostream>
#include <iomanip>
#include <array>
#include <vector>
double inline sqr(std::array<double, 4> x) {
return x[0] * x[0] - x[1] * x[1] - x[2] * x[2] - x[3] * x[3];
}
int main(){
//std::vector<std::array<double, 4>> v = {{0.6, -0.3, -0.5, -0.5}}; // <-- for bonus question
std::vector<std::array<double, 4>> v = {{0.6, -0.3, -0.5, -0.5}, {0.6, -0.3, -0.5, -0.5}};
for(const auto& a : v)
std::cout << std::setprecision(18) << std::fixed << std::showpos
<< sqr(a) << " -> " << std::hexfloat << sqr(a) << std::defaultfloat << '\n';
}
If compiled with optimisations up to -O1, using the GNU compiler g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, results are the same targeting both architectures, while they differ starting with -O2:
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O1 -std=c++17 -o mwe_${ARCH} mwe.cpp; done; diff <(./mwe_native) <(./mwe_x86-64)
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O2 -std=c++17 -o mwe_${ARCH} mwe.cpp; done; diff <(./mwe_native) <(./mwe_x86-64)
1,2c1,2
< -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
< -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
---
> -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
> -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
I then went through all -O2 options listed in the manual and it turned out that -fexpensive-optimizations triggers the difference to occur:
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O1 -fexpensive-optimizations -std=c++17 -o mwe_${ARCH} mwe.cpp; done; diff <(./mwe_native) <(./mwe_x86-64)
1,2c1,2
< -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
< -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
---
> -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
> -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O2 -fno-expensive-optimizations -std=c++17 -o mwe_${ARCH} mwe.cpp; done; diff <(./mwe_native) <(./mwe_x86-64)
$
I then tried to understand the meaning of the option and found this SO answer, which did not shed much light on the situation. After more research it turned out that -ffp-contract option is also making the discrepancy go away, if explicitly turned off. In particular, FMA operations seem to be responsible for the differences:
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O2 -mfma -std=c++17 -o mwe_${ARCH} mwe.cpp; done; sdiff <(./mwe_native) <(./mwe_x86-64)
-0.230000000000000038 -> -0x1.d70a3d70a3d72p-3 -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
-0.230000000000000038 -> -0x1.d70a3d70a3d72p-3 -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O2 -ffp-contract=off -std=c++17 -o mwe_${ARCH} mwe.cpp; done; sdiff <(./mwe_native) <(./mwe_x86-64)
-0.229999999999999982 -> -0x1.d70a3d70a3d7p-3 -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
-0.229999999999999982 -> -0x1.d70a3d70a3d7p-3 -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
In the previous output note how -mfma makes the result on x86-64 become identical to the native one, while -ffp-contract=off makes the vice-versa (the difference is in the last digit in the hex representation).
The manual for -ffp-contract reads:
-ffp-contract=style
-ffp-contract=off disables floating-point expression contraction. -ffp-contract=fast enables floating-point expression contraction such as
forming of fused multiply-add operations if the target has native support for them. -ffp-contract=on enables floating-point expression
contraction if allowed by the language standard. This is currently not implemented and treated equal to -ffp-contract=off.
The default is -ffp-contract=fast.
I assume that specifying -march=x86-64 the "if the target has native support for them" of the manual does not apply and FMA operations are not used. Please, correct me if this is wrong.
NOTE: SO was suggesting to have a look to this question, but I doubt there is any UB in the code above. I also have to say that I tried with clang-14 (which changed the default to -ffp-contract=on) reproducing discrepancies, while with clang-13 (which has default -ffp-contract=off) no discrepancy appears.

understand the performance diff of std::optional on different platforms

I was trying to understand the overhead of using std::optional in the interface design, whether it's worthing using std::optional for single interface or using combination of "isNull and getX", the main concern is to understand the overhead of std::optional. So I crafted this simple program to test it out.
The code snippet and test results are summarized in this gist https://gist.github.com/shawncao/fa801d300c4a0a2826240fd9f5961db2
Surprisingly, there are two things I don't understand, I would like to hear from experts to help
on the linux platform with gcc (8 and 9 both tested), I saw consistently std::optional is better than the direct condition, this is something I don't understand.
on Mac (my laptop), I saw the program executes way faster than the time measured on linux (12ms vs >60ms), I don't see why this is the case (is it because of Apple LLVM impl?), from the CPU info, I don't think the mac is faster, maybe I missed other information to check.
Any pointers will be appreciated.
Basically understand the runtime difference between these two platforms, maybe processors related, or compiler related, I'm not sure. Repro app is attached in the gist. Also, I don't quite understand why std::optional is faster on linux with GCC.
https://gist.github.com/shawncao/fa801d300c4a0a2826240fd9f5961db2
see details in the gist.
Code Snippet (copied from the gist based on suggestion)
inline bool f1ni(int i) {
return i % 2 == 0;
}
TEST(OptionalTest, TestOptionalPerf) {
auto f1 = [](int i) -> int {
return i + 1;
};
auto f2 = [](int i) -> std::optional<int> {
if (i % 2 == 0) {
return {};
}
return i + 1;
};
// run 1M cycles
constexpr auto cycles = 100000000;
nebula::common::Evidence::Duration duration;
long sum2 = 0;
for (int i = 0; i < cycles; ++i) {
auto x = f2(i);
if (x) {
sum2 += x.value();
}
}
LOG(INFO) << fmt::format("optional approach: sum={0}, time={1}", sum2, duration.elapsedMs());
duration.reset();
long sum1 = 0;
for (int i = 0; i < cycles; ++i) {
if (i % 2 != 0) {
sum1 += f1(i);
}
}
LOG(INFO) << fmt::format("special value approach: sum={0}, time={1}", sum1, duration.elapsedMs());
EXPECT_EQ(sum1, sum2);
Laptop Mac MacOs Mojave
Processor 2.6 GHz Intel Core i7
/usr/bin/g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
[Test Output]
I0812 09:36:20.678683 348009920 TestCommon.cpp:354] optional approach: sum=2500000050000000, time=13
I0812 09:36:20.692001 348009920 TestCommon.cpp:363] special value approach: sum=2500000050000000, time=12
Linux Ubuntu 18.04 LTS
36 cores - one of them
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8124M CPU # 3.00GHz
stepping : 4
microcode : 0x100014a
cpu MHz : 1685.329
cache size : 25344 KB
physical id : 0
siblings : 36
core id : 0
cpu cores : 18
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
/usr/bin/g++ --version
g++ (Ubuntu 9.1.0-2ubuntu2~18.04) 9.1.0
Copyright (C) 2019 Free Software Foundation, Inc.
[Test Output]
I0812 16:42:07.391607 15080 TestCommon.cpp:354] optional approach: sum=2500000050000000, time=59
I0812 16:42:07.473335 15080 TestCommon.cpp:363] special value approach: sum=2500000050000000, time=81
===== build flags =====
[build commands capture for Mac]
/usr/bin/g++ -I/Users/shawncao/nebula/include -I/Users/shawncao/nebula/src -I/usr/local/Cellar/boost/1.70.0/include -I/usr/local/Cellar/folly/2019.08.05.00/include -I/Users/shawncao/nebula/build/cuckoofilter/src/cuckoofilter/src -I/Users/shawncao/nebula/build/fmtproj-prefix/src/fmtproj/include -I/Users/shawncao/nebula/src/service/gen/nebula -isystem /usr/local/include -isystem /Users/shawncao/nebula/build/gflagsp-prefix/src/gflagsp-build/include -isystem /Users/shawncao/nebula/build/glogp-prefix/src/glogp-build -isystem /Users/shawncao/nebula/build/roaringproj-prefix/src/roaringproj/include -isystem /Users/shawncao/nebula/build/roaringproj-prefix/src/roaringproj/cpp -isystem /Users/shawncao/nebula/build/yomm2-prefix/src/yomm2/include -isystem /Users/shawncao/nebula/build/xxhash-prefix/src/xxhash -isystem /Users/shawncao/nebula/build/bloom/src/bloom -isystem /usr/local/Cellar/openssl/1.0.2s/include -Wall -Wextra -Werror -Wno-error=nullability-completeness -Wno-error=sign-compare -Wno-error=unknown-warning-option -O3 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -std=gnu++1z -o CMakeFiles/CommonTests.dir/src/common/test/TestExts.cpp.o -c /Users/shawncao/nebula/src/common/test/TestExts.cpp
[100%] Linking CXX executable CommonTests
/Applications/CMake.app/Contents/bin/cmake -E cmake_link_script CMakeFiles/CommonTests.dir/link.txt --verbose=1
/usr/bin/g++ -Wall -Wextra -Werror -Wno-error=nullability-completeness -Wno-error=sign-compare -Wno-error=unknown-warning-option -O3 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names CMakeFiles/CommonTests.dir/src/common/test/TestCommon.cpp.o CMakeFiles/CommonTests.dir/src/common/test/TestExts.cpp.o -o CommonTests -Wl,-rpath,/Users/shawncao/nebula/build/bloom/src/bloom-build/lib libNCommon.a yomm2-prefix/src/yomm2-build/src/libyomm2.a googletest-prefix/src/googletest-build/lib/libgtest.a googletest-prefix/src/googletest-build/lib/libgtest_main.a xxhash-prefix/src/xxhash/libxxhash.a roaringproj-prefix/src/roaringproj-build/src/libroaring.a glogp-prefix/src/glogp-build/libglog.a gflagsp-prefix/src/gflagsp-build/lib/libgflags.a bloom/src/bloom-build/lib/libbf.dylib libcflib.a /usr/local/Cellar/openssl/1.0.2s/lib/libssl.a fmtproj-prefix/src/fmtproj-build/libfmt.a
[build commands capture for Linux]
/usr/bin/g++ -Wall -Wextra -Werror -lstdc++ -Wl,--no-as-needed -no-pie -ldl -lunwind -I/usr/include/ -I/usr/local/include -L/usr/local/lib -L/usr/lib -O3 -s CMakeFiles/CommonTests.dir/src/common/test/TestCommon.cpp.o CMakeFiles/CommonTests.dir/src/common/test/TestExts.cpp.o -o CommonTests -Wl,-rpath,/home/shawncao/nebula/build/bloom/src/bloom-build/lib libNCommon.a yomm2-prefix/src/yomm2-build/src/libyomm2.a /usr/local/lib/libgtest.a /usr/local/lib/libgtest_main.a xxhash-prefix/src/xxhash/libxxhash.a roaringproj-prefix/src/roaringproj-build/src/libroaring.a /usr/local/lib/libglog.a /usr/local/lib/libgflags.a bloom/src/bloom-build/lib/libbf.so libcflib.a /usr/local/lib/libssl.a /usr/local/lib/libcrypto.a fmtproj-prefix/src/fmtproj-build/libfmt.a -lpthread

Can sqrtsd in inline assembler be faster than sqrt()?

I am creating a testing utility that requires high usage of sqrt() function. After digging in possible optimisations, I have decided to try inline assembler in C++. The code is:
#include <iostream>
#include <cstdlib>
#include <cmath>
#include <ctime>
using namespace std;
volatile double normalSqrt(double a){
double b = 0;
for(int i = 0; i < ITERATIONS; i++){
b = sqrt(a);
}
return b;
}
volatile double asmSqrt(double a){
double b = 0;
for(int i = 0; i < ITERATIONS; i++){
asm volatile(
"movq %1, %%xmm0 \n"
"sqrtsd %%xmm0, %%xmm1 \n"
"movq %%xmm1, %0 \n"
: "=r"(b)
: "g"(a)
: "xmm0", "xmm1", "memory"
);
}
return b;
}
int main(int argc, char *argv[]){
double a = atoi(argv[1]);
double c;
std::clock_t start;
double duration;
start = std::clock();
c = asmSqrt(a);
duration = std::clock() - start;
cout << "asm sqrt: " << c << endl;
cout << duration << " clocks" <<endl;
cout << "Start: " << start << " end: " << start + duration << endl;
start = std::clock();
c = normalSqrt(a);
duration = std::clock() - start;
cout << endl << "builtin sqrt: " << c << endl;
cout << duration << " clocks" << endl;
cout << "Start: " << start << " end: " << start + duration << endl;
return 0;
}
I am compiling this code using this script that sets number of iterations, starts profiling, and opens profiling output in VIM:
#!/bin/bash
DEFAULT_ITERATIONS=1000000
if [ $# -eq 1 ]; then
echo "Setting ITERATIONS to $1"
DEFAULT_ITERATIONS=$1
else
echo "Using default value: $DEFAULT_ITERATIONS"
fi
rm -rf asd
g++ -msse4 -std=c++11 -O0 -ggdb -pg -DITERATIONS=$DEFAULT_ITERATIONS test.cpp -o asd
./asd 16
gprof asd gmon.out > output.txt
vim -O output.txt
true
The output is:
Using default value: 1000000
asm sqrt: 4
3802 clocks
Start: 1532 end: 5334
builtin sqrt: 4
5501 clocks
Start: 5402 end: 10903
The question is why the sqrtsd instruction takes only 3802 clocks, to count square root of 16, and sqrt() takes 5501 clocks?
Does it have something to do with HW implementation of certain instructions? Thank you.
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 21
Model: 48
Model name: AMD A8-7600 Radeon R7, 10 Compute Cores 4C+6G
Stepping: 1
CPU MHz: 3100.000
CPU max MHz: 3100,0000
CPU min MHz: 1400,0000
BogoMIPS: 6188.43
Virtualization: AMD-V
L1d cache: 16K
L1i cache: 96K
L2 cache: 2048K
NUMA node0 CPU(s): 0-3

Floating point arithmetic has to take into consideration rounding. Most C/C++ compilers adopt IEEE 754, so they have an "ideal" algorithm to perform operations such as square root. They are then free to optimize, but they must return the same result down to the last decimal, in all cases. So their freedom to optimize is not complete, in fact it is severely constrained.
Your algorithm probably is off by a digit or two part of the time. Which could be completely negligible for some users, but could also cause nasty bugs for some others, so it's not allowed by default.
If you care more for speed than standard compliance, try poking around with the options of your compiler. For instance in GCC the first I'd try is -funsafe-math-optimizations, which should enable optimizations disregarding strict standard compliance. Once you tweak it enough, you should come closer to and possibly pass your handmade implementation's speed.

Ignoring the other problems, it will still be the case that sqrt() is a bit slower than sqrtsd, unless compiled with specific flags.
sqrt() has to potentially set errno, it has to check whether it's in that case. It will still boil down to the native square root instruction on any reasonable compiler, but it will have a little overhead. Not a lot of overhead like your flaws test showed, but still some.
You can see that in action here.
Some compile flags suppress this test. For example for GCC, fno-math-errno and ffinite-math-only.

long double increment operator not working on large numbers

I'm converting a C++ system from solaris (SUN box and solaris compiler) to linux (intel box and gcc compiler). I'm running into several problems when dealing with large "long double" values. (We use "long double" due to some very very large integers... not for any decimal precision). It manifests itself in several weird ways but I've simplified it to the following program. It's trying to increment a number but doesn't. I don't get any compile or runtime errors... it just doesn't increment the number.
I've also randomly tried a few different compiler switches, (-malign-double and -m128bit-long-double with various combinations of these turned on and off), but no difference.
I've run this in gdb too and gdb's "print" command shows the same value as the cout statement.
Anyone seen this behavior?
Thanks
compile commands
$ /usr/bin/c++ --version
c++ (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4)
$ /usr/bin/c++ -g -Wall -fPIC -c SimpleLongDoubleTest.C -o SimpleLongDoubleTest.o
$ /usr/bin/c++ -g SimpleLongDoubleTest.o -o SimpleLongDoubleTest
$ ./SimpleLongDoubleTest
Maximum value for long double: 1.18973e+4932
digits 10 = 18
ld = 1268035319515045691392
ld = 1268035319515045691392
ld = 1268035319515045691392
ld = 1268035319515045691392
ld = 1268035319515045691392
SimpleLongDoubleTest.C
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <limits>
#include <iomanip>
int main( int argc, char* argv[] )
{
std::cout << "Maximum value for long double: "
<< std::numeric_limits<long double>::max() << '\n';
std::cout << "digits 10 = " << std::numeric_limits<long double>::digits10
<< std::endl;
// this doesn't work (there might be smaller numbers that also doen't work...
// That is, I'm not sure the exact number between this and the number defined
// below where things break)
long double ld = 1268035319515045691392.0L ;
// but this or any smaller number works (there might be larger numbers that
// work... That is, I'm not sure the exact number between this and the number
// defined above where things break)
//long double ld = 268035319515045691392.0L ;
for ( int i = 0 ; i < 5 ; i++ )
{
ld++ ;
std::cout << std::setiosflags( std::ios::fixed )
<< std::setprecision( 0 )
<< "ld = " << ld
<< std::endl ;
}
}

This is expected behavior. Float, double, long double etc. are internally represented in form of (2^exp-bias)*1 + xxxxx, where xxxxx is a N digit binary number, where N=23 for floats, 52 for doubles and possibly 64 for long doubles. When the number grows larger than 2^N, it's no longer possible to add '1' to that variable -- one can only add multiples of 2^(n-N).
It's also possible that your architecture equates long double as double. (even though x86 can use internally 80-bit doubles).
See also Wikipedia article -- 128 bit double is rather an exception than a norm. (sparc supports it).

efficient way to divide ignoring rest

there are 2 ways i found to get a whole number from a division in c++
question is which way is more efficient (more speedy)
first way:
Quotient = value1 / value2; // normal division haveing splitted number
floor(Quotient); // rounding the number down to the first integer
second way:
Rest = value1 % value2; // getting the Rest with modulus % operator
Quotient = (value1-Rest) / value2; // substracting the Rest so the division will match
also please demonstrate how to find out which method is faster

If you're dealing with integers, then the usual way is
Quotient = value1 / value2;
That's it. The result is already an integer. No need to use the floor(Quotient); statement. It has no effect anyway. You would want to use Quotient = floor(Quotient); if it was needed.
If you have floating point numbers, then the second method won't work at all, as % is only defined for integers. But what does it mean to get a whole number from a division of real numbers? What integer do you get when you divide 8.5 by 3.2? Does it ever make sense to ask this question?
As a side note, the thing you call 'Rest' is normally called 'reminder'.remainder.

Use this program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#ifdef DIV_BY_DIV
#define DIV(a, b) ((a) / (b))
#else
#define DIV(a, b) (((a) - ((a) % (b))) / (b))
#endif
#ifndef ITERS
#define ITERS 1000
#endif
int main()
{
int i, a, b;
srand(time(NULL));
a = rand();
b = rand();
for (i = 0; i < ITERS; i++)
a = DIV(a, b);
return 0;
}
You can time execution
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 -DDIV_BY_DIV 1.c && time ./a.out
real 0m0.010s
user 0m0.012s
sys 0m0.000s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c && time ./a.out
real 0m0.019s
user 0m0.020s
sys 0m0.000s
Or, you look at the assembly output:
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 -DDIV_BY_DIV 1.c -S; mv 1.s 1_div.s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S; mv 1.s 1_modulus.s
mihai#keldon:/tmp$ diff 1_div.s 1_modulus.s
24a25,32
> movl %edx, %eax
> movl 24(%esp), %edx
> movl %edx, %ecx
> subl %eax, %ecx
> movl %ecx, %eax
> movl %eax, %edx
> sarl $31, %edx
> idivl 20(%esp)
As you see, doing only the division is faster.
Edited to fix error in code, formatting and wrong diff.
More edit (explaining the assembly diff): In the second case, when doing the modulus first, the assembly shows that two idivl operations are needed: one to get the result of % and one for the actual division. The above diff shows the subtraction and the second division, as the first one is exactly the same in both codes.
Edit: more relevant timing information:
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=42000000 -DDIV_BY_DIV 1.c && time ./a.out
real 0m0.384s
user 0m0.360s
sys 0m0.004s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=42000000 1.c && time ./a.out
real 0m0.706s
user 0m0.696s
sys 0m0.004s
Hope it helps.
Edit: diff between assembly with -O0 and without.
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S -O0; mv 1.s O0.s
mihai#keldon:/tmp$ gcc -Wall -Wextra -DITERS=1000000 1.c -S; mv 1.s noO.s
mihai#keldon:/tmp$ diff noO.s O0.s
Since the defualt optimization level of gcc is O0 (see this article explaining optimization levels in gcc) the result was expected.
Edit: if you compile with -O3 as one of the comments suggested you'll get the same assembly, at that level of optimization, both alternatives are the same.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

__rdtscp calibration unstable under Linux on Intel Xeon X5550 - c++

Related

Is GNU compiler -fexpensive-optimization flag connected to -ffp-contract?

understand the performance diff of std::optional on different platforms

Can sqrtsd in inline assembler be faster than sqrt()?

long double increment operator not working on large numbers

efficient way to divide ignoring rest

Categories

Resources