I was trying to understand the overhead of using std::optional in the interface design, whether it's worthing using std::optional for single interface or using combination of "isNull and getX", the main concern is to understand the overhead of std::optional. So I crafted this simple program to test it out.
The code snippet and test results are summarized in this gist https://gist.github.com/shawncao/fa801d300c4a0a2826240fd9f5961db2
Surprisingly, there are two things I don't understand, I would like to hear from experts to help
on the linux platform with gcc (8 and 9 both tested), I saw consistently std::optional is better than the direct condition, this is something I don't understand.
on Mac (my laptop), I saw the program executes way faster than the time measured on linux (12ms vs >60ms), I don't see why this is the case (is it because of Apple LLVM impl?), from the CPU info, I don't think the mac is faster, maybe I missed other information to check.
Any pointers will be appreciated.
Basically understand the runtime difference between these two platforms, maybe processors related, or compiler related, I'm not sure. Repro app is attached in the gist. Also, I don't quite understand why std::optional is faster on linux with GCC.
https://gist.github.com/shawncao/fa801d300c4a0a2826240fd9f5961db2
see details in the gist.
Code Snippet (copied from the gist based on suggestion)
inline bool f1ni(int i) {
return i % 2 == 0;
}
TEST(OptionalTest, TestOptionalPerf) {
auto f1 = [](int i) -> int {
return i + 1;
};
auto f2 = [](int i) -> std::optional<int> {
if (i % 2 == 0) {
return {};
}
return i + 1;
};
// run 1M cycles
constexpr auto cycles = 100000000;
nebula::common::Evidence::Duration duration;
long sum2 = 0;
for (int i = 0; i < cycles; ++i) {
auto x = f2(i);
if (x) {
sum2 += x.value();
}
}
LOG(INFO) << fmt::format("optional approach: sum={0}, time={1}", sum2, duration.elapsedMs());
duration.reset();
long sum1 = 0;
for (int i = 0; i < cycles; ++i) {
if (i % 2 != 0) {
sum1 += f1(i);
}
}
LOG(INFO) << fmt::format("special value approach: sum={0}, time={1}", sum1, duration.elapsedMs());
EXPECT_EQ(sum1, sum2);
Laptop Mac MacOs Mojave
Processor 2.6 GHz Intel Core i7
/usr/bin/g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
[Test Output]
I0812 09:36:20.678683 348009920 TestCommon.cpp:354] optional approach: sum=2500000050000000, time=13
I0812 09:36:20.692001 348009920 TestCommon.cpp:363] special value approach: sum=2500000050000000, time=12
Linux Ubuntu 18.04 LTS
36 cores - one of them
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8124M CPU # 3.00GHz
stepping : 4
microcode : 0x100014a
cpu MHz : 1685.329
cache size : 25344 KB
physical id : 0
siblings : 36
core id : 0
cpu cores : 18
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
/usr/bin/g++ --version
g++ (Ubuntu 9.1.0-2ubuntu2~18.04) 9.1.0
Copyright (C) 2019 Free Software Foundation, Inc.
[Test Output]
I0812 16:42:07.391607 15080 TestCommon.cpp:354] optional approach: sum=2500000050000000, time=59
I0812 16:42:07.473335 15080 TestCommon.cpp:363] special value approach: sum=2500000050000000, time=81
===== build flags =====
[build commands capture for Mac]
/usr/bin/g++ -I/Users/shawncao/nebula/include -I/Users/shawncao/nebula/src -I/usr/local/Cellar/boost/1.70.0/include -I/usr/local/Cellar/folly/2019.08.05.00/include -I/Users/shawncao/nebula/build/cuckoofilter/src/cuckoofilter/src -I/Users/shawncao/nebula/build/fmtproj-prefix/src/fmtproj/include -I/Users/shawncao/nebula/src/service/gen/nebula -isystem /usr/local/include -isystem /Users/shawncao/nebula/build/gflagsp-prefix/src/gflagsp-build/include -isystem /Users/shawncao/nebula/build/glogp-prefix/src/glogp-build -isystem /Users/shawncao/nebula/build/roaringproj-prefix/src/roaringproj/include -isystem /Users/shawncao/nebula/build/roaringproj-prefix/src/roaringproj/cpp -isystem /Users/shawncao/nebula/build/yomm2-prefix/src/yomm2/include -isystem /Users/shawncao/nebula/build/xxhash-prefix/src/xxhash -isystem /Users/shawncao/nebula/build/bloom/src/bloom -isystem /usr/local/Cellar/openssl/1.0.2s/include -Wall -Wextra -Werror -Wno-error=nullability-completeness -Wno-error=sign-compare -Wno-error=unknown-warning-option -O3 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -std=gnu++1z -o CMakeFiles/CommonTests.dir/src/common/test/TestExts.cpp.o -c /Users/shawncao/nebula/src/common/test/TestExts.cpp
[100%] Linking CXX executable CommonTests
/Applications/CMake.app/Contents/bin/cmake -E cmake_link_script CMakeFiles/CommonTests.dir/link.txt --verbose=1
/usr/bin/g++ -Wall -Wextra -Werror -Wno-error=nullability-completeness -Wno-error=sign-compare -Wno-error=unknown-warning-option -O3 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names CMakeFiles/CommonTests.dir/src/common/test/TestCommon.cpp.o CMakeFiles/CommonTests.dir/src/common/test/TestExts.cpp.o -o CommonTests -Wl,-rpath,/Users/shawncao/nebula/build/bloom/src/bloom-build/lib libNCommon.a yomm2-prefix/src/yomm2-build/src/libyomm2.a googletest-prefix/src/googletest-build/lib/libgtest.a googletest-prefix/src/googletest-build/lib/libgtest_main.a xxhash-prefix/src/xxhash/libxxhash.a roaringproj-prefix/src/roaringproj-build/src/libroaring.a glogp-prefix/src/glogp-build/libglog.a gflagsp-prefix/src/gflagsp-build/lib/libgflags.a bloom/src/bloom-build/lib/libbf.dylib libcflib.a /usr/local/Cellar/openssl/1.0.2s/lib/libssl.a fmtproj-prefix/src/fmtproj-build/libfmt.a
[build commands capture for Linux]
/usr/bin/g++ -Wall -Wextra -Werror -lstdc++ -Wl,--no-as-needed -no-pie -ldl -lunwind -I/usr/include/ -I/usr/local/include -L/usr/local/lib -L/usr/lib -O3 -s CMakeFiles/CommonTests.dir/src/common/test/TestCommon.cpp.o CMakeFiles/CommonTests.dir/src/common/test/TestExts.cpp.o -o CommonTests -Wl,-rpath,/home/shawncao/nebula/build/bloom/src/bloom-build/lib libNCommon.a yomm2-prefix/src/yomm2-build/src/libyomm2.a /usr/local/lib/libgtest.a /usr/local/lib/libgtest_main.a xxhash-prefix/src/xxhash/libxxhash.a roaringproj-prefix/src/roaringproj-build/src/libroaring.a /usr/local/lib/libglog.a /usr/local/lib/libgflags.a bloom/src/bloom-build/lib/libbf.so libcflib.a /usr/local/lib/libssl.a /usr/local/lib/libcrypto.a fmtproj-prefix/src/fmtproj-build/libfmt.a -lpthread
Related
In the application I am working on a colleague noticed that some results minimally change when compiling with -march=native (on a skylake CPU) or with the less aggressive -march=x86-64. I therefore tried to come up with a MWE to clarify myself the compiler behaviour (of course the real case is way more complex, hence the MWE looks stupid, but that's not the point). Before entering details, let me state my question, which can be split in two parts:
Part 1: How is -fexpensive-optimizations related to -ffp-contract? Does the former imply in some sense the latter?
Part 2: Since the default value for floating-point expression contraction is -ffp-contract=fast, independently from optimisation level, why changing it to off with -O2 in the MWE below is fixing the discrepancy? This is a way of rephrasing my comment about the manual description of this flag (see below).
Bonus question: Why in my MWE reducing the std::vector to one entry changes the decimal number representation and leaving the std::vector out makes any discrepancy go away?
Consider the following MWE:
#include <iostream>
#include <iomanip>
#include <array>
#include <vector>
double inline sqr(std::array<double, 4> x) {
return x[0] * x[0] - x[1] * x[1] - x[2] * x[2] - x[3] * x[3];
}
int main(){
//std::vector<std::array<double, 4>> v = {{0.6, -0.3, -0.5, -0.5}}; // <-- for bonus question
std::vector<std::array<double, 4>> v = {{0.6, -0.3, -0.5, -0.5}, {0.6, -0.3, -0.5, -0.5}};
for(const auto& a : v)
std::cout << std::setprecision(18) << std::fixed << std::showpos
<< sqr(a) << " -> " << std::hexfloat << sqr(a) << std::defaultfloat << '\n';
}
If compiled with optimisations up to -O1, using the GNU compiler g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, results are the same targeting both architectures, while they differ starting with -O2:
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O1 -std=c++17 -o mwe_${ARCH} mwe.cpp; done; diff <(./mwe_native) <(./mwe_x86-64)
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O2 -std=c++17 -o mwe_${ARCH} mwe.cpp; done; diff <(./mwe_native) <(./mwe_x86-64)
1,2c1,2
< -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
< -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
---
> -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
> -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
I then went through all -O2 options listed in the manual and it turned out that -fexpensive-optimizations triggers the difference to occur:
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O1 -fexpensive-optimizations -std=c++17 -o mwe_${ARCH} mwe.cpp; done; diff <(./mwe_native) <(./mwe_x86-64)
1,2c1,2
< -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
< -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
---
> -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
> -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O2 -fno-expensive-optimizations -std=c++17 -o mwe_${ARCH} mwe.cpp; done; diff <(./mwe_native) <(./mwe_x86-64)
$
I then tried to understand the meaning of the option and found this SO answer, which did not shed much light on the situation. After more research it turned out that -ffp-contract option is also making the discrepancy go away, if explicitly turned off. In particular, FMA operations seem to be responsible for the differences:
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O2 -mfma -std=c++17 -o mwe_${ARCH} mwe.cpp; done; sdiff <(./mwe_native) <(./mwe_x86-64)
-0.230000000000000038 -> -0x1.d70a3d70a3d72p-3 -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
-0.230000000000000038 -> -0x1.d70a3d70a3d72p-3 -0.230000000000000038 -> -0x1.d70a3d70a3d72p-3
$ for ARCH in native x86-64; do g++ -march=${ARCH} -O2 -ffp-contract=off -std=c++17 -o mwe_${ARCH} mwe.cpp; done; sdiff <(./mwe_native) <(./mwe_x86-64)
-0.229999999999999982 -> -0x1.d70a3d70a3d7p-3 -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
-0.229999999999999982 -> -0x1.d70a3d70a3d7p-3 -0.229999999999999982 -> -0x1.d70a3d70a3d7p-3
In the previous output note how -mfma makes the result on x86-64 become identical to the native one, while -ffp-contract=off makes the vice-versa (the difference is in the last digit in the hex representation).
The manual for -ffp-contract reads:
-ffp-contract=style
-ffp-contract=off disables floating-point expression contraction. -ffp-contract=fast enables floating-point expression contraction such as
forming of fused multiply-add operations if the target has native support for them. -ffp-contract=on enables floating-point expression
contraction if allowed by the language standard. This is currently not implemented and treated equal to -ffp-contract=off.
The default is -ffp-contract=fast.
I assume that specifying -march=x86-64 the "if the target has native support for them" of the manual does not apply and FMA operations are not used. Please, correct me if this is wrong.
NOTE: SO was suggesting to have a look to this question, but I doubt there is any UB in the code above. I also have to say that I tried with clang-14 (which changed the default to -ffp-contract=on) reproducing discrepancies, while with clang-13 (which has default -ffp-contract=off) no discrepancy appears.
I read an article from Igor's blog. The article said:
... today’s CPUs do not access memory byte by byte. Instead, they fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache. And, accessing other values from the same cache line is cheap!
The article also provides c# code to verify above conclusion:
int[] arr = new int[64 * 1024 * 1024];
// Loop 1 (step = 1)
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2 (step = 16)
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
The two for-loops take about the same time: 80 and 78 ms respectively on Igor's machine, so the cache line machanism is verified.
And then I refer the above idea to implement a c++ version to verify the cache line size as the following:
#include "stdafx.h"
#include <iostream>
#include <chrono>
#include <math.h>
using namespace std::chrono;
const int total_buff_count = 16;
const int buff_size = 32 * 1024 * 1024;
int testCacheHit(int * pBuffer, int size, int step)
{
int result = 0;
for (int i = 0; i < size;) {
result += pBuffer[i];
i += step;
}
return result;
}
int main()
{
int * pBuffer = new int[buff_size*total_buff_count];
for (int i = 0; i < total_buff_count; ++i) {
int step = (int)pow(2, i);
auto start = std::chrono::system_clock::now();
volatile int result = testCacheHit(pBuffer + buff_size*i, buff_size, step);
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "step: " << step << ", elapsed time: " << elapsed_seconds.count() * 1000 << "ms\n";
}
delete[] pBuffer;
}
But my test result is totally different from the one from Igor's article. If the step is 1 ,then the time cost is about 114ms; If the step is 16, then the time cost is about 78ms. The test application is built with release configuration, there's 32 GB memory on my machine and the CPU is intel Xeon E5 2420 v2 2.2G; the result is the following.
The interesting finding is the time cost decreased significantly when step is 2 and step is 2048. My question is, how to explain the gap when step is 2 and step is 2048 in my test? Why is my result totally different from Igor's result? Thanks.
My own explaination to the first question is, the time cost of the code contains two parts: One is "memory read/write" which contains memory read/write time cost, another is "other cost" which contains for loop and calculation cost. If step is 2, then the "memory read/write" cost almost doesn't change (because of the cache line), but the calculation and loop cost decreased half, so we see an obvious gap. And I guess the cache line on my CPU is 4096 bytes (1024 * 4 bytes) rather than 64 bytes, that's why we got another gap when step is 2048. But it's just my guess. Any help from your guys is appreciated, thanks.
Drop between 1024 and 2048
Note that you are using an uninitialized array. This basically means that
int * pBuffer = new int[buff_size*total_buff_count];
does not cause your program to actually ask for any physical memory. Instead, just some virtual address space is reserved.
Then, whey you first touch some array element, a page fault is triggered and the OS maps the page to physical memory. This is a relatively slow operation which may significantly influece your experiment. Since a page size on your system is likely 4 kB, it can hold 1024 4-byte integers. When you go for 2048 step, then only every second page is actually accessed and the runtime drops proportionally.
You can avoid the negative effect of this mechanism by "touching" the memory in advance:
int * pBuffer = new int[buff_size*total_buff_count]{};
When I tried that, I got almost linear decrease of time between 64 and 8192 step sizes.
Drop between 1 and 2
A cache line size on your system is definitely not 2048 bytes, it's very likely 64 bytes (generally, it may have different values and even different values for different cache levels).
As for the first part, for step being 1, there are simply much more arithmetic operations involved (addition of array elements and increments of i).
Difference from Igor's experiment
We can only speculate about why Igor's experiment gave practically the same times in both cases. I would guess that the runtime of arithmetics is negligible there, since there is only a single loop counter increment involved and he writes into the array, which requires an additional transfer of cached lines back to memory. (We can say that the byte/op ratio is much higher than in your experiment.)
How to verify CPU cache line size with c++ code?
There is std::hardware_destructive_interference_size in C++17, which should provide the smallest cache line size. Note that it is a compile time value and the compiler relies on your input on what machine is targeted. When targeting entire architecture, the number may be inaccurate.
How to verify CPU cache line size with c++ code?
You reliably cannot.
And you should write portable C++ code. Read n3337.
Imagine that you did not enable compiler optimizations in your C++ compiler. And imagine that you run your C++ compiler in some emulator (like these).
On Linux specifically, you could parse the /proc/cpuinfo pseudo file and get the CPU cache line size from it.
For example:
% head -20 /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 8
model name : AMD Ryzen Threadripper 2970WX 24-Core Processor
stepping : 2
microcode : 0x800820b
cpu MHz : 1776.031
cache size : 512 KB
physical id : 0
siblings : 48
core id : 0
cpu cores : 24
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
BTW, there are many different organizations and levels of caches.
You could imagine a C++ application on Linux parsing the output of /proc/cpuinfo then making HTTP requests (using libcurl) to the Web to get more from it.
See also this answer.
I want to install pkgconfig from source, because I want it in a particular path to be accessed by my application. However I am encountering a problem, which seems to be basic to someone very familiar with C/C++. I have tried to search a solution without success. The problem is with the definition of end_time as an extract from make shows below
In file included from gthread.c:42:0:
gthread-posix.c: In function 'g_cond_timed_wait_posix_impl':
gthread-posix.c:112:19: error: storage size of 'end_time' isn't known
struct timespec end_time;
gthread-posix.c:112:19: warning: unused variable 'end_time' [-Wunused-variable]
Makefile:285: recipe for target 'gthread.lo' failed
The line, 112, in the file "pkgconfig-0.23/glib-1.2.10/gthread/gthread-posix.c" at the origin of the error (including one line before and a few lines after) is
{
int result;
struct timespec end_time;
gboolean timed_out;
g_return_val_if_fail (cond != NULL, FALSE);
g_return_val_if_fail (entered_mutex != NULL, FALSE);
if (!abs_time)
{
g_cond_wait (cond, entered_mutex);
return TRUE;
}
end_time.tv_sec = abs_time->tv_sec;
end_time.tv_nsec = abs_time->tv_usec * (G_NANOSEC / G_MICROSEC);
g_assert (end_time.tv_nsec < G_NANOSEC);
result = pthread_cond_timedwait ((pthread_cond_t *) cond,
(pthread_mutex_t *) entered_mutex,
&end_time);
#ifdef HAVE_PTHREAD_COND_TIMEDWAIT_POSIX
timed_out = (result == ETIMEDOUT);
#else
timed_out = (result == -1 && errno == EAGAIN);
#endif
if (!timed_out)
posix_check_for_error (result);
return !timed_out;
}
Any assistance will be appreciated
My OS is ubuntu 16.04
Output of cat /proc/cpuinfo
cpu cores : 1
apicid : 0
initial apicid : 0
fdiv_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss nx pdpe1gb rdtscp constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor abm epb fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt dtherm ida arat pln pts
bugs :
bogomips : 5202.00
clflush size : 64
cache_alignment : 64
address sizes : 42 bits physical, 48 bits virtual
power management:
I'm trying to use __rdtscp intrinsinc function to measure time intervals. Target platform is Linux x64, CPU Intel Xeon X5550. Although constant_tsc flag is set for this processor, calibrating __rdtscp gives very different results:
$ taskset -c 1 ./ticks
Ticks per usec: 256
$ taskset -c 1 ./ticks
Ticks per usec: 330.667
$ taskset -c 1 ./ticks
Ticks per usec: 345.043
$ taskset -c 1 ./ticks
Ticks per usec: 166.054
$ taskset -c 1 ./ticks
Ticks per usec: 256
$ taskset -c 1 ./ticks
Ticks per usec: 345.043
$ taskset -c 1 ./ticks
Ticks per usec: 256
$ taskset -c 1 ./ticks
Ticks per usec: 330.667
$ taskset -c 1 ./ticks
Ticks per usec: 256
$ taskset -c 1 ./ticks
Ticks per usec: 330.667
$ taskset -c 1 ./ticks
Ticks per usec: 330.667
$ taskset -c 1 ./ticks
Ticks per usec: 345.043
$ taskset -c 1 ./ticks
Ticks per usec: 256
$ taskset -c 1 ./ticks
Ticks per usec: 125.388
$ taskset -c 1 ./ticks
Ticks per usec: 360.727
$ taskset -c 1 ./ticks
Ticks per usec: 345.043
As we can see the difference between program executions can be up to 3 times (125-360). Such instability is not appropriate for any measurements.
Here is the code (gcc 4.9.3, running on Oracle Linux 6.6, kernel 3.8.13-55.1.2.el6uek.x86_64):
// g++ -O3 -std=c++11 -Wall ticks.cpp -o ticks
#include <x86intrin.h>
#include <ctime>
#include <cstdint>
#include <iostream>
int main()
{
timespec start, end;
uint64_t s = 0;
const double rdtsc_ticks_per_usec = [&]()
{
unsigned int dummy;
clock_gettime(CLOCK_MONOTONIC, &start);
uint64_t rd_start = __rdtscp(&dummy);
for (size_t i = 0; i < 1000000; ++i) ++s;
uint64_t rd_end = __rdtscp(&dummy);
clock_gettime(CLOCK_MONOTONIC, &end);
double usec_dur = double(end.tv_sec) * 1E6 + end.tv_nsec / 1E3;
usec_dur -= double(start.tv_sec) * 1E6 + start.tv_nsec / 1E3;
return (double)(rd_end - rd_start) / usec_dur;
}();
std::cout << s << std::endl;
std::cout << "Ticks per usec: " << rdtsc_ticks_per_usec << std::endl;
return 0;
}
When I run very similar program under Windows 7, i7-4470, VS2015 the result of calibration is pretty stable, small difference in last digit only.
So the question - what is that issue about? Is it CPU issue, Linux issue or my code issue?
Other sources of jitter will be there if you don't also ensure the cpu is isolated. You really want to avoid having another process scheduled on that core.
Also ideally, you run a tickless kernel so that you never run kernel code at all on that core. In the above code I guess that only is going to matter if you get unlucky enough to get the tick or context switch between the call to clock_gettime() and __rdtscp
Making s volatile is another way to defeat that kind of compiler optimisation.
Definitely it was my code (or gcc) issue. Compiler optimized out the loop replacing it with s = 1000000.
To prevent gcc to optimize this calibrating loop shall be changed this way:
for (size_t i = 0; i < 1000000; ++i) s += i;
Or more simple and correct way (thanks to Hal):
for (volatile size_t i = 0; i < 1000000; ++i);
I am trying to understand vectorization but to my surprise this very simple code is not being vectorized
#define n 1024
int main () {
int i, a[n], b[n], c[n];
for(i=0; i<n; i++) { a[i] = i; b[i] = i*i; }
for(i=0; i<n; i++) c[i] = a[i]+b[i];
}
While the Intel compiler vectorizes for some reason the initialization loop, line 5.
> icc -vec-report a.c
a.c(5): (col. 3) remark: LOOP WAS VECTORIZED
With GCC, I get nothing it seems
> gcc -ftree-vectorize -ftree-vectorizer-verbose=2 a.c
Am I doing something wrong? Shouldn't this be a very simple vectorizable loop? All the same operations, continuous memory etc. My CPU supports SSE1/2/3/4.
--- update ---
Following the answer below, this example works for me.
#include <stdio.h>
#define n 1024
int main () {
int i, a[n], b[n], c[n];
for(i=0; i<n; i++) { a[i] = i; b[i] = i*i; }
for(i=0; i<n; i++) c[i] = a[i]+b[i];
printf("%d\n", c[1023]);
}
With icc
> icc -vec-report a.c
a.c(7): (col. 3) remark: LOOP WAS VECTORIZED
a.c(8): (col. 3) remark: LOOP WAS VECTORIZED
And gcc
> gcc -ftree-vectorize -fopt-info-vec -O a.c
a.c:8:3: note: loop vectorized
a.c:7:3: note: loop vectorized
I've slightly modified your source code to be sure that GCC couldn't remove the loops:
#include <stdio.h>
#define n 1024
int main () {
int i, a[n], b[n], c[n];
for(i=0; i<n; i++) { a[i] = i; b[i] = i*i; }
for(i=0; i<n; i++) c[i] = a[i]+b[i];
printf("%d\n", c[1023]);
}
GCC (v4.8.2) can vectorize the two loops but it needs the -O flag:
gcc -ftree-vectorize -ftree-vectorizer-verbose=1 -O2 a.c
and I get:
Analyzing loop at a.c:8
Vectorizing loop at a.c:8
a.c:8 note: LOOP VECTORIZED. Analyzing loop at a.c:7
Vectorizing loop at a.c:7
a.c:7 note: LOOP VECTORIZED. a.c: note: vectorized 2 loops in
function.
Using the -fdump-tree-vect switch GCC will dump more information in the a.c.##t.vect file (it's quite useful to get an idea of what is happening "inside").
Also consider that:
the -march= switch could be essential to perform vectorization
-ftree-vectorizer-verbose=n is now being deprecated in favor of -fopt-info-vec and -fopt-info-vec-missed (see http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html)
Most of the time the options -Ofast -march=native will vectorize your code if it can be on your processor.
$ gcc compute_simple.c -Ofast -march=native -fopt-info-vec -o compute_simple.bin
compute_simple.c:14:5: note: loop vectorized
compute_simple.c:14:5: note: loop versioned for vectorization because of possible aliasing
compute_simple.c:14:5: note: loop vectorized
To know if your processor can do it, use lscpu and look at available flags.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
...
Vendor ID: GenuineIntel
...
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64
monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb
stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1
hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt
xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify
hwp_act_window hwp_epp md_clear flush_l1d
You need sse/avx on Intel, neon on ARM, others on AMD (like xop).
You can find many more information on vectorization by searching on gcc documentation.
Here is a nice article on the subject, with flags that can be used for many platforms:
https://gcc.gnu.org/projects/tree-ssa/vectorization.html
Finaly, as written above, use -ftree-vectorizer-verbose=n in old versions of gcc, and -fopt-info-vec/-fopt-info-vec-missed in recent ones to see what is vectorized.