Related
#include <iostream>
#include <chrono>
using namespace std;
class MyTimer {
private:
std::chrono::time_point<std::chrono::steady_clock> starter;
std::chrono::time_point<std::chrono::steady_clock> ender;
public:
void startCounter() {
starter = std::chrono::steady_clock::now();
}
double getCounter() {
ender = std::chrono::steady_clock::now();
return double(std::chrono::duration_cast<std::chrono::nanoseconds>(ender - starter).count()) /
1000000; // millisecond output
}
// timer need to have nanosecond precision
int64_t getCounterNs() {
return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - starter).count();
}
};
MyTimer timer1, timer2, timerMain;
volatile int64_t dummy = 0, res1 = 0, res2 = 0;
// time run without any time measure
void func0() {
dummy++;
}
// we're trying to measure the cost of startCounter() and getCounterNs(), not "dummy++"
void func1() {
timer1.startCounter();
dummy++;
res1 += timer1.getCounterNs();
}
void func2() {
// start your counter here
dummy++;
// res2 += end your counter here
}
int main()
{
int i, ntest = 1000 * 1000 * 100;
int64_t runtime0, runtime1, runtime2;
timerMain.startCounter();
for (i=1; i<=ntest; i++) func0();
runtime0 = timerMain.getCounter();
cout << "Time0 = " << runtime0 << "ms\n";
timerMain.startCounter();
for (i=1; i<=ntest; i++) func1();
runtime1 = timerMain.getCounter();
cout << "Time1 = " << runtime1 << "ms\n";
timerMain.startCounter();
for (i=1; i<=ntest; i++) func2();
runtime2 = timerMain.getCounter();
cout << "Time2 = " << runtime2 << "ms\n";
return 0;
}
I'm trying to profile a program where certain critical parts have execution time measured in < 50 nanoseconds. I found that my timer class using std::chrono is too expensive (code with timing takes 40% more time than code without). How can I make a faster timer class?
I think some OS-specific system calls would be the fastest solution. The platform is Linux Ubuntu.
Edit: all code is compiled with -O3. It's ensured that each timer is only initialized once, so the measured cost is due to the startMeasure/stopMeasure functions only. I'm not doing any text printing.
Edit 2: the accepted answer doesn't include the method to actually convert number-of-cycles to nanoseconds. If someone can do that, it'd be very helpful.
What you want is called "micro-benchmarking". It can get very complex. I assume you are using Ubuntu Linux on x86_64. This is not valid form ARM, ARM64 or any other platforms.
std::chrono is implemented at libstdc++ (gcc) and libc++ (clang) on Linux as simply a thin wrapper around the GLIBC, the C library, which does all the heavy lifting. If you look at std::chrono::steady_clock::now() you will see calls to clock_gettime().
clock_gettime() is a VDSO, ie it is kernel code that runs in userspace. It should be very fast but it might be that from time to time it has to do some housekeeping and take a long time every n-th call. So I would not recommend for microbenchmarking.
Almost every platform has a cycle counter and x86 has the assembly instruction rdtsc. This instruction can be inserted in your code by crafting asm calls or by using the compiler-specific builtins __builtin_ia32_rdtsc() or __rdtsc().
These calls will return a 64-bit integer representing the number of clocks since the machine power up. rdtsc is not immediate but fast, it will take roughly 15-40 cycles to complete.
It is not guaranteed in all platforms that this counter will be the same for each core so beware when the process gets moved from core to core. In modern systems this should not be a problem though.
Another problem with rdtsc is that compilers will often reorder instructions if they find they don't have side effects and unfortunately rdtsc is one of them. So you have to use fake barriers around these counter reads if you see that the compiler is playing tricks on you - look at the generated assembly.
Also a big problem is cpu out of order execution itself. Not only the compiler can change the order of execution but the cpu can as well. Since the x86 486 the Intel CPUs are pipelined so several instructions can be executed at the same time - roughly speaking. So you might end up measuring spurious execution.
I recommend you to get familiar with the quantum-like problems of micro-benchmarking. It is not straightforward.
Notice that rdtsc() will return the number of cycles. You have to convert to nanoseconds using the timestamp counter frequency.
Here is one example:
#include <iostream>
#include <cstdio>
void dosomething() {
// yada yada
}
int main() {
double sum = 0;
const uint32_t numloops = 100000000;
for ( uint32_t j=0; j<numloops; ++j ) {
uint64_t t0 = __builtin_ia32_rdtsc();
dosomething();
uint64_t t1 = __builtin_ia32_rdtsc();
uint64_t elapsed = t1-t0;
sum += elapsed;
}
std::cout << "Average:" << sum/numloops << std::endl;
}
This paper is a bit outdated (2010) but it is sufficiently up to date to give you a good introduction to micro-benchmarking:
How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures
I heard that passing variable by reference is not always faster than passing by value. Passing by reference is faster for big variables but for small one this problem could be tricky.
Passing by value requires time for copy creation but taking value of local variable should be faster.
Passing by reference do not waste time for creating variable copy but there is look at pointer and then on required data.
I am aware that this detail is not so important in optimization problem however it was interesting for me to measure it (i know that -O0 is passe for optimization but this code is to simple, after optimization i was not sure what i was measuring)
g++ -std=c++14 -O0 -g3 -DSIZE_OF_DATA_ARRAY=16 main.cpp && ./a.out
g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406
SIZE_OF_DATA_ARRAY | copy time[s] | reference time [s]
4 |0.04 |0.045
8 |0.04 |0.46
16|0.04 |0.05
17|0.07 |0.05
24|0.07 |0.05
My questions:
Why time of execution is quite constant for copying vs struct size?
Why there is threshold between 16 and 17 on copying?
Guess: it is connected with cache
My code:
#include <iostream>
#include <vector>
#include <limits>
#include <chrono>
#include <iomanip>
#include <vector>
#include <algorithm>
struct Data {
double x[SIZE_OF_DATA_ARRAY];
};
double workOnData(Data &data) {
for (auto i = 0; i < 10; ++i) {
data.x[0] -= 0.5 * (data.x[0] - 1);
}
return data.x[0];
}
void runTestSuite() {
auto queries = 1000000;
Data data;
for (auto i = 0; i < queries; ++i) {
data.x[0] = i;
auto val = workOnData(data);
if (val == -357)
data.x[0] = 1;
}
}
int main() {
std::cout << "sizeof(Data) = " << sizeof(Data) << "\n";
size_t numberOfTests = 99;
std::vector<std::chrono::duration<double>> timeMeasurements{numberOfTests};
std::chrono::time_point<std::chrono::system_clock> startTime, endTime;
for (auto i = 0; i < numberOfTests; ++i) {
startTime = std::chrono::system_clock::now();
runTestSuite();
endTime = std::chrono::system_clock::now();
timeMeasurements[i] = endTime - startTime;
}
std::sort(timeMeasurements.begin(), timeMeasurements.end());
std::chrono::system_clock::time_point now = std::chrono::system_clock::now();
std::time_t now_c = std::chrono::system_clock::to_time_t(now);
std::cout << std::put_time(std::localtime(&now_c), "%F %T")
<< ": median time = " << timeMeasurements[numberOfTests * 0.5].count() << "s\n";
return 0;
}
Why time of execution is quite constant for copying vs struct size?
The best understanding is from viewing the assembly language to see the instructions that the compiler emitted. Optimizations here would depend on the optimization setting of the compiler and whether you are in release or debug configuration.
Also depends on the processor. For example, some processors may have specialized instructions for copying large blocks of memory. Other processors may copy data in parallel chunks, depending on the size of the structure. Also, some platforms may have hardware assistance, such as a DMA controller.
Alas, sometimes unrolling may be faster than using special instructions or hardware assistance (depends on the data size).
Why there is threshold between 16 and 17 on copying?
The threshold may be between alignment boundaries and non-alignment.
Let's take a 32-bit processor. It likes to access (fetch) 4 bytes at a time. Accessing 24 bytes requires 6 fetches. Access 16 bytes takes 4 fetches.
However, accessing 17, 18, or 19 bytes requires 5 fetches. It may fetch another 4 bytes to get those remainder bytes.
Another scenario is the implementation of the copy function. Some copy functions may use 32 bit copies for the first set of 4 byte quantities, then switch to byte comparing for the remainders. It could switch to byte copying for all the bytes depending on the size of the data. Many possibilities.
The truth for your system lies in debugging the assembly language or function for copying the data.
Cache Hits & Misses
Your performance metrics may be skewed by processor cache operations. If the processor has your data in cache, the loop will be a lot faster. Usually there will be a performance hit for the first access of the data. There may be more time wasted reloading the data cache if your data is too big for the cache or lies outside the domain of the data cache size.
Instruction Cache Issues
A lot of processors have large pipelines and caches for data instructions. When they encounter a branch (such as the end of a for loop), the processor may have to reset the instruction cache and reload from another location in the program. This takes time. You can demonstrate by unrolling the loop in different size chunks and measuring the performance.
C++17 added std::hardware_destructive_interference_size and std::hardware_constructive_interference_size. First, I thought it is just a portable way to get the size of a L1 cache line but that is an oversimplification.
Questions:
How are these constants related to the L1 cache line size?
Is there a good example that demonstrates their use cases?
Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?
The intent of these constants is indeed to get the cache-line size. The best place to read about the rationale for them is in the proposal itself:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html
I'll quote a snippet of the rationale here for ease-of-reading:
[...] the granularity of memory that does not interfere (to the first-order) [is] commonly referred to as the cache-line size.
Uses of cache-line size fall into two broad categories:
Avoiding destructive interference (false-sharing) between objects with temporally disjoint runtime access patterns from different threads.
Promoting constructive interference (true-sharing) between objects which have temporally local runtime access patterns.
The most sigificant issue with this useful implementation quantity is the questionable portability of the methods used in current practice to determine its value, despite their pervasiveness and popularity as a group. [...]
We aim to contribute a modest invention for this cause, abstractions for this quantity that can be conservatively defined for given purposes by implementations:
Destructive interference size: a number that’s suitable as an offset between two objects to likely avoid false-sharing due to different runtime access patterns from different threads.
Constructive interference size: a number that’s suitable as a limit on two objects’ combined memory footprint size and base alignment to likely promote true-sharing between them.
In both cases these values are provided on a quality of implementation basis, purely as hints that are likely to improve performance. These are ideal portable values to use with the alignas() keyword, for which there currently exists nearly no standard-supported portable uses.
"How are these constants related to the L1 cache line size?"
In theory, pretty directly.
Assume the compiler knows exactly what architecture you'll be running on - then these would almost certainly give you the L1 cache-line size precisely. (As noted later, this is a big assumption.)
For what it's worth, I would almost always expect these values to be the same. I believe the only reason they are declared separately is for completeness. (That said, maybe a compiler wants to estimate L2 cache-line size instead of L1 cache-line size for constructive interference; I don't know if this would actually be useful, though.)
"Is there a good example that demonstrates their use cases?"
At the bottom of this answer I've attached a long benchmark program that demonstrates false-sharing and true-sharing.
It demonstrates false-sharing by allocating an array of int wrappers: in one case multiple elements fit in the L1 cache-line, and in the other a single element takes up the L1 cache-line. In a tight loop a single, a fixed element is chosen from the array and updated repeatedly.
It demonstrates true-sharing by allocating a single pair of ints in a wrapper: in one case, the two ints within the pair do not fit in L1 cache-line size together, and in the other they do. In a tight loop, each element of the pair is updated repeatedly.
Note that the code for accessing the object under test does not change; the only difference is the layout and alignment of the objects themselves.
I don't have a C++17 compiler (and assume most people currently don't either), so I've replaced the constants in question with my own. You need to update these values to be accurate on your machine. That said, 64 bytes is probably the correct value on typical modern desktop hardware (at the time of writing).
Warning: the test will use all cores on your machines, and allocate ~256MB of memory. Don't forget to compile with optimizations!
On my machine, the output is:
Hardware concurrency: 16
sizeof(naive_int): 4
alignof(naive_int): 4
sizeof(cache_int): 64
alignof(cache_int): 64
sizeof(bad_pair): 72
alignof(bad_pair): 4
sizeof(good_pair): 8
alignof(good_pair): 4
Running naive_int test.
Average time: 0.0873625 seconds, useless result: 3291773
Running cache_int test.
Average time: 0.024724 seconds, useless result: 3286020
Running bad_pair test.
Average time: 0.308667 seconds, useless result: 6396272
Running good_pair test.
Average time: 0.174936 seconds, useless result: 6668457
I get ~3.5x speedup by avoiding false-sharing, and ~1.7x speedup by ensuring true-sharing.
"Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?"
This will indeed be a problem. These constants are not guaranteed to map to any cache-line size on the target machine in particular, but are intended to be the best approximation the compiler can muster up.
This is noted in the proposal, and in the appendix they give an example of how some libraries try to detect cache-line size at compile time based on various environmental hints and macros. You are guaranteed that this value is at least alignof(max_align_t), which is an obvious lower bound.
In other words, this value should be used as your fallback case; you are free to define a precise value if you know it, e.g.:
constexpr std::size_t cache_line_size() {
#ifdef KNOWN_L1_CACHE_LINE_SIZE
return KNOWN_L1_CACHE_LINE_SIZE;
#else
return std::hardware_destructive_interference_size;
#endif
}
During compilation, if you want to assume a cache-line size just define KNOWN_L1_CACHE_LINE_SIZE.
Hope this helps!
Benchmark program:
#include <chrono>
#include <condition_variable>
#include <cstddef>
#include <functional>
#include <future>
#include <iostream>
#include <random>
#include <thread>
#include <vector>
// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_destructive_interference_size = 64;
// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_constructive_interference_size = 64;
constexpr unsigned kTimingTrialsToComputeAverage = 100;
constexpr unsigned kInnerLoopTrials = 1000000;
typedef unsigned useless_result_t;
typedef double elapsed_secs_t;
//////// CODE TO BE SAMPLED:
// wraps an int, default alignment allows false-sharing
struct naive_int {
int value;
};
static_assert(alignof(naive_int) < hardware_destructive_interference_size, "");
// wraps an int, cache alignment prevents false-sharing
struct cache_int {
alignas(hardware_destructive_interference_size) int value;
};
static_assert(alignof(cache_int) == hardware_destructive_interference_size, "");
// wraps a pair of int, purposefully pushes them too far apart for true-sharing
struct bad_pair {
int first;
char padding[hardware_constructive_interference_size];
int second;
};
static_assert(sizeof(bad_pair) > hardware_constructive_interference_size, "");
// wraps a pair of int, ensures they fit nicely together for true-sharing
struct good_pair {
int first;
int second;
};
static_assert(sizeof(good_pair) <= hardware_constructive_interference_size, "");
// accesses a specific array element many times
template <typename T, typename Latch>
useless_result_t sample_array_threadfunc(
Latch& latch,
unsigned thread_index,
T& vec) {
// prepare for computation
std::random_device rd;
std::mt19937 mt{ rd() };
std::uniform_int_distribution<int> dist{ 0, 4096 };
auto& element = vec[vec.size() / 2 + thread_index];
latch.count_down_and_wait();
// compute
for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
element.value = dist(mt);
}
return static_cast<useless_result_t>(element.value);
}
// accesses a pair's elements many times
template <typename T, typename Latch>
useless_result_t sample_pair_threadfunc(
Latch& latch,
unsigned thread_index,
T& pair) {
// prepare for computation
std::random_device rd;
std::mt19937 mt{ rd() };
std::uniform_int_distribution<int> dist{ 0, 4096 };
latch.count_down_and_wait();
// compute
for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
pair.first = dist(mt);
pair.second = dist(mt);
}
return static_cast<useless_result_t>(pair.first) +
static_cast<useless_result_t>(pair.second);
}
//////// UTILITIES:
// utility: allow threads to wait until everyone is ready
class threadlatch {
public:
explicit threadlatch(const std::size_t count) :
count_{ count }
{}
void count_down_and_wait() {
std::unique_lock<std::mutex> lock{ mutex_ };
if (--count_ == 0) {
cv_.notify_all();
}
else {
cv_.wait(lock, [&] { return count_ == 0; });
}
}
private:
std::mutex mutex_;
std::condition_variable cv_;
std::size_t count_;
};
// utility: runs a given function in N threads
std::tuple<useless_result_t, elapsed_secs_t> run_threads(
const std::function<useless_result_t(threadlatch&, unsigned)>& func,
const unsigned num_threads) {
threadlatch latch{ num_threads + 1 };
std::vector<std::future<useless_result_t>> futures;
std::vector<std::thread> threads;
for (unsigned thread_index = 0; thread_index != num_threads; ++thread_index) {
std::packaged_task<useless_result_t()> task{
std::bind(func, std::ref(latch), thread_index)
};
futures.push_back(task.get_future());
threads.push_back(std::thread(std::move(task)));
}
const auto starttime = std::chrono::high_resolution_clock::now();
latch.count_down_and_wait();
for (auto& thread : threads) {
thread.join();
}
const auto endtime = std::chrono::high_resolution_clock::now();
const auto elapsed = std::chrono::duration_cast<
std::chrono::duration<double>>(
endtime - starttime
).count();
useless_result_t result = 0;
for (auto& future : futures) {
result += future.get();
}
return std::make_tuple(result, elapsed);
}
// utility: sample the time it takes to run func on N threads
void run_tests(
const std::function<useless_result_t(threadlatch&, unsigned)>& func,
const unsigned num_threads) {
useless_result_t final_result = 0;
double avgtime = 0.0;
for (unsigned trial = 0; trial != kTimingTrialsToComputeAverage; ++trial) {
const auto result_and_elapsed = run_threads(func, num_threads);
const auto result = std::get<useless_result_t>(result_and_elapsed);
const auto elapsed = std::get<elapsed_secs_t>(result_and_elapsed);
final_result += result;
avgtime = (avgtime * trial + elapsed) / (trial + 1);
}
std::cout
<< "Average time: " << avgtime
<< " seconds, useless result: " << final_result
<< std::endl;
}
int main() {
const auto cores = std::thread::hardware_concurrency();
std::cout << "Hardware concurrency: " << cores << std::endl;
std::cout << "sizeof(naive_int): " << sizeof(naive_int) << std::endl;
std::cout << "alignof(naive_int): " << alignof(naive_int) << std::endl;
std::cout << "sizeof(cache_int): " << sizeof(cache_int) << std::endl;
std::cout << "alignof(cache_int): " << alignof(cache_int) << std::endl;
std::cout << "sizeof(bad_pair): " << sizeof(bad_pair) << std::endl;
std::cout << "alignof(bad_pair): " << alignof(bad_pair) << std::endl;
std::cout << "sizeof(good_pair): " << sizeof(good_pair) << std::endl;
std::cout << "alignof(good_pair): " << alignof(good_pair) << std::endl;
{
std::cout << "Running naive_int test." << std::endl;
std::vector<naive_int> vec;
vec.resize((1u << 28) / sizeof(naive_int)); // allocate 256 mibibytes
run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_array_threadfunc(latch, thread_index, vec);
}, cores);
}
{
std::cout << "Running cache_int test." << std::endl;
std::vector<cache_int> vec;
vec.resize((1u << 28) / sizeof(cache_int)); // allocate 256 mibibytes
run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_array_threadfunc(latch, thread_index, vec);
}, cores);
}
{
std::cout << "Running bad_pair test." << std::endl;
bad_pair p;
run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_pair_threadfunc(latch, thread_index, p);
}, cores);
}
{
std::cout << "Running good_pair test." << std::endl;
good_pair p;
run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_pair_threadfunc(latch, thread_index, p);
}, cores);
}
}
I would almost always expect these values to be the same.
Regarding above, I would like to make a minor contribution to the accepted answer. A while ago, I saw a very good use-case where these two should be defined separately in the folly library. Please see the caveat about Intel Sandy Bridge processor.
https://github.com/facebook/folly/blob/3af92dbe6849c4892a1fe1f9366306a2f5cbe6a0/folly/lang/Align.h
// Memory locations within the same cache line are subject to destructive
// interference, also known as false sharing, which is when concurrent
// accesses to these different memory locations from different cores, where at
// least one of the concurrent accesses is or involves a store operation,
// induce contention and harm performance.
//
// Microbenchmarks indicate that pairs of cache lines also see destructive
// interference under heavy use of atomic operations, as observed for atomic
// increment on Sandy Bridge.
//
// We assume a cache line size of 64, so we use a cache line pair size of 128
// to avoid destructive interference.
//
// mimic: std::hardware_destructive_interference_size, C++17
constexpr std::size_t hardware_destructive_interference_size =
kIsArchArm ? 64 : 128;
static_assert(hardware_destructive_interference_size >= max_align_v, "math?");
// Memory locations within the same cache line are subject to constructive
// interference, also known as true sharing, which is when accesses to some
// memory locations induce all memory locations within the same cache line to
// be cached, benefiting subsequent accesses to different memory locations
// within the same cache line and heping performance.
//
// mimic: std::hardware_constructive_interference_size, C++17
constexpr std::size_t hardware_constructive_interference_size = 64;
static_assert(hardware_constructive_interference_size >= max_align_v, "math?");
I've tested the above code but I think there is a minor error preventing us from understanding the underlying functionning, a single cache line should not be shared between two distinct atomics in order to prevent false sharing.
I've changed the definition of those structs.
struct naive_int
{
alignas ( sizeof ( int ) ) atomic < int > value;
};
struct cache_int
{
alignas ( hardware_constructive_interference_size ) atomic < int > value;
};
struct bad_pair
{
// two atomics sharing a single 64 bytes cache line
alignas ( hardware_constructive_interference_size ) atomic < int > first;
atomic < int > second;
};
struct good_pair
{
// first cache line begins here
alignas ( hardware_constructive_interference_size ) atomic < int >
first;
// That one is still in the first cache line
atomic < int > first_s;
// second cache line starts here
alignas ( hardware_constructive_interference_size ) atomic < int >
second;
// That one is still in the second cache line
atomic < int > second_s;
};
And the resulting run:
Hardware concurrency := 40
sizeof(naive_int) := 4
alignof(naive_int) := 4
sizeof(cache_int) := 64
alignof(cache_int) := 64
sizeof(bad_pair) := 64
alignof(bad_pair) := 64
sizeof(good_pair) := 128
alignof(good_pair) := 64
Running naive_int test.
Average time: 0.060303 seconds, useless result: 8212147
Running cache_int test.
Average time: 0.0109432 seconds, useless result: 8113799
Running bad_pair test.
Average time: 0.162636 seconds, useless result: 16289887
Running good_pair test.
Average time: 0.129472 seconds, useless result: 16420417
I experienced a lot of variance in the last result but never dedicated precisely any core to that specific problem. Anyway this ran out of 2 Xeon 2690V2 and from various run using 64 or 128 for hardware_constructive_interference_size = 128 I found 64 to be more than enought and 128 a very poor use of available cache.
I suddently realized that your question helps me understand what's Jeff Preshing
was talking, all about payload !?
Does mt19937_64 have a higher throughput (bit/s) than the 32 bit version, mt19937, assuming a 64 bit architecture?
What about after vectorization?
As #byjoe points out, this obviously depends on the compiler.
In this case, it seems to be considerably more dependent on the compiler than is typical though. For example, the Boost test linked in the comments uses the compiler from VC++ 2010, and shows only a fairly slight increase in random bits per second from using mt19937_64.
To get more up-to-date information, I whipped up a simple test:
#include <random>
#include <chrono>
#include <iostream>
#include <iomanip>
template <class T, class U>
U test(char const *label, U count) {
using namespace std::chrono;
T gen(100);
U result = 0;
auto start = high_resolution_clock::now();
for (U i = 0; i < count; i++)
result ^= gen();
auto stop = high_resolution_clock::now();
std::cout << "Time for " << std::left << std::setw(12) << label
<< duration_cast<milliseconds>(stop - start).count() << "\n";
return result;
}
int main(int argc, char **argv) {
unsigned long long limit = 1000000000;
auto result1 = test<std::mt19937>("mt19937: ", limit);
auto result2 = test<std::mt19937_64>("mt19937_64: ", limit);
std::cout << "Ignore: " << result1 << ", " << result2 << "\n";
}
With VC++ 2015 udpate 3 (with /o2b2 /GL, though it probably doesn't matter), I got results like these:
Time for mt19937: 4339
Time for mt19937_64: 4215
Ignore: 2598366015, 13977046647333287932
This shows mt19937_64 as being slightly faster per call, so over twice as fast per bit as mt19937. With MinGW (using -O3), the results were much more like those linked from the Boost site:
Time for mt19937: 2211
Time for mt19937_64: 4183
Ignore: 2598366015, 13977046647333287932
In this case, mt19937_64 takes just a little less than twice the time per call, so it's only slightly faster per bit. The highest overall speed seems to be from g++ with mt19937_64, but the difference between g++ and VC++ (on these runs) is less than 1%, so I'm not sure it's reproducible.
For what it's worth, the difference in speed (per call) between mt19937 and mt19937_64 with VC++ is also pretty small, but does seem to be reproducible--it happened quite consistently in my testing. I did wonder about whether that might be (at least partially) a matter of clock management--that when the code first started, the CPU was idle, and the clock had been slowed, so the first part of the first run was at a lower clock speed. To check, I reversed the order to test mt19937_64 first. I think my hypothesis was at least partially correct--when I reversed the order, mt19937_64 slowed down compared to mt19937, so they were nearly identical on a per-call basis with VC++.
It clearly depends on your compiler and their implementation. I just tested and the 64bit version takes about 60% longer call-for-call, so that makes the 64bit version about 25% fast bit-for-bit. I tested with an i7 cpu.
If you need max speed, you may want to consider using something else. Especially if the numbers don't need to be very high quality.
This question already has answers here:
What is the fastest way to convert float to int on x86
(10 answers)
Closed 8 years ago.
We're doing a great deal of floating-point to integer number conversions in our project. Basically, something like this
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
The default C function which performs the conversion turns out to be quite time consuming.
Is there any work around (maybe a hand tuned function) which can speed up the process a little bit? We don't care much about a precision.
Most of the other answers here just try to eliminate loop overhead.
Only deft_code's answer gets to the heart of what is likely the real problem -- that converting floating point to integers is shockingly expensive on an x86 processor. deft_code's solution is correct, though he gives no citation or explanation.
Here is the source of the trick, with some explanation and also versions specific to whether you want to round up, down, or toward zero: Know your FPU
Sorry to provide a link, but really anything written here, short of reproducing that excellent article, is not going to make things clear.
inline int float2int( double d )
{
union Cast
{
double d;
long l;
};
volatile Cast c;
c.d = d + 6755399441055744.0;
return c.l;
}
// this is the same thing but it's
// not always optimizer safe
inline int float2int( double d )
{
d += 6755399441055744.0;
return reinterpret_cast<int&>(d);
}
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float2int(float_array[i]);
The double parameter is not a mistake! There is way to do this trick with floats directly but it gets ugly trying to cover all the corner cases. In its current form this function will round the float the nearest whole number if you want truncation instead use 6755399441055743.5 (0.5 less).
I ran some tests on different ways of doing float-to-int conversion. The short answer is to assume your customer has SSE2-capable CPUs and set the /arch:SSE2 compiler flag. This will allow the compiler to use the SSE scalar instructions which are twice as fast as even the magic-number technique.
Otherwise, if you have long strings of floats to grind, use the SSE2 packed ops.
There's an FISTTP instruction in the SSE3 instruction set which does what you want, but as to whether or not it could be utilized and produce faster results than libc, I have no idea.
Is the time large enough that it outweighs the cost of starting a couple of threads?
Assuming you have a multi-core processor or multiple processors on your box that you could take advantage of, this would be a trivial task to parallelize across multiple threads.
The key is to avoid the _ftol() function, which is needlessly slow. Your best bet for long lists of data like this is to use the SSE2 instruction cvtps2dq to convert two packed floats to two packed int64s. Do this twice (getting four int64s across two SSE registers) and you can shuffle them together to get four int32s (losing the top 32 bits of each conversion result). You don't need assembly to do this; MSVC exposes compiler intrinsics to the relevant instructions -- _mm_cvtpd_epi32() if my memory serves me correctly.
If you do this it is very important that your float and int arrays be 16-byte aligned so that the SSE2 load/store intrinsics can work at maximum efficiency. Also, I recommend you software pipeline a little and process sixteen floats at once in each loop, eg (assuming that the "functions" here are actually calls to compiler intrinsics):
for(int i = 0; i < HUGE_NUMBER; i+=16)
{
//int_array[i] = float_array[i];
__m128 a = sse_load4(float_array+i+0);
__m128 b = sse_load4(float_array+i+4);
__m128 c = sse_load4(float_array+i+8);
__m128 d = sse_load4(float_array+i+12);
a = sse_convert4(a);
b = sse_convert4(b);
c = sse_convert4(c);
d = sse_convert4(d);
sse_write4(int_array+i+0, a);
sse_write4(int_array+i+4, b);
sse_write4(int_array+i+8, c);
sse_write4(int_array+i+12, d);
}
The reason for this is that the SSE instructions have a long latency, so if you follow a load into xmm0 immediately with a dependent operation on xmm0 then you will have a stall. Having multiple registers "in flight" at once hides the latency a little. (Theoretically a magic all-knowing compiler could alias its way around this problem but in practice it doesn't.)
Failing this SSE juju you can supply the /QIfist option to MSVC which will cause it to issue the single opcode fist instead of a call to _ftol; this means it will simply use whichever rounding mode happens to be set in the CPU without making sure it is ANSI C's specific truncate op. The Microsoft docs say /QIfist is deprecated because their floating point code is fast now, but a disassembler will show you that this is unjustifiedly optimistic. Even /fp:fast simply results to a call to _ftol_sse2, which though faster than the egregious _ftol is still a function call followed by a latent SSE op, and thus unnecessarily slow.
I'm assuming you're on x86 arch, by the way -- if you're on PPC there are equivalent VMX operations, or you can use the magic-number-multiply trick mentioned above followed by a vsel (to mask out the non-mantissa bits) and an aligned store.
You might be able to load all of the integers into the SSE module of your processor using some magic assembly code, then do the equivalent code to set the values to ints, then read them as floats. I'm not sure this would be any faster though. I'm not a SSE guru, so I don't know how to do this. Maybe someone else can chime in.
In Visual C++ 2008, the compiler generates SSE2 calls by itself, if you do a release build with maxed out optimization options, and look at a disassembly (though some conditions have to be met, play around with your code).
See this Intel article for speeding up integer conversions:
http://software.intel.com/en-us/articles/latency-of-floating-point-to-integer-conversions/
According to Microsoft, the /QIfist compiler option is deprecated in VS 2005 because integer conversion has been sped up. They neglect to say how it has been sped up, but looking at the disassembly listing might give a clue.
http://msdn.microsoft.com/en-us/library/z8dh4h17(vs.80).aspx
most c compilers generate calls to _ftol or something for every float to int conversion. putting a reduced floating point conformance switch (like fp:fast) might help - IF you understand AND accept the other effects of this switch. other than that, put the thing in a tight assembly or sse intrinsic loop, IF you are ok AND understand the different rounding behavior.
for large loops like your example you should write a function that sets up floating point control words once and then does the bulk rounding with only fistp instructions and then resets the control word - IF you are ok with an x86 only code path, but at least you will not change the rounding.
read up on the fld and fistp fpu instructions and the fpu control word.
What compiler are you using? In Microsoft's more recent C/C++ compilers, there is an option under C/C++ -> Code Generation -> Floating point model, which has options: fast, precise, strict. I think precise is the default, and works by emulating FP operations to some extent. If you are using a MS compiler, how is this option set? Does it help to set it to "fast"? In any case, what does the disassembly look like?
As thirtyseven said above, the CPU can convert float<->int in essentially one instruction, and it doesn't get any faster than that (short of a SIMD operation).
Also note that modern CPUs use the same FP unit for both single (32 bit) and double (64 bit) FP numbers, so unless you are trying to save memory storing a lot of floats, there's really no reason to favor float over double.
On Intel your best bet is inline SSE2 calls.
I'm surprised by your result. What compiler are you using? Are you compiling with optimization turned all the way up? Have you confirmed using valgrind and Kcachegrind that this is where the bottleneck is? What processor are you using? What does the assembly code look like?
The conversion itself should be compiled to a single instruction. A good optimizing compiler should unroll the loop so that several conversions are done per test-and-branch. If that's not happening, you can unroll the loop by hand:
for(int i = 0; i < HUGE_NUMBER-3; i += 4) {
int_array[i] = float_array[i];
int_array[i+1] = float_array[i+1];
int_array[i+2] = float_array[i+2];
int_array[i+3] = float_array[i+3];
}
for(; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
If your compiler is really pathetic, you might need to help it with the common subexpressions, e.g.,
int *ip = int_array+i;
float *fp = float_array+i;
ip[0] = fp[0];
ip[1] = fp[1];
ip[2] = fp[2];
ip[3] = fp[3];
Do report back with more info!
If you do not care very much about the rounding semantics, you can use the lrint() function. This allows for more freedom in rounding and it can be much faster.
Technically, it's a C99 function, but your compiler probably exposes it in C++. A good compiler will also inline it to one instruction (a modern G++ will).
lrint documentation
rounding only
excellent trick, only the use 6755399441055743.5 (0.5 less) to do rounding won't work.
6755399441055744 = 2^52 + 2^51 overflowing decimals off the end of the mantissa leaving the integer that you want in bits 51 - 0 of the fpu register.
In IEEE 754
6755399441055744.0 =
sign exponent mantissa
0 10000110011 1000000000000000000000000000000000000000000000000000
6755399441055743.5
will also however compile to
0100001100111000000000000000000000000000000000000000000000000000
the 0.5 overflows off the end (rounding up) which is why this works in the first place.
to do truncation you would have to add 0.5 to your double then do this
the guard digits should take care of rounding to the correct result done this way.
also watch out for 64 bit gcc linux where long rather annoyingly means a 64 bit integer.
If you have very large arrays (bigger than a few MB--the size of the CPU cache), time your code and see what the throughput is. You're probably saturating the memory bus, not the FP unit. Look up the maximum theoretical bandwidth for your CPU and see how close to it you are.
If you're being limited by the memory bus, extra threads will just make it worse. You need better hardware (e.g. faster memory, different CPU, different motherboard).
In response to Larry Gritz's comment...
You are correct: the FPU is a major bottleneck (and using the xs_CRoundToInt trick allows one to come very close to saturating the memory bus).
Here are some test results for a Core 2 (Q6600) processor. The theoretical main-memory bandwidth for this machine is 3.2 GB/s (L1 and L2 bandwidths are much higher). The code was compiled with Visual Studio 2008. Similar results for 32-bit and 64-bit, and with /O2 or /Ox optimizations.
WRITING ONLY...
1866359 ticks with 33554432 array elements (33554432 touched). Bandwidth: 1.91793 GB/s
154749 ticks with 262144 array elements (33554432 touched). Bandwidth: 23.1313 GB/s
108816 ticks with 8192 array elements (33554432 touched). Bandwidth: 32.8954 GB/s
USING CASTING...
5236122 ticks with 33554432 array elements (33554432 touched). Bandwidth: 0.683625 GB/s
2014309 ticks with 262144 array elements (33554432 touched). Bandwidth: 1.77706 GB/s
1967345 ticks with 8192 array elements (33554432 touched). Bandwidth: 1.81948 GB/s
USING xs_CRoundToInt...
1490583 ticks with 33554432 array elements (33554432 touched). Bandwidth: 2.40144 GB/s
1079530 ticks with 262144 array elements (33554432 touched). Bandwidth: 3.31584 GB/s
1008407 ticks with 8192 array elements (33554432 touched). Bandwidth: 3.5497 GB/s
(Windows) source code:
// floatToIntTime.cpp : Defines the entry point for the console application.
//
#include <windows.h>
#include <iostream>
using namespace std;
double const _xs_doublemagic = double(6755399441055744.0);
inline int xs_CRoundToInt(double val, double dmr=_xs_doublemagic) {
val = val + dmr;
return ((int*)&val)[0];
}
static size_t const N = 256*1024*1024/sizeof(double);
int I[N];
double F[N];
static size_t const L1CACHE = 128*1024/sizeof(double);
static size_t const L2CACHE = 4*1024*1024/sizeof(double);
static size_t const Sz[] = {N, L2CACHE/2, L1CACHE/2};
static size_t const NIter[] = {1, N/(L2CACHE/2), N/(L1CACHE/2)};
int main(int argc, char *argv[])
{
__int64 freq;
QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
cout << "WRITING ONLY..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = 13;
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING CASTING..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = (int)F[n];
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING xs_CRoundToInt..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = xs_CRoundToInt(F[n]);
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
return 0;
}