I heard that passing variable by reference is not always faster than passing by value. Passing by reference is faster for big variables but for small one this problem could be tricky.
Passing by value requires time for copy creation but taking value of local variable should be faster.
Passing by reference do not waste time for creating variable copy but there is look at pointer and then on required data.
I am aware that this detail is not so important in optimization problem however it was interesting for me to measure it (i know that -O0 is passe for optimization but this code is to simple, after optimization i was not sure what i was measuring)
g++ -std=c++14 -O0 -g3 -DSIZE_OF_DATA_ARRAY=16 main.cpp && ./a.out
g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406
SIZE_OF_DATA_ARRAY | copy time[s] | reference time [s]
4 |0.04 |0.045
8 |0.04 |0.46
16|0.04 |0.05
17|0.07 |0.05
24|0.07 |0.05
My questions:
Why time of execution is quite constant for copying vs struct size?
Why there is threshold between 16 and 17 on copying?
Guess: it is connected with cache
My code:
#include <iostream>
#include <vector>
#include <limits>
#include <chrono>
#include <iomanip>
#include <vector>
#include <algorithm>
struct Data {
double x[SIZE_OF_DATA_ARRAY];
};
double workOnData(Data &data) {
for (auto i = 0; i < 10; ++i) {
data.x[0] -= 0.5 * (data.x[0] - 1);
}
return data.x[0];
}
void runTestSuite() {
auto queries = 1000000;
Data data;
for (auto i = 0; i < queries; ++i) {
data.x[0] = i;
auto val = workOnData(data);
if (val == -357)
data.x[0] = 1;
}
}
int main() {
std::cout << "sizeof(Data) = " << sizeof(Data) << "\n";
size_t numberOfTests = 99;
std::vector<std::chrono::duration<double>> timeMeasurements{numberOfTests};
std::chrono::time_point<std::chrono::system_clock> startTime, endTime;
for (auto i = 0; i < numberOfTests; ++i) {
startTime = std::chrono::system_clock::now();
runTestSuite();
endTime = std::chrono::system_clock::now();
timeMeasurements[i] = endTime - startTime;
}
std::sort(timeMeasurements.begin(), timeMeasurements.end());
std::chrono::system_clock::time_point now = std::chrono::system_clock::now();
std::time_t now_c = std::chrono::system_clock::to_time_t(now);
std::cout << std::put_time(std::localtime(&now_c), "%F %T")
<< ": median time = " << timeMeasurements[numberOfTests * 0.5].count() << "s\n";
return 0;
}
Why time of execution is quite constant for copying vs struct size?
The best understanding is from viewing the assembly language to see the instructions that the compiler emitted. Optimizations here would depend on the optimization setting of the compiler and whether you are in release or debug configuration.
Also depends on the processor. For example, some processors may have specialized instructions for copying large blocks of memory. Other processors may copy data in parallel chunks, depending on the size of the structure. Also, some platforms may have hardware assistance, such as a DMA controller.
Alas, sometimes unrolling may be faster than using special instructions or hardware assistance (depends on the data size).
Why there is threshold between 16 and 17 on copying?
The threshold may be between alignment boundaries and non-alignment.
Let's take a 32-bit processor. It likes to access (fetch) 4 bytes at a time. Accessing 24 bytes requires 6 fetches. Access 16 bytes takes 4 fetches.
However, accessing 17, 18, or 19 bytes requires 5 fetches. It may fetch another 4 bytes to get those remainder bytes.
Another scenario is the implementation of the copy function. Some copy functions may use 32 bit copies for the first set of 4 byte quantities, then switch to byte comparing for the remainders. It could switch to byte copying for all the bytes depending on the size of the data. Many possibilities.
The truth for your system lies in debugging the assembly language or function for copying the data.
Cache Hits & Misses
Your performance metrics may be skewed by processor cache operations. If the processor has your data in cache, the loop will be a lot faster. Usually there will be a performance hit for the first access of the data. There may be more time wasted reloading the data cache if your data is too big for the cache or lies outside the domain of the data cache size.
Instruction Cache Issues
A lot of processors have large pipelines and caches for data instructions. When they encounter a branch (such as the end of a for loop), the processor may have to reset the instruction cache and reload from another location in the program. This takes time. You can demonstrate by unrolling the loop in different size chunks and measuring the performance.
Related
I needed a way of initializing a scalar value given either a single float, or three floating point values (corresponding to RGB). So I just threw together a very simple struct:
struct Mono {
float value;
Mono(){
this->value = 0;
}
Mono(float value) {
this->value = value;
};
Mono(float red, float green, float blue){
this->value = (red+green+blue)/3;
};
};
// Multiplication operator overloads:
Mono operator*( Mono const& lhs, Mono const& rhs){
return Mono(lhs.value*rhs.value);
};
Mono operator*( float const& lhs, Mono const& rhs){
return Mono(lhs*rhs.value);
};
Mono operator*( Mono const& lhs, float const& rhs){
return Mono(lhs.value*rhs);
};
This worked as expected, but then I wanted to benchmark to see if this wrapper is going to impact performance at all so I wrote the following benchmark test where I simply multiplied a float by the struct 100,000,000 times, and multipled a float by a float 100,000,000 times:
#include <vector>
#include <chrono>
#include <iostream>
using namespace std::chrono;
int main() {
size_t N = 100000000;
std::vector<float> inputs(N);
std::vector<Mono> outputs_c(N);
std::vector<float> outputs_f(N);
Mono color(3.24);
float color_f = 3.24;
for (size_t i = 0; i < N; i++){
inputs[i] = i;
};
auto start_c = high_resolution_clock::now();
for (size_t i = 0; i < N; i++){
outputs_c[i] = color*inputs[i];
}
auto stop_c = high_resolution_clock::now();
auto duration_c = duration_cast<microseconds>(stop_c - start_c);
std::cout << "Mono*float duration: " << duration_c.count() << "\n";
auto start_f = high_resolution_clock::now();
for (size_t i = 0; i < N; i++){
outputs_f[i] = color_f*inputs[i];
}
auto stop_f = high_resolution_clock::now();
auto duration_f = duration_cast<microseconds>(stop_f - start_f);
std::cout << "float*float duration: " << duration_f.count() << "\n";
return 0;
}
When I compile it without any optimizations: g++ test.cpp, it prints the following times (in microseconds) very reliably:
Mono*float duration: 841122
float*float duration: 656197
So the Mono*float is clearly slower in that case. But then if I turn on optimizations (g++ test.cpp -O3), it prints the following times (in microseconds) very reliably:
Mono*float duration: 75494
float*float duration: 86176
I'm assuming that something is getting optimized weirdly here and it is NOT actually faster to wrap a float in a struct like this... but I'm struggling to see what is going wrong with my test.
On my system (i7-6700k with GCC 12.2.1), whichever loop I do second runs slower, and the asm for the two loops is identical.
Perhaps because cache is still partly primed from the inputs[i] = i; init loop when the first loop runs. (See https://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ re: Intel's adaptive replacement policy for L3 which might explain some but not all of the entries surviving that big init loop. 100000000 floats is 400 MB per array, and my CPU has 8 MiB of L3 cache.)
So as expected from the low computational intensity (one vector math instruction per 16 bytes loaded + stored), it's just a cache / memory bandwidth benchmark since you used one huge array instead of repeated passes over a smaller array. Nothing to do with whether you have a bare float or a struct {float; }
As expected, both loops compile to the same asm - https://godbolt.org/z/7eTh4ojYf - doing a movups load, mulps to multiply 4 floats, and a movups unaligned store. For some reason, GCC reloads the vector constant of 3.24 instead of hoisting it out of the loop, so it's doing 2 loads and 1 store per multiply. Cache misses on the big arrays should give plenty of time for out-of-order exec to do those extra loads from the same .rodata address that hit in L1d cache every time.
I tried How can I mitigate the impact of the Intel jcc erratum on gcc? but it didn't make a difference; still about the same performance delta with -Wa,-mbranches-within-32B-boundaries, so as expected it's not a front-end bottleneck; IPC is plenty low. Maybe some quirk of cache.
On my system (Linux 6.1.8 on i7-6700k at 3.9GHz, compiled with GCC 12.2.1 -O3 without -march=native or -ffast-math), your whole program spends nearly half its time in the kernel's page fault handler. (perf stat vs. perf stat --all-user cycle counts). So that's not great; if you're not trying to benchmark memory allocation and TLB misses.
But that's total time; you do touch the input and output arrays before the loop (std::vector<float> outputs_c(N); allocates and zeros space for N elements, same for your custom struct with a constructor.) There shouldn't be page faults inside your timed regions, only potentially TLB misses. And of course lots of cache misses.
BTW, clang correctly optimizes away all the loops, because none of the results are ever used. Benchmark::DoNotOptimize(outputs_c[argc]) might help with that. Or some manual use of asm with dummy memory inputs / outputs to force the compiler to materialize arrays in memory and forget their contents.
See also Idiomatic way of performance evaluation?
I'm currently working on operating system operations overheads.
I'm actually studying the cost to make a system call and I've developed a simple C++ program to observe it.
#include <iostream>
#include <unistd.h>
#include <sys/time.h>
uint64_t
rdtscp(void) {
uint32_t eax, edx;
__asm__ __volatile__("rdtscp" //! rdtscp instruction
: "+a" (eax), "=d" (edx) //! output
: //! input
: "%ecx"); //! registers
return (((uint64_t)edx << 32) | eax);
}
int main(void) {
uint64_t before;
uint64_t after;
struct timeval t;
for (unsigned int i = 0; i < 10; ++i) {
before = rdtscp();
gettimeofday(&t, NULL);
after = rdtscp();
std::cout << after - before << std::endl;
std::cout << t.tv_usec << std::endl;
}
return 0;
}
This program is quite straightforward.
The rdtscp function is just a wrapper to call the RTDSCP instruction (a processor instruction which loads the 64-bits cycle count into two 32-bits registers). This function is used to take the timing.
I iterate 10 times. At each iteration I call gettimeofday and determine the take it took to execute it (as a number of CPU cycles).
The results are quite unexpected:
8984
64008
376
64049
164
64053
160
64056
160
64060
160
64063
164
64067
160
64070
160
64073
160
64077
Odd lines in the output are the number of cycles needed to execute the system call. Even lines are the value contains in t.tv_usec (which is set by gettimeofday, the system call that I'm studying).
I don't really understand how that it is possible: the number of cycles drastically decreases, from nearly 10,000 to around 150! But the timeval struct is still updated at each call!
I've tried on different operating system (debian and macos) and the result is similar.
Even if the cache is used, I don't see how it is possible. Making a system call should result in a context switch to switch from user to kernel mode and we still need to read the clock in order to update the time.
Does someone has an idea?
The answer ? try another system call. There's vsyscalls on linux, and they accelerate things for certain syscalls:
What are vdso and vsyscall?
The short version: the syscall is not performed, but instead the kernel maps a region of memory where the process can access the time information. Cost ? Not much (no context switch).
I'm testing how reading multiple streams of data affects a CPUs caching performance. I'm using the following code to benchmark this. The benchmark reads integers stored sequentially in memory and writes partial sums back sequentially. The number of sequential blocks that are read from is varied. Integers from the blocks are read in a round-robin manner.
#include <iostream>
#include <vector>
#include <chrono>
using std::vector;
void test_with_split(int num_arrays) {
int num_values = 100000000;
// Fix up the number of values. The effect of this should be insignificant.
num_values -= (num_values % num_arrays);
int num_values_per_array = num_values / num_arrays;
// Initialize data to process
auto results = vector<int>(num_values);
auto arrays = vector<vector<int>>(num_arrays);
for (int i = 0; i < num_arrays; ++i) {
arrays.emplace_back(num_values_per_array);
}
for (int i = 0; i < num_values; ++i) {
arrays[i%num_arrays].emplace_back(i);
results.emplace_back(0);
}
// Try to clear the cache
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 100; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
free(c);
auto start = std::chrono::high_resolution_clock::now();
// Do the processing
int sum = 0;
for (int i = 0; i < num_values; ++i) {
sum += arrays[i%num_arrays][i/num_arrays];
results[i] = sum;
}
std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
int main() {
int num_arrays = 1;
while (true) {
test_with_split(num_arrays++);
}
}
Here are the timings for splitting 1-80 ways on an Intel Core 2 Quad CPU Q9550 # 2.83GHz:
The bump in the speed soon after 8 streams makes sense to me, as the processor has an 8-way associative L1 cache. The 24-way associative L2 cache in turn explains the bump at 24 streams. These especially hold if I'm getting the same effects as in Why is one loop so much slower than two loops?, where multiple big allocations always end up in the same associativity set. To compare I've included at the end timings when the allocation is done in one big block.
However, I don't fully understand the bump from one stream to two streams. My own guess would be that it has something to do with prefetching to L1 cache. Reading the Intel 64 and IA-32 Architectures Optimization Reference Manual it seems that the L2 streaming prefetcher supports tracking up to 32 streams of data, but no such information is given for the L1 streaming prefetcher. Is the L1 prefetcher unable to keep track of multiple streams, or is there something else at play here?
Background
I'm investigating this because I want to understand how organizing entities in a game engine as components in the structure-of-arrays style affects performance. For now it seems that the data required by a transformation being in two components vs. it being in 8-10 components won't matter much with modern CPUs. However, the testing above suggests that sometimes it might make sense to avoid splitting some data into multiple components if that would allow a "bottlenecking" transformation to only use one component, even if this means that some other transformation would have to read data it is not interested in.
Allocating in one block
Here are the timings if instead allocating multiple blocks of data only one is allocated and accessed in a strided manner. This does not change the bump from one stream to two, but I've included it for sake of completeness.
And here is the modified code for that:
void test_with_split(int num_arrays) {
int num_values = 100000000;
num_values -= (num_values % num_arrays);
int num_values_per_array = num_values / num_arrays;
// Initialize data to process
auto results = vector<int>(num_values);
auto array = vector<int>(num_values);
for (int i = 0; i < num_values; ++i) {
array.emplace_back(i);
results.emplace_back(0);
}
// Try to clear the cache
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 100; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
free(c);
auto start = std::chrono::high_resolution_clock::now();
// Do the processing
int sum = 0;
for (int i = 0; i < num_values; ++i) {
sum += array[(i%num_arrays)*num_values_per_array+i/num_arrays];
results[i] = sum;
}
std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
Edit 1
I made sure that the 1 vs 2 splits difference was not due to the compiler unrolling the loop and optimizing the first iteration differently. Using the __attribute__ ((noinline)) I made sure the work function is not inlined into the main function. I verified that it did not happen by looking at the generated assembly. The timings after these changed were the same.
To answer the main part of your question: Is the L1 prefetcher able to keep track of multiple streams?
No. This is actually because the L1 cache doesn't have a prefetcher at all. The L1 cache isn't big enough to risk speculatively fetching data that might not be used. It would cause too many evictions and adversely impact any software that isn't reading data in specific patterns suited to that particular L1 cache prediction scheme. Instead L1 caches data that has been explicitly read or written. L1 caches are only helpful when you are writing data and re-reading data that has recently been accessed.
The L1 cache implementation is not the reason for your profile bump from 1X to 2X array depth. On streaming reads like what you've set up, the L1 cache plays little or no factor in performance. Most of your reads are coming directly from the L2 cache. In your first example using nested vectors, some number of reads are probably pulled from L1 (see below).
My guess is your bump from 1X to 2X has a lot to do with the algo and how the compiler is optimizing it. If the compiler knows num_arrays is a constant equal to 1, then it will automatically eliminate a lot of per-iteration overhead for you.
Now for the second part, as to why is the second version faster?:
The reason for the second version being faster is not so much in how the data is arranged in physical memory, but rather what under-the-hood logic change a nested std::vector<std::vector<int>> type implies.
In the nested (first) case, compiled code performs the following steps:
Accesses top-level std::vector class. This class contains a pointer to the start of the data array.
That pointer value must be loaded from memory.
Add current loop offset [i%num_arrays] to that pointer.
Access nested std::vector class data. (Likely L1 cache hit)
Load pointer to the vector's start of data array. (Likely L1 cache hit)
Add loop offset [i/num_arrays]
Read data. Finally!
(note the chances of getting L1 cache hits on steps #4 and #5 decrease drastically after 24 streams or so, due to likeliness of evictions before the next iteration trough the loop)
The second version, by comparison:
Accesses top-level std::vector class.
Load pointer to the vector's start of data array.
Add offset [(i%num_arrays)*num_values_per_array+i/num_arrays]
Read data!
An entire set of under-the-hood steps are removed. The calculation for offset is slightly longer since it needs an extra multiply by num_values_per_array. But the other steps more than make up for it.
I create a large boolean 2d array (5000X5000 for a total of 25 billion elements at 23MB). Then I loop through and instantiate every element with a random true or false. Then I loop through and read every single element. All 25 million elements are read in ~100ms.
23MB is too big to fit in the CPU's cache and I think my program is too simple to benefit from any type of compiler optimization so am I right to conclude that the program is reading 25 million elements from RAM in ~100ms?
#include "stdafx.h"
#include <iostream>
#include <chrono>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
bool **locs;
locs = new bool*[5000];
for(int i = 0; i < 5000; i++)
locs[i] = new bool[5000];
for(int i = 0; i < 5000; i++)
for(int i2 = 0; i2 < 5000; i2++)
locs[i][i2] = rand() % 2 == 0 ? true : false;
int *idx = new int [5000*5000];
for(int i = 0; i < 5000*5000; i++)
*(idx + i) = rand() % 4999;
bool val;
int memAccesses = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 5000*5000; i++) {
val = locs[*(idx + i)][*(idx + ++i)];
memAccesses += 2;
}
auto finish = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(finish-start).count() << " ns\n";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(finish-start).count() << " ms\n";
cout << "TOTAL MEMORY ACCESSES: " << memAccesses << endl;
cout << "The size of the array in memory is " << ((sizeof(bool)*5000*5000)/1048576) << "MB";
int exit; cin >> exit;
return 0;
}
/*
OUTPUT IS:
137013700 ns
137 ms
TOTAL MEMORY ACCESSES: 25000000
The size of the array in memory is 23MB
*/
As other answers have mentioned, the "speed" you are seeing (even if the CPU is executing your code and it is not stripped by the compiler) is about 250 MBps, which is very very low number for modern systems.
However, your methodology seems flawed to me (admittedly, I'm not an expert in benchmarking.) And here are the problems I see:
For any benchmark such as this, even in the simplest form, you need to distinguish random-access from sequential-access. Memory is not a random-access device (despite its name) and performs very poorly here. Your code seems to be accessing memory randomly, so you add that to your conclusion as a qualifier: that you are "reading 25 million elements from random locations from RAM in ~100ms."
Another aspect of this sort of benchmarks is the concept of latency vs. throughput. Again, if you want to conclude anything from your numbers and timings, you need to be aware what are you measuring exactly.
You are counting memory accesses incorrectly. Depending of the exact code your compiler is generating, this line:
val = locs[*(idx + i)][*(idx + ++i)];
might realistically access the memory system anywhere between 4 to 9 times.
At best, if i, idx, locs and val are all either in registers or access to them is eliminated, then you need to read *(idx + i), read locs[*(idx + i)] (remember that locs is an array of pointers to arrays, not a 2D array,) read *(idx + ++i), and finally read locs[*(idx + i)][*(idx + ++i)]. A few of these might be cached, but it's unlikely, with the cache-thrashing that's going on.
At worst, in addition to the above, you need two accesses for ++i (read, then write back,) one for idx, one for locs and one for val. I don't know, you might even need another read for the single i and/or two reads for the two idx occurrences (due to pointer aliasing and whatnot.)
You need to be aware that memory is never accessed in single bytes or even words. Memory is always read and written in units of cache-line. And cache line size can be different from system to system, although the most common size these days is 64 bytes. So, each time you read a memory location that is not in the cache, you are loading 64-bytes (or more) from RAM. If the memory locations you are reading are at the cache line boundary (some of the bytes in one cache line and some in the next) then you are loading two cache lines from RAM. Given a sane compiler and properly aligned variables in memory, this doesn't happen very often, but it might. So you have to at least multiply your calculated bandwidth used by the size of your cache line.
However, if you are accessing a memory location that is already in cache, then you don't load anything from RAM. You need to consider this in your conclusions too.
You also need to consider cache line eviction, your cache's associativity, number of levels, the fact that some cache levels are shared between instructions and data and some aren't, some are shared between cores and some aren't, and a lot of other things when evaluating the performance of caches and memory.
The DRAM chips also have a lot of weird and complex behaviors and characteristics. Some memory locations are faster to read after some others (due to the arrangements of rows and columns,) some accesses might get delayed a long time (at CPU speeds) because of the refresh cycle, other devices might be using the RAM or the bus that RAM is on, etc., etc. I'm far from familiar with the operations of modern memory chips, and even I know that it's a complete mess.
You have to consider the effects of compiler optimization on your code. This means that you have to look t your code after the compiler is done with it, in assembly form. You need to look at the generated assembly to be able to know what your code is actually doing: whether and which of your memory accesses are optimized out.
All in all, I don't think that you can conclude much useful information from your program. Sorry about that, but memory is very complex!
Portions (blocks) of memory will be stored in the processor cache at a time, which allows the processor to quickly access those items. However, that speed is perfectly reasonable for modern memory. Even the slowest DDR3 ram can transfer data at about 6 GB/s.
Cache usage is independent from program's complexity. Whenever data is read from RAM it goes into cache. Since cache has a certain size, there's always that amount of data available. If you access a memory location next to the previous, there is a good chance it will be cached already. In such case RAM is not accessed.
I would suggest reading CPU cache wikipedia entry to broaden your knowledge.
BTW: val = locs[*(idx + i)][*(idx + ++i)]; are you certain that this is evaluated from left to right? I am not. This is an undefined behavior. I'd suggest putting the ++i below the accessor line.
//EDIT:
There is nothing done with the value read from memory. It is quite possible that these instructions are not executed at all! Check the bytecode or add a (void) val; instruction which should force it to be generated.
No. The reads won't always go all the way down to the RAM. Blocks of memory get pulled into the cache when a read (or write) is performed. As long as the block from which you are reading is already in the cache, the cache is used. If you request data from a block that is not in the cache, then the RAM is accessed to fetch the block of memory and place it in the cache. Reading from the cache is significantly cheaper than reading from RAM.
EDIT
Again, write oprerations cause blocks from memory to get pulled into the cache. Because you are storing the values in your program before reading them, the data you are reading is most likely already in the cache from when you stored it. Therefore, it is likely that your loop that reads the values never needs to access RAM.
This question already has answers here:
What is the fastest way to convert float to int on x86
(10 answers)
Closed 8 years ago.
We're doing a great deal of floating-point to integer number conversions in our project. Basically, something like this
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
The default C function which performs the conversion turns out to be quite time consuming.
Is there any work around (maybe a hand tuned function) which can speed up the process a little bit? We don't care much about a precision.
Most of the other answers here just try to eliminate loop overhead.
Only deft_code's answer gets to the heart of what is likely the real problem -- that converting floating point to integers is shockingly expensive on an x86 processor. deft_code's solution is correct, though he gives no citation or explanation.
Here is the source of the trick, with some explanation and also versions specific to whether you want to round up, down, or toward zero: Know your FPU
Sorry to provide a link, but really anything written here, short of reproducing that excellent article, is not going to make things clear.
inline int float2int( double d )
{
union Cast
{
double d;
long l;
};
volatile Cast c;
c.d = d + 6755399441055744.0;
return c.l;
}
// this is the same thing but it's
// not always optimizer safe
inline int float2int( double d )
{
d += 6755399441055744.0;
return reinterpret_cast<int&>(d);
}
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float2int(float_array[i]);
The double parameter is not a mistake! There is way to do this trick with floats directly but it gets ugly trying to cover all the corner cases. In its current form this function will round the float the nearest whole number if you want truncation instead use 6755399441055743.5 (0.5 less).
I ran some tests on different ways of doing float-to-int conversion. The short answer is to assume your customer has SSE2-capable CPUs and set the /arch:SSE2 compiler flag. This will allow the compiler to use the SSE scalar instructions which are twice as fast as even the magic-number technique.
Otherwise, if you have long strings of floats to grind, use the SSE2 packed ops.
There's an FISTTP instruction in the SSE3 instruction set which does what you want, but as to whether or not it could be utilized and produce faster results than libc, I have no idea.
Is the time large enough that it outweighs the cost of starting a couple of threads?
Assuming you have a multi-core processor or multiple processors on your box that you could take advantage of, this would be a trivial task to parallelize across multiple threads.
The key is to avoid the _ftol() function, which is needlessly slow. Your best bet for long lists of data like this is to use the SSE2 instruction cvtps2dq to convert two packed floats to two packed int64s. Do this twice (getting four int64s across two SSE registers) and you can shuffle them together to get four int32s (losing the top 32 bits of each conversion result). You don't need assembly to do this; MSVC exposes compiler intrinsics to the relevant instructions -- _mm_cvtpd_epi32() if my memory serves me correctly.
If you do this it is very important that your float and int arrays be 16-byte aligned so that the SSE2 load/store intrinsics can work at maximum efficiency. Also, I recommend you software pipeline a little and process sixteen floats at once in each loop, eg (assuming that the "functions" here are actually calls to compiler intrinsics):
for(int i = 0; i < HUGE_NUMBER; i+=16)
{
//int_array[i] = float_array[i];
__m128 a = sse_load4(float_array+i+0);
__m128 b = sse_load4(float_array+i+4);
__m128 c = sse_load4(float_array+i+8);
__m128 d = sse_load4(float_array+i+12);
a = sse_convert4(a);
b = sse_convert4(b);
c = sse_convert4(c);
d = sse_convert4(d);
sse_write4(int_array+i+0, a);
sse_write4(int_array+i+4, b);
sse_write4(int_array+i+8, c);
sse_write4(int_array+i+12, d);
}
The reason for this is that the SSE instructions have a long latency, so if you follow a load into xmm0 immediately with a dependent operation on xmm0 then you will have a stall. Having multiple registers "in flight" at once hides the latency a little. (Theoretically a magic all-knowing compiler could alias its way around this problem but in practice it doesn't.)
Failing this SSE juju you can supply the /QIfist option to MSVC which will cause it to issue the single opcode fist instead of a call to _ftol; this means it will simply use whichever rounding mode happens to be set in the CPU without making sure it is ANSI C's specific truncate op. The Microsoft docs say /QIfist is deprecated because their floating point code is fast now, but a disassembler will show you that this is unjustifiedly optimistic. Even /fp:fast simply results to a call to _ftol_sse2, which though faster than the egregious _ftol is still a function call followed by a latent SSE op, and thus unnecessarily slow.
I'm assuming you're on x86 arch, by the way -- if you're on PPC there are equivalent VMX operations, or you can use the magic-number-multiply trick mentioned above followed by a vsel (to mask out the non-mantissa bits) and an aligned store.
You might be able to load all of the integers into the SSE module of your processor using some magic assembly code, then do the equivalent code to set the values to ints, then read them as floats. I'm not sure this would be any faster though. I'm not a SSE guru, so I don't know how to do this. Maybe someone else can chime in.
In Visual C++ 2008, the compiler generates SSE2 calls by itself, if you do a release build with maxed out optimization options, and look at a disassembly (though some conditions have to be met, play around with your code).
See this Intel article for speeding up integer conversions:
http://software.intel.com/en-us/articles/latency-of-floating-point-to-integer-conversions/
According to Microsoft, the /QIfist compiler option is deprecated in VS 2005 because integer conversion has been sped up. They neglect to say how it has been sped up, but looking at the disassembly listing might give a clue.
http://msdn.microsoft.com/en-us/library/z8dh4h17(vs.80).aspx
most c compilers generate calls to _ftol or something for every float to int conversion. putting a reduced floating point conformance switch (like fp:fast) might help - IF you understand AND accept the other effects of this switch. other than that, put the thing in a tight assembly or sse intrinsic loop, IF you are ok AND understand the different rounding behavior.
for large loops like your example you should write a function that sets up floating point control words once and then does the bulk rounding with only fistp instructions and then resets the control word - IF you are ok with an x86 only code path, but at least you will not change the rounding.
read up on the fld and fistp fpu instructions and the fpu control word.
What compiler are you using? In Microsoft's more recent C/C++ compilers, there is an option under C/C++ -> Code Generation -> Floating point model, which has options: fast, precise, strict. I think precise is the default, and works by emulating FP operations to some extent. If you are using a MS compiler, how is this option set? Does it help to set it to "fast"? In any case, what does the disassembly look like?
As thirtyseven said above, the CPU can convert float<->int in essentially one instruction, and it doesn't get any faster than that (short of a SIMD operation).
Also note that modern CPUs use the same FP unit for both single (32 bit) and double (64 bit) FP numbers, so unless you are trying to save memory storing a lot of floats, there's really no reason to favor float over double.
On Intel your best bet is inline SSE2 calls.
I'm surprised by your result. What compiler are you using? Are you compiling with optimization turned all the way up? Have you confirmed using valgrind and Kcachegrind that this is where the bottleneck is? What processor are you using? What does the assembly code look like?
The conversion itself should be compiled to a single instruction. A good optimizing compiler should unroll the loop so that several conversions are done per test-and-branch. If that's not happening, you can unroll the loop by hand:
for(int i = 0; i < HUGE_NUMBER-3; i += 4) {
int_array[i] = float_array[i];
int_array[i+1] = float_array[i+1];
int_array[i+2] = float_array[i+2];
int_array[i+3] = float_array[i+3];
}
for(; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
If your compiler is really pathetic, you might need to help it with the common subexpressions, e.g.,
int *ip = int_array+i;
float *fp = float_array+i;
ip[0] = fp[0];
ip[1] = fp[1];
ip[2] = fp[2];
ip[3] = fp[3];
Do report back with more info!
If you do not care very much about the rounding semantics, you can use the lrint() function. This allows for more freedom in rounding and it can be much faster.
Technically, it's a C99 function, but your compiler probably exposes it in C++. A good compiler will also inline it to one instruction (a modern G++ will).
lrint documentation
rounding only
excellent trick, only the use 6755399441055743.5 (0.5 less) to do rounding won't work.
6755399441055744 = 2^52 + 2^51 overflowing decimals off the end of the mantissa leaving the integer that you want in bits 51 - 0 of the fpu register.
In IEEE 754
6755399441055744.0 =
sign exponent mantissa
0 10000110011 1000000000000000000000000000000000000000000000000000
6755399441055743.5
will also however compile to
0100001100111000000000000000000000000000000000000000000000000000
the 0.5 overflows off the end (rounding up) which is why this works in the first place.
to do truncation you would have to add 0.5 to your double then do this
the guard digits should take care of rounding to the correct result done this way.
also watch out for 64 bit gcc linux where long rather annoyingly means a 64 bit integer.
If you have very large arrays (bigger than a few MB--the size of the CPU cache), time your code and see what the throughput is. You're probably saturating the memory bus, not the FP unit. Look up the maximum theoretical bandwidth for your CPU and see how close to it you are.
If you're being limited by the memory bus, extra threads will just make it worse. You need better hardware (e.g. faster memory, different CPU, different motherboard).
In response to Larry Gritz's comment...
You are correct: the FPU is a major bottleneck (and using the xs_CRoundToInt trick allows one to come very close to saturating the memory bus).
Here are some test results for a Core 2 (Q6600) processor. The theoretical main-memory bandwidth for this machine is 3.2 GB/s (L1 and L2 bandwidths are much higher). The code was compiled with Visual Studio 2008. Similar results for 32-bit and 64-bit, and with /O2 or /Ox optimizations.
WRITING ONLY...
1866359 ticks with 33554432 array elements (33554432 touched). Bandwidth: 1.91793 GB/s
154749 ticks with 262144 array elements (33554432 touched). Bandwidth: 23.1313 GB/s
108816 ticks with 8192 array elements (33554432 touched). Bandwidth: 32.8954 GB/s
USING CASTING...
5236122 ticks with 33554432 array elements (33554432 touched). Bandwidth: 0.683625 GB/s
2014309 ticks with 262144 array elements (33554432 touched). Bandwidth: 1.77706 GB/s
1967345 ticks with 8192 array elements (33554432 touched). Bandwidth: 1.81948 GB/s
USING xs_CRoundToInt...
1490583 ticks with 33554432 array elements (33554432 touched). Bandwidth: 2.40144 GB/s
1079530 ticks with 262144 array elements (33554432 touched). Bandwidth: 3.31584 GB/s
1008407 ticks with 8192 array elements (33554432 touched). Bandwidth: 3.5497 GB/s
(Windows) source code:
// floatToIntTime.cpp : Defines the entry point for the console application.
//
#include <windows.h>
#include <iostream>
using namespace std;
double const _xs_doublemagic = double(6755399441055744.0);
inline int xs_CRoundToInt(double val, double dmr=_xs_doublemagic) {
val = val + dmr;
return ((int*)&val)[0];
}
static size_t const N = 256*1024*1024/sizeof(double);
int I[N];
double F[N];
static size_t const L1CACHE = 128*1024/sizeof(double);
static size_t const L2CACHE = 4*1024*1024/sizeof(double);
static size_t const Sz[] = {N, L2CACHE/2, L1CACHE/2};
static size_t const NIter[] = {1, N/(L2CACHE/2), N/(L1CACHE/2)};
int main(int argc, char *argv[])
{
__int64 freq;
QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
cout << "WRITING ONLY..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = 13;
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING CASTING..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = (int)F[n];
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING xs_CRoundToInt..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = xs_CRoundToInt(F[n]);
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
return 0;
}