How do I prevent GCC/Clang from inlining and optimizing out multiple invocations of a pure function?
I am trying to benchmark code of this form
int __attribute__ ((noinline)) my_loop(int const* array, int len) {
// Use array to compute result.
}
My benchmark code looks something like this:
int main() {
const int number = 2048;
// My own aligned_malloc implementation.
int* input = (int*)aligned_malloc(sizeof(int) * number, 32);
// Fill the array with some random numbers.
make_random(input, number);
const int num_runs = 10000000;
for (int i = 0; i < num_runs; i++) {
const int result = my_loop(input, number); // Call pure function.
}
// Since the program exits I don't free input.
}
As expected Clang seems to be able to turn this into a no-op at O2 (perhaps even at O1).
A few things I tried to actually benchmark my implementation are:
Accumulate the intermediate results in an integer and print the results at the end:
const int num_runs = 10000000;
uint64_t total = 0;
for (int i = 0; i < num_runs; i++) {
total += my_loop(input, number); // Call pure function.
}
printf("Total is %llu\n", total);
Sadly this doesn't seem to work. Clang at least is smart enough to realize that this is a pure function and transforms the benchmark to something like this:
int result = my_loop();
uint64_t total = num_runs * result;
printf("Total is %llu\n", total);
Set an atomic variable using release semantics at the end of every loop iteration:
const int num_runs = 10000000;
std::atomic<uint64_t> result_atomic(0);
for (int i = 0; i < num_runs; i++) {
int result = my_loop(input, number); // Call pure function.
// Tried std::memory_order_release too.
result_atomic.store(result, std::memory_order_seq_cst);
}
printf("Result is %llu\n", result_atomic.load());
My hope was that since atomics introduce a happens-before relationship, Clang would be forced to execute my code. But sadly it still did the optimization above and sets the value of the atomic to num_runs * result in one shot instead of running num_runs iterations of the function.
Set a volatile int at the end of every loop along with summing the total.
const int num_runs = 10000000;
uint64_t total = 0;
volatile int trigger = 0;
for (int i = 0; i < num_runs; i++) {
total += my_loop(input, number); // Call pure function.
trigger = 1;
}
// If I take this printf out, Clang optimizes the code away again.
printf("Total is %llu\n", total);
This seems to do the trick and my benchmarks seem to work. This is not ideal for a number of reasons.
Per my understanding of the C++11 memory model volatile set operations do not establish a happens before relationship so I can't be sure that some compiler will not decide to do the same num_runs * result_of_1_run optimization .
Also this method seems undesirable since now I have an overhead (however tiny) of setting a volatile int on every run of my loop.
Is there a canonical way of preventing Clang/GCC from optimizing this result away. Maybe with a pragma or something? Bonus points if this ideal method works across compilers.
You can insert instruction directly into the assembly. I sometimes uses a macro for splitting up the assembly, e.g. separating loads from calculations and branching.
#define GCC_SPLIT_BLOCK(str) __asm__( "//\n\t// " str "\n\t//\n" );
Then in the source you insert
GCC_SPLIT_BLOCK("Keep this please")
before and after your functions
Related
I have the following code. The bitCount function simply counts the number of the bits in a 64 bit integer. The test function is an example of something similar I am doing in a more complicated piece of code in which I tried to replicate in it how writing to a matrix slows down significantly the performance of the for loop, and I am trying to figure out why it does so, and if there are any solutions to it.
#include <vector>
#include <cmath>
#include <omp.h>
// Count the number of bits
inline int bitCount(uint64_t n){
int count = 0;
while(n){
n &= (n-1);
count++;
}
return count;
}
void test(){
int nthreads = omp_get_max_threads();
omp_set_dynamic(0);
omp_set_num_threads(nthreads);
// I need a priority queue per thread
std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY));
std::vector<uint64_t> vals(100,1);
# pragma omp parallel for shared(mat,vals)
for(int i = 0; i < 100000000; i++){
std::vector<double> &tid_vec = mat[omp_get_thread_num()];
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
tid_vec[j] = total_count; // if I comment out this line, performance increase drastically
}
}
}
This code runs in about 11 seconds. If I comment out the following line:
tid_vec[j] = total_count;
the code runs in about 2 seconds. Is there a reason why writing to a matrix in my case costs so much in performance?
Since you said nothing about your compiler/system specs, I'm assuming you are compiling with GCC and flags -O2 -fopenmp.
If you comment the line:
tid_vec[j] = total_count;
The compiler will optimize away all the computations whose result is not used. Therefore:
total_count += bitCount(vals[j]);
is optimized too. If your application main kernel is not being used, it makes sense the program runs much faster.
On the other hand, I would not implement a bit count function myself but rather rely on functionality that is already provided to you. For example, GCC builtin functions include __builtin_popcount, which does exactly what you are trying to do.
As a bonus: it is way better to work on private data rather than working on a common array using different array elements. It improves locality (specially important when access to memory is not uniform, aka. NUMA) and may reduce access contention.
# pragma omp parallel shared(mat,vals)
{
std::vector<double> local_vec(1000,-INFINITY);
#pragma omp for
for(int i = 0; i < 100000000; i++) {
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
local_vec[j] = total_count;
}
}
// Copy local vec to tid_vec[omp_get_thread_num()]
}
I am learning to program with AVX. So, I wrote a simple program to multiply matrices of size 4. While with no compiler optimizations, the AVX version is slightly faster than the non-AVX version, with O3 optimization, the non-AVX version becomes almost twice as fast as the AVX version. Any tip on how can I improve the performance of the AVX version? Following is the full code.
#include <immintrin.h>
#include <stdio.h>
#include <stdlib.h>
#define MAT_SIZE 4
#define USE_AVX
double A[MAT_SIZE][MAT_SIZE];
double B[MAT_SIZE][MAT_SIZE];
double C[MAT_SIZE][MAT_SIZE];
union {
double m[4][4];
__m256d row[4];
} matB;
void init_matrices()
{
for(int i = 0; i < MAT_SIZE; i++)
for(int j = 0; j < MAT_SIZE; j++)
{
A[i][j] = (float)(i+j);
B[i][j] = (float)(i+j+1);
matB.m[i][j] = B[i][j];
}
}
void print_result()
{
for(int i = 0; i < MAT_SIZE; i++)
{
for(int j = 0; j < MAT_SIZE; j++)
{
printf("%.1f\t", C[i][j]);
}
printf("\n");
}
}
void withoutAVX()
{
for(int row = 0; row < MAT_SIZE; row++)
for(int col = 0; col < MAT_SIZE; col++)
{
float sum = 0;
for(int e = 0; e < MAT_SIZE; e++)
sum += A[row][e] * B[e][col];
C[row][col] = sum;
}
}
void withAVX()
{
for(int row = 0; row < 4; row++)
{
//calculate_resultant_row(row);
const double* rowA = (const double*)&A[row];
__m256d* pr = (__m256d*)(&C[row]);
*pr = _mm256_mul_pd(_mm256_broadcast_sd(&rowA[0]), matB.row[0]);
for(int i = 1; i < 4; i++)
*pr = _mm256_add_pd(*pr, _mm256_mul_pd(_mm256_broadcast_sd(&rowA[i]),
matB.row[i]));
}
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
int main()
{
init_matrices();
// start timer
unsigned long long cycles = rdtsc();
#ifdef USE_AVX
withAVX();
#else
withoutAVX();
#endif
// stop timer
cycles = rdtsc() - cycles;
printf("\nTotal time elapsed : %ld\n\n", cycles);
print_result();
return 0;
}
It's hard to be sure without knowing exactly what compiler and system you are using. You need to check the assembly of generated code to be sure. Below are merely some possible reasons.
The compiler probably generated extra load/store. This will cost.
The inner most loop broadcast elements from A. And thus you have extra loads. The optimal code shall require only 8 loads, 4 for A and B each, and 4 store back in C. However your code will lead to at least 16 extra loads because your use of broadcastsd. These will cost you as much as the computation itself, and probably more.
Edit (too long for comments)
There are situations where compiler won't be able to do smart optimization or sometime it is "too clever" for good. Recently I even had need to use assembly to avoid compiler optimization which actually lead to bad code! That said, if what you need is performance and you don't really care how you get there. I would suggest you first look for good libraries. For example, Eigen for linear algebra will fit your need in this example perfectly. If you do want to learn SIMD programming, I suggest you start with simpler cases, such as adding two vectors. Most likely, you will find that compiler will be able to generate better vectorized binary than your first few attempts. But they are more straightforward and thus you will see where you need improvement more easily. You will learn all kinds of things that you need to write optimal code in the process of attempting to produce code as good as or better than that can be generated by a compiler. And eventually you will be able to provide optimal implementations to code that compiler cannot optimize. One thing you need to keep in mind is that the lower level you go, the less compiler can do for you. You will have more control over what binaries are generated, but it is also your responsibility to make them optimal. These advices are pretty vague. Sorry cannot be of more help.
I implemented some algorithm where the main data structure is a tree. I use a class to represent a node and a class to represent a tree. Because the nodes get updated a lot, I call many setters and getters.
Because I have heard many times that function calls are expensive, I was thinking that maybe if I represented the nodes and the tree using structs, it would make my algorithm more efficient in practice.
Before doing so I decided to run a small experiment to see if this is actually the case.
I created a class that had one private variable, a setter and a getter. Also I created a struct that had one variable as well, without setters/getters since we can just update the variable by calling struct.varName. Here are the results:
The number of runs is just how many times we call the setter/getter. Here is the code of the experiment:
#include <iostream>
#include <fstream>
#define BILLION 1000000000LL
using namespace std;
class foo{
private:
int a;
public:
void set(int newA){
a = newA;
}
int get(){
return a;
}
};
struct bar{
int a;
};
timespec startT, endT;
void startTimer(){
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startT);
}
double endTimer(){
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endT);
return endT.tv_sec * BILLION + endT.tv_nsec - (startT.tv_sec * BILLION + startT.tv_nsec);
}
int main() {
int runs = 10000000;
int startRun = 10000;
int step = 10000;
int iterations = 10;
int res = 0;
foo f;
ofstream fout;
fout.open("stats.txt", ios_base::out);
fout<<"alg\truns\ttime"<<endl;
cout<<"First experiment progress: "<<endl;
int cnt = 0;
for(int run = startRun; run <= runs; run += step){
double curTime = 0.0;
for(int iter = 0; iter < iterations; iter++) {
startTimer();
for (int i = 1; i <= run; i++) {
f.set(i);
res += f.get();
}
curTime += endTimer()/iterations;
cnt++;
if(cnt%10 == 0)
cout<<cnt/(((double)runs-startRun+1)/step*iterations)*100<<"%\r";
}
fout<<"class\t"<<run<<"\t"<<curTime/BILLION<<endl;
}
int res2 = 0;
bar b;
cout<<"Second experiment progress: "<<endl;
cnt = 0;
for(int run = startRun; run <= runs; run += step){
double curTime = 0.0;
for(int iter = 0; iter < iterations; iter++) {
startTimer();
for (int i = 1; i <= run; i++) {
b.a = i;
res2 += b.a;
}
curTime += endTimer()/iterations;
cnt++;
if(cnt%10 == 0)
cout<<cnt/(((double)runs-startRun+1)/step*iterations)*100<<"%\r";
}
fout<<"struct\t"<<run<<"\t"<<curTime/BILLION<<endl;
}
fout.close();
cout<<res<<endl;
cout<<res2<<endl;
return 0;
}
I don't understand why I get this behaviour. I thought that function calls were more expensive?
EDIT: I rerun the same experiment without -O3
EDIT: OK this is very surprising, by declaring the class in a separate file called foo.h, implementing the getters/setters in foo.cpp and running with -O3, it seems that the class becomes even more inefficient.
I have heard many times that function calls are expensive.
Was this in 1970 by any chance?
Compilers are smart. Very smart. They produce the best program they can given your source code, and unless you're doing something very weird, these sorts of design changes are unlikely to make much (if any) performance difference.
Most notably here, a simple getter/setter can even be completely inlined in most cases (unless you're doing something weird), making your two programs effectively the same once compiled! You can see this result on your graph.
Meanwhile, the specific change of replacing class with struct has no effect on performance whatsoever - both keywords define a class.
I don't understand why I get this behaviour. I thought that function calls were more expensive?
See, this is why we don't prematurely optimise. Write clear, easy-to-read code without tricks and let your compiler take care of the rest. That's its job, and it's generally very good at it.
The answer here is almost certainly compiler optimization. First of all, defining your getters and setters in the class definition makes them inline. Even if you didn't do that, though, I'd expect any modern compiler to optimize away the function calls if they're in the same file and the compiler knows the resultant object is the whole program.
I'm trying to make run time measurements of simple algorithms like linear sort. The problem is that no matter what I do, the time measurement won't work as intended. I get the same search time no matter what problem size I use. Both me and other people who've tried to help me are equally confused.
I have a linear sort function that looks like this:
// Search the N first elements of 'data'.
int linearSearch(vector<int> &data, int number, const int N) {
if (N < 1 || N > data.size()) return 0;
for (int i=0;i<N;i++) {
if (data[i] == number) return 1;
}
return 0;
}
I've tried to take time measurement with both time_t and chrono from C++11 without any luck, except more decimals. This is how it looks like right now when i'm searching.
vector<int> listOfNumbers = large list of numbers;
for (int i = 15000; i <= 5000000; i += 50000) {
const clock_t start = clock();
for (int a=0; a<NUMBERS_TO_SEARCH; a++) {
int randNum = rand() % INT_MAX;
linearSearch(listOfNumbers, randNum, i);
}
cout << float(clock() - start) / CLOCKS_PER_SEC << endl;
}
The result?
0.126, 0.125, 0.125, 0.124, 0.124, ... (same values?)
I have tried the code with both VC++, g++ and on different computers.
First I thought it was my implementation of the search algorithms that was at fault. But a linear sort like the one above can't become any simpler, it's clearly O(N). How can the time be the same even when the problem size is increased by so much? I'm at loss what to do.
Edit 1:
Someone else might have an explanation why this is the case. But it actually worked in release mode after changing:
if (data[i] == number)
To:
if (data.at(i) == number)
I have no idea why this is the case, but linear search could be time measured correctly after that change.
The reason for the about-constant execution times is that the compiler is able to optimize away parts of the code.
Specifically looking at this part of the code:
for (int a=0; a<NUMBERS_TO_SEARCH; a++) {
int randNum = rand() % INT_MAX;
linearSearch(listOfNumbers, randNum, i);
}
When compiling with g++5.2 and optimization level -O3, the compiler can optimize away the call to linearSearch() completely. This is because the result of the code is the same with or without that function being called.
The return value of linearSearch is not used anywhere, and the function does not seem to have side-effects. So the compiler can remove it.
You can cross-check and modify the inner loop as follows. The execution times shouldn't change:
for (int a=0; a<NUMBERS_TO_SEARCH; a++) {
int randNum = rand() % INT_MAX;
// linearSearch(listOfNumbers, randNum, i);
}
What remains in the loop is the call to rand(), and this is what you seem to be measuring. When changing the data[i] == number to data.at(i) == number, the call to linearSearch is not side-effects-free as at(i) may throw an out-of-range exception. So the compiler does not completely optimize the linearSearch code away. However, with g++5.2, it will still inline it and not make a function call.
clock() is measuring CPU time, maybe you want time(NULL)? check this issue
The start should be before the for loop. In your case the start is different for each iteration, it is constant between the { ... }.
const clock_t start = clock();
for (int i = 15000; i <= 5000000; i += 50000){
...
}
This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
Performance of breaking apart one loop into two loops
(6 answers)
Closed 9 years ago.
What is the overhead in splitting a for-loop like this,
int i;
for (i = 0; i < exchanges; i++)
{
// some code
// some more code
// even more code
}
into multiple for-loops like this?
int i;
for (i = 0; i < exchanges; i++)
{
// some code
}
for (i = 0; i < exchanges; i++)
{
// some more code
}
for (i = 0; i < exchanges; i++)
{
// even more code
}
The code is performance-sensitive, but doing the latter would improve readability significantly. (In case it matters, there are no other loops, variable declarations, or function calls, save for a few accessors, within each loop.)
I'm not exactly a low-level programming guru, so it'd be even better if someone could measure up the performance hit in comparison to basic operations, e.g. "Each additional for-loop would cost the equivalent of two int allocations." But, I understand (and wouldn't be surprised) if it's not that simple.
Many thanks, in advance.
There are often way too many factors at play... And it's easy to demonstrate both ways:
For example, splitting the following loop results in almost a 2x slow-down (full test code at the bottom):
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
And this is almost stating the obvious since you need to loop through 3 times instead of once and you make 3 passes over the entire array instead of 1.
On the other hand, if you take a look at this question: Why are elementwise additions much faster in separate loops than in a combined loop?
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
The opposite is sometimes true due to memory alignment.
What to take from this?
Pretty much anything can happen. Neither way is always faster and it depends heavily on what's inside the loops.
And as such, determining whether such an optimization will increase performance is usually trial-and-error. With enough experience you can make fairly confident (educated) guesses. But in general, expect anything.
"Each additional for-loop would cost the equivalent of two int allocations."
You are correct that it's not that simple. In fact it's so complicated that the numbers don't mean much. A loop iteration may take X cycles in one context, but Y cycles in another due to a multitude of factors such as Out-of-order Execution and data dependencies.
Not only is the performance context-dependent, but it also vary with different processors.
Here's the test code:
#include <time.h>
#include <iostream>
using namespace std;
int main(){
int size = 10000;
int *data = new int[size];
clock_t start = clock();
for (int i = 0; i < 1000000; i++){
#ifdef TOGETHER
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
#else
for (int c = 0; c < size; c++){
data[c] *= 10;
}
for (int c = 0; c < size; c++){
data[c] += 7;
}
for (int c = 0; c < size; c++){
data[c] &= 15;
}
#endif
}
clock_t end = clock();
cout << (double)(end - start) / CLOCKS_PER_SEC << endl;
system("pause");
}
Output (one loop): 4.08 seconds
Output (3 loops): 7.17 seconds
Processors prefer to have a higher ratio of data instructions to jump instructions.
Branch instructions may force your processor to clear the instruction pipeline and reload.
Based on the reloading of the instruction pipeline, the first method would be faster, but not significantly. You would add at least 2 new branch instructions by splitting.
A faster optimization is to unroll the loop. Unrolling the loop tries to improve the ratio of data instructions to branch instructions by performing more instructions inside the loop before branching to the top of the loop.
Another significant performance optimization is to organize the data so it fits into one of the processor's cache lines. So for example, you could split have inner loops that process a single cache of data and the outer loop would load new items into the cache.
This optimizations should only be applied after the program runs correctly and robustly and the environment demands more performance. The environment defined as observers (animation / movies), users (waiting for a response) or hardware (performing operations before a critical time event). Any other purpose is a waste of your time, as the OS (running concurrent programs) and storage access will contribute more to your program's performance issues.
This will give you a good indication of whether or not one version is faster than another.
#include <array>
#include <chrono>
#include <iostream>
#include <numeric>
#include <string>
const int iterations = 100;
namespace
{
const int exchanges = 200;
template<typename TTest>
void Test(const std::string &name, TTest &&test)
{
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::duration<float, std::milli> ms;
std::array<float, iterations> timings;
for (auto i = 0; i != iterations; ++i)
{
auto t0 = Clock::now();
test();
timings[i] = ms(Clock::now() - t0).count();
}
auto avg = std::accumulate(timings.begin(), timings.end(), 0) / iterations;
std::cout << "Average time, " << name << ": " << avg << std::endl;
}
}
int main()
{
Test("single loop",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
// some more code
// even more code
}
});
Test("separated loops",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
}
for (auto i = 0; i < exchanges; ++i)
{
// some more code
}
for (auto i = 0; i < exchanges; ++i)
{
// even more code
}
});
}
The thing is quite simple. The first code is like taking a single lap on a race track and the other code is like taking a full 3-lap race. So, more time required to take three laps rather than one lap. However, if the loops are doing something that needs to be done in sequence and they depend on each other then second code will do the stuff. for example if first loop is doing some calculations and second loop is doing some work with those calculations then both loops need to be done in sequence otherwise not...