How to hit cache? - c++

I am using tdm64/ mingw32.
Lets say I have a class with constructor:
class test{
int number;
int dependent;
test( int set_number){
number = set_number;
dependent = some_function(number);
}
}
Will code be faster if I would switch to:
dependent = some_function(set_number);
And if I need some explanation. Basicly does first option have to wait with function call until number will be sent back to RAM? Or does it have to wait because queue of instructions is partially empty waiting for not calculated yet variable? Is number pulled back for operation from cache? L1? L2? L3? Will it have to wait multiple CPIs or just one? Which assembly instructions will be generated out of those two following lines?
number = set_number;
dependent = some_function(number);
What sort of assembly instruction will be generated to assign number and than to handle number sending for some operation in function?
What if I will have multiple situations like this mixed with array operations in the middle of them?

Related

Large performance difference between comparing a variable to a fixed value and reading or writing from mapped memory address

I'm developing a software that runs on a DE10 board, in an ARM Cortex-A9 processor.
This software has to access physical memory addresses in order to communicate with the FPGA in the DE10, and this is done mapping /dev/mem, this method is described here.
I have a situation where I have to select which of 4 addresses to send some values, and this could be done in 1 of 2 ways:
Using an if statement and checking a integer variable (which is always 0 or 1 at that part of the loop) and only write if it's 1.
Multiply the values that should be sent by the aforementioned variable and write on all addresses without any conditional, because writing zero doesn't have any effect on my system.
I was curious about which would be faster, so I tried this:
First, I made this loop:
int test=0;
for(int i=0;i<1000000;i++)
{
if(test==9)
{
test=15;
}
test++;
if(test==9)
{
test=0;
}
}
The first if statement should never be satisfied, so its only contribution to the time taken in the loop is from its comparison itself.
The increment and the second if statement are just things I added in an attempt to prevent the compiler from just "optimizing out" the first if statement.
This loop is ran once without being benchmarked (just in case there's any frequency scaling ramp, although I'm pretty sure it has none) and then its ran once again being benchmarked, and it takes around 18350 μs to complete.
Without the first if statement, it takes around 17260 μs
Now, If I change that first if statement by a line that sets the value of a memory-mapped address to the value of the integer test, like this:
for(int i=0;i<1000000;i++)
{
*(uint8_t*)address=test;
test++;
if(test==9)
{
test=0;
}
}
This loops takes around 253600 μs to complete, almost 14 x slower.
Reading that address instead of writing on it barely changes anything.
Is this what it really is, or is there some kind of compiler optimization possibly frustrating my benchmarking?
Should I expect this difference in performance (and thus favoring the comparison method) in the actual software?

Is there a way to manipulate function arguments by their position number?

I wish to be able to manipulate function arguments by the order in which they are given. So,
void sum(int a, int b, int c)
std::cout<< arguments[0]+arguments[1];
sum(1,1,4);
should print 2. This feature is in JavaScript.
I need it to implement a numerical scheme. I'd create a function that takes 4 corner values and tangential direction as input. Then, using the tangential direction, it decides which corners to use. I wish to avoid an 'if' condition as this function would be called several times.
EDIT - The reason why I do not wish to use an array as input is for potential optimization and readability reasons. I would explain my situation a bit more. solution is a 2D array. We would be running this double for loop several times
for (int i = 0;i<N_x;i++)
for (int j = 0;j<N_y;j++)
update_solution(solution[i,j],solution[i+1,j],solution[i-1,j],...);
Optimization: N_x,N_y are large enough for me to be concerned about whether or not adding a step like variables_used = {solution(i,j),solution(i+1,j),...} in every single loop will increase the cost.
Readability The arguments of update_solution indicate which indices were used to update the solution. Putting that in the previous line is slightly non-standard, judging by the codes I have read.

Does accessing arrays slow down my function?

I have 2 functions:
unsigned long long getLineAsRow(unsigned long long board, int col) {
unsigned long long column = (board >> (7-(col - 1))) & col_mask_right;
column *= magicLineToRow;
return (column >> 56) & row_mask_bottom;
}
unsigned long long getDiagBLTR_asRow(unsigned long long board, int line, int row) {
unsigned long long result = board & diagBottomLeftToTopRightPatterns[line][row];
result = result << diagBLTR_shiftUp[line][row];
result = (result * col_mask_right) >> 56;
return result;
}
The only big difference I see is the access to a 2-dim-array. Defined like
int diagBRTL_shiftUp[9][9] = {};
I call both functions 10.000.000 times:
getLineAsRow ... time used: 1.14237s
getDiagBLTR_asRow ... time used: 2.18997s
I tested it with cl (vc++) and g++. Nearly no difference.
It is a really huge difference, do you have any advice?
The question what creates the difference between the execution times of your two functions really cannot be answered without knowing either the resulting assembler code or which of the globals you are accessing are actually constants that can be compiled right into the code. Anyway, analyzing your functions, we see that
function 1
reads two arguments from stack, returns a single value
reads three globals, which may or may not be constants
performs six arithmetic operations (the two minuses in 7-(col-1) can be collapsed into a single subtraction)
function 2
reads three arguments from stack, returns a single value
reads one global, which may or may not be a constant
dereferences two pointers (not four, see below)
does five arithmetic operations (three which you see, two which produce the array indices)
Note that accesses to 2D arrays actually boil down to a single memory access. When you write diagBottomLeftToTopRightPatterns[line][row], your compiler transforms it to something like diagBottomLeftToTopRightPatterns[line*9 + row]. That's two extra arithmetic instructions, but only a single memory access. What's more, the result of the calculation line*9 + row can be recycled for the second 2D array access.
Arithmetic operations are fast (on the order of a single CPU cycle), reads from memory may take four to twenty CPU cycles. So I guess that the three globals you access in function 1 are all constants which your compiler built right into the assembler code. This leaves function 2 with more memory accesses, making it slower.
However, one thing bothers me: If I assume you have a normal CPU with at least 2 GHz clock frequency, your times suggest that your functions consume more than 200 or 400 cycles, respectively. This is significantly more than expected. Even if your CPU has no values in cache, your functions shouldn't take more than roughly 100 cycles. So I would suggest to take a second look at how you are timing your code, I assume that you have some more code in your measuring loop which spoils your results.
Those functions do completely different things, but I assume that's no relevant to the question.
Sometimes these tests don't show the real cost of a function.
In this case the main cost is the access of the array in the memory. After the first access it will be in the cache and after that your function is going to be fast. So you don't really measure this characteristic. Even though in the test there are 10.000.000 iterations, you pay the price only once.
Now if you execute this function in a batch, calling it many times in a bulk, then it's non-issue. The cache will be warm.
If you access it sporadically, in an application which has high memory demands and frequently flushes the CPU cashes, it could be performance problem. But that of course depends on the context: how often it's called, etc..

Removing code portions does not match the profiler's data

I am doing a little proof of concept profile and optimize type of example. However, I ran into something that I can't quite explain, and I'm hoping that someone here can clear this up.
I wrote a very short snippet of code:
int main (void)
{
for (int j = 0; j < 1000; j++)
{
a = 1;
b = 2;
c = 3;
for (int i = 0; i < 100000; i++)
{
callbackWasterOne();
callbackWasterTwo();
}
printf("Two a: %d, b: %d, c: %d.", a, b, c);
}
return 0;
}
void callbackWasterOne(void)
{
a = b * c;
}
void callbackWasterTwo(void)
{
b = a * c;
}
All it does is call two very basic functions that just multiply numbers together. Since the code is identical, I expect the profiler (oprofile) to return roughly the same number.
I run this code 10 times per profile, and I got the following values for how long time is spent on each function:
main: average = 5.60%, stdev = 0.10%
callbackWasterOne = 43.78%, stdev = 1.04%
callbackWasterTwo = 50.24%, stdev = 0.98%
rest is in miscellaneous things like printf and no-vmlinux
The difference between the time for callbackWasterOne and callbackWasterTwo is significant enough (to me at least) given that they have the same code, that I switched their order in my code and reran the profiler with the following results now:
main: average = 5.45%, stdev = 0.40%
callbackWasterOne = 50.69%, stdev = 0.49%
callbackWasterTwo = 43.54%, stdev = 0.18%
rest is in miscellaneous things like printf and no-vmlinux
So evidently the profiler samples one more than the other based on the execution order. Not good. Disregarding this, I decided to see the effects of removing some code and I got this for execution times (averages):
Nothing removed: 0.5295s
call to callbackWasterOne() removed from for loop: 0.2075s
call to callbackWasterTwo() removed from for loop: 0.2042s
remove both calls from for loop: 0.1903s
remove both calls and the for loop: 0.0025s
remove contents of callbackWasterOne: 0.379s
remove contents of callbackWasterTwo: 0.378s
remove contents of both: 0.382s
So here is what I'm having trouble understanding:
When I remove just one of the calls from the for loop, the execution time drops by ~60%, which is greater than the time spent by that one function + the main in the first place! How is this possible?
why is the effect of removing both calls from the loop so little compared to removing just one? I can't figure out this non-linearity. I understand that the for loop is expensive, but in that case (if most of the remaining time can be attributed to the for loop that performs the function calls), why would removing one of the calls cause such a large improvement in the first place?
I looked at the disassembly and the two functions are the same in code. The calls to them are the same, and removing the call simply deletes the one call line.
Other info that might be relevant
I'm using Ubuntu 14.04LTS
The code is complied by Eclipse with no optimization (O0)
I time the code by running it in terminal using "time"
I use OProfile with count = 10000 and 10 repetitions.
Here are the results from when I do this with -O1 optimization:
main: avg = 5.89%, stdev = 0.14%
callbackWasterOne: avg = 44.28%, stdev = 2.64%
callbackWasterTwo: avg = 49.66%, stdev = 2.54% (greater than before)
Rest is miscellaneous
Results of removing various bits (execution time averages):
Nothing removed: 0.522s
Remove callbackWasterOne call: 0.149s (71.47% decrease)
Remove callbackWasterTwo call: 0.123% (76.45% decrease)
Remove both calls: 0.0365s (93.01% decrease) (what I would expect given the profile data just above)
So removing one call now is much better than before, and removing both still carries a benefit (probably because the optimizer understands that nothing happens in the loop). Still, removing one is much more beneficial than I would have anticipated.
Results of the two functions using different variables:
I defined 3 more variables for callbackWasterTwo() to use instead of reusing same ones. Now the results are what I would have expected.
main: avg = 10.87%, stdev = 0.19% (average is greater, but maybe due to those new variables)
callbackWasterOne: avg = 46.08%, stdev = 0.53%
callbackWasterTwo: avg = 42.82%, stdev = 0.62%
Rest is miscellaneous
Results of removing various bits (execution time averages):
Nothing removed: 0.520s
Remove callbackWasterOne call: 0.292s (43.83% decrease)
Remove callbackWasterTwo call: 0.291% (44.07% decrease)
Remove both calls: 0.065s (87.55% decrease)
So now removing both calls is pretty much equivalent (within stdev) to removing one call + the other.
Since the result of removing either function is pretty much the same (43.83% vs 44.07%), I am going to go out on a limb and say that perhaps the profiler data (46% vs 42%) is still skewed. Perhaps it is the way it samples (going to vary the counter value next and see what happens).
It appears that the success of optimization relates pretty strongly to the code reuse fraction. The only way to achieve "exactly" (you know what I mean) the speedup noted by the profiler is to optimize on completely independent code. Anyway this is all interesting.
I am still looking for some explanation ideas for the 70% decrease in the -O1 case though...
I did this with 10 functions (different formulas in each, but using some combination of 6 different variables, 3 at a time, all multiplication):
These results are disappointing to say the least. I know the functions are identical, and yet, the profiler indicates that some take significantly longer. No matter which one I remove ("fast" or "slow" one), the results are the same ;) So this leaves me to wonder, how many people are incorrectly relying on the profiler to indicate the incorrect areas of code to fix? If I unknowingly saw these results, what could possible tell me to go fix the 5% function rather than the 20% (even though they are exactly the same)? What if the 5% one was much easier to fix, with a large potential benefit? And of course, this profiler might just not be very good, but it is popular! People use it!
Here is a screenshot. I don't feel like typing it in again:
My conclusion: I am overall rather disappointed with Oprofile. I decided to try out callgrind (valgrind) through command line on the same function and it gave me far more reasonable results. In fact, the results were very reasonable (all functions spent ~ the same amount of time executing). I think Callgrind samples far more than Oprofile ever did.
Callgrind will still not explain the difference in improvement when a function is removed, but at least it gives the correct baseline information...
Ah, I see you did look at the assembly. This question is indeed interesting in its own right, but in general there's no point profiling unoptimized code, since there's so much boilerplate that could easily be reduced even in -O1.
If it's really only the call that's missing, then that could explain the timing differences -- there's lots of boilerplate from the -O0 stack manipulation code (any caller-saved registers have to be pushed onto the stack, and any arguments too, then afterwards the any return value has to be treated and the opposite stack manipulation has to be done) which contributes to the time it takes to call the functions, but is not necessarily completely attributed to the functions themselves by oprofile since that code is executed before/after the function is actually called.
I suspect the reason the second function seems to always take less time is that there's less (or no) stack juggling that needs to be done -- the parameter values are already on the stack thanks to the previous function call, and so, as you've seen, only the call to the function has to be executed, without any other extra work.

In C++, how would I make a random number (either 1 or 2) that changes every 5 minutes?

I'm trying to make a simple game and I have a shop in the game. I want it to be every 5 minutes(if the function changeItem() is called) the item in the shop either switches or stays the same. I have no problem generating the random number, but I have yet to find a thread that shows how to make it generate differently each 5 minutes. Thank you.
In short, keep track of the last time the changeItem() function was called. If it is more than 5 minutes since the last time it was called, then use your random number generator to generate a new number. Otherwise, use the saved number from the last time it was generated.
You've already accepted an answer but I would like to say that for apps that need simple timing like this and don't need great accuracy, a simple calculation in the main loop as all you need.
Kicking off a thread for a single timer is a lot of unnecessary overhead.
So, here's the code showing how you'd go about doing it.
#define FIVE_MINUTES (60*5)
int main(int argc, char** argv){
time_t lastChange = 0, tick;
run_game_loop = true;
while (run_game_loop){
// ... game loop
tick = time(NULL);
if ((tick - lastChange) >= FIVE_MINUTES){
changeItem();
lastChange = tick;
}
}
return 0;
}
It somewhat assumes to be called reasonably regularly though. If on the other hand you need it accurate then a thread would be better. And depending on platform there exist API's for timers that get called by the system.
Standard and portable approach:
You could consider C++11 threads. The general idea would be :
#include <thread>
#include <chrono>
void myrandogen () // function that refreshes your randum number:
// will be executed as a thread
{
while (! gameover ) {
this_thread::sleep_for (std::chrono::minutes(5)); // wait 5 minutes
... // generate your random number and update your game data structure
}
}
in the main function, you would then instantiate a thread with your function:
thread t1 (myrandomgen); // create an launch thread
... // do your stuff until game over
t1.join (); // wait until thread returns
Of course you could also pass parameters (references to shared variables, etc...) when you create the thread, like this:
thread t1 (myrandomgen, param1, param2, ....);
The advantage of this approach is that it's standard and portable.
Non-portable alternatives:
I'm less familiar with these, but:
In a MSWIN environment, you could use SetTimer(...) to define a function to be called at regular interval (and KillTimer(...) to delete it). But this requires a programm structure build around the windows event processing loop.
In a linux environment, you could similarly define a call back function with signal(SIGALRM, ...) and activate periodic calls with alarm().
Small update on performance considerations:
Following several reamrks about overkill of therads and performance, I've done a benchmark, executing 1 billion loop iterations an waiting 1 microsecond each 100K iterations. The whole thing was run on an i7 multicore CPU:
Non threaded execution yielded 213K iterations per millisec.
2 thread execution yielded 209K iterations per millisec and per thread. So slightly slower for each thread. The total execution time was however only 70 to 90 ms longer, so that the overall throughput is at 418 K iterations.
How come ? Because the second thread is using a non used core on the processor. This means that with the adequate architecture, a game could process many more calculatios when using multithreading...