Microoptimization: iterating with local variable vs. class member

Microoptimization: iterating with local variable vs. class member - c++

I thought I will save some time if I declare iterating variable once as a class member:
struct Foo {
int i;
void method1() {
for(i=0; i<A; ++i) ...
}
void method2() {
for(i=0; i<B; ++i) ...
}
} foo;
however, this seems to be cca 20% faster
struct Foo {
void method1() {
for(int i=0; i<A; ++i) ...
}
void method2() {
for(int i=0; i<B; ++i) ...
}
} foo;
in this code
void loop() { // Arduino loops
foo.method1();
foo.method2();
}
Can you explain the performance difference?
(I need to run many simple paralel "processes" on Arduino where such microoptimalization makes a difference.)

When you declare your loop variable inside a loop, it is scoped very narrowly. The compiler is free to keep it in a register all the time, so it does not get committed to memory even once.
When you declare your loop variable as an instance variable, the compiler has no such flexibility. It must keep the variable in memory, in case some of your methods would want to examine its state. For example, if you do this in your first code example
void method2() {
for(i=0; i<B; ++i) { method3(); }
}
void method3() {
printf("%d\n", i);
}
the value of i in method3 must be changing as the loop progresses. The compiler has no way around committing all its side effects to memory. Moreover, it cannot assume that i stayed the same when you come back from method3, further increasing the number of memory accesses.
Dealing with updates in memory requires a lot more CPU cycles than performing updates to register-based variables. That is why it is always a good idea to keep your loop variables scoped down to the loop level.

Can you explain the performance difference?
The most plausible explanation I could come up for this performance difference is:
Data member i is declared on the global memory, which cannot be kept in the register all the time, hence operations on it would be way slower than on the loop variable i due to a very broad scope (The data member i has to cater for all the member functions of the class).
#DarioOO adds:
In addition the compiler is not free to store it temporary in a
register because method3() could throw an exception leaving the object
in a unwanted state (because theoretically no one prevent to you to
write
int k=this->i; for(k=0;k<A;k++)method3(); this->i=k;. That code
would be almost as fast as local variable but you have to keep into
account when method3() throws (I believe when there is the guarantee it
does not throw the compiler will optimize that with -O3 or -O4 to be
verified)

Related

How do i get a private vector while still using functions with openmp?

Im trying to parallelize my code with openmp.
I have a global vector, so i can excess it with my functions.
Is there a way that i can asign a copy of the vector to every thread so they can do stuff with it?
Here is some pseudocode to describe my problem:
double var = 1;
std::vector<double> vec;
void function()
{
vec.push_back(var);
return;
}
int main()
{
omp_set_num_threads(2);
#pragma omp parallel
{
#pragma omp for private(vec)
for (int i = 0; i < 4; i++)
{
function();
}
}
return 0;
}
Notes:
i want each tread to have an own vector, to safe specific values, which later only the same thread needs to excess
each thread calls a function (sometimes its the same) which then does some work on the vector (changing specific values)
(in my original code there are many vectors and functions, ive just tried to break the problem down)
Ive tried #pragma omp threadprivate(), but that only works for varibles and not for vectors.
Also redeclaring the vector inside the parallel region doesnt help, as my function always works with the global vector, which then leads to problems when different treads call it at the same time.

Is there a way that I can assign a copy of the vector to every thread
so they can do stuff with it?
Yes, the firstprivate clause does this:
The firstprivate clause declares one or more list items to be private
to a task, and initializes each of them with the value that the
corresponding original item has when the construct is encountered.
So, it creates a private copy of the variable for each thread, but the scope of this private variable is the structured block following the OpenMP construct. Outside this block you access the global variable:
#pragma omp ... firstprivate(vec)
{
vec.push_back(...); // private copy is changed here, which is threadsafe
}
void function()
{
vec.push_back(var); // the global variable is changed here, which is not threadsafe
return;
}
If you wish to use the private copy of your variable in a function you have to pass it as a reference to your function :
void function(std::vector<double>& x, double y)
{
x.push_back(y);
return;
}
...
#pragma omp for firstprivate(vec)
for (int i = 0; i < 4; i++)
{
function(vec, 1);
}
Note that, however, as pointed out and explained by #JeromeRichard you should not use global variables in your code.

Am I missing something or are Virtual calls not as bad performance as people make of them

I have been developing a simple framework for embedded environments. I came to a design decision on whether to use virtual calls, CRTP, or maybe a switch statement. I have been told that vtables perform poorly in embedded.
Following up from this question
vftable performance penalty vs. switch statement
I decided to run my own test. I ran three different ways to call a member function.
using the etl library's etl::function, a library meant to mimic the stl library but for embedded environments.(no dynamic allocations).
using a master switch statement that will call an object's based on an object's int ID
using a pure virtual call to a base class
I never tried this with a basic CRTP pattern but the etl::function was supposed to be a variation on that where that was the mechanism used for the pattern.
The time I got on MSVC and similar performance on an ARM Cortex M4 was
etl : 400 million nanoseconds
switch : 420 million nanoseconds
virtual: 290 million nanoseconds
The pure virtual calls are significantly faster.
Am I missing something or are virtual calls just not as bad as people make them out to be. Here is the code used for the tests.
class testetlFunc
{
public:
uint32_t a;
testetlFunc() { a = 0; };
void foo();
};
class testetlFunc2
{
public:
uint32_t a;
testetlFunc2() { a = 0; };
virtual void foo() = 0;
};
void testetlFunc::foo()
{
a++;
}
class testetlFuncDerived : public testetlFunc2
{
public:
testetlFuncDerived();
void foo() override;
};
testetlFuncDerived::testetlFuncDerived()
{
}
void testetlFuncDerived::foo()
{
a++;
}
etl::ifunction<void>* timer1_callback1;
etl::ifunction<void>* timer1_callback2;
etl::ifunction<void>* timer1_callback3;
etl::ifunction<void>* timer1_callback4;
etl::ifunction<void>* etlcallbacks[4];
testetlFunc ttt;
testetlFunc ttt2;
testetlFunc ttt3;
testetlFunc ttt4;
testetlFuncDerived tttd1;
testetlFuncDerived tttd2;
testetlFuncDerived tttd3;
testetlFuncDerived tttd4;
testetlFunc2* tttarr[4];
static void MasterCallingFunction(uint16_t ID) {
switch (ID)
{
case 1:
ttt.foo();
break;
case 2:
ttt2.foo();
break;
case 3:
ttt3.foo();
break;
case 4:
ttt4.foo();
break;
default:
break;
}
};
int main()
{
tttarr[0] = (testetlFunc2*)&tttd1;
tttarr[1] = (testetlFunc2*)&tttd2;
tttarr[2] = (testetlFunc2*)&tttd3;
tttarr[3] = (testetlFunc2*)&tttd4;
etl::function_imv<testetlFunc, ttt, &testetlFunc::foo> k;
timer1_callback1 = &k;
etl::function_imv<testetlFunc, ttt2, &testetlFunc::foo> k2;
timer1_callback2 = &k2;
etl::function_imv<testetlFunc, ttt3, &testetlFunc::foo> k3;
timer1_callback3 = &k3;
etl::function_imv<testetlFunc, ttt4, &testetlFunc::foo> k4;
timer1_callback4 = &k4;
etlcallbacks[0] = timer1_callback1;
etlcallbacks[1] = timer1_callback2;
etlcallbacks[2] = timer1_callback3;
etlcallbacks[3] = timer1_callback4;
//results for etl::function --------------
int rng;
srand(time(0));
StartTimer(1)
for (uint32_t i = 0; i < 2000000; i++)
{
rng = rand() % 4 + 0;
for (uint16_t j= 0; j < 4; j++)
{
(*etlcallbacks[rng])();
}
}
StopTimer(1)
//results for switch --------------
StartTimer(2)
for (uint32_t i = 0; i < 2000000; i++)
{
rng = rand() % 4 + 0;
for (uint16_t j = 0; j < 4; j++)
{
MasterCallingFunction(rng);
}
}
StopTimer(2)
//results for virtual vtable --------------
StartTimer(3)
for (uint32_t i = 0; i < 2000000; i++)
{
rng = rand() % 4 + 0;
for (uint16_t j = 0; j < 4; j++)
{
tttarr[rng]->foo();
//ttt.foo();
}
}
StopTimer(3)
PrintAllTimerDuration
}

If what you really need is virtual dispatch, C++'s virtual calls are probably the most performant implementation you can get, and you should use them. Scores of compiler engineers have worked on optimizing them to the best performance they could get.
The reason behind people saying to avoid virtual methods is in my experience for when you do not need them. Avoid the virtual keyword on methods that can be statically dispatched, and on hot spots in your code.
Every time you call an object's virtual method, what happens is that the object's v-table is accessed (likely screwing up memory locality and flushing a cache or two), then a pointer is de-referenced to get at the actual function address, and then the actual function call happens. This is only fractions of a second slower, but if you're fractions slower enough times in a loop, it suddenly makes a difference.
When you call a static method, none of the earlier operations happen. The actual function call just happens. If the function that calls and the one that is called are close to each other in memory, all caches can stay the way they are.
So, avoid virtual dispatch in high-performance or low-CPU-power situations in tight loops (you can for example switch on a member variable and call a method that contains the entire loop instead).
But there is the saying "premature optimization is the root of all evil". Measure performance beforehand. "Embedded" CPUs have become much faster and more powerful than those a few years ago. Compilers for popular CPUs are better optimized than ones only just adapted to a new or exotic CPU. It may simply be that your compiler has an optimizer that alleviates any problems, or that your CPU is similar enough to a common desktop CPU to reap the benefits of work done for more popular CPUs.
Or you may have more RAM etc. than the people who told you to avoid virtual calls.
So, profile, and if the profiler says it's fine, it's fine. Also make sure your tests are representative. Your test code may just be written in a way that a network request coming in pre-empted the switch statement and made it seem slower than it really was, or that the virtual method calls were benefiting from the cache loaded by the non-virtual calls.

Avoiding Checking likely if

Given the following:
class ReadWrite {
public:
int Read(size_t address);
void Write(size_t address, int val);
private:
std::map<size_t, int> db;
}
In read function when accessing an address which no previous write was made to I want to either throw exception designating such error or allow that and return 0, in other words I would like to either use std::map<size_t, int>::operator[]() or std::map<size_t, int>::at(), depending on some bool value which user can set. So I add the following:
class ReadWrite {
public:
int Read(size_t add) { if (allow) return db[add]; return db.at(add);}
void Write(size_t add, int val) { db[add] = val; }
void Allow() { allow = true; }
private:
bool allow = false;
std::map<size_t, int> db;
}
The problem with that is:
Usually, the program will have one call of allow or none at the beginning of the program and then afterwards many accesses. So, performance wise, this code is bad because it every-time performs the check if (allow) where usually it's either always true or always false.
So how would you solve such problem?
Edit:
While the described use case (one or none Allow() at first) of this class is very likely it's not definite and so I must allow user call Allow() dynamically.
Another Edit:
Solutions which use function pointer: What about the performance overhead incurred by using function pointer which is not able to make inline by the compiler? If we use std::function instead will that solve the issue?

Usually, the program will have one call of allow or none at the
beginning of the program and then afterwards many accesses. So,
performance wise, this code is bad because it every-time performs the
check if (allow) where usually it's either always true or always
false. So how would you solve such problem?
I won't, The CPU will.
the Branch Prediction will figure out that the answer is most likely to be same for some long time so it will able to optimize the branch in the hardware level very much. it will still incur some overhead, but very negligible.
If you really need to optimize your program, I think your better use std::unordered_map instead of std::map, or move to some faster map implementation, like google::dense_hash_map. the branch is insignificant compared to map-lookup.

If you want to decrease the time-cost, you have to increase the memory-cost. Accepting that, you can do this with a function pointer. Below is my answer:
class ReadWrite {
public:
void Write(size_t add, int val) { db[add] = val; }
// when allowed, make the function pointer point to read2
void Allow() { Read = &ReadWrite::read2;}
//function pointer that points to read1 by default
int (ReadWrite::*Read)(size_t) = &ReadWrite::read1;
private:
int read1(size_t add){return db.at(add);}
int read2(size_t add) {return db[add];}
std::map<size_t, int> db;
};
The function pointer can be called as the other member functions. As an example:
ReadWrite rwObject;
//some code here
//...
rwObject.Read(5); //use of function pointer
//
Note that non-static data member initialization is available with c++11, so the int (ReadWrite::*Read)(size_t) = &ReadWrite::read1; may not compile with older versions. In that case, you have to explicitly declare one constructor, where the initialization of the function pointer can be done.

You can use a pointer to function.
class ReadWrite {
public:
void Write(size_t add, int val) { db[add] = val; }
int Read(size_t add) { (this->*Rfunc)(add); }
void Allow() { Rfunc = &ReadWrite::Read2; }
private:
std::map<size_t, int> db;
int Read1(size_t add) { return db.at(add); }
int Read2(size_t add) { return db[add]; }
int (ReadWrite::*Rfunc)(size_t) = &ReadWrite::Read1;
}

If you want runtime dynamic behaviour you'll have to pay for it at runtime (at the point you want your logic to behave dynamically).
You want different behaviour at the point where you call Read depending on a runtime condition and you'll have to check that condition.
No matter whether your overhad is a function pointer call or a branch, you'll find a jump or call to different places in your program depending on allow at the point Read is called by the client code.
Note: Profile and fix real bottlenecks - not suspected ones. (You'll learn more if you profile by either having your suspicion confirmed or by finding out why your assumption about the performance was wrong.)

Disable uninitialized warning for a local variable

C++ compilers emit warnings when a local variable may be uninitialized on first usage. However, sometimes, I know that the variable will always be written before being used, so I do not need to initialize it. When I do this, the compiler emits a warning, of course. Since my team is building with -Werror, the code will not compile. How can I turn off this warning for specific local variables. I have the following restrictions:
I am not allowed to change compiler flags
The solution must work on all compilers (i.e., no gnu-extensions or other compiler specific attributes)
I want to use this only on specific local variables. Other uninitialized locals should still trigger a warning
The solution should not generate any instructions.
I cannot alter the class of the local variable. I.e., I cannot simply add a "do nothing" constructor.
Of course, the easiest solution would be to initialize the variable. However, the variable is of a type that is costly to initialize (even default initialization is costly) and the code is used in a very hot loop, so I do not want to waste the CPU cycles for an initialization that is guaranteed to be overwritten before it is read anyway.
So is there a platform-independent, compiler-independent way of telling the compiler that a local variable does not need to be initialized?
Here is some example code that might trigger such a warning:
void foo(){
T t;
for(int i = 0; i < 100; i++){
if (i == 0) t = ...;
if (i == 1) doSomethingWith(t);
}
}
As you see, the first loop cycle initializes t and the second one uses it, so t will never be read uninitialized. However, the compiler is not able to deduce this, so it emits a warning. Note that this code is quite simplified for the sake of brevity.

My answer will recommend another approach: instead of disabling the warning code, just do some reformulation on the implementation. I see two approaches:
First Option
You can use pointers instead of a real object and guarantee that it will be initialized just when you need it, something like:
std::unique_ptr<T> t;
for(int i=0; i<100; i++)
{
if(i == 0) if(t.empty()) t = std::unique_ptr<T>(new T); *t = ...;
if(i == 1) if(t.empty()) t = std::unique_ptr<T>(new T); doSomethingWith(*t);
}
It's interesting to note that probably when i==0, you don't need to construct t using the default constructor. I can't guess how your operator= is implemented, but I supose that probably you are assigning an object that's already allocated in the code that you are omitting in the ... segment.
Second Option
As your code experiences such a huge performance loss, I can infer that T will never be an basic tipe (ints, floats, etc). So, instead of using pointers, you can reimplement your class T in a way that you use an init method and avoid initializing it on the constructor. You can use some boolean to indicate if the class needs initalization or not:
class FooClass()
{
public:
FooClass() : initialized(false){ ... }
//Class implementation
void init()
{
//Do your heavy initialization code here.
initialized = true;
}
bool initialized() const { return initialized; }
private:
bool initialized;
}
Than you will be able to write it like this:
T t;
for(int i=0; i<100; i++)
{
if(i == 0) if(!t.initialized()) t.init(); t = ...;
if(i == 1) if(!t.initialized()) t.init(); doSomethingWith(t);
}

If the code is not very complex, I usually unroll one of the iterations:
void foo(){
T t;
t = ...;
for(int i = 1; i < 100; i++){
doSomethingWith(t);
}
}

Scope within a scope, do or don't?

Although the example below compiles fine except for the last line with the error, I'd like to know the ins and outs of this 'scoping' within a scope? Also the terminology of this, if any.
Consider these brackets:
void func()
{
int i = 0;
{ // nice comment to describe this scope
while( i < 10 )
++i;
}
{ // nice comment to describe this scope
int j= 0;
while( j< 10 )
++j;
}
i = 0; // OK
// j = 0; // error C2065
}
consider this:
error C2065: 'j' : undeclared identifier
edit:
Accepted answer is from bitmask, although I think everyone should place it in the context of anio's answer. Especially, quote: "perhaps you should break your function into 2 functions"

Do. By all means!
Keeping data as local as possible and as const as possible has two main advantages:
side effects are reduced and the code becomes more functional
with complex objects, destructors can be be invoked early within a function, as soon as the data is not needed any more
Additionally, this can be useful for documentation to summarise the job a particular part of a function does.
I've heard this being referred to as explicit or dummy scoping.

I personally don't find much value in adding additional scoping within a function. If you are relying on it to separate parts of your function, perhaps you should break your function into 2 functions. Smaller functions are better than larger ones. You should strive to have small easily understood functions.
The one legitimate use of scopes within a function is for limiting the duration of a lock:
int doX()
{
// Do some work
{
//Acquire lock
} // Lock automatically released because of RAII
}
The inner scope effectively limits the code over which the lock is held. I believe this is common practice.

Yes, definitely - it's a great habit to always keep your variables as local as possible! Some examples:
for (std::string line; std::getline(std::cin, line); ) // does not
{ // leak "line"
// process "line" // into ambient
} // scope
int result;
{ // putting this in a separate scope
int a = foo(); // allows us to copy/paste the entire
a += 3; // block without worrying about
int b = bar(a); // repeated declarators
result *= (a + 2*b);
}
{ // ...and we never really needed
int a = foo(); // a and b outside of this anyway!
a += 3;
int b = bar(a);
result *= (a + 2*b);
}
Sometimes a scope is necessary for synchronisation, and you want to keep the critical section as short as possible:
int global_counter = 0;
std::mutex gctr_mx;
void I_run_many_times_concurrently()
{
int a = expensive_computation();
{
std::lock_guard<std::mutex> _(gctr_mx);
global_counter += a;
}
expensive_cleanup();
}

The explicit scoping is usually not done for commenting purposes, but I don't see any harm in doing it if you feel it makes your code more readable.
Typical usage is for avoiding name clashes and controlling when the destructors are called.

A pair of curly braces defines a scope. Names declared or defined within a scope are not visible outside that scope, which is why j is not defined at the end. If a name in a scope is the same as a name defined earlier and outside that scope, it hides the outer name.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Microoptimization: iterating with local variable vs. class member - c++

Related

How do i get a private vector while still using functions with openmp?

Am I missing something or are Virtual calls not as bad performance as people make of them

Avoiding Checking likely if

Disable uninitialized warning for a local variable

Scope within a scope, do or don't?

Categories

Resources