Arrow dereferencing p->m is syntactic sugar for (*p).m, which appears like it might involve two separate memory lookup operations--one to find the object on the heap and the second to then locate the member field offset.
This made me question whether there is any performance difference between these two code snippets. Assume classA has 30+ disparate fields of various types which need to be accessed in various orders (not necessarily consecutively or contiguously):
Version 1:
void func(classA* ptr)
{
std::string s = ptr->field1;
int i = ptr->field2;
float f = ptr->field3;
// etc...
}
Version 2:
void func(classA* ptr)
{
classA &a = *ptr;
std::string s = a.field1;
int i = a.field2;
float f = a.field3;
// etc...
}
So my question is whether or not there is a difference in performance (even if very slight) between these two versions, or if the compiler is smart enough to make them equivalent (even if the different field accesses are interrupted by many lines of other code in between them, which I did not show here).
Arrow dereferencing p->m is syntactic sugar for (*p).m
That isn't generally true, but is true in the limited context in which you are asking.
which appears like it might involve two separate memory lookup
operations--one to find the object on the heap and the second to then
locate the member field offset.
Not at all. It is one to read the parameter or local variable holding the pointer and the second to access the member. But any reasonable optimizer would keep the pointer in a register in the code you showed, so no extra access.
But your alternate version also has a local pointer, so no difference anyway (at least in the direction you're asking about):
classA &a = *ptr;
Assuming the whole function is not being inlined or assuming for some other reason the compiler doesn't know exactly where ptr points, the & must use a pointer, so either the compiler can deduce it is safe for a to be an alias of *ptr so there is NO difference, or the compiler must make a an alias of *copy_of_ptr so the version using a & is slower (not faster as you seem to have expected) by the cost of copying ptr.
even if the different field accesses are interrupted by many lines of
other code in between them, which I did not show here
That moves you toward the interesting case. If that intervening code could change ptr then obviously the two versions behave differently. But what if a human can see that the intervening code can't change ptr while a compiler can't see that: Then the two versions are semantically equal, but the compiler doesn't know that and the compiler may generate slower code for the version you tried to hand optimize by creation of a reference.
Most (?all) compilers implement references as pointers under the hood, so I would expect no difference in the generated assembly (apart from a possible copy to initialize a reference - but I would expect the optimizer to eliminate even that).
In general, this sort of micro-optimization is not worth it. It is always preferable to concentrate on clear and correct code. It is certainly not worth this sort of optimization until you have measured where the bottleneck is.
Related
I have the following class structure -
typedef std::shared_ptr<Inner> InnerPtr;
class Outer {
private:
const InnerPtr ptr_;
public:
Outer(const InnerPtr& ptr) : ptr_(ptr) {}
int calculate(int id) {
return ptr_->calculate_other(id);
}
}
Here, Outer::calculate() is likely to be called a lot of times (in ordered of millions).
Since each call to Outer::calculate() dereferences ptr_, I think it would have a performance impact compared to just using the object.
So I came up with this -
typedef std::shared_ptr<Inner> InnerPtr;
class Outer {
private:
Inner& obj_; // to reference *ptr_
const InnerPtr ptr_; // still need to hold a copy of smart pointer to ensure obj stays in memory during the lifetime of Outer
public:
Outer(const InnerPtr& ptr) : ptr_(ptr), obj_(*ptr) {}
int calculate(int id) {
return obj.calculate_other(id);
}
}
I'm not sure if this is the best solution. I'm looking for suggestions to improve the calculate function.
NOTE: Assume that ptr passed to Outer() is a non-null share pointer
A reference is typically implemented as a pointer at the hardware level in C++, when it cannot be removed entirely (because logically it is an alias). Removing a reference from the body of a class is very difficult; I am unaware of a compiler that tries (other than possibly eliminating the class instance entirely in some cases).
The difference between following a Inner const& and a shared_ptr<Inner const>const in terms of generated assembly is going to be basically nothing.
Now, you misunderstand how const works; const InnerPtr is a const pointer to a non-const value, while const Inner& is a reference to a const value. But I expect that is not intentional.
Now, a Foo& is analogous to a Foo* const, so that top level const could lead to some optimizations somehow; you are not allowed to change where either point. And that could lead to a compiler being able to prove that a Foo& refers to the same object in two bits of code and not be able to prove that a Foo* does.
However, in your example, you had a const shared_ptr, which also has top-level const.
As a general rule, premature optimization is the root of all evil. But so is premature pessimization (the opposite of optimization). Following a smart pointer is not premature pressimization, and your reference optimization is an example of premature optimization. You should only consider making this kind of change when you have already identified a performance bottleneck there.
You are honestly more likely going to run into problems caused by the extraneous reference count increase in the constructor when you build the Outer class from an rvalue shared ptr, than the problem you identify. What more, that extra atomic reference count is going to be a diffuse slowdown, in that atomic operation synchronization doesn't cause most of its slowdown at the point the code runs, but rather in trashing the CPU cache and making the rest of the program slower.
So change this:
explicit Outer(InnerPtr ptr) : ptr_(std::move(ptr)) {}
to remove the pessimization; now:
Outer outer( GenerateInnerPtr() );
does 1 less reference count increase and decrease than it did before, and no other case results in more reference count increase/decrease.
I'm not sure if this is the best solution.
Probably not. As long as you use an optimiser, indirecting through a reference is just as expensive as indirecting through a shared pointer, so you aren't necessarily saving anything, while you're paying by making the class non-assignable, and being simply larger by storing the address twice.
First:
Premature optimization is the root of all evil (or at least most of it) in programming
is an important quote, because most often, people have false understandings what makes a program slow and what fast.
Second:
Lets assume the indirect makes your program slower. Then biggest question is: Do you need the indirection? Your questions lacks a clear answer to this. Is lifetime management or object polymorphism the reason? I strongly assume that if you can make these optimization you had in mind, you can eliminate the indirection altogether.
Third:
Much more important I would assume is if the compiler is able to inline all the calculate function calls. So it would also be important how your definition of the inner class looks like. Thats one of the reason to always post a minimal complete example
Fourth:
If you still really care about performance profile your program and might also use https://godbolt.org/ .
Update, Fifth:
I forgot to mention: Algorithmic complexity is usually the bigger issue with bad performance, meaning: Do you loop several times? Are your loops nested? Can you calculate it before or when creating? Can you cache values?
There is no improvement from caching a reference to the Inner object since on all existing implementations references are implemented as pointers. Basically, you just duplicated the pointer that is part of shared_ptr already and increased size of Outer object.
Unless you are able to cache Inner directly as member of Outer, or at least some important subset of its data, there is little you can do. You have to rethink the design of the Inner and Outer classes and whether it is possible to merge them.
Another point to think about is whether it is possible to optimize memory access patterns of your code. For example, if there is a list of Outer objects, over which you call calculate, the accesses to Inner objects may be unpredictable if each Inner object is allocated separately. You may improve performance if these objects are allocate linearly in a contiguous buffer (e.g. in a vector of by using a custom memory allocator).
And of course, as others have recommended, you should profile your code to see whether this code is actually is a bottleneck and whether any changes do in fact improve performance.
The new keyword hands you back a pointer to the object created, which means you keep having to deference it - I'm just afraid performance may suffer.
E.g. a common situation I'm facing:
class cls {
obj *x; ...
}
// Later, in some member function:
x = new obj(...);
for (i ...) x->bar[i] = x->foo(i + x->baz); // much dereferencing
I'm not overly keen on reference variables either as I have many *x's (e.g. *x, *y, *z, ...) and having to write &x_ref = *x, &y_ref = *y, ... at the start of every function quickly becomes tiresome and verbose.
Indeed, is it better to do:
class cls {
obj x; ... // not pointer
}
x_ptr = new obj(...);
x = *x_ptr; // then work with x, not pointer;
So what's the standard way to work with variables created by new?
There's no other way to work with objects created by new. The location of the unnamed object created by new is always a run-time value. This immediately means that each and every access to such an object will always unconditionally require dereferencing. There's no way around it. That is what "dereferencing" actually is, by definition - accessing through a run-time address.
Your attempts to "replace" pointers with references by doing &x_ref = *x at the beginning of the function are meaningless. They achieve absolutely nothing. References in this context are just syntactic sugar. They might reduce the number of * operators in your source code (and might increase the number of & operators), but they will not affect the number of physical dereferences in the machine code. They will lead to absolutely the same machine code containing absolutely the same amount of physical dereferencing and absolutely the same performance.
Note that in contexts where dereferencing occurs repeatedly many times, a smart compiler might (and will) actually read and store the target address in a CPU register, instead of re-reading it each time from memory. Accessing data through an address stored in a CPU register is always the fastest, i.e. it is even faster than accessing data through compile-time address embedded into the CPU instruction. For this reason, repetitive dereferencing of manageable complexity might not have any negative impact on performance. This, of course, depends significantly on the quality of the compiler.
In situations when you observe significant negative impact on performance from repetitive dereferencing, you might try to cache the target value in a local buffer, use the local buffer for all calculations and then, when the result is ready, store it through the original pointer. For example, if you have a function that repeatedly accesses (reads and/or writes) data through a pointer int *px, you might want to cache the data in an ordinary local variable x
int x = *px;
work with x throughout the entire function and at the end do
*px = x;
Needless to say, this only makes sense when the performance impact from copying the object is low. And of course, you have to be careful with such techniques in aliased situations, since in this case the value of *px is not maintained continuously. (Note again, that in this case we use an ordinary variable x, not a reference. Your attempts to replace single-level pointers with references achieve nothing at all.)
Again, this sort of "data cashing" optimization can also be implicitly performed by the compiler, assuming the compiler has good understanding of the data aliasing relationships present in the code. And this is where C99-style restrict keyword can help it a lot. But that's a different topic.
In any case, there's no "standard" way to do that. The best approach depends critically on your knowledge of data flow relationships that exist in each specific piece of your code.
Instantiate the object without the new keyword, like this:
obj x;
Or if your constructor for obj takes parameters:
obj x(...);
This will give you an object instead of a pointer thereto.
You have to decide whether you want to allocate your things on heap or on stack. Thats completely your decision based on your requirements. and there is no performance degradation with dereferencing. You may allocate your cls in heap that will stay out of scope and keep instances of obj in stack
class cls {
obj x;//default constructor of obj will be called
}
and if obj doesn't have a default constructor you need to call appropiate constructor in cls constructor
Having
struct Person {
string name;
};
Person* p = ...
Assume that no operators are overloaded.
Which is more efficient (if any) ?
(*p).name vs. p->name
Somewhere in the back of my head I hear some bells ringing, that the * dereference operator may create a temporary copy of an object; is this true?
The background of this question are cases like this:
Person& Person::someFunction(){
...
return *this;
}
and I began to wonder, if changing the result to Person* and the last line to simply return this would make any difference (in performance)?
There's no difference. Even the standard says the two are equivalent, and if there's any compiler out there that doesn't generate the same binary for both versions, it's a bad one.
When you return a reference, that's exactly the same as passing back a pointer, pointer semantics excluded.
You pass back a sizeof(void*) element, not a sizeof(yourClass).
So when you do that:
Person& Person::someFunction(){
...
return *this;
}
You return a reference, and that reference has the same intrinsic size than a pointer, so there's no runtime difference.
Same goes for your use of (*i).name, but in that case you create an l-value, which has then the same semantics as a reference (see also here)
Yes, it's much harder to read and type, so you are much better off using the x->y than (*x).y - but other than typing efficiency, there is absolutely no difference. The compiler still needs to read the value of x and then add the offset to y, whether you use one form or the other [assuming there are no funny objects/classes involved that override the operator-> and operator* respectively, of course]
There is definitely no extra object created when (*x) is referenced. The value of the pointer is loaded into a register in the processor [1]. That's it.
Returning a reference is typically more efficient, as it returns a pointer (in disguise) to the object, rather than making a copy of the object. For objects that are bigger than the size of a pointer, this is typically a win.
[1] Yes, we can have a C++ compiler for a processor that doesn't have registers. I know of at least one processor from Rank-Xerox that I saw in about 1984, which doesn't have registers, it was a dedicated LiSP processor, and it just has a stack for LiSP objects... But they are far from common in todays world. If someone working on a processor that doesn't have registers, please don't downvote my answer simply because I don't cover that option. I'm trying to keep the answer simple.
Any good compiler will produce the same results. You can answer this yourself, compile both codes to assembler and check the produced code.
I have a pointer int* p, and do some operations in a loop. I do not modify the memory, just read. If I add const to the pointer (both cases, const int* p, and int* const p), can it help a compiler to optimize the code?
I know other merits of const, like safety or self-documentation, I ask about this particular case. Rephrasing the question: can const give the compiler any useful (for optimization) information, ever?
While this is obviously specific to the implementation, it is hard to see how changing a pointer from int* to int const* could ever provide any additional information that the compiler would not otherwise have known.
In both cases, the value pointed to can change during the execution of the loop.
Therefore it probably will not help the compiler optimize the code.
No. Using const like that will not provide the compiler with any information that can be used for optimization.
Theoretically, for it to be a help to your compiler, the optimizer must be able to prove that nobody will ever use const_cast on your const pointer yet be unable to otherwise prove that the variable is never written to. Such a situation is highly unlikely.
Herb Sutter covers this is more depth in one of his Guru of the Week columns.
It can help or it can make no difference or it can make it worse. The only way to know is to try both and inspect the emitted machine code.
Modern compilers are very smart so they can often deduce that memory is unchanged without any qualifiers (pr they can deduce many other optimizations are possible without code being written in manner easier to analyze) yet they are rather complex and so have a lot of deficiencies and often can't optimize every possible thing at every opportunity.
I think the compiler can't do much in your scenario. The fact that your pointer declared as const int * const p doesn't guarantee that the memory can't be changed externally, e.g. by another thread. Therefore the compiler must generate code that reads the memory value on each iteration of your loop.
But if you are not going to write to the memory location and you know that no other piece of code will, then you can create a local variable and use it similar to this:
const int * p = ...
...
int val = *p;
/* use the value in a loop */
for (i = 0; i < BAZILLION; i++)
{
use_value(val);
}
Not only you help potential readers of your code to see that the val is not changed in a loop, but you also give the compiler a possibility to optimize (load val in a register, for instance).
Using const is, as everyone else has said, unlikely to help the compiler optimize your loop.
It may, however, help optimise code outside the loop, or at the site of a call to a const-qualified method, or to a function taking const arguments.
This is likely to depend on whether the compiler can prove it's allowed to eliminate redundant loads, move them around, or cache calculated values rather than re-calculating them.
The only way to prove this is still to profile and/or check the assembly, but that's where you should probably be looking.
You don't say which compiler you are using. But if you are both reading and writing to memory you could benefit from using "restrict" or similar. The compiler does not know if your pointers are aliasing the same memory so any store often forces loading other values again. "restrict" tells the compiler that no aliasing of the pointer is happening and can keep using values loaded before a subsequent write. Another way to avoid the aliasing issue is to load your values into local variables then the compiler is not forced to reload after a write.
I'm writing something performance-critical and wanted to know if it could make a difference if I use:
int test( int a, int b, int c )
{
// Do millions of calculations with a, b, c
}
or
class myStorage
{
public:
int a, b, c;
};
int test( myStorage values )
{
// Do millions of calculations with values.a, values.b, values.c
}
Does this basically result in similar code? Is there an extra overhead of accessing the class members?
I'm sure that this is clear to an expert in C++ so I won't try and write an unrealistic benchmark for it right now
The compiler will probably equalize them. If it has any brains at all, it will copy values.a, values.b, and values.c into local variables or registers, which is also what happens in the simple case.
The relevant maxims:
Premature optimization is the root of much evil.
Write it so you can read it at 1am six months from now and still understand what you were trying to do.
Most of the time significant optimization comes from restructuring your algorithm, not small changes in how variables are accessed. Yes, I know there are exceptions, but this probably isn't one of them.
This sounds like premature optimization.
That being said, there are some differences and opportunities but they will affect multiple calls to the function rather than performance in the function.
First of all, in the second option you may want to pass MyStorage as a constant reference.
As a result of that, your compiled code will likely be pushing a single value into the stack (to allow you to access the container), rather than pushing three separate values. If you have additional fields (in addition to a-c), sending MyStorage not as a reference might actually cost you more because you will be invoking a copy constructor and essentially copying all the additional fields. All of this would be costs per-call, not within the function.
If you are doing tons of calculations with a b and c within the function, then it really doesn't matter how you transfer or access them. If you passed by reference, the initial cost might be slightly more (since your object, if passed by reference, could be on the heap rather than the stack), but once accessed for the first time, caching and registers on your machine will probably mean low-cost access. If you have passed your object by value, then it really doesn't matter, since even initially, the values will be nearby on the stack.
For the code you provided, if these are the only fields, there will likely not be a difference. the "values.variable" is merely interpreted as an offset in the stack, not as "lookup one object, then access another address".
Of course, if you don't buy these arguments, just define local variables as the first step in your function, copy the values from the object, and then use these variables. If you realy use them multiple times, the initial cost of this copy wouldn't matter :)
No, your cpu would cache the variables you use over and over again.
I think there are some overhead, but may not be much. Because the memory address of the object will be stored in the stack, which points to the heap memory object, then you access the instance variable.
If you store the variable int in stack, it would be really faster, because the value is already in stack and the machine just go to stack to get it out to calculate:).
It also depends on if you store the class's instance variable value on stack or not. If inside the test(), you do like:
int a = objA.a;
int b = objA.b;
int c = objA.c;
I think it would be almost the same performance
If you're really writing performance critical code and you think one version should be faster than the other one, write both versions and test the timing (with the code compiled with right optimization switch). You may even want to see the generated assembly codes. A lot of things can affect the speed of a code snippets that are quite subtle, like register spilling, etc.
you can also start your function with
int & a = values.a;
int & b = values.b;
although the compiler should be smart enough to do that for you behind the scenes. In general I prefer to pass around structures or classes, this makes it often clearer what the function is meant to do, plus you don't have to change the signatures every time you want to take another parameter into account.
As with your previous, similar question: it depends on the compiler and platform. If there is any difference at all, it will be very small.
Both values on the stack and values in an object are commonly accessed using a pointer (the stack pointer, or the this pointer) and some offset (the location in the function's stack frame, or the location inside the class).
Here are some cases where it might make a difference:
Depending on your platform, the stack pointer might be held in a CPU register, whereas the this pointer might not. If this is the case, accessing this (which is presumably on the stack) would require an extra memory lookup.
Memory locality might be different. If the object in memory is larger than one cache line, the fields are spread out over multiple cache lines. Bringing only the relevant values together in a stack frame might improve cache efficiency.
Do note, however, how often I used the word "might" here. The only way to be sure is to measure it.
If you can't profile the program, print out the assembly language for the code fragments.
In general, less assembly code means less instructions to execute which speeds up performance. This is a technique for getting a rough estimate of performance when a profiler is not available.
An assembly language listing will allow you to see differences, if any, between implementations.