C++ How does a vector of pointers affect performance? - c++

I am wondering how a std::vector of pointers to objects affects the performance of a program, vs. using a std::vector directly containing the objects. Specifically I am referring to the speed of the program.
I was taught to use the std::vector over other STL such as the std::list for it's speed, due to all of it's data being stored contiguously in memory, rather than being fragmented. This meant that iterating over the elements was fast, however my thinking is that if my vector contains pointers to the objects, then the objects can still be stored anywhere in memory and only the pointers are stored contiguously. I am wondering how this would affect the performance of a program when it comes to iterating over the vector and accessing the objects.
My current project design uses a vector of pointers so that I can take advantage of virtual functions however i'm unsure whether this is worth the speed hit i may encounter when my vector becomes very large. Thanks for your help!

If you need the polymorphism, as people have said, you should store the pointers to the base. If, later, you decide this code is hot and needs to optimise it's cpu cache usage you can do that say by making the objects fit cleanly in cache lanes and/or with a custom allocator to ensure code locality of the dereferenced data.
Slicing is when you store objects by value Base and copy construct or allocate into them a Derived, Derived will be Sliced, the copy constructor or allocator only takes a Base and will ignore any data in Derived, there isn't enough space allocated in Base to take the full size of Derived. i.e. if Base is 8 bytes and Derived is 16, there's only enough room for Base's 8 bytes in the destination value even if you provided a copy constructor/allocator that explicitly took a Dervived.
I should say it's really not worth thinking about data cache coherence if you're using virtualisation heavily that wont be elided by the optimiser. An instruction cache miss is far more devestating than a data cache miss and virtualisation can cause instruction cache misses because it has to look up the vtable pointer before loading the function into the instruction cache and so can't preemptively load them.
CPU's tend to like to preload as much data as they can into caches, if you load an address an entire cache lane (~64 bytes) will be loaded into a cache lane and often it will also load the cache lane before and after which is why people are so keen on data locality.
So in your vector of pointers scenario when you load the first pointer you'll get lots of pointers in the cache at once loading through the pointer will trigger a cache miss and load up the data around that object, if your actual particles are 16 bytes and local to each other, you wont lose much beyond this. if they're all over the heap and massive you will get very cache churny on each iteration and relatively ok while working on a particle.
Traditionally, particle systems tend to be very hot and like to pack data tightly, it's common to see 16 byte Plain Old Data particle systems which you iterate linearly over with very predicatble branching. meaning you can generally rely on 4 particles per cache lane and have the prefetcher stay well ahead of your code.
I also should say that cpu caches are cpu dependant and i'm focusing on intel x86. Arm for example tends to be quite a bit behind intel & the pipeline is less complex, the prefetcher less capable and so cache misses can be less devastating.

Related

Pointer against not pointer

I read in many places including Effective C++ that it is better to store data on the stack and not as pointer to the data.
I can understand doing this with small object, because the number of new and delete calls is also reduced, which reduces the chance of a memory leak. Also, the pointer can take more space than the object itself.
But with large object, where copying them will be expensive, is it not better to store them in a smart pointer?
Because with many operations with the large object there will be few object copying which is very expensive (I am not including the getters and setters).
Let's focus purely on efficiency. There's no one-size-fits-all, unfortunately. It depends on what you are optimizing for. There's a saying, always optimize the common case. But what is the common case? Sometimes the answer lies in understanding your software's design inside out. Sometimes it's unknowable even at the high level in advance because your users will discover new ways to use it that you didn't anticipate. Sometimes you will extend the design and reveal new common cases. So optimization, but especially micro-optimization, is almost always best applied in hindsight, based on both this user-end knowledge and with a profiler in your hand.
The few times you can usually have really good foresight about the common case is when your design is forcing it rather than responding to it. For example, if you are designing a class like std::deque, then you're forcing the common case write usage to be push_fronts and push_backs rather than insertions to the middle, so the requirements give you decent foresight as to what to optimize. The common case is embedded into the design, and there's no way the design would ever want to be any different. For higher-level designs, you're usually not so lucky. And even in the cases where you know the broad common case in advance, knowing the micro-level instructions that cause slowdowns is too often incorrectly guessed, even by experts, without a profiler. So the first thing any developer should be interested in when thinking about efficiency is a profiler.
But here's some tips if you do run into a hotspot with a profiler.
Memory Access
Most of the time, the biggest micro-level hotspots if you have any will relate to memory access. So if you have a large object that is just one contiguous block where all the members are getting accessed in some tight loop, it'll aid performance.
For example, if you have an array of 4-component mathematical vectors you're sequentially accessing in a tight algorithm, you'll generally fare far, far better if they're contiguous like so:
x1,y1,z1,w1,x2,y2,x2,w2...xn,yn,zn,wn
... with a single-block structure like this (all in one contiguous block):
x
y
z
w
This is because the machine will fetch this data into a cache line which will have the adjacent vectors' data inside of it when it's all tightly packed and contiguous in memory like this.
You can very quickly slow down the algorithm if you used something like std::vector here to represent each individual 4-component mathematical vector, where every single individual one stores the mathematical components in a potentially completely different place in memory. Now you could potentially have a cache miss with each vector. In addition, you're paying for additional members since it's a variable-sized container.
std::vector is a "2-block" object that often looks like this when we use it for a mathematical 4-vector:
size
capacity
ptr --> [x y z w] another block
It also stores an allocator but I'll omit that for simplicity.
On the flip side, if you have a big "1-block" object where only some of its members get accessed in those tight, performance-critical loops, then it might be better to make it into a "2-block" structure. Say you have some Vertex structure where the most-accessed part of it is the x/y/z position but it also has a less commonly-accessed list of adjacent vertices. In that case, it might be better to hoist that out and store that adjacency data elsewhere in memory, perhaps even completely outside of the Vertex class itself (or merely a pointer), because your common case, performance-critical algorithms not accessing that data will then be able to access more contiguous vertices nearby in a single cache line since the vertices will be smaller and point to that rarely-accessed data elsewhere in memory.
Creation/Destruction Overhead
When rapid creation and destruction of objects is a concern, you can also do better to create each object in a contiguous memory block. The fewer separate memory blocks per object, the faster it'll generally go (since whether or not this stuff is going on the heap or stack, there will be fewer blocks to allocate/deallocate).
Free Store/Heap Overhead
So far I've been talking more about contiguity than stack vs. heap, and it's because stack vs. heap relates more to client-side usage of an object rather than an object's design. When you're designing the representation of an object, you don't know whether it's going on the stack or heap. What you do know is whether it's going to be fully contiguous (1 block) or not (multiple blocks).
But naturally if it's not contiguous, then at least part of it is going on the heap, and heap allocations and deallocations can be enormously expensive if you are relating the cost to the hardware stack. However, you can mitigate this overhead often with the use of efficient O(1) fixed allocators. They serve a more special purpose than malloc or free, but I would suggest concerning yourself less with the stack vs. heap distinction and more about the contiguity of an object's memory layout.
Copy/Move Overhead
Last but not least, if you are copying/swapping/moving objects a lot, then the smaller they are, the cheaper this is going to be. So you might want to sort pointers or indices to big objects sometimes, for example, instead of the original object, since even a move constructor for a type T where sizeof(T) is a large number is going to be expensive to copy/move.
So move constructing something like the "2-block" std::vector here which is not contiguous (its dynamic contents are contiguous, but that's a separate block) and stores its bulky data in a separate memory block is actually going to be cheaper than move constructing like a "1-block" 4x4 matrix that is contiguous. It's because there's no such thing as a cheap shallow copy if the object is just one big memory block rather than a tiny one with a pointer to another. One of the funny trends that arises is that objects which are cheap to copy are expensive to move, and ones which are very expensive to copy are cheap to move.
However, I would not let copying/move overhead impact your object implementation choices, because the client can always add a level of indirection there if he needs for a particular use case that taxes copies and moves. When you're designing for memory layout-type micro-efficiency, the first thing to focus on is contiguity.
Optimization
The rule for optimization is this: if you have no code or no tests or no profiling measurements, don't do it. As others have wisely suggested, your number one concern is always productivity (which includes maintainability, safety, clarity, etc). So instead of trapping yourself in hypothetical what-if scenarios, the first thing to do is to write the code, measure it twice, and change it if you really have to do so. It's better to focus on how to design your interfaces appropriately so that if you do have to change anything, it'll just affect one local source file.
The reality is that this is a micro-optimisation. You should write the code to make it readable, maintainable and robust. If you worry about speed, you use a profiling tool to measure the speed. You find things that take more time than they should, and then and only then do you worry about speed optimisation.
An object should obviously only exist once. If you make multiple copies of an object that is expensive to copy you are wasting time. You also have different copies of the same object, which is in itself not a good thing.
"Move semantics" avoids expensive copying in cases where you didn't really want to copy anything but just move an object from here to there. Google for it; it is quite an important thing to understand.
What you said is essentially correct. However, move semantics alleviate the concern about object copying in a large number of cases.

Performance of container of objects vs performance of container of pointers

class C { ... };
std::vector<C> vc;
std::vector<C*> pvc;
std::vector<std::unique_ptr<C>> upvc;
Depending on the size of C, either the approach storing by value or the approach storing by pointer will be more efficient.
Is it possible to approximately know what this size is (on both a 32 and 64 bit platform)?
Yes, it is possible - benchmark it. Due to how CPU caches work these days, things are not simple anymore.
Check out this lecture about linked lists by Bjarne Stroustrup:
https://www.youtube.com/watch?v=YQs6IC-vgmo
Here is an excelent lecture by Scott Meyers about CPU caches: https://www.youtube.com/watch?v=WDIkqP4JbkE
Let's look at the details of each example before drawing any conclusions.
Vector of Objects
A vector of Objects has first, initial performance hit. When an object is added to the vector, it makes a copy. The vector will also make copies when it needs to expand the reserved memory. Larger objects will take more time to copy, as well as complex or compound objects.
Accessing the objects is very efficient - only one dereference. If your vector can fit inside a processor's data cache, this will be very efficient.
Vector of Raw Pointers
This may have an initialization performance hit. If the objects are in dynamic memory, the memory must be initialized first (allocated).
Copying a pointer into a vector is not dependent on the object size. This may be a performance savings depending on the object size.
Accessing the objects takes a performance hit. There are 2 deferences before you get to the object. Most processors don't follow pointers when loading their data cache. This may be performance hit because the processor may have to reload the data cache when dereferencing the pointer to the object.
Vector of Smart Pointers
A little bit more costly in performance than a raw pointer. However, the items will automatically be deleted when the vector is destructed. The raw pointers must be deleted before the vector can be destructed; or a memory leak is created.
Summary
The safest version is to have copies in the vector, but has performance hits depending on the size of the object and the frequency of reallocating the reserved memory area. A vector of pointers takes performance hits because of the double dereferencing, but doesn't incur extra performance hits when copying because pointers are a consistent size. A vector of smart pointers may take additional performance hits compared to a vector of raw pointers.
The real truth can be found by profiling the code. The performance savings of one data structure versus another may disappear when waiting for I/O operations, such as networking or file I/O.
Operations with the data structures may need to be performed a huge amount of times in order for the savings to be significant. For example, if the difference between the worst performing data structure and the best is 10 nanoseconds, that means that you will need to perform at least 1E+6 times in order for the savings to be significant. If a second is significant, expect to access the data structures more times (1E+9).
I suggest picking one data structure and moving on. Your time developing the code is worth more than the time that the program runs. Safety and Robustness are also more important. An unsafe program will consume more of your time fixing issues than a safe and robust version.
For a Plain Old Data (POD) type, a vector of that type is always more efficient than a vector of pointers to that type at least until sizeof(POD) > sizeof(POD*).
Almost always, the same is true for a POD type at least until sizeof(POD) > 2 * sizeof(POD*) due to superior memory locality and lower total memory usage compared to when you are dynamically allocating the objects at which to be pointed.
This kind of analysis will hold true up until sizeof(POD) crosses some threshold for your architecture, compiler and usage that you would need to discover experimentally through benchmarking. The above only puts lower bounds on that size for POD types.
It is difficult to say anything definitive about all non-POD types as their operations (e.g. - default constructor, copy constructors, assignment, etc.) can be as inexpensive as a POD's or arbitrarily more expensive.

What is meaning of locality of data structure?

I was reading following article,
What Every Programmer Should Know About Compiler Optimizations
There are other important optimizations that are currently beyond the
capabilities of any compiler—for example, replacing an inefficient
algorithm with an efficient one, or changing the layout of a data
structure to improve its locality.
Does that mean if I change sequence (layout) of data members in class, it can affect performance?
So,
class One
{
int data0;
abstract-data-type data1;
};
Differes in performance from,
class One
{
abstract-data-type data0;
int data1;
};
If this is true, what is rule of thumb while defining classes or data structure?
Locality in this sense is speaking mostly to cache locality. Writing data structures and algorithms to operate mostly out of cache makes the algorithm run as fast as it possibly can. Cache locality is one of the reasons quick sort is quick.
For a data structure, you want to keep the parts of your data structure that refer to each other relatively close to each other, to avoid flushing out useful cache lines.
Also, you can rearrange your data structure so that the compiler will use the minimum amount of memory required to hold all the members and still efficiently access them. This helps make sure your data structure consumes the minimum number of cache lines.
A single cache line on a current x86-64 architecture (core i7) is 64 bytes.
I am not an expert on data/structure locality, but it has to do with how you organize your data to avoid the CPU caching bits of memory from all over the CPU thus slowing down your program by constantly waiting for a memory fetch.
For example, a linked list can be a scattered all over your memory. However if you changed this into an array of "elements" then they are all in contiguous memory - this would save memory access times if you needed to traverse they array all at one time (its just one example)
Additionally:
Also becareful of some of the STL libraries, again I am not 100% sure which are the best, but some of them (e.g. list) are quite bad in terms of locality.
Another , perhaps more common example is an array of pointers, where the pointed to elements can be scattered around memory.
Of course, you cannot always avoid this easily because you sometimes need to be able to dynamically add/move/insert/delete elements...
Summary:
It basically means take care how you layout your data with regard to memory access.
Sort class members by how frequently you will be accessing them. This maximizes the "hotness" of the cache line that contains the head of your class, increasing the likelihood of it remaining cached. Another factor that you care about is packing - due to alignment, rearranging the order in which members are declared could lead to a reduction in the size of your class which would in turn reduce cache pressure.
(None of them are definitive, of course. These rules of thumb aren't a substitute for profiling.)

std::vector<A> vs std::vector<A*> difference for CPU

Lets discuss a case when I have a huge std::vector. I need to iterate on all elements and call print function. There are two cases. If I store my objects in the vector, and the objects will be next to each other in memory, or I allocate my object is the heap, and store the pointers of the objects in the vector. In this case the objects will be distributed in all over the RAM.
In case copies of the objects are stored in std::vector<A>, when CPU brings data from RAM to CPU cache then it brings a chunk of memory, which contains multiple elements of the vector. In this case when you iterate on each element and call a function, then you know that multiple elements will be processed and only then the CPU will go to RAM to request the remaining part of data to process. And this is good because CPU does not have a lot of free cycles.
What about the case of the std::vector<A*>? When it brings a chunk of pointers is it easy for CPU to obtain objects by pointer? Or it should request from RAM the objects on which you call some functions and there will be cache misses and free CPU cycles? Is it bad compared with the case above in the aspect of performance?
At least in a typical case, when the CPU fetches a pointer (or a number of pointers) from memory, it will not automatically fetch the data to which those pointers refer.
So, in the case of the vector of pointers, when you load the item that each of those pointers refers to, you'll typically get a cache miss, and access will be substantially slower than if they were stored contiguously. This is particularly true when/if each item is relatively small, so a number of them could fit in a single cache line (for some level of cache--keep in mind that a current processor will often have two or three levels of cache, each of which might have a different line size).
It may, however, be possible to mitigate this to some degree. You can overload operator new for a class to control allocations of objects of that class. Using this, you can at least keep objects of that class together in memory. That doesn't guarantee that the items in a particular vector will be contiguous, but could improve locality enough to make a noticeable improvement in speed.
Also note that the vector allocates its data via an Allocator object (which defaults to std::allocator<T>, which, in turn, uses new). Although the interface is kind of a mess so it's harder than you'd generally like, you can define an allocator to act differently if you wish. This won't generally have much effect on a single vector, but if (for example) you have a number of vectors (each of fixed size) and want them to use memory next to each other, you could do that via the allocator object.
If I store my objects in the vector, and the objects will be next to each other in memory, or I allocate my object is the heap
Regardless of using std::vector<A> or std::vector<A *>, the inner buffer of the vector will be allocated in the heap. You could, though, use an effecient memory pool to manage allocations and deletions, but you're still going to work with data on the heap.
Is it bad compared with the case above in the aspect of performance?
In the case of using std::vector<A *> without an specialized memory menagement, you may be lucky as to make the allocations and always get data nicely aligned in memory, but it is generally better to have the contiguous allocations performed by std::vector<A>. In the former case, it may take longer to have to reallocate the entire vector (since pointers are usually smaller than regular structs), but it will suffer from locality (considering memory accesses).
When it brings a chunk of pointers is it easy for CPU to obtain
objects by pointer?
No, it isn't. CPU doesn't know they're pointers (everything CPU sees is just a bunch of bits, no semantics involved) until it fetches "dereferencing" instruction.
Or it should request from RAM the objects on which you call some
functions and there will be cache misses and free CPU cycles?
That's right. CPU will try to load data corresponding to a cached pointer but it's likely that this data is located somewhere far away from recently accessed memory, so it'd be a cache miss.
Is it bad compared with the case above in the aspect of performance?
If the only thing you care about is accessing elements, then yes, it's bad. Yet in some cases vector of pointers is preferable. Namely, if your objects don't support moving (C++11 isn't mainstream yet) then vector copying becomes more expensive. Even if don't copy your vector it may be the case when you don't know in advance number of stored elements, so you can't call reverse(n) beforehand. Then all your objects will be copied when vector will exhaust its capacity and will be forced to resize.
But in the end it depends on concrete type. If your objects is small (tiny structs, ints or floats) then it's obviously better to work with then by copying because of overhead of pointers would be too big.

Performance impact of objects

I am a beginner programmer with some experience at c and c++ programming. I was assigned by the university to make a physics simulator, so as you might imagine there's a big emphasis on performance.
My questions are the following:
How many assembly instructions does an instance data member access
through a pointer translate to (i.e for an example vector->x )?
Is it much more then say another approach where you simply access the
memory through say a char* (at the same memory location of variable
x), or is it the same?
Is there a big impact on performance
compiler-wise if I use an object to access that memory location or
if I just access it?
Another question regarding the subject would be
whether or not accessing heap memory is faster then stack memory
access?
C++ is a compiled language. Accessing a memory location through a pointer is the same regardless of whether that's a pointer to an object or a pointer to a char* - it's one instruction in either case. There are a couple of spots where C++ adds overhead, but it always buys you some flexibility. For example, invoking a virtual function requires an extra level of indirection. However, you would need the same indirection anyway if you were to emulate the virtual function with function pointers, or you would spend a comparable number of CPU cycles if you were to emulate it with a switch or a sequence of ifs.
In general, you should not start optimizing before you know what part of your code to optimize. Usually only a small part of your code is responsible for the bulk of the CPU time used by your program. You do not know what part to optimize until you profile your code. Almost universally it's programmer's code, not the language features of C++, that is responsible for the slowdown. The only way to know for sure is to profile.
On x86, a pointer access is typically one extra instruction, above and beyond what you normally need to perform the operation (e.x. y = object->x; would be one load of the address in object, and one load of the value of x, and one store to y - in x86 assembler both loads and stores are mov instructions with memory target). Sometimes it's "zero" instructions, because the compiler can optimise away the load of the object pointer. In other architectures, it's really down to how the architecture works - some architectures have very limited ways of accessing memory and/or loading addresses to pointers, etc, making it awkward to access pointers.
Exactly the same number of instructions - this applies for all
As #2 - objects in themselves have no impact at all.
Heap memory and stack memory is the same kind. One answer says that "stack memory is always in the caceh", which is true if it's "near the top of the stack", where all the activity goes on, but if you have an object that is being passed around that was created in main, and a pointer to it is used to pass it around for several layers of function calls, and then access through the pointer, there is an obvious chance that this memory hasn't been used for a long while, so there is no real difference there either). The big difference is that "heap memory is plenty of space, stack is limited" along with "running out of heap is possible to do limited recovery, running out of stack is immediate end of execution [without tricks that aren't very portable]"
If you look at class as a synonym for struct in C (which aside from some details, they really are), then you will realize that class and objects are not really adding any extra "effort" to the code generated.
Of course, used correctly, C++ can make it much easier to write code where you deal with things that are "do this in a very similar way, but subtly differently". In C, you often end up with :
void drawStuff(Shape *shapes, int count)
{
for(i = 0; i < count; i++)
{
switch (shapes[i].shapeType)
{
case Circle:
... code to draw a circle ...
break;
case Rectangle:
... code to draw a rectangle ...
break;
case Square:
...
break;
case Triangle:
...
break;
}
}
}
In C++, we can do this at the object creation time, and your "drawStuff" becoems:
void drawStuff(std::vector<Shape*> shapes)
{
for(auto s : shapes)
{
s->Draw();
}
}
"Look Ma, no switch..." ;)
(Of course, you do need a switch or something to do the selection of which object to create, but once choice is made, assuming your objects and the surrounding architecture are well defined, everything should work "magically" like the above example).
Finally, if it's IMPORTANT with performance, then run benchmarks, run profiling and check where the code is spending it's time. Don't optimise too early (but if you have strict performance criteria for something, keep an eye on it, because deciding on the last week of a project that you need to re-organise your data and code dramatically because performance sucks due to some bad decision is also not the best of ideas!). And don't optimise for individual instructions, look at where the time is spent, and come up with better algorithms WHERE you need to. (In the above example, using const std::vector<Shape*>& shapes will effectively pass a pointer to the shapes vector passed in, instead of copying the entire thing - which may make a difference if there are a few thousand elements in shapes).
It depends on your target architecture. An struct in C (and a class in C++) is just a block of memory containing the members in sequence. An access to such a field through a pointer means adding an offset to the pointer and loading from there. Many architectures allow a load to already specify an offset to the target address, meaning that there is no performance penalty there; but even on extreme RISC machines that don't have that, adding the offset should be so cheap that the load completely shadows it.
Stack and heap memory are really the same thing. Just different areas. Their basic access speed is therefore the same. The main difference is that the stack will most likely already be in the cache no matter what, whereas heap memory might not be if it hasn't been accessed lately.
Variable. On most processors instructions are translated to something called microcode, similar to how Java bytecode are translated to processor-specific instructions before you run it. How many actual instructions you get are different between different processor manufacturers and models.
Same as above, it depends on processor internals most of us know little about.
1+2. What you should be asking are how many clock cycles these operations take. On modern platforms the answer are one. It does not matter how many instructions they are, a modern processor have optimizations to make both run on one clock cycle. I will not get into detail here. I other words, when talking about CPU load there are no difference at all.
Here you have the tricky part. While there are no difference in how many clock cycles the instruction itself take, it needs to have data from memory before it can run - this can take a HUGE ammount of clock cycles. Actually someone proved a few years ago that even with a very optimized program a x86 processor spends at least 50% of its time waiting for memory access.
When you use stack memory you are actually doing the same thing as creating an array of structs. For the data, instructions are not duplicated unless you have virtual functions. This makes data aligned and if you are going to do sequential access, you will have optimal cache hits. When you use heap memory you will create an array of pointers, and each object will have its own memory. This memory will NOT be aligned and therefore sequential access will have a lot of cache misses. And cache misses are what really will your application slower and should be avoided at all cost.
I do not know exactly what you are doing but in many cases even using objects are much slower than plain arrays. An array of objects are aligned [object1][object2] etc. If you do something like pseudocode "for each object o {o.setX() = o.getX() + 1}"... this means that you will only access one variable and your sequential access would therefore jump over the other variables in each object and get more cache misses than if your X-variables where aligned in their own array. And if you have code that use all variables in your object, standard arrays will not be slower than object array. It will just load the different arrays into different cache blocks.
While standard arrays are faster in C++ they are MUCH faster in other languages like Java, where you should NEVER store bulk data in objects - as Java objects use more memory and are always stored at the heap. This are the most common mistake that C++ programmers do in Java, and then complain that Java are slow. However if they know how to write optimal C++ programs they store data in arrays which are as fast in Java as in C++.
What I usually do are a class to store the data, that contains arrays. Even if you use the heap, its just one object which becomes as fast as using the stack. Then I have something like "class myitem { private: int pos; mydata data; public getVar1() {return data.getVar1(pos);}}". I do not write out all of the code here, just illustrating how I do this. Then when I iterate trough it the iterator class do not actually return a new myitem instance for each item, it increase the pos value and return the same object. This means you get a nice OO API while you actually only have a few objects and and nicely aligned arrays. This pattern are the fastest pattern in C++ and if you don't use it in Java you will know pain.
The fact that we get multiple function calls do not really matter. Modern processors have something called branch prediction which will remove the cost of the vast majority of those calls. Long before the code actually runs the branch predictor will have figured out what the chains of calls do and replaced them with a single call in the generated microcode.
Also even if all calls would run each would take far less clock cycles the memory access they require, which as I pointed out makes memory alignment the only issue that should bother you.