C++ STL: Array vs Vector: Raw element accessing performance - c++

I'm building an interpreter and as I'm aiming for raw speed this time, every clock cycle matters for me in this (raw) case.
Do you have any experience or information what of the both is faster: Vector or Array?
All what matters is the speed I can access an element (opcode receiving), I don't care about inserting, allocation, sorting, etc.
I'm going to lean myself out of the window now and say:
Arrays are at least a bit faster than vectors in terms of accessing an element i.
It seems really logical for me. With vectors you have all those security and controlling overhead which doesn't exist for arrays.
(Why) Am I wrong?
No, I can't ignore the performance difference - even if it is so small - I have already optimized and minimized every other part of the VM which executes the opcodes :)

Element access time in a typical implementation of a std::vector is the same as element access time in an ordinary array available through a pointer object (i.e. a run-time pointer value)
std::vector<int> v;
int *pa;
...
v[i];
pa[i];
// Both have the same access time
However, the access time to an element of an array available as an array object is better than both of the above accesses (equivalent to access through a compile-time pointer value)
int a[100];
...
a[i];
// Faster than both of the above
For example, a typical read access to an int array available through a run-time pointer value will look as follows in the compiled code on x86 platform
// pa[i]
mov ecx, pa // read pointer value from memory
mov eax, i
mov <result>, dword ptr [ecx + eax * 4]
Access to vector element will look pretty much the same.
A typical access to a local int array available as an array object will look as follows
// a[i]
mov eax, i
mov <result>, dword ptr [esp + <offset constant> + eax * 4]
A typical access to a global int array available as an array object will look as follows
// a[i]
mov eax, i
mov <result>, dword ptr [<absolute address constant> + eax * 4]
The difference in performance arises from that extra mov instruction in the first variant, which has to make an extra memory access.
However, the difference is negligible. And it is easily optimized to the point of being exactly the same in multiple-access context (by loading the target address in a register).
So the statement about "arrays being a bit faster" is correct in narrow case when the array is accessible directly through the array object, not through a pointer object. But the practical value of that difference is virtually nothing.

You may be barking up the wrong tree. Cache misses can be much more important than the number of instructions that get executed.

No. Under the hood, both std::vector and C++0x std::array find the pointer to element n by adding n to the pointer to the first element.
vector::at may be slower than array::at because the former must compare against a variable while the latter compares against a constant. Those are the functions that provide bounds checking, not operator[].
If you mean C-style arrays instead of C++0x std::array, then there is no at member, but the point remains.
EDIT: If you have an opcode table, a global array (such as with extern or static linkage) may be faster. Elements of a global array would be addressable individually as global variables when a constant is put inside the brackets, and opcodes are often constants.
Anyway, this is all premature optimization. If you don't use any of vector's resizing features, it looks enough like an array that you should be able to easily convert between the two.

You're comparing apples to oranges. Arrays have a constant-size and are automatically allocated, while vectors have a dynamic size and are dynamically allocated. Which you use depends on what you need.
Generally, arrays are "faster" to allocate (in quotes because comparison is meaningless) because dynamic allocation is slower. However, accessing an element should be the same. (Granted an array is probably more likely to be in cache, though that doesn't matter after the first access.)
Also, I don't know what "security" you're talking about, vector's have plenty of ways to get undefined behavior just like arrays. Though they have at(), which you don't need to use if you know the index is valid.
Lastly, profile and look at the generated assembly. Nobody's guess is gonna solve anything.

For decent results, use std::vector as the backing storage and take a pointer to its first element before your main loop or whatever:
std::vector<T> mem_buf;
// stuff
uint8_t *mem=&mem_buf[0];
for(;;) {
switch(mem[pc]) {
// stuff
}
}
This avoids any issues with over-helpful implementations that perform bounds checking in operator[], and makes single-stepping easier when stepping into expressions such as mem_buf[pc] later in the code.
If each instruction does enough work, and the code is varied enough, this should be quicker than using a global array by some negligible amount. (If the difference is noticeable, the opcodes need to be made more complicated.)
Compared to using a global array, on x86 the instructions for this sort of dispatch should be more concise (no 32-bit displacement fields anywhere), and for more RISC-like targets there should be fewer instructions generated (no TOC lookups or awkward 32-bit constants), as the commonly-used values are all in the stack frame.
I'm not really convinced that optimizing an interpreter's dispatch loop in this way will produce a good return on time invested -- the instructions should really be made to do more, if it's an issue -- but I suppose it shouldn't take long to try out a few different approaches and measure the difference. As always in the event of unexpected behaviour the generated assembly language (and, on x86, the machine code, as instruction length can be a factor) should be consulted to check for obvious inefficiencies.

Related

Do 2d+ vectors cause a performance hit? [duplicate]

In our C++ course they suggest not to use C++ arrays on new projects anymore. As far as I know Stroustroup himself suggests not to use arrays. But are there significant performance differences?
Using C++ arrays with new (that is, using dynamic arrays) should be avoided. There is the problem that you have to keep track of the size, and you need to delete them manually and do all sorts of housekeeping.
Using arrays on the stack is also discouraged because you don't have range checking, and passing the array around will lose any information about its size (array to pointer conversion). You should use std::array in that case, which wraps a C++ array in a small class and provides a size function and iterators to iterate over it.
Now, std::vector vs. native C++ arrays (taken from the internet):
// Comparison of assembly code generated for basic indexing, dereferencing,
// and increment operations on vectors and arrays/pointers.
// Assembly code was generated by gcc 4.1.0 invoked with g++ -O3 -S on a
// x86_64-suse-linux machine.
#include <vector>
struct S
{
int padding;
std::vector<int> v;
int * p;
std::vector<int>::iterator i;
};
int pointer_index (S & s) { return s.p[3]; }
// movq 32(%rdi), %rax
// movl 12(%rax), %eax
// ret
int vector_index (S & s) { return s.v[3]; }
// movq 8(%rdi), %rax
// movl 12(%rax), %eax
// ret
// Conclusion: Indexing a vector is the same damn thing as indexing a pointer.
int pointer_deref (S & s) { return *s.p; }
// movq 32(%rdi), %rax
// movl (%rax), %eax
// ret
int iterator_deref (S & s) { return *s.i; }
// movq 40(%rdi), %rax
// movl (%rax), %eax
// ret
// Conclusion: Dereferencing a vector iterator is the same damn thing
// as dereferencing a pointer.
void pointer_increment (S & s) { ++s.p; }
// addq $4, 32(%rdi)
// ret
void iterator_increment (S & s) { ++s.i; }
// addq $4, 40(%rdi)
// ret
// Conclusion: Incrementing a vector iterator is the same damn thing as
// incrementing a pointer.
Note: If you allocate arrays with new and allocate non-class objects (like plain int) or classes without a user defined constructor and you don't want to have your elements initialized initially, using new-allocated arrays can have performance advantages because std::vector initializes all elements to default values (0 for int, for example) on construction (credits to #bernie for reminding me).
Preamble for micro-optimizer people
Remember:
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%".
(Thanks to metamorphosis for the full quote)
Don't use a C array instead of a vector (or whatever) just because you believe it's faster as it is supposed to be lower-level. You would be wrong.
Use by default vector (or the safe container adapted to your need), and then if your profiler says it is a problem, see if you can optimize it, either by using a better algorithm, or changing container.
This said, we can go back to the original question.
Static/Dynamic Array?
The C++ array classes are better behaved than the low-level C array because they know a lot about themselves, and can answer questions C arrays can't. They are able to clean after themselves. And more importantly, they are usually written using templates and/or inlining, which means that what appears to a lot of code in debug resolves to little or no code produced in release build, meaning no difference with their built-in less safe competition.
All in all, it falls on two categories:
Dynamic arrays
Using a pointer to a malloc-ed/new-ed array will be at best as fast as the std::vector version, and a lot less safe (see litb's post).
So use a std::vector.
Static arrays
Using a static array will be at best:
as fast as the std::array version
and a lot less safe.
So use a std::array.
Uninitialized memory
Sometimes, using a vector instead of a raw buffer incurs a visible cost because the vector will initialize the buffer at construction, while the code it replaces didn't, as remarked bernie by in his answer.
If this is the case, then you can handle it by using a unique_ptr instead of a vector or, if the case is not exceptional in your codeline, actually write a class buffer_owner that will own that memory, and give you easy and safe access to it, including bonuses like resizing it (using realloc?), or whatever you need.
Vectors are arrays under the hood.
The performance is the same.
One place where you can run into a performance issue, is not sizing the vector correctly to begin with.
As a vector fills, it will resize itself, and that can imply, a new array allocation, followed by n copy constructors, followed by about n destructor calls, followed by an array delete.
If your construct/destruct is expensive, you are much better off making the vector the correct size to begin with.
There is a simple way to demonstrate this. Create a simple class that shows when it is constructed/destroyed/copied/assigned. Create a vector of these things, and start pushing them on the back end of the vector. When the vector fills, there will be a cascade of activity as the vector resizes. Then try it again with the vector sized to the expected number of elements. You will see the difference.
To respond to something Mehrdad said:
However, there might be cases where
you still need arrays. When
interfacing with low level code (i.e.
assembly) or old libraries that
require arrays, you might not be able
to use vectors.
Not true at all. Vectors degrade nicely into arrays/pointers if you use:
vector<double> vector;
vector.push_back(42);
double *array = &(*vector.begin());
// pass the array to whatever low-level code you have
This works for all major STL implementations. In the next standard, it will be required to work (even though it does just fine today).
You have even fewer reasons to use plain arrays in C++11.
There are 3 kind of arrays in nature from fastest to slowest, depending on the features they have (of course the quality of implementation can make things really fast even for case 3 in the list):
Static with size known at compile time. --- std::array<T, N>
Dynamic with size known at runtime and never resized. The typical optimization here is, that if the array can be allocated in the stack directly. -- Not available. Maybe dynarray in C++ TS after C++14. In C there are VLAs
Dynamic and resizable at runtime. --- std::vector<T>
For 1. plain static arrays with fixed number of elements, use std::array<T, N> in C++11.
For 2. fixed size arrays specified at runtime, but that won't change their size, there is discussion in C++14 but it has been moved to a technical specification and made out of C++14 finally.
For 3. std::vector<T> will usually ask for memory in the heap. This could have performance consequences, though you could use std::vector<T, MyAlloc<T>> to improve the situation with a custom allocator. The advantage compared to T mytype[] = new MyType[n]; is that you can resize it and that it will not decay to a pointer, as plain arrays do.
Use the standard library types mentioned to avoid arrays decaying to pointers. You will save debugging time and the performance is exactly the same as with plain arrays if you use the same set of features.
There is definitely a performance impact to using an std::vector vs a raw array when you want an uninitialized buffer (e.g. to use as destination for memcpy()). An std::vector will initialize all its elements using the default constructor. A raw array will not.
The c++ spec for the std:vector constructor taking a count argument (it's the third form) states:
`Constructs a new container from a variety of data sources, optionally using a user supplied allocator alloc.
Constructs the container with count default-inserted instances of T. No copies are made.
Complexity
2-3) Linear in count
A raw array does not incur this initialization cost.
Note that with a custom allocator, it is possible to avoid "initialization" of the vector's elements (i.e. to use default initialization instead of value initialization). See these questions for more details:
Is this behavior of vector::resize(size_type n) under C++11 and Boost.Container correct?
How can I avoid std::vector<> to initialize all its elements?
Go with STL. There's no performance penalty. The algorithms are very efficient and they do a good job of handling the kinds of details that most of us would not think about.
STL is a heavily optimized library. In fact, it's even suggested to use STL in games where high performance might be needed. Arrays are too error prone to be used in day to day tasks. Today's compilers are also very smart and can really produce excellent code with STL. If you know what you are doing, STL can usually provide the necessary performance. For example by initializing vectors to required size (if you know from start), you can basically achieve the array performance. However, there might be cases where you still need arrays. When interfacing with low level code (i.e. assembly) or old libraries that require arrays, you might not be able to use vectors.
About duli's contribution with my own measurements.
The conclusion is that arrays of integers are faster than vectors of integers (5 times in my example). However, arrays and vectors are arround the same speed for more complex / not aligned data.
If you compile the software in debug mode, many compilers will not inline the accessor functions of the vector. This will make the stl vector implementation much slower in circumstances where performance is an issue. It will also make the code easier to debug since you can see in the debugger how much memory was allocated.
In optimized mode, I would expect the stl vector to approach the efficiency of an array. This is since many of the vector methods are now inlined.
The performance difference between the two is very much implementation dependent - if you compare a badly implemented std::vector to an optimal array implementation, the array would win, but turn it around and the vector would win...
As long as you compare apples with apples (either both the array and the vector have a fixed number of elements, or both get resized dynamically) I would think that the performance difference is negligible as long as you follow got STL coding practise. Don't forget that using standard C++ containers also allows you to make use of the pre-rolled algorithms that are part of the standard C++ library and most of them are likely to be better performing than the average implementation of the same algorithm you build yourself.
That said, IMHO the vector wins in a debug scenario with a debug STL as most STL implementations with a proper debug mode can at least highlight/cathc the typical mistakes made by people when working with standard containers.
Oh, and don't forget that the array and the vector share the same memory layout so you can use vectors to pass data to legacy C or C++ code that expects basic arrays. Keep in mind that most bets are off in that scenario, though, and you're dealing with raw memory again.
If you're using vectors to represent multi-dimensional behavior, there is a performance hit.
Do 2d+ vectors cause a performance hit?
The gist is that there's a small amount of overhead with each sub-vector having size information, and there will not necessarily be serialization of data (as there is with multi-dimensional c arrays). This lack of serialization can offer greater than micro optimization opportunities. If you're doing multi-dimensional arrays, it may be best to just extend std::vector and roll your own get/set/resize bits function.
If you do not need to dynamically adjust the size, you have the memory overhead of saving the capacity (one pointer/size_t). That's it.
There might be some edge case where you have a vector access inside an inline function inside an inline function, where you've gone beyond what the compiler will inline and it will force a function call. That would be so rare as to not be worth worrying about - in general I would agree with litb.
I'm surprised nobody has mentioned this yet - don't worry about performance until it has been proven to be a problem, then benchmark.
I'd argue that the primary concern isn't performance, but safety. You can make a lot of mistakes with arrays (consider resizing, for example), where a vector would save you a lot of pain.
Vectors use a tiny bit more memory than arrays since they contain the size of the array. They also increase the hard disk size of programs and probably the memory footprint of programs. These increases are tiny, but may matter if you're working with an embedded system. Though most places where these differences matter are places where you would use C rather than C++.
The following simple test:
C++ Array vs Vector performance test explanation
contradicts the conclusions from "Comparison of assembly code generated for basic indexing, dereferencing, and increment operations on vectors and arrays/pointers."
There must be a difference between the arrays and vectors. The test says so... just try it, the code is there...
Sometimes arrays are indeed better than vectors. If you are always manipulating
a fixed length set of objects, arrays are better. Consider the following code snippets:
int main() {
int v[3];
v[0]=1; v[1]=2;v[2]=3;
int sum;
int starttime=time(NULL);
cout << starttime << endl;
for (int i=0;i<50000;i++)
for (int j=0;j<10000;j++) {
X x(v);
sum+=x.first();
}
int endtime=time(NULL);
cout << endtime << endl;
cout << endtime - starttime << endl;
}
where the vector version of X is
class X {
vector<int> vec;
public:
X(const vector<int>& v) {vec = v;}
int first() { return vec[0];}
};
and the array version of X is:
class X {
int f[3];
public:
X(int a[]) {f[0]=a[0]; f[1]=a[1];f[2]=a[2];}
int first() { return f[0];}
};
The array version will of main() will be faster because we are avoiding the
overhead of "new" everytime in the inner loop.
(This code was posted to comp.lang.c++ by me).
For fixed-length arrays the performance is the same (vs. vector<>) in release build, but in debug build low-level arrays win by a factor of 20 in my experience (MS Visual Studio 2015, C++ 11).
So the "save time debugging" argument in favor of STL might be valid if you (or your coworkers) tend to introduce bugs in your array usage, but maybe not if your debugging time is mostly waiting on your code to run to the point you are currently working on so that you can step through it.
Experienced developers working on numerically intensive code sometimes fall into the second group (especially if they use vector :) ).
Assuming a fixed-length array (e.g. int* v = new int[1000]; vs std::vector<int> v(1000);, with the size of v being kept fixed at 1000), the only performance consideration that really matters (or at least mattered to me when I was in a similar dilemma) is the speed of access to an element. I looked up the STL's vector code, and here is what I found:
const_reference
operator[](size_type __n) const
{ return *(this->_M_impl._M_start + __n); }
This function will most certainly be inlined by the compiler. So, as long as the only thing that you plan to do with v is access its elements with operator[], it seems like there shouldn't really be any difference in performance.
There is no argument about which of them is the best or good to use.They both have there own use cases,they both have their pros and cons.The behavior of both containers are different in different places.One of the main difficulty with arrays is that they are fixed in size if once they are defined or initialized then you can not change values and on the other side vectors are flexible, you can change vectors value whenever you want it's not fixed in size like arrays,because array has static memory allocation and vector has dynamic memory or heap memory allocation(we can push and pop elements into/from vector) and the creator of c++ Bjarne Stroustrup said that vectors are flexible to use more than arrays.
Using C++ arrays with new (that is, using dynamic arrays) should be avoided. There is the problem you have to keep track of the size, and you need to delete them manually and do all sort of housekeeping.
We can also insert, push and pull values easily in vectors which is not easily possible in arrays.
If we talk about performance wise then if you are working with small values then you should use arrays and if you are working with big scale code then you should go with vector(vectors are good at handling big values then arrays).

How to safely implement "Using Uninitialized Memory For Fun And Profit"?

I would like to build a dense integer set in C++ using the trick described at https://research.swtch.com/sparse . This approach achieves good performance by allowing itself to read uninitialized memory.
How can I implement this data structure without triggering undefined behavior, and without running afoul of tools like Valgrand or ASAN?
Edit: It seems like responders are focusing on the word "uninitialized" and interpreting it in the context of the language standard. That was probably a poor word choice on my part - here "uninitialized" means only that its value is not important to the correct functioning of the algorithm. It's obviously possible to implement this data structure safely (LLVM does it in SparseMultiSet). My question is what is the best and most performant way to do so?
I can see four basic approaches you can take. These are applicable not only to C++ but also most other low-level languages like C that make uninitialized access possible but not allowed, and the last is applicable even to higher-level "safe" languages.
Ignore the standard, implement it in the usual way
This is the one crazy trick language lawyers hate! Don't freak out yet though - the solutions following this one won't break the rules, so just skip this part if you are of the rules-stickler variety.
The standard makes most uses of uninitialized values undefined and the few loopholes it does allow (e.g., copying one undefined value to another) don't really give you enough rope to actually implement what you want - even in C which is slightly less restrictive (see for example this answer covering C11, which explains that while accessing an indeterminiate value may not directly trigger UB anything that results is also indeterminate and indeed the value may appear to chance from access to access).
So you just implement it anyway, with the knowledge that most or all currently compilers will just compile it to the expected code, and know that your code is not standards compliant.
At least in my test all of gcc, clang and icc didn't take advantage of the illegal access to do anything crazy. Of course, the test is not comprehensive and even if you could construct one, the behavior could change in a new version of the compiler.
You would be safest if the implementation of the methods that access uninitialized memory was compiled, once, in a separate compilation unit - this makes it easy to check that it does the right thing (just check the assembly once) and makes it nearly impossible (outside of LTGC) for the compiler to do anything tricky, since it can't prove whether uninitialized values are being accessed.
Still, this approach is theoretically unsafe and you should check the compiled output very carefully and have additional safeguards in place if you take it.
If you take this approach, tools like valgrind are fairly likely to report a uninitialized read error.
Now these tools work at the assembly level, and some uninitialized reads may be fine (see, for example, the next item on fast standard library implementations), so they don't actually report a uninitialized read immediately, but rather have a variety of heuristics to determine if invalid values are actually used. For example, they may avoid reporting an error until they determine the uninitialized value is used to determine the direction of a conditional jump, or some other action that is not trackable/recoverable according to the heuristic. You may be able to get the compiler to emit code that reads uninitialized memory but is safe according to this heuristic.
More likely, you won't be able to do that (since the logic here is fairly subtle as it relies on the relationship between the values in two arrays), so you can use the suppression options in your tools of choice to hide the errors. For example, valgrind can suppress based on the stack trace - and in fact there are already many such suppression entries used by default to hide false-positives in various standard libraries.
Since it works based on stack traces, you'll probably have difficulties if the reads occur in inlined code, since the top-of-stack will then be different for every call-site. You could avoid this my making sure the function is not inlined.
Use assembly
What is ill-defined in the standard, is usually well-defined at the assembly level. It is why the compiler and standard library can often implement things in a faster way than you could achieve with C or C++: a libc routine written in assembly is already targeting a specific architecture and doesn't have to worry about the various caveats in the language specification that are there to make things run fast on a variety of hardware.
Normally, implementing any serious amount of code in assembly is a costly undertaking, but here it is only a handful, so it may be feasible depending on how many platforms you are targeting. You don't even really need to write the methods yourself - just compile the C++ version (or use godbolt and copy the assembly. The is_member function, for example1, looks like:
sparse_array::is_member(unsigned long):
mov rax, QWORD PTR [rdi+16]
mov rdx, QWORD PTR [rax+rsi*8]
xor eax, eax
cmp rdx, QWORD PTR [rdi]
jnb .L1
mov rax, QWORD PTR [rdi+8]
cmp QWORD PTR [rax+rdx*8], rsi
sete al
Rely on calloc magic
If you use calloc2, you explicitly request zeroed memory from the underlying allocator. Now a correct version of calloc may simply call malloc and then zero out the returned memory, but actual implementations rely on the fact that the OS-level memory allocation routines (sbrk and mmap, pretty much) will generally return you zeroed memory on any OS with protected memory (i.e., all the big ones), to avoid zeroing out the memory again.
As a practical matter, for large allocations, this is typically satisfied by implementing a call like anonymous mmap by mapping a special zero page of all zeros. When (if ever) the memory is written, does copy-on-write actually allocate a new page. So the allocation of large, zeroed memory regions may be for free since the OS already needs to zero the pages.
In that case, implementing your sparse set on top of calloc could be just as fast as the nominally uninitialized version, while being safe and standards compliant.
Calloc Caveats
You should of course test to ensure that calloc is behaving as expected. The optimized behavior is usually only going to occur when your program initializes a lot of long-lived zeroed memory approximately "up-front". That is, the typical logic for optimized calloc if something like this:
calloc(N)
if (can satisfy a request for N bytes from allocated-then-freed memory)
memset those bytes to zero and return them
else
ask the OS for memory, return it directly because it is zeroed
Basically, the malloc infrastructure (which also underlies new and friends) has a (possibly empty) pool of memory that it has already requested from the OS and generally tries to allocated there first. This pool is composed of memory from the last block request from the OS but not handed out (e.g., because the user requested 32 bytes but the allocated asks for chunks from the OS in 1 MB blocks, so there is a lot left over), and also of memory that was handed out to the process but subsequently returned via free or delete or whatever. The memory in that pool has arbitrary values, and if a calloc can be satisfied from that pool, you don't get your magic, since the zero-init has to occur.
On the other hand if the memory has to be allocated from the OS, you get the magic. So it depends on your use case: if you are frequently creating and destroying sparse_set objects, you will generally just be drawing from the internal malloc pools and will pay the zeroing costs. If you have a long-lived sparse_set objects which take up a lot of memory, they likely were allocated by asking the OS and you got the zeroing nearly for free.
The good news is that if you don't want to rely on the calloc behavior above (indeed, on your OS or with your allocator it may not even be optimized in that way), you could usually replicate the behavior by mapping in /dev/zero manually for your allocations. On OSes that offer it, this guarantees that you get the "cheap" behavior.
Use Lazy Initialization
For a solution that is totally platform agnostic you could simply use yet another array which tracks the initialization state of the array.
First you choose some granule, at which you will track initialization, and use bitmap where each bit tracks the initialization state of that granule of the sparse array.
For example, let's say you choose your granule to be 4 elements, and the size of the elements in your array is 4 bytes (e.g., int32_t values): you need 1 bit to track every 4 elements * 4 bytes/element * 8 bits/byte, which is an overhead of less than 1%3 in allocated memory.
Now you simply check the corresponding bit in this array before accessing sparse. This adds some small cost to accessing the sparse array, but doesn't change the overall complexity, and the check is still quite fast.
For example, your is_member function now looks like:
bool sparse_set::is_member(size_t i){
bool init = is_init[i >> INIT_SHIFT] & (1UL << (i & INIT_MASK));
return init && sparse[i] < n && dense[sparse[i]] == i;
}
The generated assembly on x86 (gcc) now starts with:
mov rax, QWORD PTR [rdi+24]
mov rdx, rsi
shr rdx, 8
mov rdx, QWORD PTR [rax+rdx*8]
xor eax, eax
bt rdx, rsi
jnc .L2
...
.L2:
ret
That's all associate with the bitmap check. It's all going to be pretty quick (and often off the critical path since it isn't part of the data flow).
In general, the cost of this approach depends on the density of your set, and what functions you are calling - is_member is about the worse case for this approach since some functions (e.g., clear) aren't affected at all, and others (e.g., iterate) can batch up the is_init check and only do it once every INIT_COVERAGE elements (meaning the overhead would be again ~1% for the example values).
Sometimes this approach will be faster than the approach suggested in the OP's link, especially when the handling elements not in the set - in this case, the is_init check will fail and often shortcut the remaining code, and in this case you have a working set that is much smaller (256 times using the example granule size) than the size of the sparse array, so you may great reduce misses to DRAM or outer cache levels.
The granule size itself is an important tunable for this approach. Intuitively, a larger granule size pays a larger initialization cost when the an element covered by the granule is accessed for the first time, but saves on memory and up-front is_init initialization cost. You can come up with a formula that finds the optimum size in the simple case - but the behavior also depends on the "clustering" of values and other factors. Finally, it is totally reasonable to use a dynamic granule size to cover your bases under varying workloads - but it comes at the cost of variable shifts.
Really Lazy Solution
It is worth noting that there is a similarity between the calloc and lazy init solutions: both lazily initialize blocks of memory as they are needed, but the calloc solution track this implicitly in hardware through MMU magic with page tables and TLB entries, while the lazy init solution does it in software, with a bitmap explicitly tracking which granules have been allocated.
The hardware approach has the advantage of being nearly free in (for the "hit" case, anyway) since it uses the always-present virtual memory support in the CPU to detect misses, but the software case has the advantage of being portable and allowing precise control over the granule size etc.
You can actually combine these approaches, to make a lazy approach that doesn't use a bitmap, and doesn't even need the dense array at all: just allocate your sparse array with mmap as PROT_NONE, so you fault whenever you read from an un-allocated page in a sparse array. You catch the fault and allocate the page in the sparse array with a sentinel value indicating "not present" for every element.
This is the fastest of all for the "hot" case: you don't need any of the ... && dense[sparse[i]] == i checks any more.
The downsides are:
Your code is really not portable since you need to implement the fault-handling logic, which is usually platform specific.
You can't control the granule size: it must be at page granularity (or some multiple thereof). If your set is very sparse (say less than 1 out of 4096 elements occupied) and uniformly distributed, you end up paying a high initialization cost since you need to handle a fault and initialize a full page of values for every element.
Misses (i.e., non-insert accesses to set elements that don't exist) either need to allocate a page even if no elements will exist in that range, or will be very slow (incurring a fault) every time.
1 This implementation has no "range checking" - i.e., it doesn't check if i is greater than MAX_ELEM - depending on your use case you may want to check this. My implementation used a template parameter for MAX_ELEM, which may result in slightly faster code, but also more bloat, and you'd do fine to just make the max size a class member as well.
2 Really, the only requirement that you use something that calls calloc under the covers or performs the equivalent zero-fill optimization, but based on my tests more idiomatic C++ approaches like new int[size]() just do the allocation followed by a memset. gcc does optimize malloc followed by memset into calloc, but that's not useful if you are trying to avoid the use of C routines anyway!
3 Precisely, you need 1 extra bit to track every 128 bits of the sparse array.
If we reword your question:
What code reads from uninitialized memory without tripping tools designed to catch reads from uninitialized memory?
Then the answer becomes clear -- it is not possible. Any way of doing this that you could find represents a bug in Valgrind that would be fixed.
Maybe it's possible to get the same performance without UB, but the restrictions you put on your question "I would like to... use the trick... allowing itself to read uninitialized memory" guarantee UB. Any competing method avoiding UB will not be using the trick that you so love.
Valgrind does not complain if you just read uninitialised memory.
Valgrind will complain if you use this data in a way that influences
the visible behaviour of the program, e.g. using this data as input in a syscall, or doing a jump based on this data, or using this data to commpute another address. See http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.uninitvals for more info.
So, it might very well be that you will have no problem with Valgrind.
If valgrind still complain but your algorithm is correct even when using this uninit data, then you can use client requests to declare this memory as initialised.

Is there any performance difference between accessing a hardcode array and a run time initialization array?

For example, I want to create a square root table using array SQRT[i] to optimize a game, but I don't know if there is a performance difference between the following initialization when accessing the value of SQRT[i]:
Hardcode array
int SQRT[]={0,1,1,1,2,2,2,2,2,3,3,.......255,255,255}
Generate value at run time
int SQRT[65536];
int main(){
for(int i=0;i<65536;i++){
SQRT[i]=sqrt(i);
}
//other code
return 0;
}
Some example of accessing them:
if(SQRT[a*a+b*b]>something)
...
I'm not clear if the program stores or accesses a hard-code array in a different way, also I don't know if the compiler would optimize the hard-code array to speed up the access time, is there performance differences between them when accessing the array?
First, you should do the hard-coded array right:
static const int SQRT[]={0,1,1,1,2,2,2,2,2,3,3,.......255,255,255};
(also using uint8_t instead of int is probably better, so as to reduce the size of the array and making it more cache-friendly)
Doing this has one big advantage over the alternative: the compiler can easily check that the contents of the array cannot be changed.
Without this, the compiler has to be paranoid -- every function call might change the contents of SQRT, and every pointer could potentially be pointing into SQRT and thus any write through an int* or char* might be modifying the array. If the compiler cannot prove this doesn't happen, then that puts limits on the kinds of optimizations it can do, which in some cases could show a performance impact.
Another potential advantage is the ability to resolve things involving constants at compile-time.
If needed, you may be able to help the compiler figure things out with clever use of __restrict__.
In modern C++ you can get the best of both worlds; it should be possible (and in a reasonable way) to write code that would run at compile time to initialize SQRT as a constexpr. That would best be asked a new question, though.
As people said in comments:
if(SQRT[a*a+b*b]>something)
is a horrible example use-case. If that's all you need SQRT for, just square something.
As long as you can tell the compiler that SQRT doesn't alias anything, then a run-time loop will make your executable smaller, and only add a tiny amount of CPU overhead during startup. Definitely use uint8_t, not int. Loading a 32bit temporary from an 8bit memory location is no slower than from a zero-padded 32b memory location. (The extra movsx instruction on x86, instead of using a memory operand, will more than pay for itself in reduced cache pollution. RISC machines typically don't allow memory operands anyway, so you always need an instruction to load the value into a register.)
Also, sqrt is 10-21 cycle latency on Sandybridge. If you don't need it often, the int->double, sqrt, double->int chain is not much worse than an L2 cache hit. And better than going to L3 or main memory. If you need a lot of sqrt, then sure, make a LUT. The throughput will be much better, even if you're bouncing around in the table and causing L1 misses.
You could optimize the initialization by squaring instead of sqrting, with something like
uint8_t sqrt_lookup[65536];
void init_sqrt (void)
{
int idx = 0;
for (int i=0 ; i < 256 ; i++) {
// TODO: check that there isn't an off-by-one here
int iplus1_sqr = (i+1)*(i+1);
memset(sqrt_lookup+idx, i, iplus1_sqr-idx);
idx = iplus1_sqr;
}
}
You can still get the benefits of having sqrt_lookup being const (compiler knows it can't alias). Either use restrict, or lie to the compiler, so users of the table see a const array, but you actually write to it.
This might involve lying to the compiler, by having it declared extern const in most places, but not in the file that initializes it. You'd have to make sure this actually works, and doesn't create code referring to two different symbols. If you just cast away the const in the function that initializes it, you might get a segfault if the compiler placed it in rodata (or read-only bss memory if it's uninitialized, if that's possible on some platforms?)
Maybe we can avoid lying to the compiler, with:
uint8_t restrict private_sqrt_table[65536]; // not sure about this use of restrict, maybe that will do it?
const uint8_t *const sqrt_lookup = private_sqrt_table;
Actually, that's just a const pointer to const data, not a guarantee that what it's pointing to can't be changed by another reference.
The access time will be the same. When you hardcode an array, the C library routines called before the main will initialize it (in an embedded system the start up code copies the Read write data, i.e. hardcoded from ROM to RAM address where the array is located, if the array is constant, then it is accessed directly from ROM).
If the for loop is used to initialize, then there is an overhead of calling the Sqrt function.

Does allocation on the heap affect the access performance?

As known:
ptr = malloc(size);
or in C++
ptr = new Klass();
will allocate size bytes on the heap. It is less efficient than on the stack.
But after the allocation, when we access it:
foo(*ptr);
or
(*ptr)++;
Does it have the same performance as data on the stack, or still slower?
The only way to definitively answer this question is to code up both versions and measure their performance under multiple scenarios (different allocation sizes, different optimiziation settings, etc). This sort of thing depends heavily on a lot of different factors, such as optimization settings, how the operating system manages memory, the size of the block being allocated, locality of accesses, etc. Never blindly assume that one method is more "efficient" than another in all circumstances.
Even then, the results will only be applicable for your particular system.
It really depends on what you are comparing and how.
If you mean is
ptr = malloc(10 * sizeof(int));
slower than:
int arr[10]
ptr = arr;
and then using ptr to access the integers it points at?
Then no.
If you are referring to using arr[0] instead of *ptr in the second case, possibly, as the compiler has to read the value in ptr to find the address of the actual variable. In many cases, however, it will "know" the value inside ptr, so won't need to read the pointer in itself.
If we are comparing foo(ptr) and foo(arr) it won't make any difference at all.
[There may be some penalty in actually allocating on the heap in that the memory has to be "committed" on the first use. But that is at most once for each 4KB, and we can probably ignore that in most cases].
Efficiency considerations are important when comparing an algorithm that runs in O(n^2) time versus O(nlogn), etc.
Comparing memory storage accesses, both algorithms are O(n) or O(k) and usually is is NOT possible to measure any difference.
However, if you are writing some code for the kernel that is frequently invoked than a small difference might become measurable.
In the context of this question, the real answer is that it really doesn't matter, use whichever storage makes your program easy to read and maintain. Because, in the long run the cost of paying humans to read your code is more than the cost of a running a cpu for a few more/or less instructions.
Stack is much faster than Heap since it involves as simple as moving the stack pointer. stack is of fixed size. in contrast to Heap, user need to manually allocate and de-allocate the memory.

Pointer vs Variable speed in C++

At a job interview I was asked the question "In C++ how do you access a variable faster, though the normal variable identifier or though a pointer". I must say I did not have a good technical answer to the question so I took a wild guess.
I said that access time will probably be that same as normal variable/identifier is a pointer to the memory address where the value is stored, just like a pointer. In other words, that in terms of speed they both have the same performance, and that pointers are only different because we can specify the memory address we want them to point to.
The interviewer did not seem very convinced/satisfied with my answer (although he did not say anything, just carried on asking something else), therefore I though to come and ask SO'ers wether my answer was accurate, and if not why (from a theory and technical POV).
When you access a "variable", you look up the address, and then fetch the value.
Remember - a pointer IS a variable. So actually, you:
a) look up the address (of the pointer variable),
b) fetch the value (the address stored at that variable)
... and then ...
c) fetch the value at the address pointed to.
So yes, accessing via "pointer" (rather than directly) DOES involve (a bit) of extra work and (slightly) longer time.
Exactly the same thing occurs whether or not it's a pointer variable (C or C++) or a reference variable (C++ only).
But the difference is enormously small.
A variable does not have to live in main memory. Depending on the circumstances, the compiler can store it in a register for all or part of its life, and accessing a register is much faster than accessing RAM.
Let's ignore optimization for a moment, and just think about what the abstract machine has to do to reference a local variable vs. a variable through a (local) pointer. If we have local variables declared as:
int i;
int *p;
when we reference the value of i, the unoptimized code has to go get the value that is (say) at 12 past the current stack pointer and load it into a register so we can work with it. Whereas when we reference *p, the same unoptimized code has to go get the value of p from 16 past the current stack pointer, load it into a register, and then go get the value that the register points to and load it into another register so we can work with it as before. The first part of the work is the same, but the pointer access conceptually involves an additional step that needs to be done before we can work with the value.
That was, I think, the point of the interview question - to see if you understood the fundamental difference between the two types of access. You were thinking that the local variable access involved a kind of lookup, and it does - but the pointer access involves that very same type of lookup to get to the value of the pointer before we can start to go after the thing it is pointing to. In simple, unoptimized terms, the pointer access is going to be slower because of that extra step.
Now with optimization, it may happen that the two times are very close or identical. It is true that if other recent code has already used the value of p to reference another value, you may already find p in a register, so that the lookup of *p via p takes the same time as the lookup of i via the stack pointer. By the same token, though, if you have recently used the value of i, you may already find it in a register. And while the same might be true of the value of *p, the optimizer can only reuse its value from the register if it is sure that p hasn't changed in the mean time. It has no such problem reusing the value of i. In short, while accessing both values may take the same time under optimization, accessing the local variable will almost never be slower (except in really pathological cases), and may very well be faster. That makes it the correct answer to the interviewer's question.
In the presence of memory hierarchies, the difference in time may get even more pronounced. Local variables are going to be located near each other on the stack, which means that you are very likely to find the address you need already in main memory and in the cache the first time you access it (unless it is the very first local variable you access in this routine). There is no such guarantee with the address the pointer points to. Unless it was recently accessed, you may need to wait for a cache miss, or even a page fault, to access the pointed-to address, which could make it slower by orders of magnitude vs. the local variable. No, that won't happen all the time - but it's a potential factor that could make a difference in some cases, and that too is something that could be brought up by a candidate in response to such a question.
Now what about the question other commenters have raised: how much does it matter? It's true, for a single access, the difference is going to be tiny in absolute terms, like a grain of sand. But you put enough grains of sand together and you get a beach. And though (to continue the metaphor) if you are looking for someone who can run quickly down a beach road, you don't want someone who will obsess about sweeping every grain of sand off the road before he or she can start running, you do want someone who will be aware when he or she is running through knee-deep dunes unnecessarily. Profilers won't always rescue you here - in these metaphorical terms, they are much better at recognizing a single big rock that you need to run around than noticing lots of little grains of sand that are bogging you down. So I would want people on my team who understand these issues at a fundamental level, even if they rarely go out of their way to use that knowledge. Don't stop writing clear code in the quest for microoptimization, but be aware of the kinds of things that can cost performance, especially when designing your data structures, and have a sense of whether you are getting good value for the price you are paying. That's why I think this was a reasonable interview question, to explore the candidate's understanding of these issues.
What paulsm4 and LaC said + a little asm:
int y = 0;
mov dword ptr [y],0
y = x;
mov eax,dword ptr [x] ; Fetch x to register
mov dword ptr [y],eax ; Store it to y
y = *px;
mov eax,dword ptr [px] ; Fetch address of x
mov ecx,dword ptr [eax] ; Fetch x
mov dword ptr [y],ecx ; Store it to y
Not that on the other hand it matters much, also this probably is harder to optimize (fe. you can't keep the value in cpu register, as the pointer just points to some place in memory). So optimized code for y = x; could look like this:
mov dword ptr [y], ebx - if we assume that local var x was stored in ebx
I think the interviewer was looking for you to mention the word register. As in, if you declare a variable as a register variable the compiler will do its utmost to ensure that it is stored in a register on the CPU.
A bit of chat around bus access and negotiation for other types of variables and pointers alike would have helped to frame it.
paulsm4 and LaC has already explained it nicely along with other members. I want to emphasize effect of paging when the pointer is pointing to something in heap which has been paged out.
=> Local variables are available either in the stack or in the register => while in case of pointer, the pointer may be pointing to an address which is not in cache and paging will certainly slow down the speed.
A variable holds a value of certain type, and accessing the variable means getting this value, from memory or from a register. When getting the value from memory we need to get it's address from somewhere - most of the time it has to be loaded into a register (sometimes it can be part of the load command itself, but this is quite rare).
A pointer keeps an address of a value; this value has to be in memory, the pointer itself can be in memory or in a register.
I would expect that on average access via a pointer will be slower than accessing the value through a variable.
Your analysis ignores the common scenario in which the pointer itself is a memory variable which must also be accessed.
There are many factors that affect the performance of software, but if you make certain simplifying assumptions about the variables involved (notably that they are not cached in any way), then each level of pointer indirection requires an additional memory access.
int a = 1234; // luggage combination
int *b = &a;
int **c = &b;
...
int e = a; // one memory access
int e = *b; // two memory accesses
int e = **c; // three memory accesses
So the short answer to "which is faster" is: ignoring compiler and processor optimizations which might be occurring, it is faster to access the variable directly.
In a best-case scenario, where this code is executed repeatedly in a tight loop, the pointer value would likely be cached into a CPU register or at worst into the processor's L1 cache. In such a case, it is likely that a first-level pointer indirection is as fast or faster than accessing the variable directly since "directly" probably means via the "stack pointer" register (plus some offset). In both cases you are using a CPU register as a pointer to the value.
There are other scenarios that could affect this analysis, such as for global or static data where the variable's address is hard-coded into the instruction stream. In such a scenario, the answer may depend on the specifics of the processor involved.
I think the key part of the question is "access a variable". To me, if a variable is in scope, why would you create a pointer to it (or a reference) to access it? Using a pointer or a reference would only make sense if the variable was in itself a data structure of some sort or if you were acessing it in some non-standard way (like interpreting an int as a float).
Using a pointer or a reference would be faster only in very specific circumstances. Under general circumstances, it seems to me that you would be trying to second guess the compiler as far as optimization is concerned and my experience tells me that unless you know what you're doing, that's a bad idea.
It would even depend on the keyword. A const keyword might very well mean that the variable is totally optimized out at compile time. That is faster than a pointer. The register keyword does not guarantee that the variable is stored in a register. So how do you know whether its faster or not? I think the answer is that it depends because there is no one size fits all answer.
I think a better answer might be it depends on where the pointer is 'pointing to'. Note, a variable might already be in the cache. However a pointer might incur a fetch penalty. It's similar to a linked list vs Vector performance tradeoff. A Vector is cache friendly because all of your memory is contigious. However a linked list, since it contains pointers, might incur a cache penalty because the memory is potentially scattered all over the place