Is it better to perform direct pointer operations or [] - c++

I have a 2d array. I need to perform a few operations on it as fast as possible (function will be called a dozen of times per second, so It would be nice to make it efficient).
Now, let's say I want to get element A[i][j], is there any difference in speed between simply using A[i][j] and *(A+(i*width+j)) (ignoring the fact that I need to calculate i*width+j, let's say I already have this value)?

With all the optimizations turned on, there should be no difference - not only in the timing, but also in the code the compiler generates for these two constructs.
The biggest difference from a programmer's point of view is readability. The first construct immediately tells the reader that he's dealing with a 2D array, while the second one requires some thinking (is it a row-major order, or a column-major order? Where is the width calculated? What was the reason to choose this way over a more obvious 2D array syntax?). That is why the first construct is preferable in real-life scenarios.

Depending on the quality of compiler, I think the [] notation can result in faster code. The reason is that when you use pointers, the compiler can't be sure that pointer aliasing is not occurring and this can preclude certain optimizations.
On the other hand, if the [] notation is used, those concerns do not apply and the compiler can get more aggressive with applying optimizations.

Related

std::tuple sizeof, is it a missed optimization?

I've checked all major compilers, and sizeof(std::tuple<int, char, int, char>) is 16 for all of them. Presumably they just put elements in order into the tuple, so some space is wasted because of alignment.
If tuple stored elements internally like: int, int, char, char, then its sizeof could be 12.
Is it possible for an implementation to do this, or is it forbidden by some rule in the standard?
std::tuple sizeof, is it a missed optimization?
Yep.
Is it possible for an implementation to do this[?]
Yep.
[Is] it forbidden by some rule in the standard?
Nope!
Reading through [tuple], there is no constraint placed upon the implementation to store the members in template-argument order.
In fact, every passage I can find seems to go to lengths to avoid making any reference to member-declaration order at all: get<N>() is used in the description of operational semantics. Other wording is stated in terms of "elements" rather than "members", which seems like quite a deliberate abstraction.
In fact, some implementations do apparently store the members in reverse order, at least, probably simply due to the way they use inheritance recursively to unpack the template arguments (and because, as above, they're permitted to).
Speaking specifically about your hypothetical optimisation, though, I'm not aware of any implementation that doesn't store elements in [some trivial function of] the user-given order; I'm guessing that it would be "hard" to come up with such an order and to provide the machinery for std::get, at least as compared to the amount of gain you'd get from doing so. If you are really concerned about padding, you may of course choose your element order carefully to avoid it (on some given platform), much as you would with a class (without delving into the world of "packed" attributes). (A "packed" tuple could be an interesting proposal…)
Yes, it's possible and has been (mostly) done by R. Martinho Fernandes. He used to have a blog called Flaming Danger Zone, which is now down for some reason, but its sources are still available on github.
Here are the all four parts of the Size Matters series on this exact topic: 1, 2, 3, 4.
You might wish to view them raw since github doesn't understand C++ highlighting markup used and renders code snippets as unreadable oneliners.
He essentially computes a permutation for tuple indices via C++11 template meta-program, that sorts elements by alignment in non-ascending order, stores the elements according to it, and then applies it to the index on every access.
They could. One possible reason they don’t: some architectures, including x86, have an indexing mode that can address an address base + size × index in a single instruction—but only when size is a power of 2. Or it might be slightly faster to do a load or store aligned to a 16-byte boundary. This could make code that addresses arrays of std::tuple slightly faster and more compact if they add four padding bytes.

What are the Disadvantages of Nested Vectors?

I'm still fairly new to C++ and have a lot left to learn, but something that I've become quite attached to recently is using nested (multidimensional) vectors. So I may typically end up with something like this:
std::vector<std::vector<std::string> > table;
Which I can then easily access elements of like this:
std::string data = table[3][5];
However, recently I've been getting the impression that it's better (in terms of performance) to have a single-dimensional vector and then just use "index arithmetic" to access elements correspondingly. I assume this performance impact is significant for much larger or higher dimensional vectors, but I honestly have no idea and haven't been able to find much information about it so far.
While, intuitively, it kind of makes sense that a single vector would have better performance than a a higher dimensional one, I honestly don't understand the actual reasons why. Furthermore, if I were to just use single-dimensional vectors, I would lose the intuitive syntax I have for accessing elements of multidimensional ones. So here are my questions:
Why are multidimensional vectors inefficient? If I were to only use a single-dimensional vector instead (to represent data in higher dimensions), what would be the best, most intuitive way to access its elements?
It depends on the exact conditions. I'll talk about the case, when the nested version is a true 2D table (i.e., all rows have equal length).
A 1D vector usually will be faster on every usage patterns. Or, at least, it won't be slower than the nested version.
Nested version can be considered worse, because:
it needs to allocate number-of-rows times, instead of one.
accessing an element takes an additional indirection, so it is slower (additional indirection is usually slower than the multiply needed in the 1D case)
if you process your data sequentially, then it could be much slower, if the 2D data is scattered around the memory. It is because there could be a lot of cache misses, depending how the memory allocator returns memory areas of different rows.
So, if you go for performance, I'd recommend you to create a 2D-wrapper class for 1D vector. This way, you could get as simple API as the nested version, and you'll get the best performance too. And even, if for some cause, you decide to use the nested version instead, you can just change the internal implementation of this wrapper class.
The most intuitive way to access 1D elements is y*width+x. But, if you know your access patterns, you can choose a different one. For example, in a painting program, a tile based indexing could be better for storing and manipulating the image. Here, data can be indexed like this:
int tileMask = (1<<tileSizeL)-1; // tileSizeL is log of tileSize
int tileX = x>>tileSizeL;
int tileY = y>>tileSizeL;
int tileIndex = tileY*numberOfTilesInARow + tileX;
int index = (tileIndex<<(tileSizeL*2)) + ((y&tileMask)<<tileSizeL) + (x&tileMask);
This method has a better spatial locality in memory (pixels near to each other tend to have a near memory address). Index calculation is slower than a simple y*width+x, but this method could have much less cache misses, so in the end, it could be faster.

Eigen: Efficient equivalent to MATLAB's changem()?

I am needing to perform an operation on an Eigen VectorXi, which is equivalent to MATLAB's changem():
http://www.mathworks.com/help/map/ref/changem.html
At the moment, the way I am doing this is looping over the values in the array and performing the remapping with a switch/case block. I am guessing this is not particularly efficient.
Is there a fast way to do this with Eigen? Speed is critical for my application.
Switch / case will be particularly slow and inflexible.
changem takes a matrix and two vectors of values, new and old. If an entry is found in the old list, it is replaced by the corresponding entry in the new list. So it's inherently going to be rather slow, you need to pass over the entire matrix, search the old list, and if, and entry is found, replace with the new list.
How can you speed it up? First, don't hardcode as a switch / case. A modern compiler will possibly optimise to a loop rather than lots of jumps, but I wouldn't guarantee it. And the approach is inflexible.
Secondly, you can sort the "old" vector and use a binary search rather than a linear one. That will only help significantly if the old vector is long.
Thirdly, you can take advantage of what you know about the matrix. Are the old values constrained to lie in certain regions? Is there one value which is overwhelmingly likely and can be tested for first? Can you quickly exclude some values as not allowed in the old list (Too big, too small, not integral).
Are the old values integers and can you use indexing? Or generalise that to hashing. That would be even faster than a binary search, though with more overhead for hashing.
Can you solve the problem another way and keep an index of matrix xy co-ordinates by value?
There are lots of approaches. But simply implement the Matlab function naively in C as the first step. It might well be fast enough.

performance of thrust vs. cublas

I have an std::vector of matrices of different sizes and I am going to calculate the square of every matrix. I have two solutions :
1/ Flatten all my matrices, and store them in the device as a huge flat array (float *), with indices of beginning and end of each matrix in that array, and use cublas for example to do the squaring.
2/ store the matrices in a thrust::device_vector<float *> and use thrust::for_each to square them.
Clearly the second solution gives more readable code, but does it impact performance?
I think this is (now) just a repeat of a question you already asked.
Assuming the elementwise operation you want to do is something simple like squaring of each element, there should be little difference in performance or efficiency between the two cases.
This is because such an operation will be memory-bound, meaning its performance will be limited by (GPU) memory bandwidth. Therefore both realizations will have approximately the same limiter, and approximately the same performance.
Note that in both of your proposals, the data will ultimately need to be effectively "flattened" in the same way (thrust operations cannot be constructed in a typical or simple fashion to operate on a thrust::device_vector<float *>)
If you already have a mix of thrust and CUBLAS, for example, then you could probably use whichever approach suited you. If, on the other hand, your module used only CUBLAS, and you could realize your operation using either CUBLAS or thrust, I'm not sure I would inject thrust just for this one operation. But that's just a matter of opinion.

Performance question: Inverting an array of pointers in-place vs array of values

The background for asking this question is that I am solving a linearized equation system (Ax=b), where A is a matrix (typically of dimension less than 100x100) and x and b are vectors. I am using a direct method, meaning that I first invert A, then find the solution by x=A^(-1)b. This step is repated in an iterative process until convergence.
The way I'm doing it now, using a matrix library (MTL4):
For every iteration I copy all coeffiecients of A (values) in to the matrix object, then invert. This the easiest and safest option.
Using an array of pointers instead:
For my particular case, the coefficients of A happen to be updated between each iteration. These coefficients are stored in different variables (some are arrays, some are not). Would there be a potential for performance gain if I set up A as an array containing pointers to these coefficient variables, then inverting A in-place?
The nice thing about the last option is that once I have set up the pointers in A before the first iteration, I would not need to copy any values between successive iterations. The values which are pointed to in A would automatically be updated between iterations.
So the performance question boils down to this, as I see it:
- The matrix inversion process takes roughly the same amount of time, assuming de-referencing of pointers is non-expensive.
- The array of pointers does not need the extra memory for matrix A containing values.
- The array of pointers option does not have to copy all NxN values of A between each iteration.
- The values that are pointed to the array of pointers option are generally NOT ordered in memory. Hopefully, all values lie relatively close in memory, but *A[0][1] is generally not next to *A[0][0] etc.
Any comments to this? Will the last remark affect performance negatively, thus weighing up for the positive performance effects?
Test, test, test.
Especially in the field of Numerical Linear Algebra. There are many effects in play, which is why there is a number of optimized libraries that have solved that burden for you.
Some effects to consider:
Memory locality and cache effects
Multithreading effects (some algorithms that are optimal while running single-core, cause memory collision/cache eviction when more than one core is utilized).
There is no substitute for testing.
Here are some comments:
Is the function you use for the inversion capable of handling a matrix of pointers instead of values? If it does not realise it has to do an indirection, all kinds of strange effects could happen.
When doing an in-place matrix inversion (meaning the inverted matrix overwrites the input matrix), all input coefficients will get overwritten with new values, because matrix inversion can not be done by re-ordering the elements of the matrix.
During the inversion process, none of the input coefficients may be changed by an outside process. All such updates have to be performed between iterations.
So, you get the following set of trade-offs when you chose the pointer solution:
The coefficients making up matrix A can no longer be calculated asynchronously with the matrix inversion.
Either all coefficients must be recalculated for each iteration (when you use in-place inversion, meaning the inverted matrix uses the same memory as the input matrix), or you still have to use a matrix of N x N values to hold the result of the inversion.
You're getting good answers here. The only thing I would add is some general experience with performance.
You are thinking about performance a-priori. That's reasonable, but the real payoff is a-posteriori. In other words, you don't know for certain where the real optimization opportunities are, until the running code tells you.
You don't know if the bulk of the time will be spent in matrix inversion, multiplication, copying the matrix, dereferencing, or what. People can guess. If I had to guess, it would be matrix inversion, because it's 100x100.
However, something else I can't guess might be even bigger.
Guessing has a very poor track record, especially when you can just find out.
Here's an example of what I mean.