Swap 2D Double Arrays in c++ - c++

I have the following method to swap two double arrays (double**) in c++. Profiling the code, the method is accounting for 7% of the runtime... I was thinking that this should be a low cost operation, any suggestions? I am new with c++, but i was hoping to just swap the references to the arrays.
62 void Solver::Swap(double** &v1, double** &v2)
63 {
64 double** vswap = NULL;
65 vswap = v2;
66 v2 = v1;
67 v1 = vswap;
68 }

1) Make sure your function is inlined.
2) You can inplace swap, using a XOR for instance
3) Try to force the compiler to pass arguments using register instead of the stack (even though there's lot of register stress on x86, it's worth trying) - you can use the standard register keyword or play with fastcall on MS' compiler.
typedef double** TwoDimArray;
class Solver
{
inline void Swap( register TwoDimArray& a, register TwoDimArray& b )
{
a ^= b ^= a ^= b;
}
};
4) Don't bother giving default values for temporaries like vswap.

The code looks fine. Its is just a pointer assignment. It depends on how many times the method was got called.

I guess your profiler is getting confused here a bit, as this method really only swaps two pointers, which is very cheap. Unless this method is called a lot, it shouldn't show up in a profile. Does your profiler tell you how often this method gets called?
One issue you have to be aware of with swapping is that one array might be in the cache and the other not (especially if they are large), so constantly swapping pointers might trash the cache, but this would show up as a general slow-down.

Are you sure you profiled fully optimized code?
You should inlinethis function.
Other than that the only thing I see is that you first assign NULL to vswap and immediately afterwards some other value - but this should be taken care of by the optimizer.
inline void Solver::Swap(double** &v1, double** &v2)
{
double** vswap = v2;
v2 = v1;
v1 = vswap;
}
However, why don't you use std::swap()?

Don't assume that the 7% means this operation is slow - it depends on what else is going on.
You can have an operation that only takes 1 nanosecond, and make it take nearly 100% of the time, by doing nothing else.

Related

Zero overhead subscript operator for a set of values

Assume we have a function with the following signature (the signature may not be changed, since this function is part of a legacy API):
void Foo(const std::string& s, float v0, float v1, float v2)
{ ... }
How can one access the last three arguments by index using the subscript operator [] without actually copying the data into some sort of container?
Regularly when I come across this kind of issue I put the values in a container, like const std::array<float,3> args{v0,v1,v2}; and access these values using args[0], which unfortunately needs to copy the values.
Another idea would be to access the arguments using a parameter pack, which in turn involves the creation of a templated function which seems to be overkill for this task.
I'm aware that the version using the std::array<> might be suitable since the compiler probably will optimize this kind of stuff, however, this question is kind of academically motivated.
You can't. Not in a way that guarantees zero overhead, or overhead similar to that of array subscripting.
You could, of course, do something like float* vs[]{&v0, &v1, &v2};, and then dereference the result of vs[i]. For that matter, you could make a utility class to act as a transparent reference (to try to get around arrays of references being illegal), though the result is inevitably limited.
The ultimate problem, though, is that nothing in the standard guarantees (or even suggests) that function arguments be stored in any particular memory ordering. On most platforms, at least one of those floats is going to be in a register, meaning that there's just no way to natively subscript it.
If a group of objects does not start out as an array, it's not possible to treat them as an array.
Another idea would be to access the arguments using a parameter pack, which in turn involves the creation of a templated function which seems to be overkill for this task.
Not necessarily. One thing you can do is use std::tie to build a std::tuple of references to the function parameters and then access that tuple via std::get. That should optimize out, but let you refer to the parameters as if they are part of a single collection. That would look like
void Foo(const std::string& s, float v0, float v1, float v2)
{
auto args = std::tie(v0, v1, v2);
std::cout << std::get<1>(args);
}
It's not using operator [], and requires your indices be compile time constants, but you can now pass them to something else as one object.
Danger Wil Robinson! Danger!
This is going to be horribly implementation dependent, and an all around bad idea! This relies on undefined behavior. Less awful with set hardware and tools, but less as in "we're only going to eat 5 babies, not a full dozen".
Those three floats are on the stack next to each other. I don't know if there are any packing rules for the stack. I don't know which order on the the stack they'll be ("v0 v1 v2" vs "v2 v1 v0"). Hell, some optimized build might even put them in a different order just to optimize some oddball case that doesn't actually come up in real life. I dunno. But I suspect something like this will work.
void Foo(const std::string& s, float v0, float v1, float v2)
{
float* vp = &v2;
for (int i = 0; i < 3; ++i)
{
printf("%f\n", vp[i]);
}
}
void main(void)
{
Foo("", 1.0f, 2.0f, 3.0f);
}
3.0000
2.0000
1.0000
So it is possible. It's also ugly, vile, evil, and probably both fattening and carcinogenic.
On GodBolt.org, using gcc x86-64 9.3, the above code worked fine. In VS2017 intel/64, I had to use float* vp = &v0 and for (int i = 0; i < 5; i += 2). Different alignment, different order, and different output (1, 2, 3, not 3, 2 1).
I'm pretty sure I just consigned my soul to the Nth circle of hell.

Eigen: Efficiently storing the output of a matrix evaluation in a raw pointer

I am using some legacy C code that passing around lots of raw pointers. To interface with the code, I have to pass a function of the form:
const int N = ...;
T * func(T * x) {
// TODO Put N elements in x
return x + N;
}
where this function should write the result into x, and then return x.
Internally, in this function, I am using Eigen extensively to perform some calculations. Then I write the result back to the raw pointer using the Map class. A simple example which mimics what I am doing is this:
const int N = 5;
T * func(T * x) {
// Do a lot of operations that result in some matrices like
Eigen::Matrix<T, N, 1 > A = ...
Eigen::Matrix<T, N, 1 > B = ...
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
return x + N;
}
Obviously, there is much more complicated stuff going on internally, but that is the gist of it... Do some calculations with Eigen, then use the Map class to write the result back to the raw pointer.
Now the problem is that when I profile this code with Callgrind, and then view the results with KCachegrind, the lines
constraint = A - B;
are almost always the bottleneck. This is sort of understandable, because such lines could/are potentially doing three things:
Constructing the Map object
Performing the calculation
Writing the result to the pointer
So it is understandable that this line might have the longest runtime. But I am a little bit worried that perhaps I am somehow doing an extra copy in that line before the data gets written to the raw pointer.
So is there a better way of writing the result to the raw pointer? Or is that the idiom I should be using?
In the back of my mind, I am wondering if using the placement new syntax would buy me anything here.
Note: This code is mission critical and should run in realtime, so I really need to squeeze every ounce of speed out of it. For instance, getting this call from a runtime of 0.12 seconds to 0.1 seconds would be huge for us. But code legibility is also a huge concern since we are constantly tweaking the model used in the internal calculations.
These two lines of code:
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
are essentially compiled by Eigen as:
for(int i=0; i<N; ++i)
x[i] = A[i] - B[i];
The reality is a bit more complicated because of explicit unrolling, and explicit vectorization (both depends on T), but that's essentially it. So the construction of the Map object is essentially a no-op (it is optimized away by any compiler) and no, there is no extra copy going on here.
Actually, if your profiler is able to tell you that the bottleneck lies on this simple expression, then that very likely means that this piece of code has not been inlined, meaning that you did not enabled compiler optimizations flags (like -O3 with gcc/clang).

data locality for implementing 2d array in c/c++

Long time ago, inspired by "Numerical recipes in C", I started to use the following construct for storing matrices (2D-arrays).
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
for (i = 0; i < NumRows; ++i) x[i] = (double *)calloc(NumCol, sizeof(double));
return x;
}
double **x = allocate_matrix(1000,2000);
x[m][n] = ...;
But recently noticed that many people implement matrices as follows
double *x = (double *)malloc(NumRows * NumCols * sizeof(double));
x[NumCol * m + n] = ...;
From the locality point of view the second method seems perfect, but has awful readability... So I started to wonder, is my first method with storing auxiliary array or **double pointers really bad or the compiler will optimize it eventually such that it will be more or less equivalent in performance to the second method? I am suspicious because I think that in the first method two jumps are made when accessing the value, x[m] and then x[m][n] and there is a chance that each time the CPU will load first the x array and then x[m] array.
p.s. do not worry about extra memory for storing **double, for large matrices it is just a small percentage.
P.P.S. since many people did not understand my question very well, I will try to re-shape it: do I understand right that the first method is kind of locality-hell, when each time x[m][n] is accessed first x array will be loaded into CPU cache and then x[m] array will be loaded thus making each access at the speed of talking to RAM. Or am I wrong and the first method is also OK from data-locality point of view?
For C-style allocations you can actually have the best of both worlds:
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
x[0] = (double *)calloc(NumRows * NumCol, sizeof(double)); // <<< single contiguous memory allocation for entire array
for (i = 1; i < NumRows; ++i) x[i] = x[i - 1] + NumCols;
return x;
}
This way you get data locality and its associated cache/memory access benefits, and you can treat the array as a double ** or a flattened 2D array (array[i * NumCols + j]) interchangeably. You also have fewer calloc/free calls (2 versus NumRows + 1).
No need to guess whether the compiler will optimize the first method. Just use the second method which you know is fast, and use a wrapper class that implements for example these methods:
double& operator(int x, int y);
double const& operator(int x, int y) const;
... and access your objects like this:
arr(2, 3) = 5;
Alternatively, if you can bear a little more code complexity in the wrapper class(es), you can implement a class that can be accessed with the more traditional arr[2][3] = 5; syntax. This is implemented in a dimension-agnostic way in the Boost.MultiArray library, but you can do your own simple implementation too, using a proxy class.
Note: Considering your usage of C style (a hardcoded non-generic "double" type, plain pointers, function-beginning variable declarations, and malloc), you will probably need to get more into C++ constructs before you can implement either of the options I mentioned.
The two methods are quite different.
While the first method allows for easier direct access to the values by adding another indirection (the double** array, hence you need 1+N mallocs), ...
the second method guarantees that ALL values are stored contiguously and only requires one malloc.
I would argue that the second method is always superior. Malloc is an expensive operation and contiguous memory is a huge plus, depending on the application.
In C++, you'd just implement it like this:
std::vector<double> matrix(NumRows * NumCols);
matrix[y * numCols + x] = value; // Access
and if you're concerned with the inconvenience of having to compute the index yourself, add a wrapper that implements operator(int x, int y) to it.
You are also right that the first method is more expensive when accessing the values. Because you need two memory lookups as you described x[m] and then x[m][n]. There is no way the compiler will "optimize this away". The first array, depending on its size, will be cached, and the performance hit may not be that bad. In the second case, you need an extra multiplication for direct access.
In the first method you use, the double* in the master array point to logical columns (arrays of size NumCol).
So, if you write something like below, you get the benefits of data locality in some sense (pseudocode):
foreach(row in rows):
foreach(elem in row):
//Do something
If you tried the same thing with the second method, and if element access was done the way you specified (i.e. x[NumCol*m + n]), you still get the same benefit. This is because you treat the array to be in row-major order. If you tried the same pseudocode while accessing the elements in column-major order, I assume you'd get cache misses given that the array size is large enough.
In addition to this, the second method has the additional desirable property of being a single contiguous block of memory which further improves the performance even when you loop through multiple rows (unlike the first method).
So, in conclusion, the second method should be much better in terms of performance.
If NumCol is a compile-time constant, or if you are using GCC with language extensions enabled, then you can do:
double (*x)[NumCol] = (double (*)[NumCol]) malloc(NumRows * sizeof (double[NumCol]));
and then use x as a 2D array and the compiler will do the indexing arithmetic for you. The caveat is that unless NumCol is a compile-time constant, ISO C++ won't let you do this, and if you use GCC language extensions you won't be able to port your code to another compiler.

most efficient way of swapping values c++

I was wondering what the most efficient, in terms of operations, way of swapping integers is in c++, and why? Is something like:
int a =..., b = ...;
a = a + b;
b = a - b;
a = a - b;
more efficient than using a temporary? Are there any other more efficient ways? (not asking for just other ways to swap the ints) and why would they be more efficient?
Assigning values is always faster than doing arithmetic operations.
C++ implementation for std::swap is
template<typename T> void swap(T& t1, T& t2) {
T temp = std::move(t1); // or T temp(std::move(t1));
t1 = std::move(t2);
t2 = std::move(temp);
}
So to use a temporary variable is better than doing arithmetic trick.
And to use std::swap is even better because To reinvent the wheel in programming is never a good idea
The best way is to trust your compiler and use the C++ standard library functions. They are designed for each other.
std::swap will win.
You could use an XOR swap for an int (which doesn't require a temporary), but these days it would still perform less well than std::swap.
In my case, std::swap is 5% slower than the following (both with O3 optimization). In general, std::swap() function calls copy constructor that will be probably always slower than just copy part of memory.
#include <cstring>
size_t objectSize = sizeof(Object);
char temp[objectSize];
loop {
loop {
memcpy(temp, a, objectSize);
memcpy(a, b, objectSize);
memcpy(b, temp, objectSize);
}
}
Edit: Using stack instead of heap memory allocation.
Most efficient way is to NOT try to do it yourself.
It really depends on why/were you want to do this. Trying to be clever and writing obscure code in C++ only reduces the chance of the compiler to optimize it correctly.
Lets say we use the ±-way you wrote:
First the values a and b have to be loaded from memory.
Then you are doing 3 arithmetic-operations to "swap" their content.
And lastly the 2 values have to be stored in memory again.
(Not gonna use actual assembly-code as i am not well versed with it and this pseudo-assembly is easier to get the concept accross )
load a into register rA
load b into register rB
add rB to rA and store in rA
subtract rB from rA and stor in rB
subtract rB from rA and store in rA
store register rA to memory b
store register rB to memory a
If the compiler would do exactly what you wanted (likely he will ignore it and make it better) that would be:
2 loads, 3 simple math-funtions, 2 stores - 7 operations.
It could also do slightly better as addition/subtraction can be done with 1 value from memory.
load 'a' into register rA
add b to rA and store in rA
subtract b from rA and store in rB
subtract rB from rA and store in rA
store rA to a
store rB to b
If we use an extra tmp-variable:
int a =..., b = ...;
int tmp = a;
a = b;
b = tmp;
The compiler will likely recognize that "tmp" is only a temporary variable only used for swapping the 2 values so it would not assign it a memory location btu only use registers.
In that case what it would do is something along the lines of:
load a into register rA
load b into register rB
store register rA to memory b
store register rB to memory a
Only 4 operations - Basically the fastest it can do it as you need to load 2 values and you need to store 2 values and nothing else.
(for moder nx86_64 processors there is no command that would just swap 2 values in memory - other architectures might have it and be even faster in that case).
Doing those arithmetic operations (or the xor-trick) is a nice excercise but on modern x86 CPUs with all but the most basic compilers it will not be "more efficient" in any form.
It will user just as many registers, the same amount of memory for the variables, but require more instructions to do the same job.
In general you should not attempt to outsmart the compiler unless you have checked your code, tested and benchmarked it and found that the generated assembly is not as good as it can be.
But it is nearly never needed to go to that level for optimisation and your time is better spent looking at the larger picture.
#include <iostream>
using namespace std;
void swap(int &a, int &b){
b = (a+b) - (a=b);
}
int main() {
int a=1,b=6;
swap(a,b);
cout<<a<<b;
return 0;
}

c++ variable declaration

Im wondering if this code:
int main(){
int p;
for(int i = 0; i < 10; i++){
p = ...;
}
return 0
}
is exactly the same as that one
int main(){
for(int i = 0; i < 10; i++){
int p = ...;
}
return 0
}
in term of efficiency ?
I mean, the p variable will be recreated 10 times in the second example ?
It's is the same in terms of efficiency.
It's not the same in terms of readability. The second is better in this aspect, isn't it?
It's a semantic difference which the code keeps hidden because it's not making a difference for int, but it makes a difference to the human reader. Do you want to carry the value of whatever calculation you do in ... outside of the loop? You don't, so you should write code that reflects your intention.
A human reader will need to seek the function and look for other uses of p to confirm himself that what you did was just premature "optimization" and didn't have any deeper purpose.
Assuming it makes a difference for the type you use, you can help the human reader by commenting your code
/* p is only used inside the for-loop, to keep it from reallocating */
std::vector<int> p;
p.reserve(10);
for(int i = 0; i < 10; i++){
p.clear();
/* ... */
}
In this case, it's the same. Use the smallest scope possible for the most readable code.
If int were a class with a significant constructor and destructor, then the first (declaring it outside the loop) can be a significant savings - but inside you usually need to recreate the state anyway... so oftentimes it ends up being no savings at all.
One instance where it might make a difference is containers. A string or vector uses internal storage that gets grown to fit the size of the data it is storing. You may not want to reconstruct this container each time through the loop, instead, just clear its contents and it may not need as many reallocations inside the loop. This can (in some cases) result in a significant performance improvement.
The bottom-line is write it clearly, and if profiling shows it matters, move it out :)
They are equal in terms of efficiency - you should trust your compiler to get rid of the immeasurably small difference. The second is better design.
Edit: This isn't necessarily true for custom types, especially those that deal with memory. If you were writing a loop for any T, I'd sure use the first form just in case. But if you know that it's an inbuilt type, like int, pointer, char, float, bool, etc. I'd go for the second.
In second example the p is visible only inside of the for loop. you cannot use it further in your code.
In terms of efficiency they are equal.