I was wondering what the most efficient, in terms of operations, way of swapping integers is in c++, and why? Is something like:
int a =..., b = ...;
a = a + b;
b = a - b;
a = a - b;
more efficient than using a temporary? Are there any other more efficient ways? (not asking for just other ways to swap the ints) and why would they be more efficient?
Assigning values is always faster than doing arithmetic operations.
C++ implementation for std::swap is
template<typename T> void swap(T& t1, T& t2) {
T temp = std::move(t1); // or T temp(std::move(t1));
t1 = std::move(t2);
t2 = std::move(temp);
}
So to use a temporary variable is better than doing arithmetic trick.
And to use std::swap is even better because To reinvent the wheel in programming is never a good idea
The best way is to trust your compiler and use the C++ standard library functions. They are designed for each other.
std::swap will win.
You could use an XOR swap for an int (which doesn't require a temporary), but these days it would still perform less well than std::swap.
In my case, std::swap is 5% slower than the following (both with O3 optimization). In general, std::swap() function calls copy constructor that will be probably always slower than just copy part of memory.
#include <cstring>
size_t objectSize = sizeof(Object);
char temp[objectSize];
loop {
loop {
memcpy(temp, a, objectSize);
memcpy(a, b, objectSize);
memcpy(b, temp, objectSize);
}
}
Edit: Using stack instead of heap memory allocation.
Most efficient way is to NOT try to do it yourself.
It really depends on why/were you want to do this. Trying to be clever and writing obscure code in C++ only reduces the chance of the compiler to optimize it correctly.
Lets say we use the ±-way you wrote:
First the values a and b have to be loaded from memory.
Then you are doing 3 arithmetic-operations to "swap" their content.
And lastly the 2 values have to be stored in memory again.
(Not gonna use actual assembly-code as i am not well versed with it and this pseudo-assembly is easier to get the concept accross )
load a into register rA
load b into register rB
add rB to rA and store in rA
subtract rB from rA and stor in rB
subtract rB from rA and store in rA
store register rA to memory b
store register rB to memory a
If the compiler would do exactly what you wanted (likely he will ignore it and make it better) that would be:
2 loads, 3 simple math-funtions, 2 stores - 7 operations.
It could also do slightly better as addition/subtraction can be done with 1 value from memory.
load 'a' into register rA
add b to rA and store in rA
subtract b from rA and store in rB
subtract rB from rA and store in rA
store rA to a
store rB to b
If we use an extra tmp-variable:
int a =..., b = ...;
int tmp = a;
a = b;
b = tmp;
The compiler will likely recognize that "tmp" is only a temporary variable only used for swapping the 2 values so it would not assign it a memory location btu only use registers.
In that case what it would do is something along the lines of:
load a into register rA
load b into register rB
store register rA to memory b
store register rB to memory a
Only 4 operations - Basically the fastest it can do it as you need to load 2 values and you need to store 2 values and nothing else.
(for moder nx86_64 processors there is no command that would just swap 2 values in memory - other architectures might have it and be even faster in that case).
Doing those arithmetic operations (or the xor-trick) is a nice excercise but on modern x86 CPUs with all but the most basic compilers it will not be "more efficient" in any form.
It will user just as many registers, the same amount of memory for the variables, but require more instructions to do the same job.
In general you should not attempt to outsmart the compiler unless you have checked your code, tested and benchmarked it and found that the generated assembly is not as good as it can be.
But it is nearly never needed to go to that level for optimisation and your time is better spent looking at the larger picture.
#include <iostream>
using namespace std;
void swap(int &a, int &b){
b = (a+b) - (a=b);
}
int main() {
int a=1,b=6;
swap(a,b);
cout<<a<<b;
return 0;
}
Related
I have a large array (> millions) of Items, where each Item has the form:
struct Item { void *a; size_t b; };
There are a handful of distinct a fields—meaning there are many items with the same a field.
I would like to "factor" this information out to save about 50% memory usage.
However, the trouble is that these Items have a significant ordering, and that may change over time. Therefore, I can't just go ahead make a separate Item[] for each distinct a, because that will lose the relative ordering of the items with respect to each other.
On the other hand, if I store the orderings of all the items in a size_t index; field, then I lose any memory savings from the removal of the void *a; field.
So is there a way for me to actually save memory here, or no?
(Note: I can already think of e.g. using an unsigned char for a to index into a small array, but I'm wondering if there's a better way. That one will require me to either use unaligned memory or to split every Item[] into two, which isn't great for memory locality, so I'd prefer something else.)
(Note: I can already think of e.g. using an unsigned char for a to index into a small array, but I'm wondering if there's a better way.)
This thinking is on the right track, but it's not that simple, since you will run into some nasty alignment/padding issues that will negate your memory gains.
At that point, when you start trying to scratch the last few bytes of a structure like this, you will probably want to use bit fields.
#define A_INDEX_BITS 3
struct Item {
size_t a_index : A_INDEX_BITS;
size_t b : (sizeof(size_t) * CHAR_BIT) - A_INDEX_BITS;
};
Note that this will limit how many bits are available for b, but on modern platforms, where sizeof(size_t) is 8, stripping 3-4 bits from it is rarely an issue.
Use a combination of lightweight compression schemes (see this for examples and some references) to represent the a* values. #Frank's answer employes DICT followed by NS, for example. If you have long runs of the same pointer, you could consider RLE (Run-Length Encoding) on top of that.
This is a bit of a hack, but I've used it in the past with some success. The extra overhead for object access was compensated for by the significant memory reduction.
A typical use case is an environment where (a) values are actually discriminated unions (that is, they include a type indicator) with a limited number of different types and (b) values are mostly kept in large contiguous vectors.
With that environment, it is quite likely that the payload part of (some kinds of) values uses up all the bits allocated for it. It is also possible that the datatype requires (or benefits from) being stored in aligned memory.
In practice, now that aligned access is not required by most mainstream CPUs, I would just used a packed struct instead of the following hack. If you don't pay for unaligned access, then storing a { one-byte type + eight-byte value } as nine contiguous bytes is probably optimal; the only cost is that you need to multiply by 9 instead of 8 for indexed access, and that is trivial since the 9 is a compile-time constant.
If you do have to pay for unaligned access, then the following is possible. Vectors of "augmented" values have the type:
// Assume that Payload has already been typedef'd. In my application,
// it would be a union of, eg., uint64_t, int64_t, double, pointer, etc.
// In your application, it would be b.
// Eight-byte payload version:
typedef struct Chunk8 { uint8_t kind[8]; Payload value[8]; }
// Four-byte payload version:
typedef struct Chunk4 { uint8_t kind[4]; Payload value[4]; }
Vectors are then vectors of Chunks. For the hack to work, they must be allocated on 8- (or 4-)byte aligned memory addresses, but we've already assumed that alignment is required for the Payload types.
The key to the hack is how we represent a pointer to an individual value, because the value is not contiguous in memory. We use a pointer to it's kind member as a proxy:
typedef uint8_t ValuePointer;
And then use the following low-but-not-zero-overhead functions:
#define P_SIZE 8U
#define P_MASK P_SIZE - 1U
// Internal function used to get the low-order bits of a ValuePointer.
static inline size_t vpMask(ValuePointer vp) {
return (uintptr_t)vp & P_MASK;
}
// Getters / setters. This version returns the address so it can be
// used both as a getter and a setter
static inline uint8_t* kindOf(ValuePointer vp) { return vp; }
static inline Payload* valueOf(ValuePointer vp) {
return (Payload*)(vp + 1 + (vpMask(vp) + 1) * (P_SIZE - 1));
}
// Increment / Decrement
static inline ValuePointer inc(ValuePointer vp) {
return vpMask(++vp) ? vp : vp + P_SIZE * P_SIZE;
}
static inline ValuePointer dec(ValuePointer vp) {
return vpMask(vp--) ? vp - P_SIZE * P_SIZE : vp;
}
// Simple indexed access from a Chunk pointer
static inline ValuePointer eltk(Chunk* ch, size_t k) {
return &ch[k / P_SIZE].kind[k % P_SIZE];
}
// Increment a value pointer by an arbitrary (non-negative) amount
static inline ValuePointer inck(ValuePointer vp, size_t k) {
size_t off = vpMask(vp);
return eltk((Chunk*)(vp - off), k + off);
}
I left out a bunch of the other hacks but I'm sure you can figure them out.
One cool thing about interleaving the pieces of the value is that it has moderately good locality of reference. For the 8-byte version, almost half of the time a random access to a kind and a value will only hit one 64-byte cacheline; the rest of the time two consecutive cachelines are hit, with the result that walking forwards (or backwards) through a vector is just as cache-friendly as walking through an ordinary vector, except that it uses fewer cachelines because the objects are half the size. The four byte version is even cache-friendlier.
I think I figured out the information-theoretically-optimal way to do this myself... it's not quite worth the gains in my case, but I'll explain it here in case it helps someone else.
However, it requires unaligned memory (in some sense).
And perhaps more importantly, you lose the ability easily add new values of a dynamically.
What really matters here is the number of distinct Items, i.e. the number of distinct (a,b) pairs. After all, it could be that for one a there are a billion different bs, but for the other ones there are only a handful, so you want to take advantage of that.
If we assume that there are N distinct items to choose from, then we need n = ceil(log2(N)) bits to represent each Item. So what we really want is an array of n-bit integers, with n computed at run time. Then, once you get the n-bit integer, you can do a binary search in log(n) time to figure out which a it corresponds to, based on your knowledge of the count of bs for each a. (This may be a bit of a performance hit, but it depends on the number of distinct as.)
You can't do this in a nice memory-aligned fashion, but that isn't too bad. What you would do is make a uint_vector data structure with the number of bits per element being a dynamically-specifiable quantity. Then, to randomly access into it, you'd do a few divisions or mod operations along with bit-shifts to extract the required integer.
The caveat here is that the dividing by a variable will probably severely damage your random-access performance (although it'll still be O(1)). The way to mitigate that would probably be to write a few different procedures for common values of n (C++ templates help here!) and then branch into them with various if (n == 33) { handle_case<33>(i); } or switch (n) { case 33: handle_case<33>(i); }, etc. so that the compiler sees the divisor as a constant and generates shifts/adds/multiplies as needed, rather than division.
This is information-theoretically optimal as long as you require a constant number of bits per element, which is what you would want for random-accessing. However, you could do better if you relax that constraint: you could pack multiple integers into k * n bits, then extract them with more math. This will probably kill performance too.
(Or, long story short: C and C++ really need a high-performance uint_vector data structure...)
An Array-of-Structures approach may be helpful. That is, have three vectors...
vector<A> vec_a;
vector<B> vec_b;
SomeType b_to_a_map;
You access your data as...
Item Get(int index)
{
Item retval;
retval.a = vec_a[b_to_a_map[index]];
retval.b = vec_b[index];
return retval;
}
Now all you need to do is choose something sensible for SomeType. For example, if vec_a.size() were 2, you could use vector<bool> or boost::dynamic_bitset. For more complex cases you could try bit-packing, for example to support 4-values of A, we simple change our function with...
int a_index = b_to_a_map[index*2]*2 + b_to_a_map[index*2+1];
retval.a = vec_a[a_index];
You can always beat bit-packing by using range-packing, using div/mod to store a fractional bit length per item, but the complexity grows quickly.
A good guide can be found here http://number-none.com/product/Packing%20Integers/index.html
I am using some legacy C code that passing around lots of raw pointers. To interface with the code, I have to pass a function of the form:
const int N = ...;
T * func(T * x) {
// TODO Put N elements in x
return x + N;
}
where this function should write the result into x, and then return x.
Internally, in this function, I am using Eigen extensively to perform some calculations. Then I write the result back to the raw pointer using the Map class. A simple example which mimics what I am doing is this:
const int N = 5;
T * func(T * x) {
// Do a lot of operations that result in some matrices like
Eigen::Matrix<T, N, 1 > A = ...
Eigen::Matrix<T, N, 1 > B = ...
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
return x + N;
}
Obviously, there is much more complicated stuff going on internally, but that is the gist of it... Do some calculations with Eigen, then use the Map class to write the result back to the raw pointer.
Now the problem is that when I profile this code with Callgrind, and then view the results with KCachegrind, the lines
constraint = A - B;
are almost always the bottleneck. This is sort of understandable, because such lines could/are potentially doing three things:
Constructing the Map object
Performing the calculation
Writing the result to the pointer
So it is understandable that this line might have the longest runtime. But I am a little bit worried that perhaps I am somehow doing an extra copy in that line before the data gets written to the raw pointer.
So is there a better way of writing the result to the raw pointer? Or is that the idiom I should be using?
In the back of my mind, I am wondering if using the placement new syntax would buy me anything here.
Note: This code is mission critical and should run in realtime, so I really need to squeeze every ounce of speed out of it. For instance, getting this call from a runtime of 0.12 seconds to 0.1 seconds would be huge for us. But code legibility is also a huge concern since we are constantly tweaking the model used in the internal calculations.
These two lines of code:
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
are essentially compiled by Eigen as:
for(int i=0; i<N; ++i)
x[i] = A[i] - B[i];
The reality is a bit more complicated because of explicit unrolling, and explicit vectorization (both depends on T), but that's essentially it. So the construction of the Map object is essentially a no-op (it is optimized away by any compiler) and no, there is no extra copy going on here.
Actually, if your profiler is able to tell you that the bottleneck lies on this simple expression, then that very likely means that this piece of code has not been inlined, meaning that you did not enabled compiler optimizations flags (like -O3 with gcc/clang).
I had an algorithm that started out like
int sumLargest2 ( int * arr, size_t n )
{
int largest(max(arr[0], arr[1])), secondLargest(min(arr[0],arr[1]));
// ...
and I realized that the first is probably not optimal because calling max and then min is repetitious when you consider that the information required to know the minimum is already there once you've found the maximum. So I figured out that I could do
int largest = max(arr[0], arr[1]);
int secondLargest = arr[0] == largest ? arr[1] : arr[0];
to shave off the useless invocation of min, but I'm not sure that actually saves any number of operations. Are there any fancy bit-shifting algorithms that can do the equivalent of
int largest(max(arr[0], arr[1])), secondLargest(min(arr[0],arr[1]));
?????
In C++, you can use std::minmax to produce a std::pair of the minimum and the maximum. This is particularly easy in combination with std::tie:
#include <algorithm>
#include <utility>
int largest, secondLargest;
std::tie(secondLargest, largest) = std::minmax(arr[0], arr[1]);
GCC, at least, is capable of optimizing the call to minmax into a single comparison, identical to the result of the C code below.
In C, you could write the test out yourself:
int largest, secondLargest;
if (arr[0] < arr[1]) {
largest = arr[1];
secondLargest = arr[0];
} else {
largest = arr[0];
secondLargest = arr[1];
}
How about:
int largestIndex = arr[1] > arr[0];
int largest = arr[largestIndex];
int secondLargest = arr[1 - largestIndex];
The first line relies on an implicit cast of a boolean result to 1 in the case of true and 0 in the case of false.
I'm going to assume that you'd rather solve the larger problem... That is, getting the sum of the largest two numbers in an array.
What you are trying to do is a std::partial_sort().
Let's implement it.
int sumLargest2(int * arr, size_t n) {
int * first = arr;
int * middle = arr + 2;
int * last = arr + n;
std::partial_sort(first, middle, last, std::greater<int>());
return arr[0] + arr[1];
}
And if you're unable to modify arr, then I'd recommend looking into std::partial_sort_copy().
x = max(a, b);
y = a + b - x;
It won't necessarily be faster, but it will be different.
Also beware of overflows.
If your intention is to reduce the function call to find min mad max you can try std::minmax_element. This is available since C++11.
auto result = std::minmax_element(arr, arr+n);
std::cout<< "min:"<< *result.first<<"\n";
std::cout<< "max :" <<*result.second << "\n";
If you just want to find the bigger of two values go:
if(a > b)
{
largest = a;
second = b;
}
else
{
largest = b;
second = a;
}
No function calls, one comparison, two assignments.
I'm assuming C++...
Short answer, use std::minmax and compile with the right optimizations and the right instruction set parameters.
Long ugly answer, The compiler cannot make all the assumptions necessary to make it really, really fast. You can. In this case, you can change the algorithm to process all data first and you can force alignment on the data. Doing all this, you can use intrinsics to make it faster.
Although I haven't tested it in this particular case, I've seen enormous performance improvements using these guidelines.
Since you're not passing 2 integers to the function, I'm assuming your using an array and want to iterate it somehow. You now have a choice to make: make 2 arrays and use min/max or use 1 array with both a and b. This decision alone can already influence the performance.
If you have 2 arrays, these can be allocated on 32-byte boundaries with aligned malloc's and then processed using intrinsics. If you are going for real, raw performance - this is the way to go.
F.ex, let's assume you have AVX2. (NOTE: I'm not sure if you do and you SHOULD check this using CPU id's!). Go to the cheat sheet here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ and pick your poison.
The intrinsics you're looking for are in this case probably:
_mm256_min_epi32
_mm256_max_epi32
_mm256_stream_load_si256
If you have to do this for the entire array, you probably want to keep all the stuff in a single __mm256 register before merging the individual items. E.g.: do a min/max per 256-bit vector, and when the loop is done, extract the 32-bit items and do a min/max on that.
Long nicer answer: So ... as for the compiler. Compilers do attempt to optimize these kinds of things, but run into problems.
If you have 2 different arrays that you process, the compiler has to know that they are different in order to be able to optimize it. This is the reason why stuff like restrict exists, which tells the compiler exactly this little thing you probably already knew while writing the code.
Also, the compiler doesn't know your memory is aligned, so it has to check this and branch... for each call. We don't want this; which means we want it to inline its stuff. So, add inline, put it in a header file and that's that. You can also use aligned to give him a hint.
Your compiler also didn't get the hint that the int* won't change over time. If it cannot change, it's a good idea to tell him that using the const keyword.
A compiler uses an instruction set to do the compilation. Normally, they already use SSE, but AVX2 can help a lot (as I've shown with the intrinsics above). If you can compile it with those flags, make sure to use them - they help a lot.
Run in release mode, compile with optimizations on 'fast' and see what happens under the hood. If you do all this, you should see vpmax... instructions appearing in the inner loops, which means that the compiler uses the intrinsics just fine.
I don't know what else you want to do in the loop... if you use all these instructions you should hit the memory speed on big arrays.
How about a time-space trade-off?
#include <utility>
template<typename T>
std::pair<T, T>
minmax(T const& a, T const& b)
{ return b < a ? std::make_pair(b, a) : std::make_pair(a, b); }
//main
std::pair<int, int> values = minmax(a[0], a[1]);
int largest = values.second;
int secondLargest = values.first;
I was wondering whether (apart from the obvious syntax differences) there would be any efficiency difference between having a class containing multiple instances of an object (of the same type) or a fixed size array of objects of that type.
In code:
struct A {
double x;
double y;
double z;
};
struct B {
double xvec[3];
};
In reality I would be using boost::arrays which are a better C++ alternative to C-style arrays.
I am mainly concerned with construction/destruction and reading/writing such doubles, because these classes will often be constructed just to invoke one of their member functions once.
Thank you for your help/suggestions.
Typically the representation of those two structs would be exactly the same. It is, however, possible to have poor performance if you pick the wrong one for your use case.
For example, if you need to access each element in a loop, with an array you could do:
for (int i = 0; i < 3; i++)
dosomething(xvec[i]);
However, without an array, you'd either need to duplicate code:
dosomething(x);
dosomething(y);
dosomething(z);
This means code duplication - which can go either way. On the one hand there's less loop code; on the other hand very tight loops can be quite fast on modern processors, and code duplication can blow away the I-cache.
The other option is a switch:
for (int i = 0; i < 3; i++) {
int *r;
switch(i) {
case 0: r = &x; break;
case 1: r = &y; break;
case 1: r = &z; break;
}
dosomething(*r); // assume this is some big inlined code
}
This avoids the possibly-large i-cache footprint, but has a huge negative performance impact. Don't do this.
On the other hand, it is, in principle, possible for array accesses to be slower, if your compiler isn't very smart:
xvec[0] = xvec[1] + 1;
dosomething(xvec[1]);
Since xvec[0] and xvec[1] are distinct, in principle, the compiler ought to be able to keep the value of xvec[1] in a register, so it doesn't have to reload the value at the next line. However, it's possible some compilers might not be smart enough to notice that xvec[0] and xvec[1] don't alias. In this case, using seperate fields might be a very tiny bit faster.
In short, it's not about one or the other being fast in all cases. It's about matching the representation to how you use it.
Personally, I would suggest going with whatever makes the code working on xvec most natural. It's not worth spending a lot of human time worrying about something that, at best, will probably only produce such a small performance difference that you'll only catch it in micro-benchmarks.
MVC++ 2010 generated exactly the same code for reading/writing from two POD structs like in your example. Since the offsets to read/write to are computable at compile time, this is not surprising. Same goes for construction and destruction.
As for the actual performance, the general rule applies: profile it if it matters, if it doesn't - why care?
Indexing into an array member is perhaps a bit more work for the user of your struct, but then again, he can more easily iterate over the elements.
In case you can't decide and want to keep your options open, you can use an anonymous union:
struct Foo
{
union
{
struct
{
double x;
double y;
double z;
} xyz;
double arr[3];
};
};
int main()
{
Foo a;
a.xyz.x = 42;
std::cout << a.arr[0] << std::endl;
}
Some compilers also support anonymous structs, in that case you can leave the xyz part out.
It depends. For instance, the example you gave is a classic one in favor of 'old-school' arrays: a math point/vector (or matrix)
has a fixed number of elements
the data itself is usually kept
private in an object
since (if?) it has a class as an
interface, you can properly
initialize them in the constructor
(otherwise, classic array
inialization is something I don't
really like, syntax-wise)
In such cases (going with the math vector/matrix examples), I always ended up using C-style arrays internally, as you can loop over them instead of writing copy/pasted code for each component.
But this is a special case -- for me, in C++ nowadays arrays == STL vector, it's fast and I don't have to worry about nuthin' :)
The difference can be in storing the variables in memory. In the first example compiler can add padding to align the data. But in your paticular case it doesn't matter.
raw arrays offer better cache locality than c++ arrays, as presented however, the array example's only advantage over the multiple objects is the ability to iterate over the elements.
The real answer is of course, create a test case and measure.
I have the following method to swap two double arrays (double**) in c++. Profiling the code, the method is accounting for 7% of the runtime... I was thinking that this should be a low cost operation, any suggestions? I am new with c++, but i was hoping to just swap the references to the arrays.
62 void Solver::Swap(double** &v1, double** &v2)
63 {
64 double** vswap = NULL;
65 vswap = v2;
66 v2 = v1;
67 v1 = vswap;
68 }
1) Make sure your function is inlined.
2) You can inplace swap, using a XOR for instance
3) Try to force the compiler to pass arguments using register instead of the stack (even though there's lot of register stress on x86, it's worth trying) - you can use the standard register keyword or play with fastcall on MS' compiler.
typedef double** TwoDimArray;
class Solver
{
inline void Swap( register TwoDimArray& a, register TwoDimArray& b )
{
a ^= b ^= a ^= b;
}
};
4) Don't bother giving default values for temporaries like vswap.
The code looks fine. Its is just a pointer assignment. It depends on how many times the method was got called.
I guess your profiler is getting confused here a bit, as this method really only swaps two pointers, which is very cheap. Unless this method is called a lot, it shouldn't show up in a profile. Does your profiler tell you how often this method gets called?
One issue you have to be aware of with swapping is that one array might be in the cache and the other not (especially if they are large), so constantly swapping pointers might trash the cache, but this would show up as a general slow-down.
Are you sure you profiled fully optimized code?
You should inlinethis function.
Other than that the only thing I see is that you first assign NULL to vswap and immediately afterwards some other value - but this should be taken care of by the optimizer.
inline void Solver::Swap(double** &v1, double** &v2)
{
double** vswap = v2;
v2 = v1;
v1 = vswap;
}
However, why don't you use std::swap()?
Don't assume that the 7% means this operation is slow - it depends on what else is going on.
You can have an operation that only takes 1 nanosecond, and make it take nearly 100% of the time, by doing nothing else.