How to increase speed of code in VS 2008 [closed] - c++

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I have a portion of code that is called an enormous amount of times. How can I speed it up?
#define SUM_(p0, p1, p2, p3, offset)
((p0)[offset] - (p1)[offset] - (p2)[offset] + (p3)[offset])
inline int Calc::compute( int offset ) const
{
int b = SUM( p[5], p[6], p[9], p[10], offset );
int a1 = SUM(...);
int a2 = SUM(...);
....
return (uchar)(((a1 >= b) << 7) |
((a2 >= b) << 6) |
((a3 >= b) << 5) |
((a4 >= b) << 4) |
((a5 >= b) << 3) |
((a6 >= b) << 2) |
((a7 >= b) << 1) |
(a8 >= b));
}
Thank you.

I see 3 possible opportunities here:
Pack your p0, p1, p2 and p3 data contiguously in memory (so basically contiguously in an array) in order to prevent cache misses:
#define SUM_(array, offset)
(array[offset] - array[offset + 1] - array[offset + 2] + array[offset + 3])
....
//make sure pArray contains all the p0, p1, ... values.
int a1 = SUM(pArray, offset);
replace the bitshift operator with an if structure that or's togethers static literals if (a1 >=b) and so on. The values are going to be static everytime anyway:
uint8_t bitmask = 0;
if(a1 >= b)
bitmask |= 0x80;
if(a2 >= b)
bitmask |= 0x40;
...
Try to make sure that SIMD instructions are being used. This involves dumping the assembly and see if these kind of instructions are being generated.
EDIT: Reaction on the comment below:
In order to prevent cache misses, everything revolves around accessing your data in a predictable manner. A problem with your original code is that you resolve the offset with your p-variables.
So you have something like:
int b = SUM( p[5], p[6], p[9], p[10], _offset );
int a1 = SUM( p[0], p[1], p[4], p[5], _offset );
int a2 = SUM( p[1], p[2], p[5], p[6], _offset );
You could create an array which contains these values in the order you use them.
So I'd try to create an array that looks like this at a certain offset:
p[5], p[6], p[9], p[10], p[0], p[1], p[4], p[5], p[1], p[2], p[5], p[6]
And now you can define your sum function like this:
#define SUM_(array, offset, calculationOffset)
(array[offset + calculationOffset * 4] - array[offset + calculationOffset * 4 + 1]
- array[offset + calculationOffset * 4 + 2] + array[offset + calculationOffset * 4 + 3])
Your calls can be transformed into this:
int b = SUM(pArray, offset, 0);
int a1 = SUM(pArray, offset, 1);
int a2 = SUM(pArray, offset, 2);
There's only one problem with this: if you have to create the array and copy all the data every function call, this might remove any benefit of what we did, but you may be able to construct this kind of array before using this function and pass it as an argument.

There's still the possibility to change/improve the algorithm itself. We cannot help you with this unless the algorithm you use is known (I mean the algorithm which uses the code you have shown).
If there are any known properties of the variables used and their values, this could give room for improvement.
I don't know whether the cast to uchar could impact performance. Why do you need this? I think this should be changed, since you are adding int and returning int. But you'd have to neasure performance to see if for example using & 0xff would give better performance.
Also, you should try the difference it makes when you replace the bit-oring-operators ("|") by a simple addition (should work the same in this case because each summand has a single unique bit set to 1).
Then try the effect of altering (a1 >= b) << 7) to (a1 >= b) ? 128 : 0) etc.
All of the above might or might not effect performance, and you have to measure the effect with the compiler you use.
But most important: if your problem with analyzing all these images is the total amount of time, you should look into processing different images at the same time (if you are on a multiprocessor machine with sufficient RAM). You have several options:
parallizing the code that processes a single image (using something like OpenMP)
Use one thread per image (IMHO much easier and I'd expect better overall throughput than 1.)
Move the concurrency to where your start your program (ie. a script).

Related

Broadcasting Row and Column Vector in Eigen C++

I have following Python Code written in NumPy:
> r = 3
> y, x = numpy.ogrid[-r : r + 1, -r : r + 1]
> mask = numpy.sqrt(x**2 + y**2)
> mask
array([[4.24264, 3.60555, 3.16228, 3.00000, 3.16228, 3.60555, 4.24264],
[3.60555, 2.82843, 2.23607, 2.00000, 2.23607, 2.82843, 3.60555],
[3.16228, 2.23607, 1.41421, 1.00000, 1.41421, 2.23607, 3.16228],
[3.00000, 2.00000, 1.00000, 0.00000, 1.00000, 2.00000, 3.00000],
[3.16228, 2.23607, 1.41421, 1.00000, 1.41421, 2.23607, 3.16228],
[3.60555, 2.82843, 2.23607, 2.00000, 2.23607, 2.82843, 3.60555],
[4.24264, 3.60555, 3.16228, 3.00000, 3.16228, 3.60555, 4.24264]])
Now, I am making the mask in Eigen where I need to broadcast row and column vector. Unfortunately, it is not allowed so I made the following workaround:
int len = 1 + 2 * r;
MatrixXf mask = MatrixXf::Zero(len, len);
ArrayXf squared_yx = ArrayXf::LinSpaced(len, -r, r).square();
mask = (mask.array().colwise() + squared_yx) +
(mask.array().rowwise() + squared_yx.transpose());
mask = mask.cwiseSqrt();
cout << "mask" << endl << mask << endl;
4.24264 3.60555 3.16228 3 3.16228 3.60555 4.24264
3.60555 2.82843 2.23607 2 2.23607 2.82843 3.60555
3.16228 2.23607 1.41421 1 1.41421 2.23607 3.16228
3 2 1 0 1 2 3
3.16228 2.23607 1.41421 1 1.41421 2.23607 3.16228
3.60555 2.82843 2.23607 2 2.23607 2.82843 3.60555
4.24264 3.60555 3.16228 3 3.16228 3.60555 4.24264
It works. But I wonder if there is another and shorter way to do it. Therefore my question is how to broadcast Row and Column Vector in Eigen C++?
System Info
Tool
Version
Eigen
3.3.7
GCC
9.4.0
Ubuntu
20.04.4 LTS
I think the easiest approach (as in: most readable), is replicate.
int r = 3;
int len = 1 + 2 * r;
const auto& squared_yx = Eigen::ArrayXf::LinSpaced(len, -r, r).square();
const auto& bcast = squared_yx.replicate(1, len);
Eigen::MatrixXf mask = (bcast + bcast.transpose()).sqrt();
Note that what you do is numerically unstable (for large r) and the hypot function exists to work around these issues. So even your python code could be better:
r = 3
y, x = numpy.ogrid[-r : r + 1, -r : r + 1]
mask = numpy.hypot(x, y)
To achieve the same in Eigen, do something like this:
const auto& yx = Eigen::ArrayXf::LinSpaced(len, -r, r);
const auto& bcast = yx.replicate(1, len);
Eigen::MatrixXf mask = bcast.binaryExpr(bcast.transpose(),
[](float x, float y) noexcept -> float {
return std::hypot(x, y);
});
Eigen's documentation on binaryExpr is currently broken, so this is hard to find.
To be fair, you will probably never run into stability issues in this particular case because you will run out of memory first. However, it'd still like to point this out because seeing a naive sqrt(x**2 + y**2) is always a bit of a red flag. Also, in Python hypot might still worth it from a performance point because it reduces the number of temporary memory allocations and function calls.
BinaryExpr
The documentation on binaryExpr is missing, I assume because the parser has trouble with Eigen's C++ code. In any case, one can find it indirectly as CwiseBinaryOp and similarly CwiseUnaryOp, CwiseNullaryOp and CwiseTernaryOp.
The use looks a bit weird but is pretty simple. It takes a functor (either a struct with operator(), a function pointer, or a lambda) and applies this element-wise.
The unary operation makes this pretty clear. If Eigen::Array.sin() didn't exist, you could write this:
array.unaryExpr([](double x) -> double { return std::sin(x); }) to achieve exactly the same effect.
The binary and ternary versions take one or two more Eigen expressions as the second and third argument to the function. That's what I did above. The nullary version is explained in the documentation in its own chapter.
Use of auto
Eigen is correct to warn about auto but only in that you have to know what you do. It is important to realize that auto on an Eigen expression just keeps the expression around. It does not evaluate it into a vector or matrix.
This is fine and very useful if you want to compose a complex expression that would be hard to read when put in a single statement. In my code above, there are no temporary memory allocations and no floating point computations take place until the final expression is assigned to the matrix.
As long as the programmer knows that these are expressions and not final matrices, everything is fine.
I think the main take-away is that use of auto with Eigen should be limited to short-lived (as in: inside a single function) scalar expressions. Any coding style that uses auto for everything will quickly break or be hard to read with Eigen. But it can be used safely and make the code more readable in the process without sacrificing performance in the same way as evaluating into matrices would.
As for why I chose const auto& instead of auto or const auto: Mostly force of habit that is unrelated to the task at hand. I mostly do it for instances like this:
const Eigen::Vector& Foo::get_bar();
void quz(Foo& foo)
{
const auto& bar = foo.get_bar();
}
Here, bar will remain a reference whereas auto would create a copy. If the return value is changed, everything stays valid.
Eigen::Vector Foo::get_bar();
void quz(Foo& foo)
{
const auto& bar = foo.get_bar();
}
Now a copy is created anyway. But everything continues to work because assigning the return value to a const-reference extends the lifetime of the object. So this may look like a dangling pointer, but it is not.

how could I use the power function in c/c++ without pow(), functions, or recursion

I'm using a C++ compiler but writing code in C (if that helps)
There's a series of numbers
(-1^(a-1)/2a-1)B^(2a-1)
A and X are user defined... A must be positive, but X can be anything (+,-)...
to decode this sequence... I need use exponents/powers, but was given some restrictions... I can't make another function, use recursion, or pow() (among other advanced math functions that come with cmath or math.h).
There were plenty of similar questions, but many answers have used functions and recursion which aren't directly relevant to this question.
This is the code that works perfectly with pow(), I spent a lot of time trying to modify it to replace pow() with my own code, but nothing seems to be working... mainly getting wrong results. X and J are user inputted variables
for (int i = 1; i < j; i++)
sum += (pow(-1, i - 1)) / (5 * i - 1) * (pow(x, 5 * i - 1));
}
You can use macros to get away with no function calls restriction as macros will generate inline code which is technically not a function call
however in case of more complex operations macro can not have return value so you need to use some local variable for the result (in case of more than single expression) like:
int ret;
#define my_pow_notemp(a,b) (b==0)?1:(b==1)?a:(b==2)?a*a:(b==3)?a*a*a:0
#define my_pow(a,b)\
{\
ret=1;\
if (int(b& 1)) ret*=a;\
if (int(b& 2)) ret*=a*a;\
if (int(b& 4)) ret*=a*a*a*a;\
if (int(b& 8)) ret*=a*a*a*a*a*a*a*a;\
if (int(b&16)) ret*=a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a;\
if (int(b&32)) ret*=a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a;\
}
void main()
{
int a=2,b=3,c;
c=my_pow_notemp(a,b); // c = a^b
my_pow(a,b); c = ret; // c = a^b
}
as you can see you can use my_pow_notemp directly but the code is hardcoded so only up to a^3 if you want more you have to add it to code. The my_pow is accepting exponents up to a^63 and its also an example on how to return value in case of more complex code inside macro. Here are some (normal) ways on how to compute powers in case you need non integer or negative exponents (but to convert it to unrolled code will be insanely hard without loops/recursion):
Power by squaring for negative exponents
In case you want to get away with recursion and function calls you can use templates instead of macros but that is limited to C++.
template<class T> T my_pow(T a,T b)
{
if (b==0) return 1;
if (b==1) return a;
return a*my_pow(a,b-1);
}
void main()
{
int a=2,b=3,c;
c=my_pow(a,b);
}
As you can see templates have return value so no problem even with more complex code (more than single expression).
To avoid loops you can use LUT tables
int my_pow[4][4]=
{
{1,0,0,0}, // 0^
{1,1,1,1}, // 1^
{1,2,4,8}, // 2^
{1,3,9,27}, // 3^
};
void main()
{
int a=2,b=3,c;
c=my_pow[a][b];
}
If you have access to FPU or advanced math assembly you can use that as asm instruction is not a function call. FPU usually have log,exp,pow functions natively. This however limits the code to specific instruction set !!!
Here some examples:
How to: pow(real, real) in x86
So when I consider your limitation I think the best way is:
#define my_pow(a,b) (b==0)?1:(b==1)?a:(b==2)?a*a:(b==3)?a*a*a:0
void main()
{
int a=2,b=3,c;
c=my_pow(a,b); // c = a^b
}
Which will work on int exponents b up to 3 (if you want more just add (b==4)?a*a*a*a: ... :0) and both int and float bases a. If you need much bigger exponent use the complicated version with local temp variable for returning result.
[Edit1] ultimative single expression macro with power by squaring up to a^15
#define my_pow(a,b) (1* (int(b&1))?a:1* (int(b&2))?a*a:1* (int(b&4))?a*a*a*a:1* (int(b&8))?a*a*a*a*a*a*a*a:1)
void main()
{
int a=2,b=3,c;
c=my_pow(a,b); // c = a^b
}
In case you want more than a^15 just add sub term (int(b&16))?a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a:1 and so on for each bit of exponent.
It is a series. Replace pow() based on the previous iteration. #Bathsheba
Code does not need to call pow(). It can form pow(x, 5 * i - 1) and pow(-1, i - 1), since both have an int exponent based on the iterator i, from the prior loop iteration.
Example:
Let f(x, i) = pow(x, 5 * i - 1)
Then f(x, 1) = x*x*x*x
and f(x, i > 1) = f(x, i-1) * x*x*x*x*x
double power_n1 = 1.0;
double power_x5 = x*x*x*x;
for (int i = 1; i < j + 1; i++)
// sum += (pow(-1, i - 1)) / (5 * i - 1) * (pow(x, 5 * i - 1));
sum += power_n1 / (5 * i - 1) * power_x5;
power_n1 = -power_n1;
power_x5 *= x*x*x*x*x;
}

Fast inner product of ternary vectors

Consider two vectors, A and B, of size n, 7 <= n <= 23. Both A and B consists of -1s, 0s and 1s only.
I need a fast algorithm which computes the inner product of A and B.
So far I've thought of storing the signs and values in separate uint32_ts using the following encoding:
sign 0, value 0 → 0
sign 0, value 1 → 1
sign 1, value 1 → -1.
The C++ implementation I've thought of looks like the following:
struct ternary_vector {
uint32_t sign, value;
};
int inner_product(const ternary_vector & a, const ternary_vector & b) {
uint32_t psign = a.sign ^ b.sign;
uint32_t pvalue = a.value & b.value;
psign &= pvalue;
pvalue ^= psign;
return __builtin_popcount(pvalue) - __builtin_popcount(psign);
}
This works reasonably well, but I'm not sure whether it is possible to do it better. Any comment on the matter is highly appreciated.
I like having the 2 uint32_t, but I think your actual calculation is a bit wasteful
Just a few minor points:
I'm not sure about the reference (getting a and b by const &) - this adds a level of indirection compared to putting them on the stack. When the code is this small (a couple of clocks maybe) this is significant. Try passing by value and see what you get
__builtin_popcount can be, unfortunately, very inefficient. I've used it myself, but found that even a very basic implementation I wrote was far faster than this. However - this is dependent on the platform.
Basically, if the platform has a hardware popcount implementation, __builtin_popcount uses it. If not - it uses a very inefficient replacement.
The one serious problem here is the reuse of the psign and pvalue variables for the positive and negative vectors. You are doing neither your compiler nor yourself any favors by obfuscating your code in this way.
Would it be possible for you to encode your ternary state in a std::bitset<2> and define the product in terms of and? For example, if your ternary types are:
1 = P = (1, 1)
0 = Z = (0, 0)
-1 = M = (1, 0) or (0, 1)
I believe you could define their product as:
1 * 1 = 1 => P * P = P => (1, 1) & (1, 1) = (1, 1) = P
1 * 0 = 0 => P * Z = Z => (1, 1) & (0, 0) = (0, 0) = Z
1 * -1 = -1 => P * M = M => (1, 1) & (1, 0) = (1, 0) = M
Then the inner product could start by taking the and of the bits of the elements and... I am working on how to add them together.
Edit:
My foolish suggestion did not consider that (-1)(-1) = 1, which cannot be handled by the representation I proposed. Thanks to #user92382 for bringing this up.
Depending on your architecture, you may want to optimize away the temporary bit vectors -- e.g. if your code is going to be compiled to FPGA, or laid out to an ASIC, then a sequence of logical operations will be better in terms of speed/energy/area than storing and reading/writing to two big buffers.
In this case, you can do:
int inner_product(const ternary_vector & a, const ternary_vector & b) {
return __builtin_popcount( a.value & b.value & ~(a.sign ^ b.sign))
- __builtin_popcount( a.value & b.value & (a.sign ^ b.sign));
}
This will lay out very well -- the (a.value & b.value & ... ) can enable/disable an XOR gate, whose output splits into two signed accumulators, with the first pathway NOTed before accumulation.

Fast dot product for a very special case

Given a vector X of size L, where every scalar element of X is from a binary set {0,1}, it is to find a dot product z=dot(X,Y) if vector Y of size L consists of the integer-valued elements. I suggest, there must exist a very fast way to do it.
Let's say we have L=4; X[L]={1, 0, 0, 1}; Y[L]={-4, 2, 1, 0} and we have to find z=X[0]*Y[0] + X[1]*Y[1] + X[2]*Y[2] + X[3]*Y[3] (which in this case will give us -4).
It is obvious that X can be represented using binary digits, e.g. an integer type int32 for L=32. Then, all what we have to do is to find a dot product of this integer with an array of 32 integers. Do you have any idea or suggestions how to do it very fast?
This really would require profiling but an alternative you might want to consider:
int result=0;
int mask=1;
for ( int i = 0; i < L; i++ ){
if ( X & mask ){
result+=Y[i];
}
mask <<= 1;
}
Typically bit shifting and bitwise operations are faster than multiplication, however, the if statement might be slower than a multiplication, although with branch prediction and large L my guess is it might be faster. You would really have to profile it, though, to determine if it resulted in any speedup.
As has been pointed out in the comments below, unrolling the loop either manually or via a compiler flag (such as "-funroll-loops" on GCC) could also speed this up (eliding the loop condition).
Edit
In the comments below, the following good tweak has been proposed:
int result=0;
for ( int i = 0; i < L; i++ ){
if ( X & 1 ){
result+=Y[i];
}
X >>= 1;
}
Is a suggestion to look into SSE2 helpful? It has dot-product type operations already, plus you can trivially do 4 (or perhaps 8, I forget the register size) simple iterations of your naive loop in parallel.
SSE also has some simple logic-type operations so it may be able to do additions rather than multiplications without using any conditional operations... again you'd have to look at what ops are available.
Try this:
int result=0;
for ( int i = 0; i < L; i++ ){
result+=Y[i] & (~(((X>>i)&1)-1));
}
This avoids a conditional statement and uses bitwise operators to mask the scalar value with either zeros or ones.
Since size explicitly doesn’t matter, I think the following is probably the most efficient general-purpose code:
int result = 0;
for (size_t i = 0; i < 32; ++i)
result += Y[i] & -X[i];
Bit-encoding X just doesn’t bring anything to the table (even if the loop may potentially terminate earlier as #Mathieu correctly noted). But omitting the if inside the loop does.
Of course, loop unrolling can speed this up drastically, as others have noted.
This solution is identical to, but slightly faster (by my test), than Micheal Aaron's:
long Lev=1;
long Result=0
for (int i=0;i<L;i++) {
if (X & Lev)
Result+=Y[i];
Lev*=2;
}
I thought there was a numerical way to rapidly establish the next set bit in a word which should improve performance if your X data is very sparse but currently cannot find said numerical formulation currently.
I've seen a number of responses with bit trickery (to avoid branching) but none got the loop right imho :/
Optimizing #Goz answer:
int result=0;
for (int i = 0, x = X; x > 0; ++i, x>>= 1 )
{
result += Y[i] & -(int)(x & 1);
}
Advantages:
no need to do i bit-shifting operations each time (X>>i)
the loop stops sooner if X contains 0 in higher bits
Now, I do wonder if it runs faster, especially since the premature stop of the for loop might not be as easy for loop unrolling (compared to a compile-time constant).
How about combining a shifting loop with a small lookup table?
int result=0;
for ( int x=X; x!=0; x>>=4 ){
switch (x&15) {
case 0: break;
case 1: result+=Y[0]; break;
case 2: result+=Y[1]; break;
case 3: result+=Y[0]+Y[1]; break;
case 4: result+=Y[2]; break;
case 5: result+=Y[0]+Y[2]; break;
case 6: result+=Y[1]+Y[2]; break;
case 7: result+=Y[0]+Y[1]+Y[2]; break;
case 8: result+=Y[3]; break;
case 9: result+=Y[0]+Y[3]; break;
case 10: result+=Y[1]+Y[3]; break;
case 11: result+=Y[0]+Y[1]+Y[3]; break;
case 12: result+=Y[2]+Y[3]; break;
case 13: result+=Y[0]+Y[2]+Y[3]; break;
case 14: result+=Y[1]+Y[2]+Y[3]; break;
case 15: result+=Y[0]+Y[1]+Y[2]+Y[3]; break;
}
Y+=4;
}
The performance of this will depend on how good the compiler is at optimising the switch statement, but in my experience they are pretty good at that nowadays....
There is probably no general answer to this question. You need to profile your code under all the different targets. Performance will depend on compiler optimizations such as loop unwinding and SIMD instructions that are available on most modern CPUs (x86, PPC, ARM all have their own implementations).
For small L, you can use a switch statement instead of a loop. For example, if L = 8, you could have:
int dot8(unsigned int X, const int Y[])
{
switch (X)
{
case 0: return 0;
case 1: return Y[0];
case 2: return Y[1];
case 3: return Y[0]+Y[1];
// ...
case 255: return Y[0]+Y[1]+Y[2]+Y[3]+Y[4]+Y[5]+Y[6]+Y[7];
}
assert(0 && "X too big");
}
And if L = 32, you can write a dot32() function which calls dot8() four times, inlined if possible. (If your compiler refuses to inline dot8(), you could rewrite dot8() as a macro to force inlining.) Added:
int dot32(unsigned int X, const int Y[])
{
return dot8(X >> 0 & 255, Y + 0) +
dot8(X >> 8 & 255, Y + 8) +
dot8(X >> 16 & 255, Y + 16) +
dot8(X >> 24 & 255, Y + 24);
}
This solution, as mikera points out, may have an instruction cache cost; if so, using a dot4() function might help.
Further update: This can be combined with mikera's solution:
static int dot4(unsigned int X, const int Y[])
{
switch (X)
{
case 0: return 0;
case 1: return Y[0];
case 2: return Y[1];
case 3: return Y[0]+Y[1];
//...
case 15: return Y[0]+Y[1]+Y[2]+Y[3];
}
}
Looking at the resulting assembler code with the -S -O3 options with gcc 4.3.4 on CYGWIN, I'm slightly surprised to see that this is automatically inlined within dot32(), with eight 16-entry jump-tables.
But adding __attribute__((__noinline__)) seems to produce nicer-looking assembler.
Another variation is to use fall-throughs in the switch statement, but gcc adds jmp instructions, and it doesn't look any faster.
Edit--Completely new answer: After thinking about the 100 cycle penalty mentioned by Ants Aasma, and the other answers, the above is likely not optimal. Instead, you could manually unroll the loop as in:
int dot(unsigned int X, const int Y[])
{
return (Y[0] & -!!(X & 1<<0)) +
(Y[1] & -!!(X & 1<<1)) +
(Y[2] & -!!(X & 1<<2)) +
(Y[3] & -!!(X & 1<<3)) +
//...
(Y[31] & -!!(X & 1<<31));
}
This, on my machine, generates 32 x 5 = 160 fast instructions. A smart compiler could conceivably unroll the other suggested answers to give the same result.
But I'm still double-checking.
result = 0;
for(int i = 0; i < L ; i++)
if(X[i]!=0)
result += Y[i];
It's quite likely that the time spent to load X and Y from main memory will dominate. If this is the case for your CPU architecture, the algorithm is faster when loading less. This means that storing X as a bitmask and expanding it into L1 cache will speed up the algorithm as a whole.
Another relevant question is whether your compiler will generate optimal loads for Y. This is higly CPU and compiler dependent. But in general, it helps if the compiler can see precsiely which values are needed when. You could manually unroll the loop. However, if L is a contant, leave it to the compiler:
template<int I> inline void calcZ(int (&X)[L], int(&Y)[L], int &Z) {
Z += X[I] * Y[I]; // Essentially free, as it operates in parallel with loads.
calcZ<I-1>(X,Y,Z);
}
template< > inline void calcZ<0>(int (&X)[L], int(&Y)[L], int &Z) {
Z += X[0] * Y[0];
}
inline int calcZ(int (&X)[L], int(&Y)[L]) {
int Z = 0;
calcZ<L-1>(X,Y,Z);
return Z;
}
(Konrad Rudolph questioned this in a comment, wondering about memory use. That's not the real bottleneck in modern computer architectures, bandwidth between memory and CPU is. This answer is almost irrelevant if Y is somehow already in cache. )
You can store your bit vector as a sequence of ints where each int packs a couple of coefficients as bits. Then, the component-wise multiplication is equivalent to bit-and. With this you simply need to count the number of set bits which could be done like this:
inline int count(uint32_t x) {
// see link
}
int dot(uint32_t a, uint32_t b) {
return count(a & b);
}
For a bit hack to count the set bits see http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
Edit: Sorry I just realized only one of the vectors contains elements of {0,1} and the other one doesn't. This answer only applies to the case where both vectors are limited to coefficients from the set of {0,1}.
Represente X using linked list of the places where x[i] = 1.
To find required sum you need O(N) operations where N is size of your list.
Well you want all bits to get past if its a 1 and none if its a 0. So you want to somehow turn 1 into -1 (ie 0xffffffff) and 0 stays the same. Thats just -X .... so you do ...
Y & (-X)
for each element ... job done?
Edit2: To give a code example you can do something like this and avoid the branch:
int result=0;
for ( int i = 0; i < L; i++ )
{
result+=Y[i] & -(int)((X >> i) & 1);
}
Of course you'd be best off keeping the 1s and 0s in an array of ints and therefore avoiding the shifts.
Edit: Its also worth noting that if the values in Y are 16-bits in size then you can do 2 of these and operations per operation (4 if you have 64-bit registers). It does mean negating the X values 1 by 1 into a larger integer, though.
ie YVals = -4, 3 in 16-bit = 0xFFFC, 0x3 ... put into 1 32-bit and you get 0xFFFC0003. If you have 1, 0 as the X vals then you form a bit mask of 0xFFFF0000 and the 2 together and you've got 2 results in 1 bitwise-and op.
Another edit:
IF you want the code on how to do the 2nd method something like this should work (Though it takes advantage of unspecified behaviour so it may not work on every compiler .. works on every compiler I've come across though).
union int1632
{
int32_t i32;
int16_t i16[2];
};
int result=0;
for ( int i = 0; i < (L & ~0x1); i += 2 )
{
int3264 y3264;
y3264.i16[0] = Y[i + 0];
y3264.i16[1] = Y[i + 1];
int3264 x3264;
x3264.i16[0] = -(int16_t)((X >> (i + 0)) & 1);
x3264.i16[1] = -(int16_t)((X >> (i + 1)) & 1);
int3264 res3264;
res3264.i32 = y3264.i32 & x3264.i32;
result += res3264.i16[0] + res3264.i16[1];
}
if ( i < L )
result+=Y[i] & -(int)((X >> i) & 1);
Hopefully the compiler will optimise out the assigns (Off the top of my head i'm not sure but the idea could be re-worked so that they definitely are) and give you a small speed up in that you now only need to do 1 bitwise-and instead of 2. The speed up would be minor though ...

Is It Possible To Simplify This Branch-Based Vector Math Operation?

I'm trying to achieve something like the following in C++:
class MyVector; // 3 component vector class
MyVector const kA = /* ... */;
MyVector const kB = /* ... */;
MyVector const kC = /* ... */;
MyVector const kD = /* ... */;
// I'd like to shorten the remaining lines, ideally making it readable but less code/operations.
MyVector result = kA;
MyVector const kCMinusD = kC - kD;
if(kCMinusD.X <= 0)
{
result.X = kB.X;
}
if(kCMinusD.Y <= 0)
{
result.Y = kB.Y;
}
if(kCMinusD.Z <= 0)
{
result.Z = kB.Z;
}
Paraphrasing the code into English, I have four 'known' vectors. Two of the vectors have values that I may or may not want in my result, and whether I want them or not is contingent on a branch based on the components of two other vectors.
I feel like I should be able to simplify this code with some matrix math and masking, but I can't wrap my head around it.
For now I'm going with the branch, but I'm curious to know if there's a better way that still would be understandable, and less code-verbose.
Edit:
In reference to Mark's comment, I'll explain what I'm trying to do here.
This code is an excerpt from some spring physics I'm working on. The components are as follows:
kC is the springs length currently, and kD is minimum spring length.
kA and kB are two sets of spring tensions, each component of which may be unique per component (i.e., a different spring tension along the X, Y, or Z). kA is the springs tension if it's not fully compressed, and kB is the springs tension if it IS fully compressed.
I'd like to build up a resultant 'vector' that simply is the amalgamation of kC and kD, dependant on whether the spring is compressed or not.
Depending on the platform you're on, the compiler might be able to optimize statements like
result.x = (kC.x > kD.x) ? kA.x : kB.x;
result.y = (kC.y > kD.y) ? kA.y : kB.y;
result.z = (kC.z > kD.z) ? kA.z : kB.z;
using fsel (floating point select) instructions or conditional moves. Personally, I think the code looks nicer and more concise this way too, but that's subjective.
If the code is really performance critical, and you don't mind changing your vector class to be 4 floats instead of 3, you could use SIMD (e.g SSE on Intel platforms, VMX on PowerPC) to do the comparison and select the answers. If you went ahead with this, it would like this: (in pseudo code)
// Set each component of mask to be either 0x0 or 0xFFFFFFFF depending on the comparison
MyVector4 mask = vec_compareLessThan(kC, kD);
// Sets each component of result to either kA or kB's component, depending on whether the bits are set in mask
result = vec_select(kA, kb, mask);
This takes a while getting used to, and it might be less readable initially, but you eventually get used to thinking in SIMD mode.
The usual caveats apply, of course - don't optimize before you profile, etc.
If your vector elements are ints, you can do:
MyVector result;
MyVector const kCMinusD = kC - kD;
int mask = kCMinusD.X >> 31; // either 0 or -1
result.X = (kB.X & mask) | (kCMinusD.X & ~mask)
mask = kCMinusD.Y >> 31;
result.X = (kB.Y & mask) | (kCMinusD.Y & ~mask)
mask = kCMinusD.Z >> 31;
result.X = (kB.Z & mask) | (kCMinusD.Z & ~mask)
(note this handles the == 0 case differently, not sure if you care)
If your vector elements are doubles instead of ints, you can do something similar as the sign bit is in the same place, you just have to convert to integers, do the mask, and convert back.
If you're seeking a clean expression in source more than a runtime optimization, you might consider solving this problem from the "toolbox" point of view. So let's say that on MyVector you defined sign, gt (greater-than), and le (less-than-or-equal-to). Then in two lines:
MyVector const kSignCMinusD = (kC - kD).sign();
result = kSignCMinusD.gt(0) * kA + kSignCMinusD.le(0) * kB;
With operator overloading:
MyVector const kSignCMinusD = (kC - kD).sign();
result = (kSignCMinusD > 0) * kA + (kSignCMinusD <= 0) * kB;
For inspiration here's the MatLab function reference. And obviously there are many C++ vector libraries to choose from with such functions.
You can always go in and optimize further if profiling shows it necessary. But often the biggest performance issues are how well you can see the big picture and reuse intermediate computations.
Since you are only doing subtraction you are rewrite as below:
MyVector result;
result.x = kD.x > kC.x ? kB.x : kA.x;
result.y = kD.y > kC.y ? kB.y : kA.y;
result.z = kD.z > kC.z ? kB.z : kA.z;