C++, most efficient way to modulate sign (+/-) in an expression

C++, most efficient way to modulate sign (+/-) in an expression - c++

I have an expression
x += y;
and, based on a boolean, I want to be able to change it to
x -= y;
Of course I could do
if(i){x+=y;} else{x-=y;}
//or
x+=(y*sign); //where sign is either 1 or -1
But if I have to do this iteratively, I want to avoid the extra computation. Is there a more efficient way? Is it possible to modulate the operator?

if (i) {x += y;} else {x -= y;}
is probably going to be as efficient as anything else you can do. y * sign is likely to be fairly expensive (unless the compiler can figure out that y is guaranteed to be 1 or -1).

The most efficient way to do this iteratively is to precompute the data you need.
So, precomputation:
const YourNumberType increment = (i? y : -y);
Then in your loop:
x += increment;
EDIT: re question in commentary about how to generate code, like this:
#include <stdio.h>
void display( int x ) { printf( "%d\n", x ); }
template< bool isSomething >
inline void advance( int& x, int y );
template<> inline void advance<true>( int& x, int y ) { x += y; }
template<> inline void advance<false>( int& x, int y ) { x -= y; }
template< bool isSomething >
void myFunc()
{
int x = 314;
int y = 271;
for( ;; )
{
advance< isSomething >( x, y ); // The nano-optimization.
display( x );
if( !( -10000 < x && x < 10000 ) ) { return; }
}
}
int main( int n, char*[] )
{
n > 1? myFunc<true>() : myFunc<false>();
}
E.g. with Visual C++ 10.0 this generates two versions of myFunc, one with an add instruction and the other with a sub instruction.
Cheers & hth.,

On a modern pipelined machine you want to avoid branching if at all possible in those cases where performance really does count. When the front of the pipeline hits a branch, the CPU guesses which branch to take and lets the pipeline work ahead based on that guess. Everything is fine if the guess was right. Everything is not so fine if the guess was wrong, particularly so if you're still using one of Intel's processors such as a Pentium 4 that suffered from pipeline bloat. Intel discovered that too much pipelining is not a good thing.
More modern processors still do use pipelining (the Core line has a pipeline length of 14 or so), so avoiding branching still remains one of those good things to do -- when it counts, that is. Don't make your code an ugly, prematurely optimized mess when it doesn't count.
The best thing to do is to first find out where your performance demons lie. It is not at all uncommon for a tiny fraction of one percent of the code base to be the cause of almost all of the CPU usage. Optimizing the 99.9% of the code that doesn't contribute to the CPU usage won't solve your performance problems but it will have a deleterious effect on maintenance.
You optimize once you have found the culprit code, and even then, maybe not. When performance doesn't matter, don't optimize. Performance as a metric runs counter to almost every other code quality metric out there.
So, getting off the soap box, let's suppose that little snippet of code is the performance culprit. Try both approaches and test. Try a third approach you haven't thought of yet and test. Sometimes the code that is the best performance-wise is surprisingly non-intuitive. Think Duff's device.

If i stays constant during the execution of the loop, and y doesn't, move the if outside of the loop.
So instead of...
your_loop {
y = ...;
if (i)
x += y;
else
x -= y;
}
...do the following....
if (i) {
your_loop {
y = ...;
x += y;
}
}
else {
your_loop {
y = ...;
x -= y;
}
}
BTW, a decent compiler will do that optimization for you, so you may not see the difference when actually benchmarking.

Sounds like you want to avoid branching and multiplication. Let's say the switch i is set to all 1 bits, same size as y, when you want to add, and to 0 when you want to subtract. Then:
x += (y & i) - (y & ~i)
Haven't tested it, this is just to give you the general idea. Bear in mind that this makes the code a lot harder to read in exchange for what would probably be a very small increase in efficiency.
Edit: or, as bdonlan points out in the comments, possibly even a decrease.

I put my suggestion in the comments to the test, and in a simple test bit-fiddling is faster than branching options on an Intel(R) Xeon(R) CPU L5520 # 2.27GHz, but slower on my laptop Intel Core Duo.
If you are free to give i either the value 0 (for +) or ~0 (for -), these statements are equivalent:
// branching:
if ( i ) sum -= add; else sum += add;
sum += i?-add:add;
sum += (i?-1:1)*add;
// bit fiddling:
sum += (add^i)+(i&1);
sum += (add^i)+(!!i);
sum += (i&~add)-(i&add);
And as said, one method can beat the other by a factor of 2, depending on CPU and optimization level used.
Conclusion, as always, is that benchmarking is the only way to find out which is faster in your particular situation.

Related

Compute Power(x,y) code clarification C++

int Power(double x, double y) {
long long power = y;
double result = 1.0;
if (y<0) {
power = -power;
x = 1 / x;
}
while (power) {
if (power & 1) {
result *= x;
}
x *= x;
power >>= 1;
}
std::cout << result;
return result;
}
I am seeking clarification about this code. Here are a few questions I have about the code
When the power is negative where is it multiplying 1/(x*y)?
In the if statement inside the while loop it tests to see if power%2 == 0 but if it is not mod 2 then where is that calculation taking place?
If someone can clarify through an example like x = 2 and y = 4 and show how the program runs to calculate the power to be 16 that will be really helpful.
I am new to programming so trying to understand basic primitive types with these examples. Thank you in advance.

When the power is negative where is it multiplying 1/(x*y)?
Nowhere. When y is negative the code uses power = -power; x = 1 / x;.
In the if statement inside the while loop it tests to see if power%2 == 0 but if it is not mod 2 then where is that calculation taking place?
No it doesn't. The code does not check if the power is even. It does check if the power is odd:
if (power & 1) {
result *= x;
}
What is left now is even and if you consider that (x^2n) == (x^n)^2 then you will understand why the code continues with:
x *= x;
power >>= 1;
For example power was 5 then the odd power is handled by result*=x;, we have 4 left, and x^4 is the same as (x^2)^2 so we can coninue the loop with power divided by 2 and x replaced by x^2.
If someone can clarify through an example like x = 2 and y = 4 and show how the program runs to calculate the power to be 16 that will be really helpful.
You should take this opportunity to learn how to use a debugger. If you want to step through code to see what each line does: debugger.
PS: the code is almost "ok-ish". What is not ok at all is the mess about types. double y is assigned to long long power and double result is returned as int. All this makes no sense. The function does not correctly calculate floating-point powers. As this is just an exercise, I would recommend to use int everywhere and concentrate on integers for now. Last but not least, note that this is code one actually should not write, because someone did it already: std::pow.

Multiply numbers which are divisible by 3 and less than 10 with a while loop in c++?

In C++, I should write a program where the app detects which numbers are divisible by 3 from 1 till 10 and then multiply all of them and print the result. That means that I should multiply 3,6,9 and print only the result, which is 162, but I should do it by using a "While" loop, not just multiplying the 3 numbers with each other. How should I write the code of this? I attached my attempt to code the problem below. Thanks
#include <iostream>
using namespace std;
int main() {
int x, r;
int l;
x = 1;
r = 0;
while (x < 10 && x%3==0) {
r = (3 * x) + 3;
cout << r;
}
cin >> l;
}

Firstly your checking the condition x%3 == 0 brings you out of your while - loop right in the first iteration where x is 1. You need to check the condition inside the loop.
Since you wish to store your answer in variable r you must initialize it to 1 since the product of anything with 0 would give you 0.
Another important thing is you need to increment the value of x at each iteration i.e. to check if each number in the range of 1 to 10 is divisible by 3 or not .
int main()
{
int x, r;
int l;
x = 1;
r = 1;
while (x < 10)
{
if(x%3 == 0)
r = r*x ;
x = x + 1; //incrementing the value of x
}
cout<<r;
}
Lastly I have no idea why you have written the last cin>>l statement . Omit it if not required.

Ok so here are a few hints that hopefully help you solving this:
Your approach with two variables (x and r) outside the loop is a good starting point for this.
Like I wrote in the comments you should use *= instead of your formula (I still don't understand how it is related to the problem)
Don't check if x is dividable by 3 inside the while-check because it would lead to an too early breaking of the loop
You can delete your l variable because it has no affect at the moment ;)
Your output should also happen outside the loop, else it is done everytime the loop runs (in your case this would be 10 times)
I hope I can help ;)
EDIT: Forget about No.4. I didn't saw your comment about the non-closing console.

int main()
{
int result = 1; // "result" is better than "r"
for (int x=1; x < 10; ++x)
{
if (x%3 == 0)
result = result * x;
}
cout << result;
}
or the loop in short with some additional knowledge:
for (int x=3; x < 10; x += 3) // i know that 3 is dividable
result *= x;
or, as it is c++, and for learning purposes, you could do:
vector<int> values; // a container holding integers that will get the multiples of 3
for (int x=1; x < 10; ++x) // as usual
if ( ! x%3 ) // same as x%3 == 0
values.push_back(x); // put the newly found number in the container
// now use a function that multiplies all numbers of the container (1 is start value)
result = std::accumulate(values.begin(), values.end(), 1, multiplies<int>());
// so much fun, also get the sum (0 is the start value, no function needed as add is standard)
int sum = std::accumulate(values.begin(), values.end(), 0);

It's important to remember the difference between = and ==. = sets something to a value while == compares something to a value. You're on the right track with incrementing x and using x as a condition to check your range of numbers. When writing code I usually try and write a "pseudocode" in English to organize my steps and get my logic down. It's also wise to consider using variables that tell you what they are as opposed to just random letters. Imagine if you were coding a game and you just had letters as variables; it would be impossible to remember what is what. When you are first learning to code this really helps a lot. So with that in mind:
/*
- While x is less than 10
- check value to see if it's mod 3
- if it's mod 3 add it to a sum
- if not's mod 3 bump a counter
- After my condition is met
- print to screen pause screen
*/
Now if we flesh out that pseudocode a little more we'll get a skeletal structure.
int main()
{
int x=1//value we'll use as a counter
int sum=0//value we'll use as a sum to print out at the end
while(x<10)//condition we'll check against
{
if (x mod 3 is zero)
{
sum=x*1;
increment x
}
else
{
increment x
}
}
//screen output the sum the sum
//system pause or cin.get() use whatever your teacher gave you.
I've given you a lot to work with here you should be able to figure out what you need from this. Computer Science and programming is hard and will require a lot of work. It's important to develop good coding habits and form now as it will help you in the future. Coding is a skill like welding; the more you do it the better you'll get. I often refer to it as the "Blue Collar Science" because it's really a skillset and not just raw knowledge. It's not like studying history or Biology (minus Biology labs) because those require you to learn things and loosely apply them whereas programming requires you to actually build something. It's like welding or plumbing in my opinion.
Additionally when you come to sites like these try and read up how things should be posted and try and seek the "logic" behind the answer and come up with it on your own as opposed to asking for the answer. People will be more inclined to help you if they think you're working for something instead of asking for a handout (not saying you are, just some advice). Additionally take the attitude these guys give you with a grain of salt, Computer Scientists aren't known to be the worlds most personable people. =) Good luck.

Conditional statements with SSE

I'm trying to do some calculations for my game, and I'm trying to calculate the distance between two points. Essentially, I'm using the equation of a circle to see if the points are inside of the radius that I define.
(x - x1)^2 + (y - y1)^2 <= r^2
My question is: how do I evaluate the conditional statement with SSE and interpret the results? So far I have this:
float distSqr4 = (pow(x4 - k->getPosition().x, 2) + pow(y4 - k->getPosition().y, 2));
float distSqr3 = (pow(x3 - k->getPosition().x, 2) + pow(y3 - k->getPosition().y, 2));
float distSqr2 = (pow(x2 - k->getPosition().x, 2) + pow(y2 - k->getPosition().y, 2));
float distSqr1 = (pow(x1 - k->getPosition().x, 2) + pow(y1 - k->getPosition().y, 2));
__m128 distances = _mm_set_ps(distSqr1, distSqr2, distSqr3, distSqr4);
__m128 maxDistSqr = _mm_set1_ps(k->getMaxDistance() * k->getMaxDistance());
__m128 result = _mm_cmple_ps(distances, maxDistSqr);
Once I get the result variable, I get lost. How do I use the result variable that I just got? My plan was, if the condition evaluated turned out to be true, to do some lighting calculations and then draw the pixel on the screen. How do I interpret true vs false in this case?
Any help towards the right direction is greatly appreciated!

My plan was, if the condition evaluated turned out to be true, to do some lighting calculations and then draw the pixel on the screen.
Then you really have little choice but to branch.
The big advantage of doing conditional tests using SSE is that it allows you to write branchless code, which can lead to massive speed improvements. But in your case, you pretty much have to branch because, if I'm understanding you correctly, you never want to output anything on the screen if the condition evaluated to false.
I mean, I guess you could do all of the calculations unconditionally (speculatively) and then just use the result of the conditional to twiddle bits in the pixel values, essentially causing you to draw off of the screen. That would give you branchless code, but it's pretty silly. There is a penalty for branch mispredictions, but it won't be anywhere near as expensive as all of the calculations and drawing code.
In other words, the parallelism you're using SIMD to exploit is exhausted once you have the final result. It's just a simple, scalar compare-and-branch. First you test whether the condition evaluated to true. If not, you'll jump over the code that does the lighting calculations and pixel-drawing. Otherwise, you'll just fall through to execute that code.
The tricky part is that the compiler won't let you use an __m128 variable in a regular old if statement, so you need to "convert" result to an integer that you can use as the basis for a conditional. The easiest way to do that would be the _mm_movemask_epi8 intrinsic.
So you would basically just do:
__m128 distances = _mm_set_ps(distSqr1, distSqr2, distSqr3, distSqr4);
__m128 maxDistSqr = _mm_set1_ps(k->getMaxDistance() * k->getMaxDistance());
__m128 result = _mm_cmple_ps(distances, maxDistSqr);
if (_mm_movemask_epi8(result) == (unsigned)-1)
{
// All distances were less-than-or-equal-to the maximum, so
// go ahead and calculate the lighting and draw the pixels.
CalcLightingAndDraw(…);
}
This works because _mm_cmple_ps sets each packed double-word to all 1s if the comparison is true, or all 0s if the comparison is false. _mm_movemask_epi8 then collapses that into an integer-sized mask and moves it to an integer value. You then can use that integer value in a normal conditional statement.
Note: With Clang and ICC, you can get away with passing a __m128 value to the _mm_movemask_epi8 intrinsic. On GCC, it insists upon a __m128i value. You can handle this with a cast: _mm_movemask_epi8((__m128i)result).
Of course, I'm assuming here that you are only going to do the drawing if all of the distances are less-than-or-equal-to the maximum distance. If you want to treat each of the four distances independently, then you need to add more conditional tests on the mask:
__m128 distances = _mm_set_ps(distSqr1, distSqr2, distSqr3, distSqr4);
__m128 maxDistSqr = _mm_set1_ps(k->getMaxDistance() * k->getMaxDistance());
__m128 result = _mm_cmple_ps(distances, maxDistSqr);
unsigned condition = _mm_movemask_epi8(result);
if (condition != 0)
{
// One or more of the distances were less-than-or-equal-to the maximum,
// so we have something to draw.
if ((condition & 0x000F) != 0)
{
// distSqr1 was less-than-or-equal-to the maximum
CalcLightingAndDraw(distSqr1);
}
if ((condition & 0x00F0) != 0)
{
// distSqr2 was less-than-or-equal-to the maximum
CalcLightingAndDraw(distSqr2);
}
if ((condition & 0x0F00) != 0)
{
// distSqr3 was less-than-or-equal-to the maximum
CalcLightingAndDraw(distSqr3);
}
if ((condition & 0xF000) != 0)
{
// distSqr4 was less-than-or-equal-to the maximum
CalcLightingAndDraw(distSqr4);
}
}
This won't result in very efficient code, since you have to do so many conditional test-and-branch operations. You might be able to continue parallelizing some of the lighting calculations inside of the main if block. I can't say for sure if this is workable, since I don't have enough details about your algorithm/design.
Otherwise, if you can't see any way to wring more parallelism out of the drawing code, the use of explicit SSE intrinsics isn't buying you much here. You were able to parallelize one comparison (_mm_cmple_ps), but the overhead of setting up for that comparison (_mm_set_ps, which will probably compile into vinsertps or unpcklps+movlhps instructions, assuming the inputs are already in XMM registers) will more than cancel out any trivial gains you might get. You'd arguably be just as well off writing the code like so:
float maxDistSqr = k->getMaxDistance() * k->getMaxDistance();
if (distSqr1 <= maxDistSqr)
{
CalcLightingAndDraw(distSqr1);
}
if (distSqr2 <= maxDistSqr)
{
CalcLightingAndDraw(distSqr2);
}
if (distSqr3 <= maxDistSqr)
{
CalcLightingAndDraw(distSqr3);
}
if (distSqr4 <= maxDistSqr)
{
CalcLightingAndDraw(distSqr4);
}

Hello, i need to figure out how to change coordinates. not a GUI

I am using Microsoft visual C++ express 2010
i have a variables:
int x, that represents the position of a video game character. (their is a Y of course)
The program loops and each time it changes X by a couple of places. but it must be within 0-800. and when it reaches the 0 (which is supposed to be the edge of the screen) its rewinds.
I have figured out how to change their value every time the program runs, but how do i make sure that it keeps its value in the 0-800 range, and rewind it when it reaches position 0?
and it has its very own function outside of Main entirely.
thank you.

x = (x + 800) % 800;
This will keep x within (0..799). If you really need (0..800), replace 800 with 801.

Make a direction variable...
int dir = -2;
for(;;) {
x += dir;
if( x < 0 || x >= 800 ) {
dir *= -1;
x += dir;
}
}

First, it's not quite clear exactly what you want. When you say "rewind", do you mean start over at the opposite side again, or turn around and move back in the direction it came from.
Assuming the first, the easy (but somewhat clumsy) way is to just do a comparison and when/if the value goes out of range, adjust as necessary:
x -= increment;
if (x < 0)
x = 800;
or:
x += increment;
if (x > 800)
x = 0;
You can also use the remainder operator, but it can be a little bit clumsy to get it entirely correct. When you're going in the positive direction, it's fairly direct and simple, but in the negative direction, it's not -- in this case a negative number is entirely possible, so simple tests like above are needed. If the value only ever goes in the positive direction, so you only care about it becoming greater than the limit, it works fine though.

Fast dot product for a very special case

Given a vector X of size L, where every scalar element of X is from a binary set {0,1}, it is to find a dot product z=dot(X,Y) if vector Y of size L consists of the integer-valued elements. I suggest, there must exist a very fast way to do it.
Let's say we have L=4; X[L]={1, 0, 0, 1}; Y[L]={-4, 2, 1, 0} and we have to find z=X[0]*Y[0] + X[1]*Y[1] + X[2]*Y[2] + X[3]*Y[3] (which in this case will give us -4).
It is obvious that X can be represented using binary digits, e.g. an integer type int32 for L=32. Then, all what we have to do is to find a dot product of this integer with an array of 32 integers. Do you have any idea or suggestions how to do it very fast?

This really would require profiling but an alternative you might want to consider:
int result=0;
int mask=1;
for ( int i = 0; i < L; i++ ){
if ( X & mask ){
result+=Y[i];
}
mask <<= 1;
}
Typically bit shifting and bitwise operations are faster than multiplication, however, the if statement might be slower than a multiplication, although with branch prediction and large L my guess is it might be faster. You would really have to profile it, though, to determine if it resulted in any speedup.
As has been pointed out in the comments below, unrolling the loop either manually or via a compiler flag (such as "-funroll-loops" on GCC) could also speed this up (eliding the loop condition).
Edit
In the comments below, the following good tweak has been proposed:
int result=0;
for ( int i = 0; i < L; i++ ){
if ( X & 1 ){
result+=Y[i];
}
X >>= 1;
}

Is a suggestion to look into SSE2 helpful? It has dot-product type operations already, plus you can trivially do 4 (or perhaps 8, I forget the register size) simple iterations of your naive loop in parallel.
SSE also has some simple logic-type operations so it may be able to do additions rather than multiplications without using any conditional operations... again you'd have to look at what ops are available.

Try this:
int result=0;
for ( int i = 0; i < L; i++ ){
result+=Y[i] & (~(((X>>i)&1)-1));
}
This avoids a conditional statement and uses bitwise operators to mask the scalar value with either zeros or ones.

Since size explicitly doesn’t matter, I think the following is probably the most efficient general-purpose code:
int result = 0;
for (size_t i = 0; i < 32; ++i)
result += Y[i] & -X[i];
Bit-encoding X just doesn’t bring anything to the table (even if the loop may potentially terminate earlier as #Mathieu correctly noted). But omitting the if inside the loop does.
Of course, loop unrolling can speed this up drastically, as others have noted.

This solution is identical to, but slightly faster (by my test), than Micheal Aaron's:
long Lev=1;
long Result=0
for (int i=0;i<L;i++) {
if (X & Lev)
Result+=Y[i];
Lev*=2;
}
I thought there was a numerical way to rapidly establish the next set bit in a word which should improve performance if your X data is very sparse but currently cannot find said numerical formulation currently.

I've seen a number of responses with bit trickery (to avoid branching) but none got the loop right imho :/
Optimizing #Goz answer:
int result=0;
for (int i = 0, x = X; x > 0; ++i, x>>= 1 )
{
result += Y[i] & -(int)(x & 1);
}
Advantages:
no need to do i bit-shifting operations each time (X>>i)
the loop stops sooner if X contains 0 in higher bits
Now, I do wonder if it runs faster, especially since the premature stop of the for loop might not be as easy for loop unrolling (compared to a compile-time constant).

How about combining a shifting loop with a small lookup table?
int result=0;
for ( int x=X; x!=0; x>>=4 ){
switch (x&15) {
case 0: break;
case 1: result+=Y[0]; break;
case 2: result+=Y[1]; break;
case 3: result+=Y[0]+Y[1]; break;
case 4: result+=Y[2]; break;
case 5: result+=Y[0]+Y[2]; break;
case 6: result+=Y[1]+Y[2]; break;
case 7: result+=Y[0]+Y[1]+Y[2]; break;
case 8: result+=Y[3]; break;
case 9: result+=Y[0]+Y[3]; break;
case 10: result+=Y[1]+Y[3]; break;
case 11: result+=Y[0]+Y[1]+Y[3]; break;
case 12: result+=Y[2]+Y[3]; break;
case 13: result+=Y[0]+Y[2]+Y[3]; break;
case 14: result+=Y[1]+Y[2]+Y[3]; break;
case 15: result+=Y[0]+Y[1]+Y[2]+Y[3]; break;
}
Y+=4;
}
The performance of this will depend on how good the compiler is at optimising the switch statement, but in my experience they are pretty good at that nowadays....

There is probably no general answer to this question. You need to profile your code under all the different targets. Performance will depend on compiler optimizations such as loop unwinding and SIMD instructions that are available on most modern CPUs (x86, PPC, ARM all have their own implementations).

For small L, you can use a switch statement instead of a loop. For example, if L = 8, you could have:
int dot8(unsigned int X, const int Y[])
{
switch (X)
{
case 0: return 0;
case 1: return Y[0];
case 2: return Y[1];
case 3: return Y[0]+Y[1];
// ...
case 255: return Y[0]+Y[1]+Y[2]+Y[3]+Y[4]+Y[5]+Y[6]+Y[7];
}
assert(0 && "X too big");
}
And if L = 32, you can write a dot32() function which calls dot8() four times, inlined if possible. (If your compiler refuses to inline dot8(), you could rewrite dot8() as a macro to force inlining.) Added:
int dot32(unsigned int X, const int Y[])
{
return dot8(X >> 0 & 255, Y + 0) +
dot8(X >> 8 & 255, Y + 8) +
dot8(X >> 16 & 255, Y + 16) +
dot8(X >> 24 & 255, Y + 24);
}
This solution, as mikera points out, may have an instruction cache cost; if so, using a dot4() function might help.
Further update: This can be combined with mikera's solution:
static int dot4(unsigned int X, const int Y[])
{
switch (X)
{
case 0: return 0;
case 1: return Y[0];
case 2: return Y[1];
case 3: return Y[0]+Y[1];
//...
case 15: return Y[0]+Y[1]+Y[2]+Y[3];
}
}
Looking at the resulting assembler code with the -S -O3 options with gcc 4.3.4 on CYGWIN, I'm slightly surprised to see that this is automatically inlined within dot32(), with eight 16-entry jump-tables.
But adding __attribute__((__noinline__)) seems to produce nicer-looking assembler.
Another variation is to use fall-throughs in the switch statement, but gcc adds jmp instructions, and it doesn't look any faster.
Edit--Completely new answer: After thinking about the 100 cycle penalty mentioned by Ants Aasma, and the other answers, the above is likely not optimal. Instead, you could manually unroll the loop as in:
int dot(unsigned int X, const int Y[])
{
return (Y[0] & -!!(X & 1<<0)) +
(Y[1] & -!!(X & 1<<1)) +
(Y[2] & -!!(X & 1<<2)) +
(Y[3] & -!!(X & 1<<3)) +
//...
(Y[31] & -!!(X & 1<<31));
}
This, on my machine, generates 32 x 5 = 160 fast instructions. A smart compiler could conceivably unroll the other suggested answers to give the same result.
But I'm still double-checking.

result = 0;
for(int i = 0; i < L ; i++)
if(X[i]!=0)
result += Y[i];

It's quite likely that the time spent to load X and Y from main memory will dominate. If this is the case for your CPU architecture, the algorithm is faster when loading less. This means that storing X as a bitmask and expanding it into L1 cache will speed up the algorithm as a whole.
Another relevant question is whether your compiler will generate optimal loads for Y. This is higly CPU and compiler dependent. But in general, it helps if the compiler can see precsiely which values are needed when. You could manually unroll the loop. However, if L is a contant, leave it to the compiler:
template<int I> inline void calcZ(int (&X)[L], int(&Y)[L], int &Z) {
Z += X[I] * Y[I]; // Essentially free, as it operates in parallel with loads.
calcZ<I-1>(X,Y,Z);
}
template< > inline void calcZ<0>(int (&X)[L], int(&Y)[L], int &Z) {
Z += X[0] * Y[0];
}
inline int calcZ(int (&X)[L], int(&Y)[L]) {
int Z = 0;
calcZ<L-1>(X,Y,Z);
return Z;
}
(Konrad Rudolph questioned this in a comment, wondering about memory use. That's not the real bottleneck in modern computer architectures, bandwidth between memory and CPU is. This answer is almost irrelevant if Y is somehow already in cache. )

You can store your bit vector as a sequence of ints where each int packs a couple of coefficients as bits. Then, the component-wise multiplication is equivalent to bit-and. With this you simply need to count the number of set bits which could be done like this:
inline int count(uint32_t x) {
// see link
}
int dot(uint32_t a, uint32_t b) {
return count(a & b);
}
For a bit hack to count the set bits see http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
Edit: Sorry I just realized only one of the vectors contains elements of {0,1} and the other one doesn't. This answer only applies to the case where both vectors are limited to coefficients from the set of {0,1}.

Represente X using linked list of the places where x[i] = 1.
To find required sum you need O(N) operations where N is size of your list.

Well you want all bits to get past if its a 1 and none if its a 0. So you want to somehow turn 1 into -1 (ie 0xffffffff) and 0 stays the same. Thats just -X .... so you do ...
Y & (-X)
for each element ... job done?
Edit2: To give a code example you can do something like this and avoid the branch:
int result=0;
for ( int i = 0; i < L; i++ )
{
result+=Y[i] & -(int)((X >> i) & 1);
}
Of course you'd be best off keeping the 1s and 0s in an array of ints and therefore avoiding the shifts.
Edit: Its also worth noting that if the values in Y are 16-bits in size then you can do 2 of these and operations per operation (4 if you have 64-bit registers). It does mean negating the X values 1 by 1 into a larger integer, though.
ie YVals = -4, 3 in 16-bit = 0xFFFC, 0x3 ... put into 1 32-bit and you get 0xFFFC0003. If you have 1, 0 as the X vals then you form a bit mask of 0xFFFF0000 and the 2 together and you've got 2 results in 1 bitwise-and op.
Another edit:
IF you want the code on how to do the 2nd method something like this should work (Though it takes advantage of unspecified behaviour so it may not work on every compiler .. works on every compiler I've come across though).
union int1632
{
int32_t i32;
int16_t i16[2];
};
int result=0;
for ( int i = 0; i < (L & ~0x1); i += 2 )
{
int3264 y3264;
y3264.i16[0] = Y[i + 0];
y3264.i16[1] = Y[i + 1];
int3264 x3264;
x3264.i16[0] = -(int16_t)((X >> (i + 0)) & 1);
x3264.i16[1] = -(int16_t)((X >> (i + 1)) & 1);
int3264 res3264;
res3264.i32 = y3264.i32 & x3264.i32;
result += res3264.i16[0] + res3264.i16[1];
}
if ( i < L )
result+=Y[i] & -(int)((X >> i) & 1);
Hopefully the compiler will optimise out the assigns (Off the top of my head i'm not sure but the idea could be re-worked so that they definitely are) and give you a small speed up in that you now only need to do 1 bitwise-and instead of 2. The speed up would be minor though ...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js