Fast dot product for a very special case

Fast dot product for a very special case - c++

Given a vector X of size L, where every scalar element of X is from a binary set {0,1}, it is to find a dot product z=dot(X,Y) if vector Y of size L consists of the integer-valued elements. I suggest, there must exist a very fast way to do it.
Let's say we have L=4; X[L]={1, 0, 0, 1}; Y[L]={-4, 2, 1, 0} and we have to find z=X[0]*Y[0] + X[1]*Y[1] + X[2]*Y[2] + X[3]*Y[3] (which in this case will give us -4).
It is obvious that X can be represented using binary digits, e.g. an integer type int32 for L=32. Then, all what we have to do is to find a dot product of this integer with an array of 32 integers. Do you have any idea or suggestions how to do it very fast?

This really would require profiling but an alternative you might want to consider:
int result=0;
int mask=1;
for ( int i = 0; i < L; i++ ){
if ( X & mask ){
result+=Y[i];
}
mask <<= 1;
}
Typically bit shifting and bitwise operations are faster than multiplication, however, the if statement might be slower than a multiplication, although with branch prediction and large L my guess is it might be faster. You would really have to profile it, though, to determine if it resulted in any speedup.
As has been pointed out in the comments below, unrolling the loop either manually or via a compiler flag (such as "-funroll-loops" on GCC) could also speed this up (eliding the loop condition).
Edit
In the comments below, the following good tweak has been proposed:
int result=0;
for ( int i = 0; i < L; i++ ){
if ( X & 1 ){
result+=Y[i];
}
X >>= 1;
}

Is a suggestion to look into SSE2 helpful? It has dot-product type operations already, plus you can trivially do 4 (or perhaps 8, I forget the register size) simple iterations of your naive loop in parallel.
SSE also has some simple logic-type operations so it may be able to do additions rather than multiplications without using any conditional operations... again you'd have to look at what ops are available.

Try this:
int result=0;
for ( int i = 0; i < L; i++ ){
result+=Y[i] & (~(((X>>i)&1)-1));
}
This avoids a conditional statement and uses bitwise operators to mask the scalar value with either zeros or ones.

Since size explicitly doesn’t matter, I think the following is probably the most efficient general-purpose code:
int result = 0;
for (size_t i = 0; i < 32; ++i)
result += Y[i] & -X[i];
Bit-encoding X just doesn’t bring anything to the table (even if the loop may potentially terminate earlier as #Mathieu correctly noted). But omitting the if inside the loop does.
Of course, loop unrolling can speed this up drastically, as others have noted.

This solution is identical to, but slightly faster (by my test), than Micheal Aaron's:
long Lev=1;
long Result=0
for (int i=0;i<L;i++) {
if (X & Lev)
Result+=Y[i];
Lev*=2;
}
I thought there was a numerical way to rapidly establish the next set bit in a word which should improve performance if your X data is very sparse but currently cannot find said numerical formulation currently.

I've seen a number of responses with bit trickery (to avoid branching) but none got the loop right imho :/
Optimizing #Goz answer:
int result=0;
for (int i = 0, x = X; x > 0; ++i, x>>= 1 )
{
result += Y[i] & -(int)(x & 1);
}
Advantages:
no need to do i bit-shifting operations each time (X>>i)
the loop stops sooner if X contains 0 in higher bits
Now, I do wonder if it runs faster, especially since the premature stop of the for loop might not be as easy for loop unrolling (compared to a compile-time constant).

How about combining a shifting loop with a small lookup table?
int result=0;
for ( int x=X; x!=0; x>>=4 ){
switch (x&15) {
case 0: break;
case 1: result+=Y[0]; break;
case 2: result+=Y[1]; break;
case 3: result+=Y[0]+Y[1]; break;
case 4: result+=Y[2]; break;
case 5: result+=Y[0]+Y[2]; break;
case 6: result+=Y[1]+Y[2]; break;
case 7: result+=Y[0]+Y[1]+Y[2]; break;
case 8: result+=Y[3]; break;
case 9: result+=Y[0]+Y[3]; break;
case 10: result+=Y[1]+Y[3]; break;
case 11: result+=Y[0]+Y[1]+Y[3]; break;
case 12: result+=Y[2]+Y[3]; break;
case 13: result+=Y[0]+Y[2]+Y[3]; break;
case 14: result+=Y[1]+Y[2]+Y[3]; break;
case 15: result+=Y[0]+Y[1]+Y[2]+Y[3]; break;
}
Y+=4;
}
The performance of this will depend on how good the compiler is at optimising the switch statement, but in my experience they are pretty good at that nowadays....

There is probably no general answer to this question. You need to profile your code under all the different targets. Performance will depend on compiler optimizations such as loop unwinding and SIMD instructions that are available on most modern CPUs (x86, PPC, ARM all have their own implementations).

For small L, you can use a switch statement instead of a loop. For example, if L = 8, you could have:
int dot8(unsigned int X, const int Y[])
{
switch (X)
{
case 0: return 0;
case 1: return Y[0];
case 2: return Y[1];
case 3: return Y[0]+Y[1];
// ...
case 255: return Y[0]+Y[1]+Y[2]+Y[3]+Y[4]+Y[5]+Y[6]+Y[7];
}
assert(0 && "X too big");
}
And if L = 32, you can write a dot32() function which calls dot8() four times, inlined if possible. (If your compiler refuses to inline dot8(), you could rewrite dot8() as a macro to force inlining.) Added:
int dot32(unsigned int X, const int Y[])
{
return dot8(X >> 0 & 255, Y + 0) +
dot8(X >> 8 & 255, Y + 8) +
dot8(X >> 16 & 255, Y + 16) +
dot8(X >> 24 & 255, Y + 24);
}
This solution, as mikera points out, may have an instruction cache cost; if so, using a dot4() function might help.
Further update: This can be combined with mikera's solution:
static int dot4(unsigned int X, const int Y[])
{
switch (X)
{
case 0: return 0;
case 1: return Y[0];
case 2: return Y[1];
case 3: return Y[0]+Y[1];
//...
case 15: return Y[0]+Y[1]+Y[2]+Y[3];
}
}
Looking at the resulting assembler code with the -S -O3 options with gcc 4.3.4 on CYGWIN, I'm slightly surprised to see that this is automatically inlined within dot32(), with eight 16-entry jump-tables.
But adding __attribute__((__noinline__)) seems to produce nicer-looking assembler.
Another variation is to use fall-throughs in the switch statement, but gcc adds jmp instructions, and it doesn't look any faster.
Edit--Completely new answer: After thinking about the 100 cycle penalty mentioned by Ants Aasma, and the other answers, the above is likely not optimal. Instead, you could manually unroll the loop as in:
int dot(unsigned int X, const int Y[])
{
return (Y[0] & -!!(X & 1<<0)) +
(Y[1] & -!!(X & 1<<1)) +
(Y[2] & -!!(X & 1<<2)) +
(Y[3] & -!!(X & 1<<3)) +
//...
(Y[31] & -!!(X & 1<<31));
}
This, on my machine, generates 32 x 5 = 160 fast instructions. A smart compiler could conceivably unroll the other suggested answers to give the same result.
But I'm still double-checking.

result = 0;
for(int i = 0; i < L ; i++)
if(X[i]!=0)
result += Y[i];

It's quite likely that the time spent to load X and Y from main memory will dominate. If this is the case for your CPU architecture, the algorithm is faster when loading less. This means that storing X as a bitmask and expanding it into L1 cache will speed up the algorithm as a whole.
Another relevant question is whether your compiler will generate optimal loads for Y. This is higly CPU and compiler dependent. But in general, it helps if the compiler can see precsiely which values are needed when. You could manually unroll the loop. However, if L is a contant, leave it to the compiler:
template<int I> inline void calcZ(int (&X)[L], int(&Y)[L], int &Z) {
Z += X[I] * Y[I]; // Essentially free, as it operates in parallel with loads.
calcZ<I-1>(X,Y,Z);
}
template< > inline void calcZ<0>(int (&X)[L], int(&Y)[L], int &Z) {
Z += X[0] * Y[0];
}
inline int calcZ(int (&X)[L], int(&Y)[L]) {
int Z = 0;
calcZ<L-1>(X,Y,Z);
return Z;
}
(Konrad Rudolph questioned this in a comment, wondering about memory use. That's not the real bottleneck in modern computer architectures, bandwidth between memory and CPU is. This answer is almost irrelevant if Y is somehow already in cache. )

You can store your bit vector as a sequence of ints where each int packs a couple of coefficients as bits. Then, the component-wise multiplication is equivalent to bit-and. With this you simply need to count the number of set bits which could be done like this:
inline int count(uint32_t x) {
// see link
}
int dot(uint32_t a, uint32_t b) {
return count(a & b);
}
For a bit hack to count the set bits see http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
Edit: Sorry I just realized only one of the vectors contains elements of {0,1} and the other one doesn't. This answer only applies to the case where both vectors are limited to coefficients from the set of {0,1}.

Represente X using linked list of the places where x[i] = 1.
To find required sum you need O(N) operations where N is size of your list.

Well you want all bits to get past if its a 1 and none if its a 0. So you want to somehow turn 1 into -1 (ie 0xffffffff) and 0 stays the same. Thats just -X .... so you do ...
Y & (-X)
for each element ... job done?
Edit2: To give a code example you can do something like this and avoid the branch:
int result=0;
for ( int i = 0; i < L; i++ )
{
result+=Y[i] & -(int)((X >> i) & 1);
}
Of course you'd be best off keeping the 1s and 0s in an array of ints and therefore avoiding the shifts.
Edit: Its also worth noting that if the values in Y are 16-bits in size then you can do 2 of these and operations per operation (4 if you have 64-bit registers). It does mean negating the X values 1 by 1 into a larger integer, though.
ie YVals = -4, 3 in 16-bit = 0xFFFC, 0x3 ... put into 1 32-bit and you get 0xFFFC0003. If you have 1, 0 as the X vals then you form a bit mask of 0xFFFF0000 and the 2 together and you've got 2 results in 1 bitwise-and op.
Another edit:
IF you want the code on how to do the 2nd method something like this should work (Though it takes advantage of unspecified behaviour so it may not work on every compiler .. works on every compiler I've come across though).
union int1632
{
int32_t i32;
int16_t i16[2];
};
int result=0;
for ( int i = 0; i < (L & ~0x1); i += 2 )
{
int3264 y3264;
y3264.i16[0] = Y[i + 0];
y3264.i16[1] = Y[i + 1];
int3264 x3264;
x3264.i16[0] = -(int16_t)((X >> (i + 0)) & 1);
x3264.i16[1] = -(int16_t)((X >> (i + 1)) & 1);
int3264 res3264;
res3264.i32 = y3264.i32 & x3264.i32;
result += res3264.i16[0] + res3264.i16[1];
}
if ( i < L )
result+=Y[i] & -(int)((X >> i) & 1);
Hopefully the compiler will optimise out the assigns (Off the top of my head i'm not sure but the idea could be re-worked so that they definitely are) and give you a small speed up in that you now only need to do 1 bitwise-and instead of 2. The speed up would be minor though ...

Related

how could I use the power function in c/c++ without pow(), functions, or recursion

I'm using a C++ compiler but writing code in C (if that helps)
There's a series of numbers
(-1^(a-1)/2a-1)B^(2a-1)
A and X are user defined... A must be positive, but X can be anything (+,-)...
to decode this sequence... I need use exponents/powers, but was given some restrictions... I can't make another function, use recursion, or pow() (among other advanced math functions that come with cmath or math.h).
There were plenty of similar questions, but many answers have used functions and recursion which aren't directly relevant to this question.
This is the code that works perfectly with pow(), I spent a lot of time trying to modify it to replace pow() with my own code, but nothing seems to be working... mainly getting wrong results. X and J are user inputted variables
for (int i = 1; i < j; i++)
sum += (pow(-1, i - 1)) / (5 * i - 1) * (pow(x, 5 * i - 1));
}

You can use macros to get away with no function calls restriction as macros will generate inline code which is technically not a function call
however in case of more complex operations macro can not have return value so you need to use some local variable for the result (in case of more than single expression) like:
int ret;
#define my_pow_notemp(a,b) (b==0)?1:(b==1)?a:(b==2)?a*a:(b==3)?a*a*a:0
#define my_pow(a,b)\
{\
ret=1;\
if (int(b& 1)) ret*=a;\
if (int(b& 2)) ret*=a*a;\
if (int(b& 4)) ret*=a*a*a*a;\
if (int(b& 8)) ret*=a*a*a*a*a*a*a*a;\
if (int(b&16)) ret*=a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a;\
if (int(b&32)) ret*=a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a;\
}
void main()
{
int a=2,b=3,c;
c=my_pow_notemp(a,b); // c = a^b
my_pow(a,b); c = ret; // c = a^b
}
as you can see you can use my_pow_notemp directly but the code is hardcoded so only up to a^3 if you want more you have to add it to code. The my_pow is accepting exponents up to a^63 and its also an example on how to return value in case of more complex code inside macro. Here are some (normal) ways on how to compute powers in case you need non integer or negative exponents (but to convert it to unrolled code will be insanely hard without loops/recursion):
Power by squaring for negative exponents
In case you want to get away with recursion and function calls you can use templates instead of macros but that is limited to C++.
template<class T> T my_pow(T a,T b)
{
if (b==0) return 1;
if (b==1) return a;
return a*my_pow(a,b-1);
}
void main()
{
int a=2,b=3,c;
c=my_pow(a,b);
}
As you can see templates have return value so no problem even with more complex code (more than single expression).
To avoid loops you can use LUT tables
int my_pow[4][4]=
{
{1,0,0,0}, // 0^
{1,1,1,1}, // 1^
{1,2,4,8}, // 2^
{1,3,9,27}, // 3^
};
void main()
{
int a=2,b=3,c;
c=my_pow[a][b];
}
If you have access to FPU or advanced math assembly you can use that as asm instruction is not a function call. FPU usually have log,exp,pow functions natively. This however limits the code to specific instruction set !!!
Here some examples:
How to: pow(real, real) in x86
So when I consider your limitation I think the best way is:
#define my_pow(a,b) (b==0)?1:(b==1)?a:(b==2)?a*a:(b==3)?a*a*a:0
void main()
{
int a=2,b=3,c;
c=my_pow(a,b); // c = a^b
}
Which will work on int exponents b up to 3 (if you want more just add (b==4)?a*a*a*a: ... :0) and both int and float bases a. If you need much bigger exponent use the complicated version with local temp variable for returning result.
[Edit1] ultimative single expression macro with power by squaring up to a^15
#define my_pow(a,b) (1* (int(b&1))?a:1* (int(b&2))?a*a:1* (int(b&4))?a*a*a*a:1* (int(b&8))?a*a*a*a*a*a*a*a:1)
void main()
{
int a=2,b=3,c;
c=my_pow(a,b); // c = a^b
}
In case you want more than a^15 just add sub term (int(b&16))?a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a:1 and so on for each bit of exponent.

It is a series. Replace pow() based on the previous iteration. #Bathsheba
Code does not need to call pow(). It can form pow(x, 5 * i - 1) and pow(-1, i - 1), since both have an int exponent based on the iterator i, from the prior loop iteration.
Example:
Let f(x, i) = pow(x, 5 * i - 1)
Then f(x, 1) = x*x*x*x
and f(x, i > 1) = f(x, i-1) * x*x*x*x*x
double power_n1 = 1.0;
double power_x5 = x*x*x*x;
for (int i = 1; i < j + 1; i++)
// sum += (pow(-1, i - 1)) / (5 * i - 1) * (pow(x, 5 * i - 1));
sum += power_n1 / (5 * i - 1) * power_x5;
power_n1 = -power_n1;
power_x5 *= x*x*x*x*x;
}

Why is a switch not optimized the same way as chained if else in c/c++?

The following implementation of square produces a series of cmp/je statements like I would expect of a chained if statement:
int square(int num) {
if (num == 0){
return 0;
} else if (num == 1){
return 1;
} else if (num == 2){
return 4;
} else if (num == 3){
return 9;
} else if (num == 4){
return 16;
} else if (num == 5){
return 25;
} else if (num == 6){
return 36;
} else if (num == 7){
return 49;
} else {
return num * num;
}
}
And the following produces a data table for return:
int square_2(int num) {
switch (num){
case 0: return 0;
case 1: return 1;
case 2: return 4;
case 3: return 9;
case 4: return 16;
case 5: return 25;
case 6: return 36;
case 7: return 49;
default: return num * num;
}
}
Why is gcc unable to optimize the top one into the bottom one?
Dissassembly for reference: https://godbolt.org/z/UP_igi
EDIT: interestingly, MSVC generates a jump table instead of a data table for the switch case. And surprisingly, clang optimizes them to the same result.

The generated code for switch-case conventionally uses a jump table. In this case, the direct return through a look-up table seems to be an optimization making use of the fact that every case here involves a return. Though the standard makes no guarantees to that effect, I would be surprised if a compiler were to generate a series of compares instead of a jump-table for a conventional switch-case.
Now coming to if-else, it is the exact opposite. While switch-case executes in constant time, irrespective of the number of branches, if-else is optimized for a smaller number of branches. Here, you would expect the compiler to basically generate a series of comparisons in the order that you have written them.
So if I had used if-else because I expect most calls to square() to be for 0 or 1 and rarely for other values, then 'optimizing' this to a table-lookup could actually cause my code to run slower than I expect, defeating my purpose for using an if instead of a switch. So although it is debatable, I feel GCC is doing the right thing and clang is being overly aggressive in its optimization.
Someone had, in the comments, shared a link where clang does this optimization and generates lookup-table based code for if-else as well. Something notable happens when we reduce the number of cases to just two (and a default) with clang. It once again generates identical code for both if and switch, but this time,
switches over to compares and moves instead of the lookup-table approach, for both. This means that even the switch-favoring clang knows that the 'if' pattern is more optimal when the number of cases is small!
In summary, a sequence of compares for if-else and a jump-table for switch-case is the standard pattern that compilers tend to follow and developers tend to expect when they write code. However, for certain special cases, some compilers might choose to break this pattern where they feel it provides better optimization. Other compilers might just choose to stick to the pattern anyway, even if apparently sub-optimal, trusting the developer to know what he wants. Both are valid approaches with their own advantages and disadvantages.

One possible rationale is that if low values of num are more likely, for example always 0, the generated code for the first one might be faster. The generated code for switch takes equal time for all values.
Comparing the best cases, according to this table. See this answer for the explanation of the table.
If num == 0, for "if" you have xor, test, je (with jump), ret. Latency: 1 + 1 + jump. However, xor and test are independent so the actual execution speed would be faster than 1 + 1 cycles.
If num < 7, for "switch" you have mov, cmp, ja (without jump), mov, ret. Latency: 2 + 1 + no jump + 2.
A jump instruction that does not result to jump is faster than one that results to jump. However, the table does not define the latency for a jump, so it is not clear to me which one is better. It is possible that the last one is always better and GCC is simply not able to optimize it.

Multiply numbers which are divisible by 3 and less than 10 with a while loop in c++?

In C++, I should write a program where the app detects which numbers are divisible by 3 from 1 till 10 and then multiply all of them and print the result. That means that I should multiply 3,6,9 and print only the result, which is 162, but I should do it by using a "While" loop, not just multiplying the 3 numbers with each other. How should I write the code of this? I attached my attempt to code the problem below. Thanks
#include <iostream>
using namespace std;
int main() {
int x, r;
int l;
x = 1;
r = 0;
while (x < 10 && x%3==0) {
r = (3 * x) + 3;
cout << r;
}
cin >> l;
}

Firstly your checking the condition x%3 == 0 brings you out of your while - loop right in the first iteration where x is 1. You need to check the condition inside the loop.
Since you wish to store your answer in variable r you must initialize it to 1 since the product of anything with 0 would give you 0.
Another important thing is you need to increment the value of x at each iteration i.e. to check if each number in the range of 1 to 10 is divisible by 3 or not .
int main()
{
int x, r;
int l;
x = 1;
r = 1;
while (x < 10)
{
if(x%3 == 0)
r = r*x ;
x = x + 1; //incrementing the value of x
}
cout<<r;
}
Lastly I have no idea why you have written the last cin>>l statement . Omit it if not required.

Ok so here are a few hints that hopefully help you solving this:
Your approach with two variables (x and r) outside the loop is a good starting point for this.
Like I wrote in the comments you should use *= instead of your formula (I still don't understand how it is related to the problem)
Don't check if x is dividable by 3 inside the while-check because it would lead to an too early breaking of the loop
You can delete your l variable because it has no affect at the moment ;)
Your output should also happen outside the loop, else it is done everytime the loop runs (in your case this would be 10 times)
I hope I can help ;)
EDIT: Forget about No.4. I didn't saw your comment about the non-closing console.

int main()
{
int result = 1; // "result" is better than "r"
for (int x=1; x < 10; ++x)
{
if (x%3 == 0)
result = result * x;
}
cout << result;
}
or the loop in short with some additional knowledge:
for (int x=3; x < 10; x += 3) // i know that 3 is dividable
result *= x;
or, as it is c++, and for learning purposes, you could do:
vector<int> values; // a container holding integers that will get the multiples of 3
for (int x=1; x < 10; ++x) // as usual
if ( ! x%3 ) // same as x%3 == 0
values.push_back(x); // put the newly found number in the container
// now use a function that multiplies all numbers of the container (1 is start value)
result = std::accumulate(values.begin(), values.end(), 1, multiplies<int>());
// so much fun, also get the sum (0 is the start value, no function needed as add is standard)
int sum = std::accumulate(values.begin(), values.end(), 0);

It's important to remember the difference between = and ==. = sets something to a value while == compares something to a value. You're on the right track with incrementing x and using x as a condition to check your range of numbers. When writing code I usually try and write a "pseudocode" in English to organize my steps and get my logic down. It's also wise to consider using variables that tell you what they are as opposed to just random letters. Imagine if you were coding a game and you just had letters as variables; it would be impossible to remember what is what. When you are first learning to code this really helps a lot. So with that in mind:
/*
- While x is less than 10
- check value to see if it's mod 3
- if it's mod 3 add it to a sum
- if not's mod 3 bump a counter
- After my condition is met
- print to screen pause screen
*/
Now if we flesh out that pseudocode a little more we'll get a skeletal structure.
int main()
{
int x=1//value we'll use as a counter
int sum=0//value we'll use as a sum to print out at the end
while(x<10)//condition we'll check against
{
if (x mod 3 is zero)
{
sum=x*1;
increment x
}
else
{
increment x
}
}
//screen output the sum the sum
//system pause or cin.get() use whatever your teacher gave you.
I've given you a lot to work with here you should be able to figure out what you need from this. Computer Science and programming is hard and will require a lot of work. It's important to develop good coding habits and form now as it will help you in the future. Coding is a skill like welding; the more you do it the better you'll get. I often refer to it as the "Blue Collar Science" because it's really a skillset and not just raw knowledge. It's not like studying history or Biology (minus Biology labs) because those require you to learn things and loosely apply them whereas programming requires you to actually build something. It's like welding or plumbing in my opinion.
Additionally when you come to sites like these try and read up how things should be posted and try and seek the "logic" behind the answer and come up with it on your own as opposed to asking for the answer. People will be more inclined to help you if they think you're working for something instead of asking for a handout (not saying you are, just some advice). Additionally take the attitude these guys give you with a grain of salt, Computer Scientists aren't known to be the worlds most personable people. =) Good luck.

Keeping the order of operations while doing a sequential calculation

I am spending my evening doing some programming problems from Kattis. There is one part of the problem 4 thought that I am stuck on.
Given a number, the program is supposed to return the operations (+, -, * or /) required between 4 fours to achieve that number.
For example, the input
9
would result in the output
4 + 4 + 4 / 4 = 9
My solution (not efficient, but simple) is to evaluate all possible ways to combine the operators above and see if any of the combinations achieve the wanted result.
To do this I have written the function seen below. It takes in an array of chars which are the operators to be evaluated (uo[3], could look like {+, /, *}), and the wanted result as an integer (expRes).
bool check(char uo[3], int expRes) {
int res = 4;
for(int oPos = 2; oPos >= 0; oPos--) {
switch (uo[oPos]) {
case '+' : res += 4; break;
case '-' : res -= 4; break;
case '*' : res *= 4; break;
case '/' : res /= 4; break;
}
}
return res == expRes;
}
I realized that this "sequential" approach comes with a problem: it doesn't follow the order of operations. If I was to call the function with
uo = {+, -, /}
and
expRes = 7 it would return false since 4 + 4 = 8, 8 - 4 = 4, 4 / 4 = 1.
The real answer is obviously true, since 4 + 4 - 4 / 4 = 7.
Can any of you think of a way to rewrite the function so that the evaluation follows the order of operations?
Thanks in advance!

Its an easy problem if you look at it.
You are restricted with four 4's and three operators in between, that is you already know your search space. So one solution is to generate the complete search space which is O(n^3) = 4^3 = 64 total equations, where n is the number of operators. Keep the answer to these solutions as a <key, value> pair so that look up to the input of test case is O(1).
Step wise you'd do.
Generate Complete Sequence and store them as key, value pairs
Take Input from test cases
Check if key exists, if yes print the sequence, else print that the sequence doesn't exist
Solution would take 64*1000 operations, which can easily be computed with in a second and would avoid Time Limit Exceeded Error that usually these competitions have
in Code form (most of it is incomplete):
// C++ Syntax
map<int, string> mp;
void generateAll() {
// generate all equations
}
void main () {
generateAll();
int n, t; scanf("%d", &t);
while (t--) {
scanf("%d", &n);
if ( mp.find(n) != mp.end() )
// equation exists to the input
else
// equation doesn't exist for the input
}
}

C++, most efficient way to modulate sign (+/-) in an expression

I have an expression
x += y;
and, based on a boolean, I want to be able to change it to
x -= y;
Of course I could do
if(i){x+=y;} else{x-=y;}
//or
x+=(y*sign); //where sign is either 1 or -1
But if I have to do this iteratively, I want to avoid the extra computation. Is there a more efficient way? Is it possible to modulate the operator?

if (i) {x += y;} else {x -= y;}
is probably going to be as efficient as anything else you can do. y * sign is likely to be fairly expensive (unless the compiler can figure out that y is guaranteed to be 1 or -1).

The most efficient way to do this iteratively is to precompute the data you need.
So, precomputation:
const YourNumberType increment = (i? y : -y);
Then in your loop:
x += increment;
EDIT: re question in commentary about how to generate code, like this:
#include <stdio.h>
void display( int x ) { printf( "%d\n", x ); }
template< bool isSomething >
inline void advance( int& x, int y );
template<> inline void advance<true>( int& x, int y ) { x += y; }
template<> inline void advance<false>( int& x, int y ) { x -= y; }
template< bool isSomething >
void myFunc()
{
int x = 314;
int y = 271;
for( ;; )
{
advance< isSomething >( x, y ); // The nano-optimization.
display( x );
if( !( -10000 < x && x < 10000 ) ) { return; }
}
}
int main( int n, char*[] )
{
n > 1? myFunc<true>() : myFunc<false>();
}
E.g. with Visual C++ 10.0 this generates two versions of myFunc, one with an add instruction and the other with a sub instruction.
Cheers & hth.,

On a modern pipelined machine you want to avoid branching if at all possible in those cases where performance really does count. When the front of the pipeline hits a branch, the CPU guesses which branch to take and lets the pipeline work ahead based on that guess. Everything is fine if the guess was right. Everything is not so fine if the guess was wrong, particularly so if you're still using one of Intel's processors such as a Pentium 4 that suffered from pipeline bloat. Intel discovered that too much pipelining is not a good thing.
More modern processors still do use pipelining (the Core line has a pipeline length of 14 or so), so avoiding branching still remains one of those good things to do -- when it counts, that is. Don't make your code an ugly, prematurely optimized mess when it doesn't count.
The best thing to do is to first find out where your performance demons lie. It is not at all uncommon for a tiny fraction of one percent of the code base to be the cause of almost all of the CPU usage. Optimizing the 99.9% of the code that doesn't contribute to the CPU usage won't solve your performance problems but it will have a deleterious effect on maintenance.
You optimize once you have found the culprit code, and even then, maybe not. When performance doesn't matter, don't optimize. Performance as a metric runs counter to almost every other code quality metric out there.
So, getting off the soap box, let's suppose that little snippet of code is the performance culprit. Try both approaches and test. Try a third approach you haven't thought of yet and test. Sometimes the code that is the best performance-wise is surprisingly non-intuitive. Think Duff's device.

If i stays constant during the execution of the loop, and y doesn't, move the if outside of the loop.
So instead of...
your_loop {
y = ...;
if (i)
x += y;
else
x -= y;
}
...do the following....
if (i) {
your_loop {
y = ...;
x += y;
}
}
else {
your_loop {
y = ...;
x -= y;
}
}
BTW, a decent compiler will do that optimization for you, so you may not see the difference when actually benchmarking.

Sounds like you want to avoid branching and multiplication. Let's say the switch i is set to all 1 bits, same size as y, when you want to add, and to 0 when you want to subtract. Then:
x += (y & i) - (y & ~i)
Haven't tested it, this is just to give you the general idea. Bear in mind that this makes the code a lot harder to read in exchange for what would probably be a very small increase in efficiency.
Edit: or, as bdonlan points out in the comments, possibly even a decrease.

I put my suggestion in the comments to the test, and in a simple test bit-fiddling is faster than branching options on an Intel(R) Xeon(R) CPU L5520 # 2.27GHz, but slower on my laptop Intel Core Duo.
If you are free to give i either the value 0 (for +) or ~0 (for -), these statements are equivalent:
// branching:
if ( i ) sum -= add; else sum += add;
sum += i?-add:add;
sum += (i?-1:1)*add;
// bit fiddling:
sum += (add^i)+(i&1);
sum += (add^i)+(!!i);
sum += (i&~add)-(i&add);
And as said, one method can beat the other by a factor of 2, depending on CPU and optimization level used.
Conclusion, as always, is that benchmarking is the only way to find out which is faster in your particular situation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js