I'm trying to achieve something like the following in C++:
class MyVector; // 3 component vector class
MyVector const kA = /* ... */;
MyVector const kB = /* ... */;
MyVector const kC = /* ... */;
MyVector const kD = /* ... */;
// I'd like to shorten the remaining lines, ideally making it readable but less code/operations.
MyVector result = kA;
MyVector const kCMinusD = kC - kD;
if(kCMinusD.X <= 0)
{
result.X = kB.X;
}
if(kCMinusD.Y <= 0)
{
result.Y = kB.Y;
}
if(kCMinusD.Z <= 0)
{
result.Z = kB.Z;
}
Paraphrasing the code into English, I have four 'known' vectors. Two of the vectors have values that I may or may not want in my result, and whether I want them or not is contingent on a branch based on the components of two other vectors.
I feel like I should be able to simplify this code with some matrix math and masking, but I can't wrap my head around it.
For now I'm going with the branch, but I'm curious to know if there's a better way that still would be understandable, and less code-verbose.
Edit:
In reference to Mark's comment, I'll explain what I'm trying to do here.
This code is an excerpt from some spring physics I'm working on. The components are as follows:
kC is the springs length currently, and kD is minimum spring length.
kA and kB are two sets of spring tensions, each component of which may be unique per component (i.e., a different spring tension along the X, Y, or Z). kA is the springs tension if it's not fully compressed, and kB is the springs tension if it IS fully compressed.
I'd like to build up a resultant 'vector' that simply is the amalgamation of kC and kD, dependant on whether the spring is compressed or not.
Depending on the platform you're on, the compiler might be able to optimize statements like
result.x = (kC.x > kD.x) ? kA.x : kB.x;
result.y = (kC.y > kD.y) ? kA.y : kB.y;
result.z = (kC.z > kD.z) ? kA.z : kB.z;
using fsel (floating point select) instructions or conditional moves. Personally, I think the code looks nicer and more concise this way too, but that's subjective.
If the code is really performance critical, and you don't mind changing your vector class to be 4 floats instead of 3, you could use SIMD (e.g SSE on Intel platforms, VMX on PowerPC) to do the comparison and select the answers. If you went ahead with this, it would like this: (in pseudo code)
// Set each component of mask to be either 0x0 or 0xFFFFFFFF depending on the comparison
MyVector4 mask = vec_compareLessThan(kC, kD);
// Sets each component of result to either kA or kB's component, depending on whether the bits are set in mask
result = vec_select(kA, kb, mask);
This takes a while getting used to, and it might be less readable initially, but you eventually get used to thinking in SIMD mode.
The usual caveats apply, of course - don't optimize before you profile, etc.
If your vector elements are ints, you can do:
MyVector result;
MyVector const kCMinusD = kC - kD;
int mask = kCMinusD.X >> 31; // either 0 or -1
result.X = (kB.X & mask) | (kCMinusD.X & ~mask)
mask = kCMinusD.Y >> 31;
result.X = (kB.Y & mask) | (kCMinusD.Y & ~mask)
mask = kCMinusD.Z >> 31;
result.X = (kB.Z & mask) | (kCMinusD.Z & ~mask)
(note this handles the == 0 case differently, not sure if you care)
If your vector elements are doubles instead of ints, you can do something similar as the sign bit is in the same place, you just have to convert to integers, do the mask, and convert back.
If you're seeking a clean expression in source more than a runtime optimization, you might consider solving this problem from the "toolbox" point of view. So let's say that on MyVector you defined sign, gt (greater-than), and le (less-than-or-equal-to). Then in two lines:
MyVector const kSignCMinusD = (kC - kD).sign();
result = kSignCMinusD.gt(0) * kA + kSignCMinusD.le(0) * kB;
With operator overloading:
MyVector const kSignCMinusD = (kC - kD).sign();
result = (kSignCMinusD > 0) * kA + (kSignCMinusD <= 0) * kB;
For inspiration here's the MatLab function reference. And obviously there are many C++ vector libraries to choose from with such functions.
You can always go in and optimize further if profiling shows it necessary. But often the biggest performance issues are how well you can see the big picture and reuse intermediate computations.
Since you are only doing subtraction you are rewrite as below:
MyVector result;
result.x = kD.x > kC.x ? kB.x : kA.x;
result.y = kD.y > kC.y ? kB.y : kA.y;
result.z = kD.z > kC.z ? kB.z : kA.z;
Related
I have following Python Code written in NumPy:
> r = 3
> y, x = numpy.ogrid[-r : r + 1, -r : r + 1]
> mask = numpy.sqrt(x**2 + y**2)
> mask
array([[4.24264, 3.60555, 3.16228, 3.00000, 3.16228, 3.60555, 4.24264],
[3.60555, 2.82843, 2.23607, 2.00000, 2.23607, 2.82843, 3.60555],
[3.16228, 2.23607, 1.41421, 1.00000, 1.41421, 2.23607, 3.16228],
[3.00000, 2.00000, 1.00000, 0.00000, 1.00000, 2.00000, 3.00000],
[3.16228, 2.23607, 1.41421, 1.00000, 1.41421, 2.23607, 3.16228],
[3.60555, 2.82843, 2.23607, 2.00000, 2.23607, 2.82843, 3.60555],
[4.24264, 3.60555, 3.16228, 3.00000, 3.16228, 3.60555, 4.24264]])
Now, I am making the mask in Eigen where I need to broadcast row and column vector. Unfortunately, it is not allowed so I made the following workaround:
int len = 1 + 2 * r;
MatrixXf mask = MatrixXf::Zero(len, len);
ArrayXf squared_yx = ArrayXf::LinSpaced(len, -r, r).square();
mask = (mask.array().colwise() + squared_yx) +
(mask.array().rowwise() + squared_yx.transpose());
mask = mask.cwiseSqrt();
cout << "mask" << endl << mask << endl;
4.24264 3.60555 3.16228 3 3.16228 3.60555 4.24264
3.60555 2.82843 2.23607 2 2.23607 2.82843 3.60555
3.16228 2.23607 1.41421 1 1.41421 2.23607 3.16228
3 2 1 0 1 2 3
3.16228 2.23607 1.41421 1 1.41421 2.23607 3.16228
3.60555 2.82843 2.23607 2 2.23607 2.82843 3.60555
4.24264 3.60555 3.16228 3 3.16228 3.60555 4.24264
It works. But I wonder if there is another and shorter way to do it. Therefore my question is how to broadcast Row and Column Vector in Eigen C++?
System Info
Tool
Version
Eigen
3.3.7
GCC
9.4.0
Ubuntu
20.04.4 LTS
I think the easiest approach (as in: most readable), is replicate.
int r = 3;
int len = 1 + 2 * r;
const auto& squared_yx = Eigen::ArrayXf::LinSpaced(len, -r, r).square();
const auto& bcast = squared_yx.replicate(1, len);
Eigen::MatrixXf mask = (bcast + bcast.transpose()).sqrt();
Note that what you do is numerically unstable (for large r) and the hypot function exists to work around these issues. So even your python code could be better:
r = 3
y, x = numpy.ogrid[-r : r + 1, -r : r + 1]
mask = numpy.hypot(x, y)
To achieve the same in Eigen, do something like this:
const auto& yx = Eigen::ArrayXf::LinSpaced(len, -r, r);
const auto& bcast = yx.replicate(1, len);
Eigen::MatrixXf mask = bcast.binaryExpr(bcast.transpose(),
[](float x, float y) noexcept -> float {
return std::hypot(x, y);
});
Eigen's documentation on binaryExpr is currently broken, so this is hard to find.
To be fair, you will probably never run into stability issues in this particular case because you will run out of memory first. However, it'd still like to point this out because seeing a naive sqrt(x**2 + y**2) is always a bit of a red flag. Also, in Python hypot might still worth it from a performance point because it reduces the number of temporary memory allocations and function calls.
BinaryExpr
The documentation on binaryExpr is missing, I assume because the parser has trouble with Eigen's C++ code. In any case, one can find it indirectly as CwiseBinaryOp and similarly CwiseUnaryOp, CwiseNullaryOp and CwiseTernaryOp.
The use looks a bit weird but is pretty simple. It takes a functor (either a struct with operator(), a function pointer, or a lambda) and applies this element-wise.
The unary operation makes this pretty clear. If Eigen::Array.sin() didn't exist, you could write this:
array.unaryExpr([](double x) -> double { return std::sin(x); }) to achieve exactly the same effect.
The binary and ternary versions take one or two more Eigen expressions as the second and third argument to the function. That's what I did above. The nullary version is explained in the documentation in its own chapter.
Use of auto
Eigen is correct to warn about auto but only in that you have to know what you do. It is important to realize that auto on an Eigen expression just keeps the expression around. It does not evaluate it into a vector or matrix.
This is fine and very useful if you want to compose a complex expression that would be hard to read when put in a single statement. In my code above, there are no temporary memory allocations and no floating point computations take place until the final expression is assigned to the matrix.
As long as the programmer knows that these are expressions and not final matrices, everything is fine.
I think the main take-away is that use of auto with Eigen should be limited to short-lived (as in: inside a single function) scalar expressions. Any coding style that uses auto for everything will quickly break or be hard to read with Eigen. But it can be used safely and make the code more readable in the process without sacrificing performance in the same way as evaluating into matrices would.
As for why I chose const auto& instead of auto or const auto: Mostly force of habit that is unrelated to the task at hand. I mostly do it for instances like this:
const Eigen::Vector& Foo::get_bar();
void quz(Foo& foo)
{
const auto& bar = foo.get_bar();
}
Here, bar will remain a reference whereas auto would create a copy. If the return value is changed, everything stays valid.
Eigen::Vector Foo::get_bar();
void quz(Foo& foo)
{
const auto& bar = foo.get_bar();
}
Now a copy is created anyway. But everything continues to work because assigning the return value to a const-reference extends the lifetime of the object. So this may look like a dangling pointer, but it is not.
I'm trying to do some calculations for my game, and I'm trying to calculate the distance between two points. Essentially, I'm using the equation of a circle to see if the points are inside of the radius that I define.
(x - x1)^2 + (y - y1)^2 <= r^2
My question is: how do I evaluate the conditional statement with SSE and interpret the results? So far I have this:
float distSqr4 = (pow(x4 - k->getPosition().x, 2) + pow(y4 - k->getPosition().y, 2));
float distSqr3 = (pow(x3 - k->getPosition().x, 2) + pow(y3 - k->getPosition().y, 2));
float distSqr2 = (pow(x2 - k->getPosition().x, 2) + pow(y2 - k->getPosition().y, 2));
float distSqr1 = (pow(x1 - k->getPosition().x, 2) + pow(y1 - k->getPosition().y, 2));
__m128 distances = _mm_set_ps(distSqr1, distSqr2, distSqr3, distSqr4);
__m128 maxDistSqr = _mm_set1_ps(k->getMaxDistance() * k->getMaxDistance());
__m128 result = _mm_cmple_ps(distances, maxDistSqr);
Once I get the result variable, I get lost. How do I use the result variable that I just got? My plan was, if the condition evaluated turned out to be true, to do some lighting calculations and then draw the pixel on the screen. How do I interpret true vs false in this case?
Any help towards the right direction is greatly appreciated!
My plan was, if the condition evaluated turned out to be true, to do some lighting calculations and then draw the pixel on the screen.
Then you really have little choice but to branch.
The big advantage of doing conditional tests using SSE is that it allows you to write branchless code, which can lead to massive speed improvements. But in your case, you pretty much have to branch because, if I'm understanding you correctly, you never want to output anything on the screen if the condition evaluated to false.
I mean, I guess you could do all of the calculations unconditionally (speculatively) and then just use the result of the conditional to twiddle bits in the pixel values, essentially causing you to draw off of the screen. That would give you branchless code, but it's pretty silly. There is a penalty for branch mispredictions, but it won't be anywhere near as expensive as all of the calculations and drawing code.
In other words, the parallelism you're using SIMD to exploit is exhausted once you have the final result. It's just a simple, scalar compare-and-branch. First you test whether the condition evaluated to true. If not, you'll jump over the code that does the lighting calculations and pixel-drawing. Otherwise, you'll just fall through to execute that code.
The tricky part is that the compiler won't let you use an __m128 variable in a regular old if statement, so you need to "convert" result to an integer that you can use as the basis for a conditional. The easiest way to do that would be the _mm_movemask_epi8 intrinsic.
So you would basically just do:
__m128 distances = _mm_set_ps(distSqr1, distSqr2, distSqr3, distSqr4);
__m128 maxDistSqr = _mm_set1_ps(k->getMaxDistance() * k->getMaxDistance());
__m128 result = _mm_cmple_ps(distances, maxDistSqr);
if (_mm_movemask_epi8(result) == (unsigned)-1)
{
// All distances were less-than-or-equal-to the maximum, so
// go ahead and calculate the lighting and draw the pixels.
CalcLightingAndDraw(…);
}
This works because _mm_cmple_ps sets each packed double-word to all 1s if the comparison is true, or all 0s if the comparison is false. _mm_movemask_epi8 then collapses that into an integer-sized mask and moves it to an integer value. You then can use that integer value in a normal conditional statement.
Note: With Clang and ICC, you can get away with passing a __m128 value to the _mm_movemask_epi8 intrinsic. On GCC, it insists upon a __m128i value. You can handle this with a cast: _mm_movemask_epi8((__m128i)result).
Of course, I'm assuming here that you are only going to do the drawing if all of the distances are less-than-or-equal-to the maximum distance. If you want to treat each of the four distances independently, then you need to add more conditional tests on the mask:
__m128 distances = _mm_set_ps(distSqr1, distSqr2, distSqr3, distSqr4);
__m128 maxDistSqr = _mm_set1_ps(k->getMaxDistance() * k->getMaxDistance());
__m128 result = _mm_cmple_ps(distances, maxDistSqr);
unsigned condition = _mm_movemask_epi8(result);
if (condition != 0)
{
// One or more of the distances were less-than-or-equal-to the maximum,
// so we have something to draw.
if ((condition & 0x000F) != 0)
{
// distSqr1 was less-than-or-equal-to the maximum
CalcLightingAndDraw(distSqr1);
}
if ((condition & 0x00F0) != 0)
{
// distSqr2 was less-than-or-equal-to the maximum
CalcLightingAndDraw(distSqr2);
}
if ((condition & 0x0F00) != 0)
{
// distSqr3 was less-than-or-equal-to the maximum
CalcLightingAndDraw(distSqr3);
}
if ((condition & 0xF000) != 0)
{
// distSqr4 was less-than-or-equal-to the maximum
CalcLightingAndDraw(distSqr4);
}
}
This won't result in very efficient code, since you have to do so many conditional test-and-branch operations. You might be able to continue parallelizing some of the lighting calculations inside of the main if block. I can't say for sure if this is workable, since I don't have enough details about your algorithm/design.
Otherwise, if you can't see any way to wring more parallelism out of the drawing code, the use of explicit SSE intrinsics isn't buying you much here. You were able to parallelize one comparison (_mm_cmple_ps), but the overhead of setting up for that comparison (_mm_set_ps, which will probably compile into vinsertps or unpcklps+movlhps instructions, assuming the inputs are already in XMM registers) will more than cancel out any trivial gains you might get. You'd arguably be just as well off writing the code like so:
float maxDistSqr = k->getMaxDistance() * k->getMaxDistance();
if (distSqr1 <= maxDistSqr)
{
CalcLightingAndDraw(distSqr1);
}
if (distSqr2 <= maxDistSqr)
{
CalcLightingAndDraw(distSqr2);
}
if (distSqr3 <= maxDistSqr)
{
CalcLightingAndDraw(distSqr3);
}
if (distSqr4 <= maxDistSqr)
{
CalcLightingAndDraw(distSqr4);
}
I am trying to implement an IIR filter I have designed in Matlab into a c++ program to filter out an unwanted signal from a wave file. The fdatool in Matlab generated this C header to use (it is a bandstop filter):
#include "tmwtypes.h"
/*
* Expected path to tmwtypes.h
* C:\Program Files (x86)\MATLAB\R2013a Student\extern\include\tmwtypes.h
*/
const int al = 7;
const real64_T a[7] = {
0.9915141178644, -5.910578456199, 14.71918523779, -19.60023964796,
14.71918523779, -5.910578456199, 0.9915141178644
};
const int bl = 7;
const real64_T b[7] = {
1, -5.944230431733, 14.76096188047, -19.60009655976,
14.67733658492, -5.877069568864, 0.9831002459245
};
After hours of exhausting research, I still can't figure out the proper way to use these values to determine the W values and then how to use those W values to properly calculate my Y outputs. If anyone has any insight into the ordering these values should be used to do all these conversions, it would be a major help.
All the methods I've developed and tried to this point do not generate a valid wave file, the header values all translate correctly, but everything beyond cannot be evaluated by a media player.
Thanks.
IIR filters work this way:
Assuming an array of samples A and and array of ceof named 'c' the result array B will be:
B[i] = (A[i] * c[0]) + (B[i-1] * c[1]) + ... + (B[n] * c[n])
Note that only the newest element is taken from A.
This is easier to do in-place, just update A as you move along.
These filter coefs are very violent, are you sure you got them right?
The first one is also symmetrical which probably indicates it's an FIR filter.
It appears to me that you have a 3 pole IIR filter with the coefficients given for an Nth order implementation (as opposed to a series of 2nd order sections). Since this is a band reject (or band pass) the polynomial order is twice the pole count.
I am not sure what you mean by W values, unless you are trying to evaluate the frequency response of this filter.
To calculate the Y values, as you put it, see this link for code on implementing IIR filters. See the Nth order implementation code in particular.
http://www.iowahills.com/A7ExampleCodePage.html
BTW: I assumed these were Nth order coefficients and simulated them. I got a 10 dB notch at 0.05 Pi. Sound about right?
where
B6 = 0.9915141178644
.
.
.
b0 = 0.9915141178644
a6 = 0.9831002459245
.
.
.
a0 = 1
Also, you may want to post a question like this on:
https://dsp.stackexchange.com/
Consider two vectors, A and B, of size n, 7 <= n <= 23. Both A and B consists of -1s, 0s and 1s only.
I need a fast algorithm which computes the inner product of A and B.
So far I've thought of storing the signs and values in separate uint32_ts using the following encoding:
sign 0, value 0 → 0
sign 0, value 1 → 1
sign 1, value 1 → -1.
The C++ implementation I've thought of looks like the following:
struct ternary_vector {
uint32_t sign, value;
};
int inner_product(const ternary_vector & a, const ternary_vector & b) {
uint32_t psign = a.sign ^ b.sign;
uint32_t pvalue = a.value & b.value;
psign &= pvalue;
pvalue ^= psign;
return __builtin_popcount(pvalue) - __builtin_popcount(psign);
}
This works reasonably well, but I'm not sure whether it is possible to do it better. Any comment on the matter is highly appreciated.
I like having the 2 uint32_t, but I think your actual calculation is a bit wasteful
Just a few minor points:
I'm not sure about the reference (getting a and b by const &) - this adds a level of indirection compared to putting them on the stack. When the code is this small (a couple of clocks maybe) this is significant. Try passing by value and see what you get
__builtin_popcount can be, unfortunately, very inefficient. I've used it myself, but found that even a very basic implementation I wrote was far faster than this. However - this is dependent on the platform.
Basically, if the platform has a hardware popcount implementation, __builtin_popcount uses it. If not - it uses a very inefficient replacement.
The one serious problem here is the reuse of the psign and pvalue variables for the positive and negative vectors. You are doing neither your compiler nor yourself any favors by obfuscating your code in this way.
Would it be possible for you to encode your ternary state in a std::bitset<2> and define the product in terms of and? For example, if your ternary types are:
1 = P = (1, 1)
0 = Z = (0, 0)
-1 = M = (1, 0) or (0, 1)
I believe you could define their product as:
1 * 1 = 1 => P * P = P => (1, 1) & (1, 1) = (1, 1) = P
1 * 0 = 0 => P * Z = Z => (1, 1) & (0, 0) = (0, 0) = Z
1 * -1 = -1 => P * M = M => (1, 1) & (1, 0) = (1, 0) = M
Then the inner product could start by taking the and of the bits of the elements and... I am working on how to add them together.
Edit:
My foolish suggestion did not consider that (-1)(-1) = 1, which cannot be handled by the representation I proposed. Thanks to #user92382 for bringing this up.
Depending on your architecture, you may want to optimize away the temporary bit vectors -- e.g. if your code is going to be compiled to FPGA, or laid out to an ASIC, then a sequence of logical operations will be better in terms of speed/energy/area than storing and reading/writing to two big buffers.
In this case, you can do:
int inner_product(const ternary_vector & a, const ternary_vector & b) {
return __builtin_popcount( a.value & b.value & ~(a.sign ^ b.sign))
- __builtin_popcount( a.value & b.value & (a.sign ^ b.sign));
}
This will lay out very well -- the (a.value & b.value & ... ) can enable/disable an XOR gate, whose output splits into two signed accumulators, with the first pathway NOTed before accumulation.
I have an expression
x += y;
and, based on a boolean, I want to be able to change it to
x -= y;
Of course I could do
if(i){x+=y;} else{x-=y;}
//or
x+=(y*sign); //where sign is either 1 or -1
But if I have to do this iteratively, I want to avoid the extra computation. Is there a more efficient way? Is it possible to modulate the operator?
if (i) {x += y;} else {x -= y;}
is probably going to be as efficient as anything else you can do. y * sign is likely to be fairly expensive (unless the compiler can figure out that y is guaranteed to be 1 or -1).
The most efficient way to do this iteratively is to precompute the data you need.
So, precomputation:
const YourNumberType increment = (i? y : -y);
Then in your loop:
x += increment;
EDIT: re question in commentary about how to generate code, like this:
#include <stdio.h>
void display( int x ) { printf( "%d\n", x ); }
template< bool isSomething >
inline void advance( int& x, int y );
template<> inline void advance<true>( int& x, int y ) { x += y; }
template<> inline void advance<false>( int& x, int y ) { x -= y; }
template< bool isSomething >
void myFunc()
{
int x = 314;
int y = 271;
for( ;; )
{
advance< isSomething >( x, y ); // The nano-optimization.
display( x );
if( !( -10000 < x && x < 10000 ) ) { return; }
}
}
int main( int n, char*[] )
{
n > 1? myFunc<true>() : myFunc<false>();
}
E.g. with Visual C++ 10.0 this generates two versions of myFunc, one with an add instruction and the other with a sub instruction.
Cheers & hth.,
On a modern pipelined machine you want to avoid branching if at all possible in those cases where performance really does count. When the front of the pipeline hits a branch, the CPU guesses which branch to take and lets the pipeline work ahead based on that guess. Everything is fine if the guess was right. Everything is not so fine if the guess was wrong, particularly so if you're still using one of Intel's processors such as a Pentium 4 that suffered from pipeline bloat. Intel discovered that too much pipelining is not a good thing.
More modern processors still do use pipelining (the Core line has a pipeline length of 14 or so), so avoiding branching still remains one of those good things to do -- when it counts, that is. Don't make your code an ugly, prematurely optimized mess when it doesn't count.
The best thing to do is to first find out where your performance demons lie. It is not at all uncommon for a tiny fraction of one percent of the code base to be the cause of almost all of the CPU usage. Optimizing the 99.9% of the code that doesn't contribute to the CPU usage won't solve your performance problems but it will have a deleterious effect on maintenance.
You optimize once you have found the culprit code, and even then, maybe not. When performance doesn't matter, don't optimize. Performance as a metric runs counter to almost every other code quality metric out there.
So, getting off the soap box, let's suppose that little snippet of code is the performance culprit. Try both approaches and test. Try a third approach you haven't thought of yet and test. Sometimes the code that is the best performance-wise is surprisingly non-intuitive. Think Duff's device.
If i stays constant during the execution of the loop, and y doesn't, move the if outside of the loop.
So instead of...
your_loop {
y = ...;
if (i)
x += y;
else
x -= y;
}
...do the following....
if (i) {
your_loop {
y = ...;
x += y;
}
}
else {
your_loop {
y = ...;
x -= y;
}
}
BTW, a decent compiler will do that optimization for you, so you may not see the difference when actually benchmarking.
Sounds like you want to avoid branching and multiplication. Let's say the switch i is set to all 1 bits, same size as y, when you want to add, and to 0 when you want to subtract. Then:
x += (y & i) - (y & ~i)
Haven't tested it, this is just to give you the general idea. Bear in mind that this makes the code a lot harder to read in exchange for what would probably be a very small increase in efficiency.
Edit: or, as bdonlan points out in the comments, possibly even a decrease.
I put my suggestion in the comments to the test, and in a simple test bit-fiddling is faster than branching options on an Intel(R) Xeon(R) CPU L5520 # 2.27GHz, but slower on my laptop Intel Core Duo.
If you are free to give i either the value 0 (for +) or ~0 (for -), these statements are equivalent:
// branching:
if ( i ) sum -= add; else sum += add;
sum += i?-add:add;
sum += (i?-1:1)*add;
// bit fiddling:
sum += (add^i)+(i&1);
sum += (add^i)+(!!i);
sum += (i&~add)-(i&add);
And as said, one method can beat the other by a factor of 2, depending on CPU and optimization level used.
Conclusion, as always, is that benchmarking is the only way to find out which is faster in your particular situation.