Why does this take so long to compile in VCC 2003? - c++

My team need the "Sobol quasi-random number generator" - a common RNG which is famous for good quality results and speed of operation. I found what looks like a simple C implementation on the web. At home I was able to compile it almost instantaneously using my Linux GCC compiler.
The following day I tried it at work: If I compile in Visual Studio in debug mode it takes about 1 minute. If I were to compile it in release mode it takes about 40 minutes.
Why?
I know that "release" mode triggers some compiler optimization... but how on earth could a file this small take so long to optimize? It's mostly comments and static-data. There's hardly anything worth optimizing.
None of these PCs are particularly slow, and in any case I know that the compile time is consistent across a range of Windows computers. I've also heard that newer versions of Visual Studio have a faster compile time, however for now we are stuck with Visual Studio.Net 2003. Compiling on GCC (the one bundled with Ubuntu 8.04) always takes microseconds.

To be honest, I'm not really sure the codes that good. It's got a nasty smell in it. Namely, this function:
unsigned int i4_xor ( unsigned int i, unsigned int j )
//****************************************************************************80
//
// Purpose:
//
// I4_XOR calculates the exclusive OR of two integers.
//
// Modified:
//
// 16 February 2005
//
// Author:
//
// John Burkardt
//
// Parameters:
//
// Input, unsigned int I, J, two values whose exclusive OR is needed.
//
// Output, unsigned int I4_XOR, the exclusive OR of I and J.
//
{
unsigned int i2;
unsigned int j2;
unsigned int k;
unsigned int l;
k = 0;
l = 1;
while ( i != 0 || j != 0 )
{
i2 = i / 2;
j2 = j / 2;
if (
( ( i == 2 * i2 ) && ( j != 2 * j2 ) ) ||
( ( i != 2 * i2 ) && ( j == 2 * j2 ) ) )
{
k = k + l;
}
i = i2;
j = j2;
l = 2 * l;
}
return k;
}
There's an i8_xor too. And a couple of abs functions.
I think a post to the DailyWTF is in order.
EDIT: For the non-c programmers, here's a quick guide to what the above does:
function xor i:unsigned, j:unsigned
answer = 0
bit_position = 1
while i <> 0 or j <> 0
if least significant bit of i <> least significant bit of j
answer = answer + bit_position
end if
bit_position = bit_position * 2
i = i / 2
j = j / 2
end while
return answer
end function
To determine if the least significant bit is set or cleared, the following is used:
bit set if i <> (i / 2) * 2
bit clear if i == (i / 2) * 2
What makes the code extra WTFy is that C defines an XOR operator, namely '^'. So, instead of:
result = i4_xor (a, b);
you can have:
result = a ^ b; // no function call at all!
The original programmer really should have know about the xor operator. But even if they didn't (and granted, it's another obfuscated C symbol), their implementation of an XOR function is unbelievably poor.

I'm using VC++ 2003 and it compiled instantly in both debug/release modes.
Edit:
Do you have the latest service pack installed on your systems?

I would recommend you download a trial edition of Visual Studio 2008 and try the compile there, just to see if the problem is inherent. Also, if it does happen on a current version, you would be able to report the problem, and Microsoft might fix it.
On the other hand, there is no chance that Microsoft will fix whatever bug is in VS2003.

Related

pigeon hole / multiple numbers

input : integer ( i'll call it N ) and (1 <= N <= 5,000,000 )
output : integer, multiple of N and only contains 0,7
Ex.
Q1 input : 1 -> output : 7 ( 7 mod 1 == 0 )
Q2 input : 2 -> output : 70 ( 70 mod 2 == 0 )
#include <string>
#include <iostream>
using namespace std;
typedef long long ll;
int remaind(string num, ll m)
{
ll mod = 0;
for (int i = 0; i < num.size(); i++) {
int digit = num[i] - '0';
mod = mod * 10 + digit;
mod = mod % m;
}
return mod;
}
int main()
{
int n;
string ans;
cin >> n;
ans.append(n, '7');
for (int i = ans.length() - 1; i >= 0; i--)
{
if (remaind(ans, n) == 0)
{
cout << ans;
return 0;
}
ans.at(i) = '0';
}
return 0;
}
is there a way to lessen the time complexity?
i just tried very hard and it takes little bit more time to run while n is more than 1000000
ps. changed code
ps2. changed code again because of wrong code
ps3. optimize code again
ps4. rewrite post
Your approach is wrong, let's say you divide "70" by 5. Then you result will be 2 which is not right (just analyze your code to see why that happens).
You can really base your search upon numbers like 77777770000000, but think more about that - which numbers you need to add zeros and which numbers you do not.
Next, do not use strings! Think of reminder for a * b if you know reminder of a and reminder of b. When you program it, be careful with integer size, use 64 bit integers.
Now, what about a + b?
Finally, find reminders for numbers 10, 100, 1000, 10000, etc (once again, do not use strings and still try to find reminder for any power of 10).
Well, if you do all that, you'll be able to easily solve the whole problem.
May I recommend any of the boost::bignum integer classes?
I suspect uint1024_t (or whatever... they also have 128, 256, and 512, bit ints already typedefed, and you can declare your own easily enough) will meet your needs, allowing you to perform a single %, rather than one per iteration. This may outweigh the performance lost when using bignum vs c++'s built-in ints.
2^1024 ~= 1.8e+308. Enough to represent any 308 digit number. That's probably excessive.
2^512 ~= 1.34e+154. Good for any 154 digit number.
etc.
I suspect you should first write a loop that went through n = 4e+6 -> 5e+6 and wrote out which string got the longest, then size your uint*_t appropriately. If that longest string length is more than 308 characters, you could just whip up your own:
typedef number<cpp_int_backend<LENGTH, LENGTH, unsigned_magnitude, unchecked, void> > myReallyUnsignedBigInt;
The modulo operator is probably the most expensive operation in that inner loop. Performing once per iteration on the outer loop rather than at the inner loop (O(n) vs O(n^2)) should save you quite a bit of time.
Will that plus the whole "not going to and from strings" thing pay for bignum's overhead? You'll have to try it and see.

g++ optimization breaks for loops

A few days ago, I encountered what I believe to be a bug in g++ 5.3 concerning the nesting of for loops at higher -OX optimization levels. (Been experiencing it specifically for -O2 and -O3). The issue is that if you have two nested for loops, that have some internal sum to keep track of total iterations, once this sum exceeds its maximum value it prevents the outer loop from terminating. The smallest code set that I have been able to replicate this with is:
int main(){
int sum = 0;
// Value of 100 million. (2047483648 less than int32 max.)
int maxInner = 100000000;
int maxOuter = 30;
// 100million * 30 = 3 billion. (Larger than int32 max)
for(int i = 0; i < maxOuter; ++i)
{
for(int j = 0; j < maxInner; ++j)
{
++sum;
}
std::cout<<"i = "<<i<<" sum = "<<sum<<std::endl;
}
}
When this is compiled using g++ -o run.me main.cpp it runs just as expected outputting:
i = 0 sum = 100000000
i = 1 sum = 200000000
i = 2 sum = 300000000
i = 3 sum = 400000000
i = 4 sum = 500000000
i = 5 sum = 600000000
i = 6 sum = 700000000
i = 7 sum = 800000000
i = 8 sum = 900000000
i = 9 sum = 1000000000
i = 10 sum = 1100000000
i = 11 sum = 1200000000
i = 12 sum = 1300000000
i = 13 sum = 1400000000
i = 14 sum = 1500000000
i = 15 sum = 1600000000
i = 16 sum = 1700000000
i = 17 sum = 1800000000
i = 18 sum = 1900000000
i = 19 sum = 2000000000
i = 20 sum = 2100000000
i = 21 sum = -2094967296
i = 22 sum = -1994967296
i = 23 sum = -1894967296
i = 24 sum = -1794967296
i = 25 sum = -1694967296
i = 26 sum = -1594967296
i = 27 sum = -1494967296
i = 28 sum = -1394967296
i = 29 sum = -1294967296
However, when this is compiled using g++ -O2 -o run.me main.cpp, the outer loop fails to terminate. (This only occurs when maxInner * maxOuter > 2^31) While sum continually overflows, it shouldn't in any way affect the other variables. I have also tested this on Ideone.com with the test case demonstrated here: https://ideone.com/5MI5Jb
My question is thus twofold.
How is it possible for the value of sum to in some way effect the system? No decisions are based upon its value, it is merely utilized for the purposes of a counter and the std::cout statement.
What could possibly be causing the dramatically different outcomes at different optimization levels?
Thank you greatly in advance for taking the time to read and consider my question.
Note: This question differs from existing questions such as: Why does integer overflow on x86 with GCC cause an infinite loop? because the issue with that problem was an overflow for the sentinal variable. However, both sentinal variables in this question i and j never exceed the value of 100m let alone 2^31.
This is an optimisation that's perfectly valid for correct code. Your code isn't correct.
What GCC sees is that the only way the loop exit condition i >= maxOuter could ever be reached is if you have signed integer overflow during earlier loop iterations in your calculation of sum. The compiler assumes there isn't signed integer overflow, because signed integer overflow isn't allowed in standard C. Therefore, i < maxOuter can be optimised to just true.
This is controlled by the -faggressive-loop-optimizations flag. You should be able to get the behaviour you expect by adding -fno-aggressive-loop-optimizations to your command line arguments. But better would be making sure your code is valid. Use unsigned integer types to get guaranteed valid wraparound behaviour.
Your code invokes undefined behaviour, since the int sum overflows. You say "this shouldn't in any way affect the other variables". Wrong. Once you have undefined behaviour, all odds are off. Anything can happen.
gcc is (in)famous for optimisations that assume there is no undefined behaviour and do let's say interesting things if undefined behaviour happens.
Solution: Don't do it.
Answers
As #hvd pointed out, the problem is in your invalid code, not in the compiler.
During your program execution, the sum value overflows int range. Since int is by default signed and overflow of signed values causes undefined behavior* in C, the compiler is free to do anything. As someone noted somewhere, dragons could be flying out of your nose. The result is just undefined.
The difference -O2 causes is in testing the end condition. When the compiler optimizes your loop, it realizes that it can optimize away the inner loop, making it
int sum = 0;
for(int i = 0; i < maxOuter; i++) {
sum += maxInner;
std::cout<<"i = "<<i<<" sum = "<<sum<<std::endl;
}
and it may go further, transforming it to
int i = 0;
for(int sum = 0; sum < (maxInner * maxOuter); sum += maxInner) {
i++;
std::cout<<"i = "<<i<<" sum = "<<sum<<std::endl;
}
To be honest, I don't really know what it does, the point is, it can do just this. Or anything else, remember the dragons, your program causes undefined behavior.
Suddenly, your sum variable is used in the loop end condition. Note that for defined behavior, these optimizations are perfectly valid. If your sum was unsigned (and your maxInner and maxOuter), the (maxInner * maxOuter) value (which would also be unsigned) would be reached after maxOuter loops, because unsigned operations are defined** to overflow as expected.
Now since we're in the signed domain, the compiler is for one free to assume, that at all times sum < (maxInner * maxOuter), just because the latter overflows, and therefore is not defined. So the optimizing compiler can end up with something like
int i = 0;
for(int sum = 0;/* nothing here evaluates to true */; sum += maxInner) {
i++;
std::cout<<"i = "<<i<<" sum = "<<sum<<std::endl;
}
which looks like observed behavior.
*: According to the C11 standard draft, section 6.5 Expressions:
If an exceptional condition occurs during the evaluation of an expression (that is, if the result is not mathematically defined or not in the range of representable values for its type), the behavior is undefined.
**: According to the C11 standard draft, Annex H, H.2.2:
C’s unsigned integer types are ‘‘modulo’’ in the LIA−1 sense in that overflows or out-of-bounds results silently wrap.
I did some research on the topic. I compiled the code above with gcc and g++ (version 5.3.0 on Manjaro) and got some pretty interesting things of it.
Description
To successfully compile it with gcc (C compiler, that is), I have replaced
#include <iostream>
...
std::cout<<"i = "<<i<<" sum = "<<sum<<std::endl;
with
#include <stdio.h>
...
printf("i = %d sum = %d\n", i, sum);
and wrapped this replacement with #ifndef ORIG, so I could have both versions. Then I ran 8 compilations: {gcc,g++} x {-O2, ""} x {-DORIG=1,""}. This yields following results:
Results
gcc, -O2, -DORIG=1: Won't compile, missing <iostream>. Not surprising.
gcc, -O2, "": Produces compiler warning and behaves "normally". A look in the assembly shows that the inner loop is optimized out (j being incremented by 100000000) and the outer loop variable is compared with hardcoded value -1294967296. So, GCC can detect this and do some clever things while the program is working expectably. More importantly, warning is emitted to warn user about undefined behavior.
gcc, "", -DORIG=1: Won't compile, missing <iostream>. Not surprising.
gcc, "", "": Compiles without warning. No optimizations, program runs as expected.
g++, -O2, -DORIG=1: Compiles without warning, runs in endless loop. This is OP's original code running. C++ assembly is tough to follow for me. Addition of 100000000 is there though.
g++, -O2, "": Compiles with warning. It is enough to change how the output is printed to change compiler warning emiting. Runs "normally". By the assembly, AFAIK the inner loop gets optimized out. At least there is again comparison against -1294967296 and incrementation by 100000000.
g++, "", -DORIG=1: Compiles without warning. No optimization, runs "normally".
g++, "", "": dtto
The most interesting part for me was to find out the difference upon change of printing. Actually from all the combinations, only the one used by OP produces endless-loop program, the others fail to compile, do not optimize or optimize with warning and preserve sanity.
Code
Follows example build command and my full code
$ gcc -x c -Wall -Wextra -O2 -DORIG=1 -o gcc_opt_orig main.cpp
main.cpp:
#ifdef ORIG
#include <iostream>
#else
#include <stdio.h>
#endif
int main(){
int sum = 0;
// Value of 100 million. (2047483648 less than int32 max.)
int maxInner = 100000000;
int maxOuter = 30;
// 100million * 30 = 3 billion. (Larger than int32 max)
for(int i = 0; i < maxOuter; ++i)
{
for(int j = 0; j < maxInner; ++j)
{
++sum;
}
#ifdef ORIG
std::cout<<"i = "<<i<<" sum = "<<sum<<std::endl;
#else
printf("i = %d sum = %d\n", i, sum);
#endif
}
}

assuming signed overflow does not occur in if statement

Why is this warning appearing? It's not really an assumption if I check the bounds. And how to fix?
If num_actions_to_skip is set to 1, instead of 2, the error goes away.
Thanks
error: assuming signed overflow does not occur when assuming that (X - c) <= X is always true [-Werror=strict-overflow]
cc1plus: all warnings being treated as errors
On if (loc >= 0 && loc < action_list.count()) {
const QList<QAction *> &action_list = tool_menu->actions();
static const int num_actions_to_skip = 2;
const int loc = action_list.count() - num_actions_to_skip;
if (loc >= 0 && loc < action_list.count()) {
tool_menu->insertAction(action_list.at(loc),
action);
}
It started with
Q_ASSERT_X(i >= 0 && i < p.size()
at qlist.h:454, which performs the same check, and throws this error as well, with just
tool_menu->insertAction(action_list.at(action_list.count() - 2),
action);
You just need to rethink your logic.
static const int num_actions_to_skip = 2;
const int loc = action_list.count() - num_actions_to_skip;
if (loc >= 0 && loc < action_list.count()) {
// ...
}
Apparently action_list.count() is a constant value (at least it won't change as this code is executed), and the compiler is able to figure that out.
Let's simplify this a bit, replacing num_actions_to_skip by 2, reducing action_list.count() to count. We can then re-express loc as count - 2.
Your if condition becomes:
if (count - 2 >= 0 && count - 2 < count)
which is equivalent (assuming, as the compiler warning said, that no overflow occurs) to:
if (count >= 2 && -2 < 0)
The second half of that, -2 > 0 is obviously true, so you can safely drop it, which leaves us with
if (count >= 2)
Re-substituting the original terms, this gives us:
static const int num_actions_to_skip = 2;
// const int loc = action_list.count() - num_actions_to_skip;
if (action_list.count() >= num_actions_to_skip) {
// ...
}
The compiler warned you that it was performing an optimization that might be invalid if there's an integer overflow (it's permitted to assume that there is no overflow because if there is the behavior is undefined). It was kind enough to warn you about this -- which is lucky for you, because it pointed to the fact that your code is doing something it doesn't need to.
I don't know whether you need to keep the declaration of loc; it depends on whether you use it later. But if you simplify the code in the way I've suggested it should work the same way and be easier to read and understand.
If you get a warning message from the compiler, your goal should not just be to make the message go away; it should be to drill down and figure out just what the compiler is warning you about, and beyond that why your code causes that problem.
You know the context of this code better than I do. If you look at the revised version, you may well find that it expresses the intent more clearly.
From this GCC resource:
-Wstrict-overflow
-Wstrict-overflow=n
This option is only active when -fstrict-overflow is active. It warns about cases where the compiler optimizes based on the assumption that signed overflow does not occur. Note that it does not warn about all cases where the code might overflow: it only warns about cases where the compiler implements some optimization. Thus this warning depends on the optimization level.
An optimization that assumes that signed overflow does not occur is perfectly safe if the values of the variables involved are such that overflow never does, in fact, occur. Therefore this warning can easily give a false positive: a warning about code that is not actually a problem. To help focus on important issues, several warning levels are defined. No warnings are issued for the use of undefined signed overflow when estimating how many iterations a loop requires, in particular when determining whether a loop will be executed at all.
-Wstrict-overflow=1
Warn about cases that are both questionable and easy to avoid. For example, with -fstrict-overflow, the compiler simplifies x + 1 > x to 1. This level of -Wstrict-overflow is enabled by -Wall; higher levels are not, and must be explicitly requested.
-Wstrict-overflow=2
Also warn about other cases where a comparison is simplified to a constant. For example: abs (x) >= 0. This can only be simplified when -fstrict-overflow is in effect, because abs (INT_MIN) overflows to INT_MIN, which is less than zero. -Wstrict-overflow (with no level) is the same as -Wstrict-overflow=2.
-Wstrict-overflow=3
Also warn about other cases where a comparison is simplified. For example: x + 1 > 1 is simplified to x > 0.
-Wstrict-overflow=4
Also warn about other simplifications not covered by the above cases. For example: (x * 10) / 5 is simplified to x * 2.
-Wstrict-overflow=5
Also warn about cases where the compiler reduces the magnitude of a constant involved in a comparison. For example: x + 2 > y is simplified to x + 1 >= y. This is reported only at the highest warning level because this simplification applies to many comparisons, so this warning level gives a very large number of false positives.
The solution was to change my ints to unsigned ints:
const QList<QAction *> &action_list = tool_menu->actions();
static const unsigned int num_actions_to_skip = 2;
const unsigned int pos = action_list.count() - num_actions_to_skip;
assert(pos >= 0);
tool_menu->insertAction(action_list.at(pos),
action);
I just solved a similar issue on a comparison that works fine for x86 but not for arm (optimization level O2):
I replaced (x<right) by (x-right<0) and the code is fine now.
And it sounds logical (on hindsight): The new code is worse to read for humans but expresses what the compiler does even for the overflow case.
static inline bool shape_int_rectangle_contains (
const shape_int_rectangle_t *this_, int32_t x, int32_t y )
{
bool result;
const int32_t right = (*this_).left + (*this_).width;
const int32_t bottom = (*this_).top + (*this_).height;
/*
* warning: assuming signed overflow does not occur when assuming that (X + c) >= X is always true [-Wstrict-overflow]
*
* result = ( x >= (*this_).left )&&( y >= (*this_).top )&&( x < right )&&( y < bottom );
*
* fix:
*/
result = ( x-(*this_).left >= 0 )&&( y-(*this_).top >= 0 )&&( x-right < 0 )&&( y-bottom < 0 );
return result;
}

How to let Boost::random and Matlab produce the same random numbers

To check my C++ code, I would like to be able to let Boost::Random and Matlab produce the same random numbers.
So for Boost I use the code:
boost::mt19937 var(static_cast<unsigned> (std::time(0)));
boost::uniform_int<> dist(1, 6);
boost::variate_generator<boost::mt19937&, boost::uniform_int<> > die(var, dist);
die.engine().seed(0);
for(int i = 0; i < 10; ++i) {
std::cout << die() << " ";
}
std::cout << std::endl;
Which produces (every run of the program):
4 4 5 6 4 6 4 6 3 4
And for matlab I use:
RandStream.setDefaultStream(RandStream('mt19937ar','seed',0));
randi(6,1,10)
Which produces (every run of the program):
5 6 1 6 4 1 2 4 6 6
Which is bizarre, since both use the same algorithm, and same seed.
What do I miss?
It seems that Python (using numpy) and Matlab seems comparable, in the random uniform numbers:
Matlab
RandStream.setDefaultStream(RandStream('mt19937ar','seed',203));rand(1,10)
0.8479 0.1889 0.4506 0.6253 0.9697 0.2078 0.5944 0.9115 0.2457 0.7743
Python:
random.seed(203);random.random(10)
array([ 0.84790006, 0.18893843, 0.45060688, 0.62534723, 0.96974765,
0.20780668, 0.59444858, 0.91145688, 0.24568615, 0.77430378])
C++Boost
0.8479 0.667228 0.188938 0.715892 0.450607 0.0790326 0.625347 0.972369 0.969748 0.858771
Which is identical to ever other Python and Matlab value...
I have to agree with the other answers, stating that these generators are not "absolute". They may produce different results according to the implementation. I think the simplest solution would be to implement your own generator. It might look daunting (Mersenne twister sure is by the way) but take a look at Xorshift, an extremely simple though powerful one. I copy the C implementation given in the Wikipedia link :
uint32_t xor128(void) {
static uint32_t x = 123456789;
static uint32_t y = 362436069;
static uint32_t z = 521288629;
static uint32_t w = 88675123;
uint32_t t;
t = x ^ (x << 11);
x = y; y = z; z = w;
return w = w ^ (w >> 19) ^ (t ^ (t >> 8));
}
To have the same seed, just put any values you want int x,y,z,w (except(0,0,0,0) I believe). You just need to be sure that Matlab and C++ use both 32 bit for these unsigned int.
Using the interface like
randi(6,1,10)
will apply some kind of transformation on the raw result of the random generator. This transformation is not trivial in general and Matlab will almost certainly do a different selection step than Boost.
Try comparing raw data streams from the RNGs - chances are they are the same
In case this helps anyone interested in the question:
In order to the get the same behavior for the Twister algorithm:
Download the file
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/CODES/mt19937ar.c
Try the following:
#include <stdint.h>
// mt19937ar.c content..
int main(void)
{
int i;
uint32_t seed = 100;
init_genrand(seed);
for (i = 0; i < 100; ++i)
printf("%.20f\n",genrand_res53());
return 0;
}
Make sure the same values are generated within matlab:
RandStream.setGlobalStream( RandStream.create('mt19937ar','seed',100) );
rand(100,1)
randi() seems to be simply ceil( rand()*maxval )
Thanks to Fezvez's answer I've written xor128 in matlab:
function [ w, state ] = xor128( state )
%XOR128 implementation of Xorshift
% https://en.wikipedia.org/wiki/Xorshift
% A starting state might be [123456789, 362436069, 521288629, 88675123]
x = state(1);
y = state(2);
z = state(3);
w = state(4);
% t1 = (x << 11)
t1 = bitand(bitshift(x,11),hex2dec('ffffffff'));
% t = x ^ (x << 11)
t = bitxor(x,t1);
x = y;
y = z;
z = w;
% t2 = (t ^ (t >> 8))
t2 = bitxor(t, bitshift(t,-8));
% t3 = w ^ (w >> 19)
t3 = bitxor(w, bitshift(w,-19));
% w = w ^ (w >> 19) ^ (t ^ (t >> 8))
w = bitxor(t3, t2);
state = [x y z w];
end
You need to pass state in to xor128 every time you use it. I've written a "tester" function which simply returns a vector with random numbers. I tested 1000 numbers output by this function against values output by cpp with gcc and it is perfect.
function [ v ] = txor( iterations )
%TXOR test xor128, returns vector v of length iterations with random number
% output from xor128
% output
v = zeros(iterations,1);
state = [123456789, 362436069, 521288629, 88675123];
i = 1;
while i <= iterations
disp(i);
[t,state] = xor128(state);
v(i) = t;
i = i + 1;
end
I would be very careful assuming that two different implementations of pseudo random generators (even though based on the same algorithms) produce the same result. There could be that one of the implementations use some sort of tweak, hence producing different results. If you need two equal "random" distributions I suggest you either precalculate a sequence, store and access from both C++ and Matlab or create your own generator. It should be fairly easy to implement MT19937 if you use the pseudocode on Wikipedia.
Take care ensuring that both your Matlab and C++ code runs on the same architecture (that is, both runs on either 32 or 64-bit) - using a 64 bit integer in one implementation and a 32 bit integer in the other will lead to different results.

Fast dot product for a very special case

Given a vector X of size L, where every scalar element of X is from a binary set {0,1}, it is to find a dot product z=dot(X,Y) if vector Y of size L consists of the integer-valued elements. I suggest, there must exist a very fast way to do it.
Let's say we have L=4; X[L]={1, 0, 0, 1}; Y[L]={-4, 2, 1, 0} and we have to find z=X[0]*Y[0] + X[1]*Y[1] + X[2]*Y[2] + X[3]*Y[3] (which in this case will give us -4).
It is obvious that X can be represented using binary digits, e.g. an integer type int32 for L=32. Then, all what we have to do is to find a dot product of this integer with an array of 32 integers. Do you have any idea or suggestions how to do it very fast?
This really would require profiling but an alternative you might want to consider:
int result=0;
int mask=1;
for ( int i = 0; i < L; i++ ){
if ( X & mask ){
result+=Y[i];
}
mask <<= 1;
}
Typically bit shifting and bitwise operations are faster than multiplication, however, the if statement might be slower than a multiplication, although with branch prediction and large L my guess is it might be faster. You would really have to profile it, though, to determine if it resulted in any speedup.
As has been pointed out in the comments below, unrolling the loop either manually or via a compiler flag (such as "-funroll-loops" on GCC) could also speed this up (eliding the loop condition).
Edit
In the comments below, the following good tweak has been proposed:
int result=0;
for ( int i = 0; i < L; i++ ){
if ( X & 1 ){
result+=Y[i];
}
X >>= 1;
}
Is a suggestion to look into SSE2 helpful? It has dot-product type operations already, plus you can trivially do 4 (or perhaps 8, I forget the register size) simple iterations of your naive loop in parallel.
SSE also has some simple logic-type operations so it may be able to do additions rather than multiplications without using any conditional operations... again you'd have to look at what ops are available.
Try this:
int result=0;
for ( int i = 0; i < L; i++ ){
result+=Y[i] & (~(((X>>i)&1)-1));
}
This avoids a conditional statement and uses bitwise operators to mask the scalar value with either zeros or ones.
Since size explicitly doesn’t matter, I think the following is probably the most efficient general-purpose code:
int result = 0;
for (size_t i = 0; i < 32; ++i)
result += Y[i] & -X[i];
Bit-encoding X just doesn’t bring anything to the table (even if the loop may potentially terminate earlier as #Mathieu correctly noted). But omitting the if inside the loop does.
Of course, loop unrolling can speed this up drastically, as others have noted.
This solution is identical to, but slightly faster (by my test), than Micheal Aaron's:
long Lev=1;
long Result=0
for (int i=0;i<L;i++) {
if (X & Lev)
Result+=Y[i];
Lev*=2;
}
I thought there was a numerical way to rapidly establish the next set bit in a word which should improve performance if your X data is very sparse but currently cannot find said numerical formulation currently.
I've seen a number of responses with bit trickery (to avoid branching) but none got the loop right imho :/
Optimizing #Goz answer:
int result=0;
for (int i = 0, x = X; x > 0; ++i, x>>= 1 )
{
result += Y[i] & -(int)(x & 1);
}
Advantages:
no need to do i bit-shifting operations each time (X>>i)
the loop stops sooner if X contains 0 in higher bits
Now, I do wonder if it runs faster, especially since the premature stop of the for loop might not be as easy for loop unrolling (compared to a compile-time constant).
How about combining a shifting loop with a small lookup table?
int result=0;
for ( int x=X; x!=0; x>>=4 ){
switch (x&15) {
case 0: break;
case 1: result+=Y[0]; break;
case 2: result+=Y[1]; break;
case 3: result+=Y[0]+Y[1]; break;
case 4: result+=Y[2]; break;
case 5: result+=Y[0]+Y[2]; break;
case 6: result+=Y[1]+Y[2]; break;
case 7: result+=Y[0]+Y[1]+Y[2]; break;
case 8: result+=Y[3]; break;
case 9: result+=Y[0]+Y[3]; break;
case 10: result+=Y[1]+Y[3]; break;
case 11: result+=Y[0]+Y[1]+Y[3]; break;
case 12: result+=Y[2]+Y[3]; break;
case 13: result+=Y[0]+Y[2]+Y[3]; break;
case 14: result+=Y[1]+Y[2]+Y[3]; break;
case 15: result+=Y[0]+Y[1]+Y[2]+Y[3]; break;
}
Y+=4;
}
The performance of this will depend on how good the compiler is at optimising the switch statement, but in my experience they are pretty good at that nowadays....
There is probably no general answer to this question. You need to profile your code under all the different targets. Performance will depend on compiler optimizations such as loop unwinding and SIMD instructions that are available on most modern CPUs (x86, PPC, ARM all have their own implementations).
For small L, you can use a switch statement instead of a loop. For example, if L = 8, you could have:
int dot8(unsigned int X, const int Y[])
{
switch (X)
{
case 0: return 0;
case 1: return Y[0];
case 2: return Y[1];
case 3: return Y[0]+Y[1];
// ...
case 255: return Y[0]+Y[1]+Y[2]+Y[3]+Y[4]+Y[5]+Y[6]+Y[7];
}
assert(0 && "X too big");
}
And if L = 32, you can write a dot32() function which calls dot8() four times, inlined if possible. (If your compiler refuses to inline dot8(), you could rewrite dot8() as a macro to force inlining.) Added:
int dot32(unsigned int X, const int Y[])
{
return dot8(X >> 0 & 255, Y + 0) +
dot8(X >> 8 & 255, Y + 8) +
dot8(X >> 16 & 255, Y + 16) +
dot8(X >> 24 & 255, Y + 24);
}
This solution, as mikera points out, may have an instruction cache cost; if so, using a dot4() function might help.
Further update: This can be combined with mikera's solution:
static int dot4(unsigned int X, const int Y[])
{
switch (X)
{
case 0: return 0;
case 1: return Y[0];
case 2: return Y[1];
case 3: return Y[0]+Y[1];
//...
case 15: return Y[0]+Y[1]+Y[2]+Y[3];
}
}
Looking at the resulting assembler code with the -S -O3 options with gcc 4.3.4 on CYGWIN, I'm slightly surprised to see that this is automatically inlined within dot32(), with eight 16-entry jump-tables.
But adding __attribute__((__noinline__)) seems to produce nicer-looking assembler.
Another variation is to use fall-throughs in the switch statement, but gcc adds jmp instructions, and it doesn't look any faster.
Edit--Completely new answer: After thinking about the 100 cycle penalty mentioned by Ants Aasma, and the other answers, the above is likely not optimal. Instead, you could manually unroll the loop as in:
int dot(unsigned int X, const int Y[])
{
return (Y[0] & -!!(X & 1<<0)) +
(Y[1] & -!!(X & 1<<1)) +
(Y[2] & -!!(X & 1<<2)) +
(Y[3] & -!!(X & 1<<3)) +
//...
(Y[31] & -!!(X & 1<<31));
}
This, on my machine, generates 32 x 5 = 160 fast instructions. A smart compiler could conceivably unroll the other suggested answers to give the same result.
But I'm still double-checking.
result = 0;
for(int i = 0; i < L ; i++)
if(X[i]!=0)
result += Y[i];
It's quite likely that the time spent to load X and Y from main memory will dominate. If this is the case for your CPU architecture, the algorithm is faster when loading less. This means that storing X as a bitmask and expanding it into L1 cache will speed up the algorithm as a whole.
Another relevant question is whether your compiler will generate optimal loads for Y. This is higly CPU and compiler dependent. But in general, it helps if the compiler can see precsiely which values are needed when. You could manually unroll the loop. However, if L is a contant, leave it to the compiler:
template<int I> inline void calcZ(int (&X)[L], int(&Y)[L], int &Z) {
Z += X[I] * Y[I]; // Essentially free, as it operates in parallel with loads.
calcZ<I-1>(X,Y,Z);
}
template< > inline void calcZ<0>(int (&X)[L], int(&Y)[L], int &Z) {
Z += X[0] * Y[0];
}
inline int calcZ(int (&X)[L], int(&Y)[L]) {
int Z = 0;
calcZ<L-1>(X,Y,Z);
return Z;
}
(Konrad Rudolph questioned this in a comment, wondering about memory use. That's not the real bottleneck in modern computer architectures, bandwidth between memory and CPU is. This answer is almost irrelevant if Y is somehow already in cache. )
You can store your bit vector as a sequence of ints where each int packs a couple of coefficients as bits. Then, the component-wise multiplication is equivalent to bit-and. With this you simply need to count the number of set bits which could be done like this:
inline int count(uint32_t x) {
// see link
}
int dot(uint32_t a, uint32_t b) {
return count(a & b);
}
For a bit hack to count the set bits see http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
Edit: Sorry I just realized only one of the vectors contains elements of {0,1} and the other one doesn't. This answer only applies to the case where both vectors are limited to coefficients from the set of {0,1}.
Represente X using linked list of the places where x[i] = 1.
To find required sum you need O(N) operations where N is size of your list.
Well you want all bits to get past if its a 1 and none if its a 0. So you want to somehow turn 1 into -1 (ie 0xffffffff) and 0 stays the same. Thats just -X .... so you do ...
Y & (-X)
for each element ... job done?
Edit2: To give a code example you can do something like this and avoid the branch:
int result=0;
for ( int i = 0; i < L; i++ )
{
result+=Y[i] & -(int)((X >> i) & 1);
}
Of course you'd be best off keeping the 1s and 0s in an array of ints and therefore avoiding the shifts.
Edit: Its also worth noting that if the values in Y are 16-bits in size then you can do 2 of these and operations per operation (4 if you have 64-bit registers). It does mean negating the X values 1 by 1 into a larger integer, though.
ie YVals = -4, 3 in 16-bit = 0xFFFC, 0x3 ... put into 1 32-bit and you get 0xFFFC0003. If you have 1, 0 as the X vals then you form a bit mask of 0xFFFF0000 and the 2 together and you've got 2 results in 1 bitwise-and op.
Another edit:
IF you want the code on how to do the 2nd method something like this should work (Though it takes advantage of unspecified behaviour so it may not work on every compiler .. works on every compiler I've come across though).
union int1632
{
int32_t i32;
int16_t i16[2];
};
int result=0;
for ( int i = 0; i < (L & ~0x1); i += 2 )
{
int3264 y3264;
y3264.i16[0] = Y[i + 0];
y3264.i16[1] = Y[i + 1];
int3264 x3264;
x3264.i16[0] = -(int16_t)((X >> (i + 0)) & 1);
x3264.i16[1] = -(int16_t)((X >> (i + 1)) & 1);
int3264 res3264;
res3264.i32 = y3264.i32 & x3264.i32;
result += res3264.i16[0] + res3264.i16[1];
}
if ( i < L )
result+=Y[i] & -(int)((X >> i) & 1);
Hopefully the compiler will optimise out the assigns (Off the top of my head i'm not sure but the idea could be re-worked so that they definitely are) and give you a small speed up in that you now only need to do 1 bitwise-and instead of 2. The speed up would be minor though ...