Modulu vs if statement in a loop - c++

I have a loop like this:
for(i = 0; i < arrayLength; i++)
{
result[i / 8] = SETBIT(result[i / 8], ++(*bitIndex), array[i]);
*bitIndex %= 8;
}
I wonder what is better, performance-wise, if I use the above style, or this style:
for(i = 0; i < arrayLength; i++)
{
result[i / 8] = SETBIT(result[i / 8], ++(*bitIndex), array[i]);
if(*bitIndex == 8) *bitIndex = 0;
}
This is in C, compiled with GCC, an explanation would be appreciated as well.
Thanks

It doesn't make any sense to talk about optimization...
without a specific target system in mind
before you have ensured that all compiler optimizations are enabled and work
before you have performed some kind of benchmarking
In your case, for most systems, there will be no difference between the two examples. What you really should focus on is to write readable code, instead of doing the opposite. Consider rewriting your program into something like this:
for(i = 0; i < arrayLength; i++)
{
(*bitIndex)++;
SETBIT(&result[i / 8], *bitIndex, array[i]);
*bitIndex %= 8;
}
This will likely yield exactly the same binary executable as to what you already had, +- a few CPU ticks.

In all honesty, there will only be a fraction of a second difference between them. But I believe the first would be SLIGHTLY better for simple READING style. However, this question references performance in which I still think the first would have the EVER SO SLIGHT edge for the fact the compiler is reading/computing less info.

You don't say why bitIndex is a pointer. But, assuming that bitIndex is a parameter to the function containing the given loop, I would pull the *bitIndex stuff out of the loop, so:
unsigned bi = *bitIndex ;
for(i = 0 ; i < arrayLength ; i++)
{
result[i / 8] = SETBIT(&result[i / 8], ++bi, array[i]) ;
bi %= 8 ;
} ;
*bitIndex = bi ;
[I assume that the ++bi is correct, and that SETBIT() requires 1..8, where bi is 0..7.]
As noted elsewhere, the compiler has the % 8 for breakfast and replaces it by & 0x7 (for unsigned, but not for signed). With gcc -O2, the above produced a loop of 48 bytes of code (15 instructions). Fiddling with *bitIndex in the loop was 50 bytes of code (16 instructions), including a read and two writes of *bitIndex.
How much actual difference this makes is anybody's guess... it could be that the memory read and writes are completely subsumed by the rest of the loop.
If bitIndex is a pointer to a local variable, then the compiler will pull the value into a register for the duration of the loop all on its own -- so it thinks it's worth doing !

Related

Understanding a C++ function for reversing a string using ^=

The code below returns a reversed string. For example, it take input "codebyte" and returns "etybedoc".
string FirstReverse(string str) {
for(int i = 0, j = str.length() - 1; i < str.length() / 2; i++, j--)
{
str[i]^=str[j]^=str[i]^=str[j];
}
return str;
}
I am lost as to how this function works:
Why is the ^=-operator being used? It is a bitwise operator but why is it being used here?
Why is str.length() divided by 2 in the for loop?
What is with the alteration of str[i] and str[j]?
I want to work though it with values but I don't know where to begin. The introductory textbook I used did not cover this.
As an answer:
It's a swapping functionality similar to the famous bit-twiddling hacks.
A detailed explanation of this swapping mechanism can be found here.
The length is divided by two because otherwise you would undo every swap and end up with the original string again.
The indices i and j run against each other (from the beginning or end, respectively).

How to make this code shorter to do faster

I am new c++ learner.I logged in Codeforces site and it is 11A question:
A sequence a0, a1, ..., at - 1 is called increasing if ai - 1 < ai for each i: 0 < i < t.
You are given a sequence b0, b1, ..., bn - 1 and a positive integer d. In each move you may choose one element of the given sequence and add d to it. What is the least number of moves required to make the given sequence increasing?
Input
The first line of the input contains two integer numbers n and d (2 ≤ n ≤ 2000, 1 ≤ d ≤ 106). The second line contains space separated sequence b0, b1, ..., bn - 1 (1 ≤ bi ≤ 106).
Output the minimal number of moves needed to make the sequence increasing.
I write this code for this question:
#include <iostream>
using namespace std;
int main()
{
long long int n,d,ci,i,s;
s=0;
cin>>n>>d;
int a[n];
for(ci=0;ci<n;ci++)
{
cin>>a[ci];
}
for(i=0;i<(n-1);i++)
{
while(a[i]>=a[i+1])
{
a[i+1]+=d;
s+=1;
}
}
cout<<s;
return 0;
}
It work good.But In a test codeforces server enter 2000 number.Time limit is 1 second.But it calculate up to 1 second.
How to make this code shorter to calculate faster?
One improvement that can be made is to use
std::ios_base::sync_with_stdio(false);
By default, cin/cout waste time synchronizing themselves with the C library’s stdio buffers, so that you can freely intermix calls to scanf/printf with operations on cin/cout. By turning this off using the above call the input and output operations in the above program should take less time since it no longer initialises the sync for input and output.
This is know to have helped in previous code challenges that require code to be completed in a certain time scale and which the c++ input/output was causing some bottleneck in the speed.
You can get rid of the while loop. Your program should run faster without
#include <iostream>
using namespace std;
int main()
{
long int n,d,ci,i,s;
s=0;
cin>>n>>d;
int a[n];
for(ci=0;ci<n;ci++)
{
cin>>a[ci];
}
for(i=0;i<(n-1);i++)
{
if(a[i]>=a[i+1])
{
int x = ((a[i] - a[i+1])/d) + 1;
s+=x;
a[i+1]+=x*d;
}
}
cout<<s;
return 0;
}
This is not a complete answer, but a hint.
Suppose our seqence is {1000000, 1} and d is 2.
To make an increasing sequence, we need to make the second element 1,000,001 or greater.
We could do it your way, by repeatedly adding 2 until we get past 1,000,000
1 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + ...
which would take a while, or we could say
Our goal is 1,000,001
We have 1
The difference is 1,000,000
So we need to to do 1,000,000 / 2 = 500,000 additions
So the answer is 500,000.
Which is quite a bit faster, because we only did 1 addition (1,000,000 + 1), one subtraction (1,000,001 - 1) and one division (1,000,000 / 2) instead of doing half a million additions.
Just as #molbdnilo said, Use math to get rid of the loop, and it's simple.
Here is my code, accepted on Codeforces.
#include <iostream>
using namespace std;
int main()
{
int n = 0 , b = 0;
int a[2001];
cin >> n >> b;
for(int i = 0 ; i < n ; i++){
cin >> a[i];
}
int sum = 0;
for(int i = 0 ; i < n - 1 ; i++){
if(a[i] >= a[i + 1]){
int minus = a[i] - a[i+1];
int diff = minus / b + 1;
a[i+1] += diff * b;
sum += diff;
}
}
cout << sum << endl;
return 0;
}
I suggest you profile your code to see where the bottlenecks are.
One of the popular areas of time wasting is with input. The fewer input requests, the faster your program will be.
So, you could speed up your program by reading from cin using read() into a buffer and then parse the buffer using istringstream.
Other techniques include loop unrolling and optimizing for data cache. Reducing the number of branches or if statements will also speed up your programs. Processor prefer crunching data and moving data around to jumping to different areas in the code.

C++ code translation and explanation

I have the following c++ code snippet. I have a basic understanding of c++ code.Please correct my explanation of the following code where ever necessary:
for (p = q->prnmsk, s = savedx->msk, j = sizeof(q->prnmsk);
j && !(*p & *s); j--, p++, s++);
What does it contain: q is char *q(as declared) is type of structure MSK as per code.
q->prnmsk contains byte data where prnmask containd 15 bytes.
It is similar for s.
So in the for loop as j decreases it will go through each byte and perform this !(*p & *s) operation to continue the loop and eventually if the condition is not met the loop will exit else j will run till j==0.
Am I correct? What does *p and *s mean? Will it contain the byte value?
Some (like me) might think that following is more readable
int j;
for (j = 0; j < sizeof(q->prnmsk); ++j)
{
if ((q->prnmsk[j] & savedx->msk[j]) != 0) break;
}
which would mean that q->prnmsk and savedx->msk are iterated to find the first occurence of where bit-anding both is not zero. if j equals sizeof(q->prnmsk), all bit-andings were zero.
Yes, you are right. !(*p & *s) means that they want to check if q->prnmsk and savedx->msk don't have corresponding bits set to 1 simultaneously.

Fast dot product for a very special case

Given a vector X of size L, where every scalar element of X is from a binary set {0,1}, it is to find a dot product z=dot(X,Y) if vector Y of size L consists of the integer-valued elements. I suggest, there must exist a very fast way to do it.
Let's say we have L=4; X[L]={1, 0, 0, 1}; Y[L]={-4, 2, 1, 0} and we have to find z=X[0]*Y[0] + X[1]*Y[1] + X[2]*Y[2] + X[3]*Y[3] (which in this case will give us -4).
It is obvious that X can be represented using binary digits, e.g. an integer type int32 for L=32. Then, all what we have to do is to find a dot product of this integer with an array of 32 integers. Do you have any idea or suggestions how to do it very fast?
This really would require profiling but an alternative you might want to consider:
int result=0;
int mask=1;
for ( int i = 0; i < L; i++ ){
if ( X & mask ){
result+=Y[i];
}
mask <<= 1;
}
Typically bit shifting and bitwise operations are faster than multiplication, however, the if statement might be slower than a multiplication, although with branch prediction and large L my guess is it might be faster. You would really have to profile it, though, to determine if it resulted in any speedup.
As has been pointed out in the comments below, unrolling the loop either manually or via a compiler flag (such as "-funroll-loops" on GCC) could also speed this up (eliding the loop condition).
Edit
In the comments below, the following good tweak has been proposed:
int result=0;
for ( int i = 0; i < L; i++ ){
if ( X & 1 ){
result+=Y[i];
}
X >>= 1;
}
Is a suggestion to look into SSE2 helpful? It has dot-product type operations already, plus you can trivially do 4 (or perhaps 8, I forget the register size) simple iterations of your naive loop in parallel.
SSE also has some simple logic-type operations so it may be able to do additions rather than multiplications without using any conditional operations... again you'd have to look at what ops are available.
Try this:
int result=0;
for ( int i = 0; i < L; i++ ){
result+=Y[i] & (~(((X>>i)&1)-1));
}
This avoids a conditional statement and uses bitwise operators to mask the scalar value with either zeros or ones.
Since size explicitly doesn’t matter, I think the following is probably the most efficient general-purpose code:
int result = 0;
for (size_t i = 0; i < 32; ++i)
result += Y[i] & -X[i];
Bit-encoding X just doesn’t bring anything to the table (even if the loop may potentially terminate earlier as #Mathieu correctly noted). But omitting the if inside the loop does.
Of course, loop unrolling can speed this up drastically, as others have noted.
This solution is identical to, but slightly faster (by my test), than Micheal Aaron's:
long Lev=1;
long Result=0
for (int i=0;i<L;i++) {
if (X & Lev)
Result+=Y[i];
Lev*=2;
}
I thought there was a numerical way to rapidly establish the next set bit in a word which should improve performance if your X data is very sparse but currently cannot find said numerical formulation currently.
I've seen a number of responses with bit trickery (to avoid branching) but none got the loop right imho :/
Optimizing #Goz answer:
int result=0;
for (int i = 0, x = X; x > 0; ++i, x>>= 1 )
{
result += Y[i] & -(int)(x & 1);
}
Advantages:
no need to do i bit-shifting operations each time (X>>i)
the loop stops sooner if X contains 0 in higher bits
Now, I do wonder if it runs faster, especially since the premature stop of the for loop might not be as easy for loop unrolling (compared to a compile-time constant).
How about combining a shifting loop with a small lookup table?
int result=0;
for ( int x=X; x!=0; x>>=4 ){
switch (x&15) {
case 0: break;
case 1: result+=Y[0]; break;
case 2: result+=Y[1]; break;
case 3: result+=Y[0]+Y[1]; break;
case 4: result+=Y[2]; break;
case 5: result+=Y[0]+Y[2]; break;
case 6: result+=Y[1]+Y[2]; break;
case 7: result+=Y[0]+Y[1]+Y[2]; break;
case 8: result+=Y[3]; break;
case 9: result+=Y[0]+Y[3]; break;
case 10: result+=Y[1]+Y[3]; break;
case 11: result+=Y[0]+Y[1]+Y[3]; break;
case 12: result+=Y[2]+Y[3]; break;
case 13: result+=Y[0]+Y[2]+Y[3]; break;
case 14: result+=Y[1]+Y[2]+Y[3]; break;
case 15: result+=Y[0]+Y[1]+Y[2]+Y[3]; break;
}
Y+=4;
}
The performance of this will depend on how good the compiler is at optimising the switch statement, but in my experience they are pretty good at that nowadays....
There is probably no general answer to this question. You need to profile your code under all the different targets. Performance will depend on compiler optimizations such as loop unwinding and SIMD instructions that are available on most modern CPUs (x86, PPC, ARM all have their own implementations).
For small L, you can use a switch statement instead of a loop. For example, if L = 8, you could have:
int dot8(unsigned int X, const int Y[])
{
switch (X)
{
case 0: return 0;
case 1: return Y[0];
case 2: return Y[1];
case 3: return Y[0]+Y[1];
// ...
case 255: return Y[0]+Y[1]+Y[2]+Y[3]+Y[4]+Y[5]+Y[6]+Y[7];
}
assert(0 && "X too big");
}
And if L = 32, you can write a dot32() function which calls dot8() four times, inlined if possible. (If your compiler refuses to inline dot8(), you could rewrite dot8() as a macro to force inlining.) Added:
int dot32(unsigned int X, const int Y[])
{
return dot8(X >> 0 & 255, Y + 0) +
dot8(X >> 8 & 255, Y + 8) +
dot8(X >> 16 & 255, Y + 16) +
dot8(X >> 24 & 255, Y + 24);
}
This solution, as mikera points out, may have an instruction cache cost; if so, using a dot4() function might help.
Further update: This can be combined with mikera's solution:
static int dot4(unsigned int X, const int Y[])
{
switch (X)
{
case 0: return 0;
case 1: return Y[0];
case 2: return Y[1];
case 3: return Y[0]+Y[1];
//...
case 15: return Y[0]+Y[1]+Y[2]+Y[3];
}
}
Looking at the resulting assembler code with the -S -O3 options with gcc 4.3.4 on CYGWIN, I'm slightly surprised to see that this is automatically inlined within dot32(), with eight 16-entry jump-tables.
But adding __attribute__((__noinline__)) seems to produce nicer-looking assembler.
Another variation is to use fall-throughs in the switch statement, but gcc adds jmp instructions, and it doesn't look any faster.
Edit--Completely new answer: After thinking about the 100 cycle penalty mentioned by Ants Aasma, and the other answers, the above is likely not optimal. Instead, you could manually unroll the loop as in:
int dot(unsigned int X, const int Y[])
{
return (Y[0] & -!!(X & 1<<0)) +
(Y[1] & -!!(X & 1<<1)) +
(Y[2] & -!!(X & 1<<2)) +
(Y[3] & -!!(X & 1<<3)) +
//...
(Y[31] & -!!(X & 1<<31));
}
This, on my machine, generates 32 x 5 = 160 fast instructions. A smart compiler could conceivably unroll the other suggested answers to give the same result.
But I'm still double-checking.
result = 0;
for(int i = 0; i < L ; i++)
if(X[i]!=0)
result += Y[i];
It's quite likely that the time spent to load X and Y from main memory will dominate. If this is the case for your CPU architecture, the algorithm is faster when loading less. This means that storing X as a bitmask and expanding it into L1 cache will speed up the algorithm as a whole.
Another relevant question is whether your compiler will generate optimal loads for Y. This is higly CPU and compiler dependent. But in general, it helps if the compiler can see precsiely which values are needed when. You could manually unroll the loop. However, if L is a contant, leave it to the compiler:
template<int I> inline void calcZ(int (&X)[L], int(&Y)[L], int &Z) {
Z += X[I] * Y[I]; // Essentially free, as it operates in parallel with loads.
calcZ<I-1>(X,Y,Z);
}
template< > inline void calcZ<0>(int (&X)[L], int(&Y)[L], int &Z) {
Z += X[0] * Y[0];
}
inline int calcZ(int (&X)[L], int(&Y)[L]) {
int Z = 0;
calcZ<L-1>(X,Y,Z);
return Z;
}
(Konrad Rudolph questioned this in a comment, wondering about memory use. That's not the real bottleneck in modern computer architectures, bandwidth between memory and CPU is. This answer is almost irrelevant if Y is somehow already in cache. )
You can store your bit vector as a sequence of ints where each int packs a couple of coefficients as bits. Then, the component-wise multiplication is equivalent to bit-and. With this you simply need to count the number of set bits which could be done like this:
inline int count(uint32_t x) {
// see link
}
int dot(uint32_t a, uint32_t b) {
return count(a & b);
}
For a bit hack to count the set bits see http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
Edit: Sorry I just realized only one of the vectors contains elements of {0,1} and the other one doesn't. This answer only applies to the case where both vectors are limited to coefficients from the set of {0,1}.
Represente X using linked list of the places where x[i] = 1.
To find required sum you need O(N) operations where N is size of your list.
Well you want all bits to get past if its a 1 and none if its a 0. So you want to somehow turn 1 into -1 (ie 0xffffffff) and 0 stays the same. Thats just -X .... so you do ...
Y & (-X)
for each element ... job done?
Edit2: To give a code example you can do something like this and avoid the branch:
int result=0;
for ( int i = 0; i < L; i++ )
{
result+=Y[i] & -(int)((X >> i) & 1);
}
Of course you'd be best off keeping the 1s and 0s in an array of ints and therefore avoiding the shifts.
Edit: Its also worth noting that if the values in Y are 16-bits in size then you can do 2 of these and operations per operation (4 if you have 64-bit registers). It does mean negating the X values 1 by 1 into a larger integer, though.
ie YVals = -4, 3 in 16-bit = 0xFFFC, 0x3 ... put into 1 32-bit and you get 0xFFFC0003. If you have 1, 0 as the X vals then you form a bit mask of 0xFFFF0000 and the 2 together and you've got 2 results in 1 bitwise-and op.
Another edit:
IF you want the code on how to do the 2nd method something like this should work (Though it takes advantage of unspecified behaviour so it may not work on every compiler .. works on every compiler I've come across though).
union int1632
{
int32_t i32;
int16_t i16[2];
};
int result=0;
for ( int i = 0; i < (L & ~0x1); i += 2 )
{
int3264 y3264;
y3264.i16[0] = Y[i + 0];
y3264.i16[1] = Y[i + 1];
int3264 x3264;
x3264.i16[0] = -(int16_t)((X >> (i + 0)) & 1);
x3264.i16[1] = -(int16_t)((X >> (i + 1)) & 1);
int3264 res3264;
res3264.i32 = y3264.i32 & x3264.i32;
result += res3264.i16[0] + res3264.i16[1];
}
if ( i < L )
result+=Y[i] & -(int)((X >> i) & 1);
Hopefully the compiler will optimise out the assigns (Off the top of my head i'm not sure but the idea could be re-worked so that they definitely are) and give you a small speed up in that you now only need to do 1 bitwise-and instead of 2. The speed up would be minor though ...

Why does this take so long to compile in VCC 2003?

My team need the "Sobol quasi-random number generator" - a common RNG which is famous for good quality results and speed of operation. I found what looks like a simple C implementation on the web. At home I was able to compile it almost instantaneously using my Linux GCC compiler.
The following day I tried it at work: If I compile in Visual Studio in debug mode it takes about 1 minute. If I were to compile it in release mode it takes about 40 minutes.
Why?
I know that "release" mode triggers some compiler optimization... but how on earth could a file this small take so long to optimize? It's mostly comments and static-data. There's hardly anything worth optimizing.
None of these PCs are particularly slow, and in any case I know that the compile time is consistent across a range of Windows computers. I've also heard that newer versions of Visual Studio have a faster compile time, however for now we are stuck with Visual Studio.Net 2003. Compiling on GCC (the one bundled with Ubuntu 8.04) always takes microseconds.
To be honest, I'm not really sure the codes that good. It's got a nasty smell in it. Namely, this function:
unsigned int i4_xor ( unsigned int i, unsigned int j )
//****************************************************************************80
//
// Purpose:
//
// I4_XOR calculates the exclusive OR of two integers.
//
// Modified:
//
// 16 February 2005
//
// Author:
//
// John Burkardt
//
// Parameters:
//
// Input, unsigned int I, J, two values whose exclusive OR is needed.
//
// Output, unsigned int I4_XOR, the exclusive OR of I and J.
//
{
unsigned int i2;
unsigned int j2;
unsigned int k;
unsigned int l;
k = 0;
l = 1;
while ( i != 0 || j != 0 )
{
i2 = i / 2;
j2 = j / 2;
if (
( ( i == 2 * i2 ) && ( j != 2 * j2 ) ) ||
( ( i != 2 * i2 ) && ( j == 2 * j2 ) ) )
{
k = k + l;
}
i = i2;
j = j2;
l = 2 * l;
}
return k;
}
There's an i8_xor too. And a couple of abs functions.
I think a post to the DailyWTF is in order.
EDIT: For the non-c programmers, here's a quick guide to what the above does:
function xor i:unsigned, j:unsigned
answer = 0
bit_position = 1
while i <> 0 or j <> 0
if least significant bit of i <> least significant bit of j
answer = answer + bit_position
end if
bit_position = bit_position * 2
i = i / 2
j = j / 2
end while
return answer
end function
To determine if the least significant bit is set or cleared, the following is used:
bit set if i <> (i / 2) * 2
bit clear if i == (i / 2) * 2
What makes the code extra WTFy is that C defines an XOR operator, namely '^'. So, instead of:
result = i4_xor (a, b);
you can have:
result = a ^ b; // no function call at all!
The original programmer really should have know about the xor operator. But even if they didn't (and granted, it's another obfuscated C symbol), their implementation of an XOR function is unbelievably poor.
I'm using VC++ 2003 and it compiled instantly in both debug/release modes.
Edit:
Do you have the latest service pack installed on your systems?
I would recommend you download a trial edition of Visual Studio 2008 and try the compile there, just to see if the problem is inherent. Also, if it does happen on a current version, you would be able to report the problem, and Microsoft might fix it.
On the other hand, there is no chance that Microsoft will fix whatever bug is in VS2003.