Array Reverse - XOR slower than switching with temporary object - c++

I was reading a question regarding the fastest way to reverse an array (which ended up being less than thrilling), and I came across an interesting comment located at the link here:
https://stackoverflow.com/a/1129028/857994
The solution referenced shows these two possibilities:
//Possibility #1
void reverse(char word[])
{
int len=strlen(word);
char temp;
for (int i=0;i<len/2;i++)
{
temp=word[i];
word[i]=word[len-i-1];
word[len-i-1]=temp;
}
}
//Possibility #2
void reverse(char word[])
{
int len=strlen(word);
for (int i=0;i<len/2;i++)
{
word[i]^=word[len-i-1];
word[len-i-1]^=word[i];
word[i]^=word[len-i-1];
}
}
and the comment states: "Using XOR will be far slower than swapping using a temp object."
Nobody disputed this. So, my questions are:
Is this true?
Why is it true?
Would it still be true if this was an array of a non-built-in-type?

The xor loop contains 2 memory reads and 1 memory write per line, for a total of 6 reads and 3 writes for each loop iteration. Furthermore, there is a strong dependency between the first line the writes to word[i] and the next line that reads from word[i]. This will prevent pipelining, or if the two lines execute in parallel, the second line's read from word[i] will stall until the first line's write is complete. There is another such dependency between the 2nd and 3rd lines.
In the temp var loop, the temp var will almost certainly be stored in a CPU register, not in main memory. So the total memory I/O count for the temp var loop is 2 reads and 2 writes. There are loose data flow dependencies between the statements, but they are read-before-write which can be pipelined. The data flow dependencies in the xor example are read-after-write, which are much harder to do without stalling the pipeline.
6 reads + 3 writes compared to 2 reads + 2 writes. 2 + 2 has a distinct advantage.

Related

Verfiy the number of times a cuda kernel is called

Say you have a cuda kernel that you want to run 2048 times, so you define your kernel like this:
__global__ void run2048Times(){ }
Then you call it from your main code:
run2048Times<<<2,1024>>>();
All seems well so far. However now say for debugging purposes when you're calling the kernel millions of times, you want to verify that your actually calling the Kernel that many times.
What I did was pass a pointer to the kernel and ++'d the pointer every time the kernel ran.
__global__ void run2048Times(int *kernelCount){
kernelCount[0]++; // Add to the pointer
}
However when I copied that pointer back to the main function I get "2".
At first it baffeled me, then after 5 minutes of coffee and pacing back and forth I realized this probably makes sense because the cuda kernel is running 1024 instances of itself at the same time, which means that the kernels overwrite the "kernelCount[0]" instead of truly adding to it.
So instead I decided to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id > kernelCount[0]){
kernelCount[0] = id;
}
}
Genius!! This was guaranteed to work I thought. Until I ran it and got all sorts of numbers between 0 and 2000.
Which tells me that the problem mentioned above still happens here.
Is there any way to do this, even if it involves forcing the kernels to pause and wait for each other to run?
Assuming this is a simplified example, and you are not in fact trying to do profiling as others have already suggested, but want to use this in a more complex scenario, you can achieve the result you want with atomicAdd, which will ensure that the increment operation is executed as a single atomic operation:
__global__ void run2048Times(int *kernelCount){
atomicAdd(kernelCount, 1); // Add to the pointer
}
Why your solutions didn't work:
The problem with your first solution is that it gets compiled into the following PTX code (see here for description of PTX instructions):
ld.global.u32 %r1, [%rd2];
add.s32 %r2, %r1, 1;
st.global.u32 [%rd2], %r2;
You can verify this by calling nvcc with the --ptx option to only generate the intermediate representation.
What can happen here is the following timeline, assuming you launch 2 threads (Note: this is a simplified example and not exactly how GPUs work, but it is enough to illustrate the problem):
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 0 increases it's local copy by 1
thread 0 stores 1 back to kernelCount
thread 1 increases it's local copy by 1
thread 1 stores 1 back to kernelCount
and you end up with 1 even though 2 threads were launched.
Your second solution is wrong even if the threads are launched sequentially because thread indexes are 0-based. So I'll assume you wanted to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id + 1 > kernelCount[0]){
kernelCount[0] = id + 1;
}
}
This will compile into:
ld.global.u32 %r5, [%rd1];
setp.lt.s32 %p1, %r1, %r5;
#%p1 bra BB0_2;
add.s32 %r6, %r1, 1;
st.global.u32 [%rd1], %r6;
BB0_2:
ret;
What can happen here is the following timeline:
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 1 compares 0 to 1 + 1 and stores 2 into kernelCount
thread 0 compares 0 to 0 + 1 and stores 1 into kernelCount
You end up having the wrong result of 1.
I suggest you pick up a good parallel programming / CUDA book if you want to better understand problems with synchronization and non-atomic operations.
EDIT:
For completeness, the version using atomicAdd compiles into:
atom.global.add.u32 %r1, [%rd2], 1;
It seems like the only point of that counter is to do profiling (i.e. analyse how the code runs) rather than to actually count something (i.e. no functional benefit to the program).
There are profiling tools available designed for this task. For example, nvprof gives the number of calls, as well as some time metrics for each kernel in your codebase.

Strangely For loop counter variable gets reduced by .get()

Consider the following piece of code. This function reads the some integers and strings from a file.
const int vardo_ilgis = 10;
void skaityti(int &n, int &m, int &tiriama, avys A[])
{
ifstream fd("test.txt");
fd >> n >> m >> tiriama;
fd.ignore(80, '\n');
char vard[vardo_ilgis]; // <---
for(int i = 1; i <= n; i++)
{
cout << i << ' ';
fd.get(vard, vardo_ilgis+1); // <---
cout << i << endl;
A[i].vardas = vard;
getline(fd, A[i].DNR);
}
fd.close();
}
and input:
4 6
4
Baltukas TAGCTT
Bailioji ATGCAA
Doli AGGCTC
Smarkuolis AATGAA
In this case, variable 'vard' has a length vardo_ilgis = 10, but in function fd.get the read input is vardo_ilgis+1 = 11 (larger than the variable length in which data is stored). I'm not asking how to fix a problem, because it's obvious not to read more than you can store on a variable.
However, I really want to understand the reason of this behaviour: the loop count variable gets decreased by fd.get. Why and how even can this happen? That's the output of this little piece of code:
1 0
1 0
1 0
1 0
1 1
2 2
3 3
4 4
Why did you use +1 ??
fd.get(vard, vardo_ilgis+1);
Overrunning that buffer corrupts some memory. In a simple unoptimized build, that corrupted memory could be the loop index.
the loop count variable gets decreased by fd.get. Why and how even can this happen?
Once you know why you have caused undefined behavior, many people say you aren't supposed to inquire into the details of that undefined behavior. I disagree. By understanding the details, you can improve your ability to diagnose other situations where you don't know what undefined behavior you might have invoked.
All you local variables are stored together, so overwriting one will tend to clobber another.
You describe the variable being "decreased" when in fact it was set to zero. The fact that it was 1 before being zeroed didn't affect its being zeroed. The undefined behavior happened to be equivalent to i&=~255; which for values under 256 is equal to i=0;. It is more accidental that you could see it as i--;
Hopefully it is clear why i stopped being zeroed once you ran out of input.
fd.get(vard, vardo_ilgis+1); makes buffer be written out-of-bounds.
In your case, the area where you write (and where you should not) is probably the same memory area where i is stored.
But, what's most important is that you end up with the so famous undetermined behaviour. Which mean anything could happen and there is no point trying to understand why or how (what happens is platform, compiler and even context specific, I don't think anyone can predict nor explain it).

How to make this code shorter to do faster

I am new c++ learner.I logged in Codeforces site and it is 11A question:
A sequence a0, a1, ..., at - 1 is called increasing if ai - 1 < ai for each i: 0 < i < t.
You are given a sequence b0, b1, ..., bn - 1 and a positive integer d. In each move you may choose one element of the given sequence and add d to it. What is the least number of moves required to make the given sequence increasing?
Input
The first line of the input contains two integer numbers n and d (2 ≤ n ≤ 2000, 1 ≤ d ≤ 106). The second line contains space separated sequence b0, b1, ..., bn - 1 (1 ≤ bi ≤ 106).
Output the minimal number of moves needed to make the sequence increasing.
I write this code for this question:
#include <iostream>
using namespace std;
int main()
{
long long int n,d,ci,i,s;
s=0;
cin>>n>>d;
int a[n];
for(ci=0;ci<n;ci++)
{
cin>>a[ci];
}
for(i=0;i<(n-1);i++)
{
while(a[i]>=a[i+1])
{
a[i+1]+=d;
s+=1;
}
}
cout<<s;
return 0;
}
It work good.But In a test codeforces server enter 2000 number.Time limit is 1 second.But it calculate up to 1 second.
How to make this code shorter to calculate faster?
One improvement that can be made is to use
std::ios_base::sync_with_stdio(false);
By default, cin/cout waste time synchronizing themselves with the C library’s stdio buffers, so that you can freely intermix calls to scanf/printf with operations on cin/cout. By turning this off using the above call the input and output operations in the above program should take less time since it no longer initialises the sync for input and output.
This is know to have helped in previous code challenges that require code to be completed in a certain time scale and which the c++ input/output was causing some bottleneck in the speed.
You can get rid of the while loop. Your program should run faster without
#include <iostream>
using namespace std;
int main()
{
long int n,d,ci,i,s;
s=0;
cin>>n>>d;
int a[n];
for(ci=0;ci<n;ci++)
{
cin>>a[ci];
}
for(i=0;i<(n-1);i++)
{
if(a[i]>=a[i+1])
{
int x = ((a[i] - a[i+1])/d) + 1;
s+=x;
a[i+1]+=x*d;
}
}
cout<<s;
return 0;
}
This is not a complete answer, but a hint.
Suppose our seqence is {1000000, 1} and d is 2.
To make an increasing sequence, we need to make the second element 1,000,001 or greater.
We could do it your way, by repeatedly adding 2 until we get past 1,000,000
1 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + ...
which would take a while, or we could say
Our goal is 1,000,001
We have 1
The difference is 1,000,000
So we need to to do 1,000,000 / 2 = 500,000 additions
So the answer is 500,000.
Which is quite a bit faster, because we only did 1 addition (1,000,000 + 1), one subtraction (1,000,001 - 1) and one division (1,000,000 / 2) instead of doing half a million additions.
Just as #molbdnilo said, Use math to get rid of the loop, and it's simple.
Here is my code, accepted on Codeforces.
#include <iostream>
using namespace std;
int main()
{
int n = 0 , b = 0;
int a[2001];
cin >> n >> b;
for(int i = 0 ; i < n ; i++){
cin >> a[i];
}
int sum = 0;
for(int i = 0 ; i < n - 1 ; i++){
if(a[i] >= a[i + 1]){
int minus = a[i] - a[i+1];
int diff = minus / b + 1;
a[i+1] += diff * b;
sum += diff;
}
}
cout << sum << endl;
return 0;
}
I suggest you profile your code to see where the bottlenecks are.
One of the popular areas of time wasting is with input. The fewer input requests, the faster your program will be.
So, you could speed up your program by reading from cin using read() into a buffer and then parse the buffer using istringstream.
Other techniques include loop unrolling and optimizing for data cache. Reducing the number of branches or if statements will also speed up your programs. Processor prefer crunching data and moving data around to jumping to different areas in the code.

Time Limit Exceeded - Simple Program - Divisibility Test

Input
The input begins with two positive integers n k (n, k<=10^7). The next n lines of input contain one positive integer ti, not greater than 10^9, each.
Output
Write a single integer to output, denoting how many integers ti are divisible by k.
Example
Input:
7 3
1
51
966369
7
9
999996
11
Output:
4
My Code:
#include <iostream>
using namespace std;
int main()
{
long long n,k, i;
cin>>n;
cin>>k;
int count=0;
for(i=0;i<n;i++)
{
int z;
cin>>z;
if(z%k == 0) count++;
}
cout<<count;
return 0;
}
Now this code produces the correct output. However, its not being accepted by CodeChef(http://www.codechef.com/problems/INTEST) for the following reason: Time Limit Exceeded. How can this be further optimized?
As said by caleb the problem is labeled "Enormous Input Test" so it requires you to use some better/faster I/O methods
just replacing cout with printf and cin with scanf will give you an AC but to improve your execution time you need to use some faster IO method for example reading character by character using getchar_unlocked() will give you a better execution time
so you can read the values by using a function like this , for a better execution time.
inline int read(){
char c=getchar_unlocked();
int n=0;
while(!(c>='0' && c<='9'))
c=getchar_unlocked();
while(c>='0' && c<='9'){
n=n*10 + (c-'0');
c=getchar_unlocked();
}
return n;
}
The linked problem contains the following description:
The purpose of this problem is to verify whether the method you are
using to read input data is sufficiently fast to handle problems
branded with the enormous Input/Output warning. You are expected to be
able to process at least 2.5MB of input data per second at runtime.
Considering that, reading values from input a few bytes at a time using iostreams isn't going to cut it. I googled around a bit and found a drop-in replacement for cin and cout described on CodeChef. Some other approaches you could try include using a memory-mapped file and using stdio.
It might also help to look for ways to optimize the calculation. For example, if ti < k, then you know that k is not a factor of ti. Depending on the magnitude of k and the distribution of ti values, that observation alone could save a lot of time.
Remember: the fact that your code is short doesn't mean that it's fast.

Need your input Project Euler Q 8

Is there a better way of doing this ?
http://projecteuler.net/problem=8
I added a condition to check if the number is >6 (Eliminates small products and 0's)
#include <iostream>
#include <math.h>
#include "bada.h"
using namespace std;
int main()
{
int badanum[] { DATA };
int pro=0,highest=0;
for(int i=0;i<=996;++i)
{
if (badanum[i]>6 and badanum[i+1] > 6 and badanum[i+2] >6 and badanum[i+3]>6 and badanum[i+4]>6)
{
pro=badanum[i]*badanum[i+1]*badanum[i+2]*badanum[i+3]*badanum[i+4];
if(pro>highest)
{
cout << pro << " " << badanum[i] << badanum[i+1] << badanum[i+2] << badanum[i+3] << badanum[i+4] << endl;
highest = pro;
}
pro = 0;
}
}
}
bada.h is just a file containing the 1000 digit number.
#DEFINE DATA <1000 digit number>
http://projecteuler.net/problem=8
that if slows things down actually
causes branching the parallel pipeline of CPU execution
also as mentioned before it will invalidate the result
does not matter that your solution is the same as it should be (for another digits it could not)
On algorithmic side you can do:
if you have fast enough division you can lower the computations number
char a[]="7316717653133062491922511967442657474235534919493496983520312774506326239578318016984801869478851843858615607891129494954595017379583319528532088055111254069874715852386305071569329096329522744304355766896648950445244523161731856403098711121722383113622298934233803081353362766142828064444866452387493035890729629049156044077239071381051585930796086670172427121883998797908792274921901699720888093776657273330010533678812202354218097512545405947522435258490771167055601360483958644670632441572215539753697817977846174064955149290862569321978468622482839722413756570560574902614079729686524145351004748216637048440319989000889524345065854122758866688116427171479924442928230863465674813919123162824586178664583591245665294765456828489128831426076900422421902267105562632111110937054421750694165896040807198403850962455444362981230987879927244284909188845801561660979191338754992005240636899125607176060588611646710940507754100225698315520005593572972571636269561882670428252483600823257530420752963450\0";
int i=0,s=0,m=1,q;
for (i=0;i<4;i++)
{
q=a[i ]-'0'; if (q) m*=q;
}
for (i=0;i<996;i++)
{
q=a[i+4]-'0'; if (q) m*=q;
if (s<m) s=m;
q=a[i ]-'0'; if (q) m/=q;
}
also you can do a table for mul,div operations for speed (but that is not faster in all cases)
int mul_5digits[9*9*9*9*9+1][10]={ 0*0,0*1,0*2, ... ,9*9*9*9*9/9 };
int div_5digits[9*9*9*9*9+1][10]={ 0/0,0/1,0/2, ... ,9*9*9*9*9/9 };
// so a=b*c; is rewritten by a=mul_5digits[b][c];
// so a=b/c; is rewritten by a=div_5digits[b][c];
of course instead of values 0*0 have to add neutral value = 1 !!!
of course instead of values i/0 have to add neutral value = i !!!
int i=0,s=0,t=1;
for (i=0;i<4;i++)
{
t=mul_5digits[t][a[i ]-'0'];
}
for (i=0;i<996;i++)
{
t=mul_5digits[t][a[i+4]-'0'];
if (s<t) s=t;
t=div_5digits[t][a[i ]-'0'];
}
Run-time measurements on AMD 3.2GHz, 64bit Win7, 32 bit App BDS2006 C++:
0.022ms classic approach
0.013ms single mul,div per step (produce false outut if there is none product > 0 present)
0.054ms tabled single mul,div per step (is slower for my setup)
PS.
All code improvements should be measured so you see if you actually speed thing up or not.
Because what is faster for one compiler/platform/computer can be slower for another.
Use at least 0.1 ms resolution.
I prefer the use of RDTSC or PerformanceCounter for that.
Except for the errors pointed out in the comments, that much multiplications aren´t necessary. If you start with the product of [0] * [1] * [2] * [3] * [4] for index 0, what would be the product starting at [1]? The old result divided by [0] and multiplied by [5]. One division and one multiplication could be faster than 4 multiplications
You don't need to store all the digits at once. Just current five of them (use an array with cyclic overwriting), one variable to store the current problem result and one to store the latest multiplication result(see below). If the number of digits in the input will grow you won't get any troubles with memory.
Also you could have the check if the oldest read digit equals zero. If it is, than you will really have to multiply all the five current digits, but if not - a better way will be to divide previous multiplication result by the oldest digit and multiply it by the latest read digit.