OpenCL 1D range loop without knowledge of global size - c++

I was wondering how can I iterate over a loop with a any number of work items (per group is irrelevant)
I have 3 arrays and one of them is 2-dimensional(a matrix). The first array contains a set of integers. The matrix is filled with another set of (repeated and random) integers.
The third one is only to store the results.
I need to search for the farest pair's numbers of occurrences of a number, from the first array, in the matrix.
To summarize:
A: Matrix with random numbers
num: Array with numbers to search in A
d: Array with maximum distances of pairs of each number from num
The algorithm is simple(as I don't need to optimize it), I only compare calculated Manhattan distances and keep the maximum value.
To keep it simple, it does the following (C-like pseudo code):
for(number in num){
maxDistance = 0
for(row in A){
for(column in A){
//calculateDistance is a function to another nested loop like this
//it returns the max found distance if it is, and 0 otherwise
currentDistance = calculateDistance(row, column, max)
if(currentDistance > maxDistance){
maxDistance = currentDistance
}
}
}
}
As you can see there is no dependent data between iterations. I tried to assign each work item a slice of the matrix A, but still doesn't convince me.
IMPORTANT: The kernel must be executed with only one dimension for the problem.
Any ideas? How can I use the global id to make multiple search at once?
Edit:
I added the code to clear away any doubt.
Here is the kernel:
__kernel void maxDistances(int N, __constant int *A, int n, __constant int *numbers, __global int *distances)
{
//N is matrix row and col size
//A the matrix
//n the total count of numbers to be searched
//numbers is the array containing the numbers
//distances is the array containing the computed distances
size_t id = get_global_id(0);
int slice = (N*N)/get_global_size(0);
for(int idx_num = 0; idx_num < n; idx_num++)
{
int number = numbers[idx_num];
int currentDistance = 0;
int maxDistance = 0;
for(int c = id*slice; c < (id+1)*slice; c++)
{
int i = c/N;
int j = c%N;
if(*CELL(A,N,i,j) == number){
coord_t coords;
coords.i = i;
coords.j = j;
//bestDistance is a function with 2 nested loop iterating over
//rows and column to retrieve the farest pair of the number
currentDistance = bestDistance(N,A,coords,number, maxDistance);
if(currentDistance > maxDistance)
{
maxDistance = currentDistance;
}
}
}
distances[idx_num] = maxDistance;
}
}

This answer may be seen as incomplete, nevertheless, I am going to post it in order to close the question.
My problem was not the code, the kernel (or that algorithm), it was the machine. The above code is correct and works perfectly. After I tried my program in another machine it executed and computed the solution with no problem at all.
So, in brief, the problem was the OpenCL device or most likely the host libraries.

Related

Converting C++ function to recursive function

I'm looking over a function and need to convert it to a dynamic programming form. But I'm having difficulty understanding the logic used in this function (what would be the base case?), the original author of this function is no longer available for questioning, I can't make heads or tails of his work and there is 0 documentation available.
Description:
This function takes in a matrix of positive integers and finds the maximum sum by
selecting one element from every column in the matrix, moving left-to-right. As you move through
the matrix column-by-column, there is a penalty to your sum depending on how you
move relative to your previous two positions. If the next row you select is between the previous two
selected rows, there is no penalty; however, there is a penalty of 2 to your sum for every row above
the maximum of the previous two or below the minimum of the previous two.
int calSum(int row, int cols, vector<vector<int>> inputArray, vector<int> *outputArray){
int ans[row][cols][row];
int index[row][cols][row];
int firstCol[row];
for(int i=0;i<row;i++){
firstCol[i]= inputArray[i][0] - 2*(i);
}
for(int i=0;i<row;i++){
for(int j=0;j<row;j++){
int penalty;
if(i<=j){
penalty=0;
}else{
penalty= 2* (i-j);
}
ans[i][1][j]= inputArray[i][1] - penalty+ firstCol[j];
}
}
for(int j=2;j<cols;j++){
for(int i=0;i<row;i++){
int nextRow= i;
for(int k=0;k<row;k++){
int currRow= k;
int ind=-1;
int maxVal= INT_MIN;
for(int l=0;l<row;l++){
int prevRow=l;
int max1= max(prevRow, currRow);
int min1= min(prevRow, currRow);
int penalty;
if(nextRow<=max1&&nextRow>= min1){
penalty=0;
}else if(nextRow>max1){
penalty= 2*(nextRow-max1);
}else{
penalty= 2*(min1-nextRow);
}
int val= -penalty+ inputArray[i][j] + ans[k][j-1][l];
if(val>maxVal){
maxVal=val;
ind=l;
}
}
ans[i][j][k]=maxVal;
index[i][j][k]=ind;
}
}
}
int max=INT_MIN;
int x=-1;
int y=-1;
for(int i=0;i<row;i++){
for(int j=0;j<row;j++){
if(ans[i][cols-1][j]>max){
max= ans[i][cols-1][j];
x=i;
y=i;
}
}
}
for(int j=cols-1;j>=2;j--) {
outputArray->push_back(x);
int temp=x;
x= y;
y= index[temp][j][y];
}
outputArray->push_back(x);
outputArray->push_back(y);
return max;
}
I have tried tracing the code and keep getting lost in the logic. A basic explanation of what this function is doing would be greatly appreciated.
The core datastructure ans works as follows: ans[i][j][k] is the best possible path from (k, 0) to (i, j). (Note this uses row,col notation to match the notation in the program)
If we walk the code for-loop by for-loop:
The first for-loop calculates the score of values in the first column, taking into account that everything with row > 1 has a penalty.
The second for-loop calculates ans[i][1][j], or maximum paths up to the second column, given a starting row j and ending row i.
The third for-loop gradually expands ans to the right. For every column j > 1, it fills in ans[i][j][k] by finding an l that maximizes (k, 0) to (l, j-1) to (i, j). The first part can be read from ans[k][j-1][l], the last step calculated according to the rules given in the problem.
This loop also writes the optimal choice of l in the ind datastructure, so you can reconstruct the optimal path later.
The fourth for-loop simply finds the maximal path value and stores the ending row.
The final for-loop reconstructs the path by retracing steps in the ind datastructure.

How to trace error with counter in do while loop in C++?

I am trying to get i to read array with numbers and get the smaller number, store it in variable and then compare it with another variable that is again from two other numbers (like 2,-3).
There is something wrong in the way I implement the do while loop. I need the counter 'i' to be updated twice so it goes through I have 2 new variables from 4 compared numbers. When I hard code it n-1,n-2 it works but with the loop it gets stuck at one value.
int i=0;
int closestDistance=0;
int distance=0;
int nextDistance=0;
do
{
distance = std::min(values[n],values[n-i]); //returns the largest
distance=abs(distance);
i++;
nextDistance=std::min(values[n],values[n-i]);
nextDistance=abs(closestDistance); //make it positive then comp
if(distance<nextDistance)
closestDistance=distance;//+temp;
else
closestDistance=nextDistance;
i++;
}
while(i<n);
return closestDistance;
Maybe this:
int i = 0;
int m = 0;
do{
int lMin = std::min(values[i],values[i + 1]);
i += 2;
int rMin = std::min(values[i], values[i + 1]);
m = std::min(lMin,rMin);
i += 2;
}while(i < n);
return m;
I didn't understand what you meant, but this compares values in values 4 at a time to find the minimal. Is that all you needed?
Note that if n is the size of values, this would go out of bounds. n would have to be the size minus 4, leading to odd exceptional cases.
The issue with your may be in the call to abs. Are all the values positive? Are you trying to find the smallest absolute value?
Also, note that using i += 2 twice ensures that you do not repeat any values. This means that you will go over 4 unique values. Your code goes through 3 in each iteration of the loop.
I hope this clarified.
What are you trying to do in following lines.
nextDistance=std::min(values[n],values[n-i]);
nextDistance=abs(closestDistance); //make it positive , then computed

Iterate through all combinations in Gray code order [duplicate]

This question already has answers here:
Gray code increment function
(4 answers)
Closed 8 years ago.
Let's say i have n integers in an array a, and i want to iterate through all possible subsets of these integers, find the sum, and then do something with it.
What i immedieatelly did, was to create a bit field b, which indicated which numbers were included in the subset, and iterate through its possible values using ++b. Then, to compute the sum in each step, i had to iterate through all bits like this:
int sum = 0;
for (int i = 0; i < n; i++)
if (b&1<<i)
sum += a[i];
Then i realized that if i iterated through the possible values of b in a Gray code order, so that each time only a single bit is flipped, i wouldn't have to reconstruct the sum completely, but only needed to add or subtract the single value that is being added or removed from the subset. It should work like this:
int sum = 0;
int whichBitToFlip = 0;
bool isBitSet = false;
for (int k = 0; whichBitToFlip < n; k++) {
sum += (isBitSet ? -1 : 1)*a[whichBitToFlip];
// do something with sum here
whichBitToFlip = ???;
bool isBitSet = ???;
}
But i can't figure out how to directly and efficiently compute whichBitToFlip. The desired values are basically sequence A007814. I know that i can compute the Gray code using the formula (k>>1)^k and xor it with the previous one, but then i need to find the position of the changed bit, which might not be much faster.
So is there any better way to determine these values (index of flipped bit), preferably without a cycle, faster than recomputing the whole sum (of at most 64 values) every time?
To convert a bitmask to a bit index, you can use the ffs function (if you have one), which corresponds to a machine opcode on some machines.
Otherwise, the bit changed in the gray code corresponds to the ruler function:
0, 1, 0, 2, 0, 1, 0, 3, 0, 1...
for which there is a simple recursion. You can simulate the recursion with a stack (it will have maximum depth O(log N), so it's not much space), but probably ffs is a lot faster.
(By the way, even if you were to count bits one at a time from right-to-left, the increment function would be O(1) on average because the total number of trailing 0s in the integers from 1 to 2k is 2k-1.)
So i came up with this:
int sum = 0;
unsigned long grayPos = 0;
int graySign = 1;
for (uint64 k = 2; grayPos < n; k++) {
sum += graySign*a[grayPos];
// Do something with sum
#ifdef _M_X64
grayPos = n;
_BitScanForward64(&grayPos, k);
#else
for (grayPos = 0; !(k&1ull<<grayPos); grayPos++);
#endif
graySign = 2-(k>>grayPos&0x3);
}
It works really well, brought down the execution time (in comparison to always recomputing the whole sum) from 254 to only 7 seconds for n = 32. I also found that counting trailing zeroes with the for cycle is only slightly (~15%) slower than using _BitScanForward64 for the reasons mentioned by rici. So thanks.

Number of parallelograms on a NxM grid

I have to solve a problem when Given a grid size N x M , I have to find the number of parallelograms that "can be put in it", in such way that they every coord is an integer.
Here is my code:
/*
~Keep It Simple!~
*/
#include<fstream>
#define MaxN 2005
int N,M;
long long Paras[MaxN][MaxN]; // Number of parallelograms of Height i and Width j
long long Rects; // Final Number of Parallelograms
int cmmdc(int a,int b)
{
while(b)
{
int aux = b;
b = a -(( a/b ) * b);
a = aux;
}
return a;
}
int main()
{
freopen("paralelograme.in","r",stdin);
freopen("paralelograme.out","w",stdout);
scanf("%d%d",&N,&M);
for(int i=2; i<=N+1; i++)
for(int j=2; j<=M+1; j++)
{
if(!Paras[i][j])
Paras[i][j] = Paras[j][i] = 1LL*(i-2)*(j-2) + i*j - cmmdc(i-1,j-1) -2; // number of parallelograms with all edges on the grid + number of parallelograms with only 2 edges on the grid.
Rects += 1LL*(M-j+2)*(N-i+2) * Paras[j][i]; // each parallelogram can be moved in (M-j+2)(N-i+2) places.
}
printf("%lld", Rects);
}
Example : For a 2x2 grid we have 22 possible parallelograms.
My Algorithm works and it is correct, but I need to make it a little bit faster. I wanna know how is it possible.
P.S. I've heard that I should pre-process the greatest common divisor and save it in an array which would reduce the run-time to O(n*m), but I'm not sure how to do that without using the cmmdc ( greatest common divisor ) function.
Make sure N is not smaller than M:
if( N < M ){ swap( N, M ); }
Leverage the symmetry in your loops, you only need to run j from 2 to i:
for(int j=2; j<=min( i, M+1); j++)
you don't need an extra array Paras, drop it. Instead use a temporary variable.
long long temparas = 1LL*(i-2)*(j-2) + i*j - cmmdc(i-1,j-1) -2;
long long t1 = temparas * (M-j+2)*(N-i+2);
Rects += t1;
// check if the inverse case i <-> j must be considered
if( i != j && i <= M+1 ) // j <= N+1 is always true because of j <= i <= N+1
Rects += t1;
Replace this line: b = a -(( a/b ) * b); using the remainder operator:
b = a % b;
Caching the cmmdc results would probably be possible, you can initialize the array using sort of sieve algorithm: Create an 2d array indexed by a and b, put "2" at each position where a and b are multiples of 2, then put a "3" at each position where a and b are multiples of 3, and so on, roughly like this:
int gcd_cache[N][N];
void init_cache(){
for (int u = 1; u < N; ++u){
for (int i = u; i < N; i+=u ) for (int k = u; k < N ; k+=u ){
gcd_cache[i][k] = u;
}
}
}
Not sure if it helps a lot though.
The first comment in your code states "keep it simple", so, in the light of that, why not try solving the problem mathematically and printing the result.
If you select two lines of length N from your grid, you would find the number of parallelograms in the following way:
Select two points next to each other in both lines: there is (N-1)^2
ways of doing this, since you can position the two points on N-1
positions on each of the lines.
Select two points with one space between them in both lines: there is (N-2)^2 ways of doing this.
Select two points with two, three and up to N-2 spaces between them.
The resulting number of combinations would be (N-1)^2+(N-2)^2+(N-3)^2+...+1.
By solving the sum, we get the formula: 1/6*N*(2*N^2-3*N+1). Check WolframAlpha to verify.
Now that you have a solution for two lines, you simply need to multiply it by the number of combinations of order 2 of M, which is M!/(2*(M-2)!).
Thus, the whole formula would be: 1/12*N*(2*N^2-3*N+1)*M!/(M-2)!, where the ! mark denotes factorial, and the ^ denotes a power operator (note that the same sign is not the power operator in C++, but the bitwise XOR operator).
This calculation requires less operations that iterating through the matrix.

Weighted probability with long doubles

I am working with an array of roughly 2000 elements in C++.
Each element represents the probability of that element being selected randomly.
I then have convert this array into a cumulative array, with the intention of using this to work out which element to choose when a dice is rolled.
Example array:
{1,2,3,4,5}
Example cumulative array:
{1,3,6,10,15}
I want to be able to select 3 in the cumulative array when numbers 3, 4 or 5 are rolled.
The added complexity is that my array is made up of long doubles. Here's an example of a few consecutive elements:
0.96930161525189592646367317541056252139242133125662803649902343750
0.96941377254127855667142910078837303444743156433105468750000000000
0.96944321382974149711383993199831365927821025252342224121093750000
0.96946143938926617454089618153290075497352518141269683837890625000
0.96950069444055009509463721739663810694764833897352218627929687500
0.96951751803395748961766908990966840065084397792816162109375000000
This could be a terrible way of doing weighted probabilities with this data set, so I'm open to any suggestions of better ways of working this out.
You can use partial_sum:
unsigned int SIZE = 5;
int array[SIZE] = {1,2,3,4,5};
int partials[SIZE] = {0};
partial_sum(array, array+SIZE, partials);
// partials is now {1,3,6,10,15}
The value you want from the array is available from the partial sums:
12 == array[2] + array[3] + array[4];
12 == partials[4] - partials[1];
The total is obviously the last value in the partial sums:
15 == partial[4];
consider storing the information as an integer numerator and denominator so that there is no loss of precision until the final step.
You can actually do this using stream selection without having to compute an array of partial sums. Here's code I have for this in Java:
public static int selectRandomWeighted(double[] wts, Random rnd) {
int selected = 0;
double total = wts[0];
for( int i = 1; i < wts.length; i++ ) {
total += wts[i];
if( rnd.nextDouble() <= (wts[i] / total)) {
selected = i;
}
}
return selected;
}
The above could potentially be further improved using Kahan summation if you want to preserve as many digits of accuracy in the sum as possible.
However, if you want to draw from this array repeatedly, then pre-computing an array of partial sums and using binary search to find the right index will be faster.
Ok I think I've solved this one.
I just did a binary split search, but instead of just having
if (arr[middle] == value)
I added in an OR
if (arr[middle] == value || (arr[middle] < value && arr[middle+1] > value))
This seems to handle it in the way I was hoping for.