OpenCL built-in function select - c++

I am trying to use the select function to choose elements from v1 and v2 based on my 3rd argument, however i do not know a way to access the current components in v1 and v2.
Say if v[i] is more than 5, i want to select v2[i] into results[i] else select v1[i], but i can't access the components that way.
Any advice would be appreciated! I am a super beginner at this btw
__kernel void copy(__global int4* Array1,
__global int* Array2,
__global int* output
)
{
int id = get_local_id(0);
//Reads the contents from array 1 and 2 into local memory
__local int4 local_array1;
__local int local_array2;
local_array1 = Array1[id];
local_array2 = Array2[id];
//Copy the contents of array 1 into an int8 vector called v
int8 v;
/*i have trouble here too, how do i copy into int8 v from int4 data type */
v = vload8(0, Array1);
//Copy the contents of array 2 into two int8 vectors called v1 and v2
int8 v1, v2;
v1 = vload8(0, Array2);
v2 = vload8(1, Array2);
//Creates an int8 vector in private memory called results
int8 results;
if (any(v > 5) == 1) {
results = select(v2[what do i do to get current index], v1[i], isgreater(v[i], 5.0));*
vstore8(results, 0, output);
}
else {
results.lo = v1.lo;
results.hi = v2.lo;
vstore8 (results, 0, output);
}
}

You're trying to access vector elements via the [] operator. That's illegal in OpenCL. It might work with some OpenCL compilers but it's still undefined behaviour.
The "official" way to access vector elements is by 1) vector.X or 2) vector.sX
As you noticed, this does not allow dynamic access.
The reason is: a vector is not an array. A vector is supposed to map to a hardware "vector register" (or multiple registers). E.g. a "float8" will map to a single 256bit AVX register on AVX2 CPU, or two 128bit AVX registers on AVX1 CPU.
OpenCL does not have an operator to dynamically access vector elements. Perhaps a missing feature, but it reflects the reality of vectorized hardware: most instructions only operate on entire hardware vector registers, not on their individual elements. If you want to work on a vector element selected dynamically, you have to extract it from the vector. Here are several ways to do it.
Using vectors makes sense in some specific cases; IMO they're useful mainly for two things: 1) when you have a bunch of values logically tied together (e.g. colors in a pixel) and 99.99% of time you don't need to access individual values; 2)
you have hardware with vector registers (e.g. a VLIW CPU or GPU) and your OpenCL compiler can't "autovectorize" the code, so you need to manually write vectorized code to get reasonable performance.
In your code, i'd simply change __global int4* Array1 to __global int* Array1, write the kernel without using vectors (simply indexing as a normal array), and see how it performs. If you're targeting modern Nvidia/AMD GPUs, you don't need vectors at all to get good performance.

Related

C++ allocate N-dimensional vector without copying a c-array

I want to load N-dimensional matrices from disk (HDF5) into std::vector objects.
I know their rank beforehand, just not the shape. For instance, one of the matrices is 4-rank std::vector<std::vector<std::vector<std::vector<float>>>> data;
I want to use vectors to store the values because they are standard and not as ugly as c-arrays (mostly because they are aware of their length).
However, the way to load them is using a loading function that takes a void *, which would work fine for rank 1 vectors where I can just resize them and then access its data pointer (vector.data()). For higher ranks, vector.data() will just point to vectors, not the actual data.
Worst case scenario I just load all the data to an auxiliary c-array and then copy it manually but this could slow it down quite a bit for big matrices.
Is there a way to have contiguous multidimensional data in vectors and then get a single address to it?
If you are concerned about performance please don't use a vector of vector of vector... .
Here is why. I think the answer of #OldPeculier is worth reading.
The reason that it's both fat and slow is actually the same. Each "row" in the matrix is a separately allocated dynamic array. Making a heap allocation is expensive both in time and space. The allocator takes time to make the allocation, sometimes running O(n) algorithms to do it. And the allocator "pads" each of your row arrays with extra bytes for bookkeeping and alignment. That extra space costs...well...extra space. The deallocator will also take extra time when you go to deallocate the matrix, painstakingly free-ing up each individual row allocation. Gets me in a sweat just thinking about it.
There's another reason it's slow. These separate allocations tend to live in discontinuous parts of memory. One row may be at address 1,000, another at address 100,000—you get the idea. This means that when you're traversing the matrix, you're leaping through memory like a wild person. This tends to result in cache misses that vastly slow down your processing time.
So, if you absolute must have your cute [x][y] indexing syntax, use that solution. If you want quickness and smallness (and if you don't care about those, why are you working in C++?), you need a different solution.
Your plan is not a wise one. Vectors of vectors of vectors are inefficient and only really useful for dynamic jagged arrays, which you don't have.
Instead of your plan, load into a flst vector.
Next, wrap it with a multidimensional view.
template<class T, size_t Dim>
struct dimensional{
size_t const* strides;
T* data;
dimensional<T, Dim-1> operator[](size_t i)const{
return {strides+1, data+i* *strides};
}
};
template<class T>
struct dimensional<T,0>{
size_t const* strides; // not valid to dereference
T* data;
T& operator[](size_t i)const{
return data[i];
}
};
where strides points at an array of array-strides for each dimension (the product of the sizes of all later dimensions).
So my_data.access()[3][5][2] gets a specific element.
This sketch of a solution leaves everything public, and doesn't support for(:) iteration. A more shipping quality one would have proper privacy and support c++11 style for loops.
I am unaware of the name of a high quality multi-dimensional array view already written for you, but there is almost certainly one in boost.
For a bi-dimensional matrix, you could use an ugly c-array like that:
float data[w * h]; //width, height
data[(y * w) + x] = 0; //access (x,y) element
For a tri-dimensional matrix:
float data[w * h * d]; //width, height, depth
data[((z * h) + y) * w + x] = 0; //access (x,y,z) element
And so on. To load data from, let's say, a file,
float *data = yourProcToLoadData(); //works for any dimension
That's not very scalable but you deal with a known dimension. This way your data is contiguous and you have a single address.

OpenACC and GNU Scientific Library - data movement of gsl_matrix

I've watched the recorded openacc overview course videos up to lecture 3, which talks about expressing data movement. How would you move a gsl_matrix* from cpu to gpu using copy_in(). For example on the CPU I can do something like,
gsl_matrix *Z = gsl_matrix_calloc(100, 100),
which will give me a 100x100 matrix of zeroes. Now Z is a pointer to a gsl_matrix structure which looks like,
typedef struct{
size_t size1;
size_t size2;
size_t tda;
double * data;
gsl_block * block;
int owner;
} gsl_matrix;
How would I express data movement of Z (which is a pointer) from the CPU to the GPU using copyin()?
I can't speak directly about using GSL within OpenACC data and compute regions but can give you general answer about aggregate types with dynamic data members.
The first thing to try, assuming you're using PGI compilers and a newer NVIDIA device, is CUDA Unified Memory (UVM). Compiling with the flag "-ta=tesla:managed", all dynamically allocated data will be managed by the CUDA runtime so you don't need to manage the data movement yourself. There is overhead involved and caveats but it makes things easier to get started. Note that CUDA 8.0, which ships with PGI 16.9 or later, improves UVM performance.
Without UVM, you need to perform a manual deep copy of the data. Below is the basic idea where you first create the parent structure on the device and perform an shallow copy. Next create the dynamic array, "data" on the device, copy over the initial values to the array, then attach the device pointer for data to the device structure's data pointer. Since "block" is itself an array of structs with dynamic data members, you'll need to loop through the array, creating it's data arrays on the device.
matrix * mat = (matrix*) malloc(sizeof(matrix));
#pragma acc enter data copyin(mat)
// Change this to the correct size of "data" and blocks
#pragma acc enter data copyin(mat.data[0:dataSize]);
#pragma acc enter data copyin(mat.block[0:blockSize]);
for (i=0; i < blockSize; ++i) {
#pragma acc enter data copyin(mat.block[i].data[0:mat.block[i].size])
}
To Delete, walk the structure again, deleting from the bottom-up
for (i=0; i < blockSize; ++i) {
#pragma acc exit data delete(mat.block[i].data)
}
#pragma acc exit data delete(mat.block);
#pragma acc exit data delete(mat.data);
#pragma acc exit data delete(mat);
When you update, be sure to only update scalars or arrays of fundamental data types. i.e., update "data" but not "block". Update does a shallow copy so updating "block" would update host or device pointers leading to illegal addresses.
Finally, be sure to put the matrix variable in a "present" clause when using it in a compute region.
#pragma acc parallel loop present(mat)

On memory allocation in armadillo sparse matrices

I want to know whether I need to free up the memory occupied by locations and values objects after sparse matrix has been created. Here is the code:
void load_data(umat &locations, vec& values){
// fill up locations and values
}
int main(int argc, char**argv){
umat loc;
vec val;
load_data(loc,val);
sp_mat X(loc,val);
return 0;
}
In the above code, load_data() fills up the locations and values objects and then sparse matrix is created in main(). My Q: Do I need to free up memory used by locations and values after construction of X? The reason is that X could be large and I am on low RAM. I know when main returns, OS will free up locations and values as well as X. But, the real question is whether the memory occupied by X is the same as by locationsand values OR X is allocated memory separately and I need to free locations and values in the case.
The constructor (SpMat_meat.hpp:231) you are using
template<typename T1, typename T2>
inline SpMat(const Base<uword,T1>& locations, const Base<eT,T2>& values, const bool sort_locations = true);
fills the sparse matrix with copies of the values in values.
I understand you are worried that you will run out of memory and if you keep loc, val and X separately you basically have two copies of the same data, taking up twice as much memory as actually needed (this is indeed what happens in your code snippet), so I will try to focus on addressing this problem and give you a few options:
1) If you are fine with keeping two copies of the data for a short while, the easiest solution is to dynamically allocate loc and val and delete them right after initialization of X
int main(int argc, char**argv){
umat* ploc;
vec* pval;
load_data(*ploc,*pval);
// at this point we have one copy of the data
sp_mat X(*ploc,*pval);
// at this point we have two copies of the data
delete ploc;
delete pval;
// at this point we have one copy of the data
return 0;
}
Of course you can use safe pointers instead of C style ones, but you get the idea.
2) If you absolutely don't want to have two copies of the data at any time, I would suggest that you modify your load_data routine to sequentially load values one by one and directly insert them into X
void load_data(umat &locations, vec& values, sp_mat& X){
// fill up locations and values one by one into X
}
Other options would be to i) use move semantics to directly move the values in val into X or ii) directly use the memory allocated for val as the memory for X, similar to the advanced constructor for matrices
Mat(eT* aux_mem, const uword aux_n_rows, const uword aux_n_cols, const bool copy_aux_mem = true, const bool strict = false)
Both options would however require modifications on the level of the armadillo library, as such a functionality is not supplied yet for sparse matrices (There is only a plain move constructor so far). It would be a good idea to request these features from the developers though!

Error: Deallocating a 2D array

I am developing a program in which one of the task is to read points (x,y and z) from a text file and then store them in an array. Now the text file may contain 10^2 or even 10^6 points, depending upon the text file user selects. Therefore I am defining a dynamic array.
For allocating a dynamic 2D array, I wrote as below and it works fine:
const int array_size = 100000;
float** array = new float* [array_size];
for(int i = 0; i < array_size; ++i){
ary[i] = new float[2]; // 0,1,2 being the columns for x,y,z co-ordinates
}
After the points are saved in the array, I write the following to deallocate the unallocated memory :
for (int i = 0; i < array_size; i++){
delete [] array[i];
}
delete [] array;
and then my program stops working and shows "Project.exe stopped working".
If I don't deallocate, the program works just fine.
In your comment you say 0,1,2 being the columns for x,y,z co-ordinates, if that's the case, you need to be allocating as float[3]. When you allocate an array of float[N], you are allocating a chunk of the memory of the size N * sizeof(float), and you will index them in the array from 1 to N - 1. Therefore if you need indeces 0,1,2, you will need to allocate a memory of the size 3 * sizeof(float), which makes it float[3].
Because other than that, I can compile and run the code without an error. If you fix it and still get an error, it might be your compiler problem. Then try to decrease 100000 to a small number and try again.
You are saying that you are trying to implement a dynamic array, this is what std::vector does and I would highly recommend that you use it. This way you are using something from the standard library that's extremely well tested and you won't run into issues by essentially trying to roll your own version of std::vector. Additionally this approach wraps memory better as it uses RAII which leverages the language to solve a lot of memory management issues. This has other benefits too like making your code more exception safe.
Also if you are storing x,y,z coordinates consider using a struct or a tuple, I think that enhances readability a lot. You can typedef the coordinate type too. Something like std::vector< coord_t > is more readable to me.
(Thanx a lot for suggestions!!)
Finally I am using vectors for the stated problem for reasons as below:
1.Unlike Arrays (not array object ofcourse), I don't need to manually deallocate unallocated memory.
2.There are numerous built in methods defined under vector class
Vector size can be extended at later stages
Below is how I used 2D Vector to store points (x,y,z co-ordinates)
Initialized (allocated memory) a 2D vector:
vector<vector<float>> array (1000, vector<float> array (3));
Where 1000 is the number of rows, and 3 is the number of columns
Once declared, values can be passed simply as:
array[i][j] = some value;
Also, at later stage I declared functions taking vector arguments and returning vectors as:
vector <vector <float>> function_name ( vector <vector <float>>);
vector <vector <float>> function_name ( vector <vector <float>> input_vector_name)
{
return output_vector_name_created_inside_function
}
Note: This method crates a copy of vector while returning, use pointer to return by reference. Even though mine is not working when I return vector by reference :(
For multi arrays I recommended use boost::multi_array.
Example:
typedef boost::multi_array<double, 3> array_type;
array_type A(boost::extents[3][4][2]);
A[0][0][0] = 3.14;

pointer arithmetic on vectors in c++

i have a std::vector, namely
vector<vector<vector> > > mdata;
i want pass data from my mdata vector to the GSL function
gsl_spline_init(gsl_spline * spline, const double xa[], const double ya[], size_t size);
as ya. i already figured out that i can do things like
gsl_spline_init(spline, &(mgrid.front()), &(mdata[i][j][k].front()), mgrid.size());
this is fine if i want to pass the data from mdata for fixed i,j to gsl_spline_init().
however, now i would need to pass along the first dimension of mdata, so for fixed j,k.
i know that for any two fixed indices, all vectors along the remaining dimensions have the same length, so my vector is a 'regular cube'. so the offset between all the values i need should be the same.
of course i could create a temporary vector
int j = 123;
int k = 321;
vector<double> tmp;
for (int i = 0: i < mdata.size(); i++)
tmp.push_back(mdata[i][j][k]);
gsl_spline_init(spline, &(mgrid.front()), &(tmp.front()), mgrid.size());
but this seems too complicated. perhaps there is a way to achieve my goal with pointer arithmetic?
any help is greatly appreciated :)
You really can't do that without redesigning the array consumer function gsl_spline_init() - it relies on the data passed being a contiguous block of data. This is not the case with you three-level vector - not only it is a cube but also each level has a separate buffer allocated on heap.
This can't be done. Not only with vectors, but even with plain arrays only the last dimension is a contiguous block of data. If gsl_spline_init took an iterator instead of array, you could try to craft some functor to choose appropriate data but I'm not sure it's worth trying. No pointer arithmetic can help you.