Cuda efficient insertion of data into unsorted populated array - c++

I have two arrays within Cuda;
int *main; // unsorted
int *source; // sorted
Part of my algorithm requires that I regulary insert new data into the main array from the source array. If a position within the main array is zero, it assumes it is empty, therefore it can be populated with a value from the source array.
I'm just wondering what the most efficient method of doing this is, I've tried a couple of approaches but still think there are some more performance gains to be made here.
Currently I'm using a modified version of a radix sort, to "shuffle" the contents of the main array to the very end of the main array, leaving all zero values at the beginning of the array, making the insertion from source trivial. The sort has been modified to iterate over a single bit, rather than 32 bits, this works with a simple switch on the input;
input[i] = source[i] > 1 ? 1 : 0
I'm wondering if this is already quite an efficient way of doing this? I'm wondering if I wouldn't gain something by using a tactically deployed atomicAdd such as;
__global__ void find(int *destination, int *indices, const int N)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if((destination[idx] == 0)&&(count<elements_to_add))
indices[count] = idx;
atomicAdd(&count, 1);
__global__ void insert(int *destination, int *indices, int *source, const int N)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if((source[idx] > 0)&&(indices[idx] > 0))
destination[indices[idx]] = source[idx];
I'm not inserting that many items via the source array at the moment, but that could changing in the future.
This feels like it should be a common problem that has been solved before, I'm wondering if the thrust library may help, but having a browse for appropriate functions it doesn't quite feel right for what I'm trying to accomplish (not very neatly fitting with the code I already have)
Thoughts from experienced Cuda developers appreciated!

You can decouple your finding algorithm, which is categorized as a stream compaction procedure, and your insertion , which is categorized as scatter procedure. However, you can merge the functionality of the two.
Assuming srcPtr is a pointer that its content resides inside the global memory and is already set to zero before the kernel launch.
__global__ void find_and_insert( int* destination, int const* source, int const N, int* srcPtr ) { // Assuming N is the length of the destination buffer and also the length of the source buffer is less than N.
int const idx = blockIdx.x * blockDim.x + threadIdx.x;
// Get the assigned element.
int const dstElem = destination[ idx ];
bool const pred = ( dstElem == 0 );
// Intra-warp binary reduction to count the total number of lanes with empty elements.
int const predBallot = __ballot( pred );
int const intraWarpRed = __popc( predBallot );
// Warp-aggregated atomics to reduce the contention over the srcPtr content.
unsigned int laneID; asm( "mov.u32 %0, %laneid;" : "=r"(laneID) ); //const uint laneID = tidWithinCTA & ( WARP_SIZE - 1 );
int posW;
if( laneID == 0 )
posW = atomicAdd( srcPtr, intraWarpRed );
posW = __shfl( posW, 0 );
// Threads that have found empty elements can fill out their assigned positions from the src. Intra-warp binary prefix sum is used here.
uint laneMask; asm( "mov.u32 %0, %lanemask_lt;" : "=r"(laneMask) ); //const uint laneMask = 0xFFFFFFFF >> ( WARP_SIZE - laneID ) ;
int const positionToRead = posW + __popc( predBallot & laneMask );
if( pred )
destination[ idx ] = source[ positionToRead ];
A few things:
This kernel is just a suggestion on how you can do it. Here threads inside the warps collaborate on the task. You can extend the binary reduction and prefix sum over the thread-block.
I wrote this kernel inside the browser and haven't tested it. So be careful.
The whole design is not something new. Similar approaches have been implemented (for example this paper) and is mostly based on the work done by Mark Harris and Michael Garland.


Cuda move element in array to the end

Hello my issue is this any advice will be greatfully accepted:
I have array of structs (representating Particles) but for simplify I have array containing only True values at start (Particle.exist = True). I am running my own CUDA kernel function on this array and in some cases the True value is changed to False. After that I have to move this Value to the end of array for better optimalization (No more working with dead Particle (exist = False)).
I have theoretically two options how to do this...
Some Parallel sorting Algorithms or
Move instead dead Particle to the end and shift array.
Second option should be better choice but I donĀ“t know how to do this in parallel. I could Have 1 000 000 Particles so shifting in one thread is not good idea...
Here is example of my code. I put Todo in part where I need shift array
struct Particle
float2 position;
float angle;
bool exists;
__global__ void moveParticles(Particle* particles, const unsigned int lengthOfParticles, const Particle* leaders, const unsigned int lengthOfLeaders, const unsigned int sizeOfLeader, const float speedFactor, const cudaTextureObject_t heightMapTexture)
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
while (idx < lengthOfParticles)
// If particle does not exist then do nothing and skip
if (!particles[idx].exists) { idx += skip; continue; }
float bestLength = 3.40282e+038;
unsigned int bestLeaderIndex;
for (unsigned int i = 0; i < lengthOfLeaders; i++)
float currentLength = (
(particles[idx].position.x - leaders[i].position.x) * (particles[idx].position.x - leaders[i].position.x)
) + (
(particles[idx].position.y - leaders[i].position.y) * (particles[idx].position.y - leaders[i].position.y)
if (currentLength < bestLength)
bestLength = currentLength;
bestLeaderIndex = i;
Particle bestLeader = leaders[bestLeaderIndex];
float differenceX = bestLeader.position.x - particles[idx].position.x;
float differenceY = bestLeader.position.y - particles[idx].position.y;
float newLength = sqrtf(differenceX * differenceX + differenceY * differenceY);
// If the newLenght is equal to zero, then the particle is at the same position as leader
if (newLength <= sizeOfLeader / 2) { particles[idx].exists = false; idx += skip; continue; }
// Current height at the position
const uchar4 texelOfHeight = tex2D<uchar4>(heightMapTexture, particles[idx].position.x, particles[idx].position.y);
// Normalize vector
differenceX /= newLength;
differenceY /= newLength;
int nextPositionOnMapX = round(particles[idx].position.x + differenceX);
int nextPositionOnMapY = round(particles[idx].position.y + differenceY);
// Height of the next position
const uchar4 texelOfNextPosition = tex2D<uchar4>(heightMapTexture, nextPositionOnMapX, nextPositionOnMapY);
float differenceHeight = texelOfHeight.x - texelOfNextPosition.x;
float speed = sqrtf(speedFactor + 2 * fabsf(differenceHeight));
// Multiply by speed
differenceX *= speed;
differenceY *= speed;
particles[idx].position.x += differenceX;
particles[idx].position.y += differenceY;
idx += skip;
One possible solution what am I thinking about is do own kernel function which will only shifting particles. Something like this
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
//TODO: Shifting...
Sorting on GPUs is rather inefficient, so it is better to select the values to keep and perform a partition based on them. To do that easily, you can use CUB which is quite efficient (as it often implement best state-of-the-art algorithm or close to).
You can use DevicePartition or two DeviceSelect (the former will likely be faster, except if you do not want to keep dead particles at all). You could also use block primitives if you want to perform some advanced tweaks/optimizations.
If you still want to do this yourself for some reason (eg. reducing the number of dependencies in your project), then you can use atomic adds on relatively new devices since they are very-well optimized by the hardware. On old device you could use scans to do that but it is a but harder to implement. The thing is atomics do not scale particularly when there is a lot of SM, so you need to perform some advanced blocking strategy. Here is an untested naive implementation to understand the idea:
// PS: what is the difference between sizeOfParticle and lengthOfParticles?
// pos must be initialized to 0 and contains the number of living particles (pivot) once the kernel finished its execution.
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle, Particle* outParticles, int* pos) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
const bool exists = particles[idx].exists;
const int localPos = atomicAdd(pos, exists); // Here is the important point
const Particle current = particles[idx];
// outParticles is a needed temporary array or output one
// as the operation cannot be efficiently performed in parallel.
// It should likely be allocated and provided in argument to the kernel
// Move the current particle to the beginning
outParticles[localPos] = current;
// Move the current particle to the end
outParticles[lengthOfParticles-1-idx+localPos] = current;
Note that the ordering is not preserved due to the atomic operations. If you need to keep the order of the particles, then it gets significantly more complicated, especially on GPUs, since it would make the algorithm more sequential. A naive solution could be to use a stable sort in that case. Another solution is to use a global scan followed by an indirection to store the values (so with two pass). Implementing an efficient scan is a bit complex/tedious. Hopefully, CUB can help a lot in this case with its DeviceScan primitive.
Finally note that using array of structures is not efficient, especially on hardware using SIMD instructions like GPUs. The implementation should be significantly faster with structures of arrays (due to cache lines, coalescence, contiguity of access pattern, etc.).

cudaMemset equivalents for non-integer types? [duplicate]

How do I initialize device array which is allocated using cudaMalloc()?
I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.
As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:
cudaError_t cudaMemset ( void * devPtr,
int value,
size_t count
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
So value is a byte value. If you do something like:
int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.
Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as
template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < nwords; tidx += stride)
devPtr[tidx] = val;
(standard disclaimer: written in browser, never compiled, never tested, use at own risk).
Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.
Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.
I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.
Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:
template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value,
const size_t n, const size_t incx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid*incx < n) {
a[tid*incx] = value;
Then you may invoke the kernel like this:
template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 gridDim(number_of_blocks, 1);
dim3 blockDim(BLOCK_SIZE, 1);
kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);

Opencl - Transfer Global memory Work-Group + border to Local memory

Here a draft of code I produced :
void __kernel myKernel(__global const short* input,
__global short* output,
const int width,
const int height){
// Always square. (and 16x16 in our example)
const uint local_size = get_local_size(0);
// Get the work-item col/row index
const uint wi_c = get_local_id(0);
const uint wi_r = get_local_id(1);
// Get the global col/row index
const uint g_c = get_global_id(0);
const uint g_r = get_global_id(1);
// Declare a local array NxN
const uint arr_size = local_size *local_size ;
__local short local_in[arr_size];
// Transfer the global memory for into a local one.
local_in[wi_c + wi_r*local_size ] = input[g_c + g_r*width];
// Wait that all the work-item are sync
// Now add code to process on the local array (local_in).
As far as I understand OpenCL work-group/work-item, this is what I need to do to copy a global 16x16 ROI of from global to local memory. (Please correct me if I'm wrong, since I'm beginning at this).
So after the barrier, each element in local_in can be access via wi_c + wi_r*local_size.
But now let's do something tricky. If I want for each work-item in my work group to work on a 3x3 neighborhood, I will need a 18x18 local_in array.
But how to create this ? Since I have only 16x16=256 work-item (threads), but I need 18x18=324 (missing 68 threads to do it).
My basic idea should be to do:
if(wi_c == 0 && wi_r == 0){
// Code that copy the border into the new array that should be
// local_in[(local_size+2)*(local_size+2)];
But this is terrible, since the first work-item (1st thread) will have to handle all the border and the rest of the work-items in this group will just be waiting this 1st work-item to finish. (Again, this is my understanding of OpenCL, might be wrong).
So here are my real question:
Is there another easier solution for this kind of problem ? Like changing the NDRange Local size to be overlapping or something ?
I start to read about coalesced memory access, is my first draft of code look like it ? I don't think so, since I'm using a "stride" approach to load the global memory. But I don't understand how I could change the first part of that code to be efficient also.
Once the barrier is reached, the processing continue of each work-item to get a final value that need to be stored back into the global output array. Should I put again a barrier before this "write" or all good to leave all the work-item finish their self ?
I tried different approaches and I came with the final version, which is less "if" and use thread as much as possible (On second phase, might not be fully efficient since few thread are idle, but it's the best I was able to get).
The principle is to set an origin (start pos) at the top-left corner and create Read/Write index from this position using loop index. The loop start at the local id position in 2D. So all 256 work-items write their first element, and on phase two only 68 work-items on 256 will complete the 2 bottom rows + 2 right columns.
I'm not a OpenCL pro yet, so this could still have more improvement (maybe loop unroll, I don't know).
__local float wrkSrc[324];
const int lpitch = 18;
// Add halfROI to handle the corner
const int lcol = get_local_id(0);
const int lrow = get_local_id(1);
const int2 gid = { col, row };
const int2 lid = { lcol, lrow };
// Always get the most Top-left corner of that ROI to extract.
const int2 startPos = gid - lid - halfROI;
// Loop on each thread to get their right ID.
// Thread with id < 2 * halfROI will process more then others, but not that much an issue.
for ( int x = lid.x; x < lpitch; x += 16 ) {
for ( int y = lid.y; y < lpitch; y += 16 ) {
// Get the position to write into the local array.
const int lidx = x + y * lpitch;
// Get the position to read into the global memory (src)
const int2 readPos = startPos + (int2)( x, y );
// Is inside ?
if ( readPos.x >= 0 && readPos.x < width && readPos.y >= 0 && readPos.y < height )
wrkSrc[lidx] = src[readPos.x + readPos.y * lab_new_pitch];
wrkSrc[lidx] = 0.0f;

How to speed up this GSL code for selecting a submatrix?

I wrote a very simple function in GSL, to select a submatrix from an existing matrix in a struct.
EDIT: I had timed VERY INCORRECTLY and didn't notice the changed number of zeros in front.Still, I hope this can be sped up
For 100x100 submatrices of a 10000x10000 matrix, it takes 1.2E-5 seconds. So, repeating that 1E4 times, takes 50 times longer than I need to diagonalise the 100x100 matrix.
I realise, it happens even if I comment out everything except return(0);
Thus, I theorize, it must be something about struct TOWER. This is how TOWER looks:
struct TOWER
int array_level[TOWERSIZE];
int array_window[TOWERSIZE];
gsl_matrix *matrix_ordered_covariance;
gsl_matrix *matrix_peano_covariance;
double array_angle_tw[XISTEP];
double array_correl_tw[XISTEP];
gsl_interp_accel *acc_correl; // interpolating for correlation
gsl_spline *spline_correl;
double array_all_eigenvalues[TOWERSIZE]; //contains all eiv. of whole matrix
std::vector< std::vector<double> > cropped_peano_covariance, peano_mask;
Below comes my function!
/* --- --- */
int monolevelsubmatrix(int i, int j, struct TOWER *tower, gsl_matrix *result) //relying on spline!! //must addd auto vanishing
int firstrow, firstcol,mu,nu,a,b;
double aux, correl;
firstrow = helix*i;
firstcol = helix*j;
gsl_matrix_view Xi = gsl_matrix_submatrix (tower ->matrix_ordered_covariance, firstrow, firstcol, helix, helix);
gsl_matrix_memcpy (result, &(Xi.matrix));
/* --- --- */
The problem is almost certainly gls_matric_memcpy. The source for that is in copy_source.c, with:
const size_t src_tda = src->tda ;
const size_t dest_tda = dest->tda ;
size_t i, j;
for (i = 0; i < src_size1 ; i++)
for (j = 0; j < MULTIPLICITY * src_size2; j++)
dest->data[MULTIPLICITY * dest_tda * i + j]
= src->data[MULTIPLICITY * src_tda * i + j];
This would be quite slow. Note that gls_matrix_memcpy returns a GLS_ERROR if the matrices are different sizes, so it's very likely the data member could be served with a CRT memcpy on the data members of dest and src.
This loop is very slow. Each cell is derefence through dest & src structs for the data member, and THEN indexed.
You could choose to write a replacement for the library, or write your own personal version of this matrix copy, with something like (untested suggestion code here):
unsigned int cellsize = sizeof( src->data[0] ); // just psuedocode here
memcpy( dest->data, src->data, cellsize * src_size1 * src_size2 * MULTIPLICITY )
Note that MULTIPLICITY is a define, usually 1 or 2, probably depends on library configuration - might not apply to your usage (if it's 1 )
Now, important caveat....if the source matrix is a subview, then you have to go by rows...that is, a loop of rows in i where crt's memcpy is limited to rows at a time, not the entire matrix as I show above.
In other words, you do have to account for the source matrix geometry from which the subview was taken...that's probably why they index each cell (makes it simple).
If, however, you KNOW the geometry, you can very likely optimize this WAY above the performance you're seeing.
If all you did was take out the src/dest derefence, you'd see SOME performance gain, as in:
const size_t src_tda = src->tda ;
const size_t dest_tda = dest->tda ;
size_t i, j;
float * dest_data = dest->data; // psuedocode here
float * src_data = src->data; // psuedocode here
for (i = 0; i < src_size1 ; i++)
for (j = 0; j < MULTIPLICITY * src_size2; j++)
dest_data[MULTIPLICITY * dest_tda * i + j]
= src_data[MULTIPLICITY * src_tda * i + j];
We'd HOPE the compiler recognized that anyway, but...sometimes...

How to use a 2d vector of pointers

What is the correct way to implement an efficient 2d vector? I need to store a set of Item objects in a 2d collection, that is fast to iterate (most important) and also fast to find elements.
I have a 2d vector of pointers declared as follows:
std::vector<std::vector<Item*>> * items;
In the constructor, I instantiate it as follows:
items = new std::vector<std::vector<Item*>>();
items->resize(10, std::vector<Item*>(10, new Item()));
I how do I (correctly) implement methods for accessing items? Eg:
items[3][4] = new Item();
AddItem(Item *& item, int x, int y)
items[x][y] = item;
My reasoning for using pointers is for better performance, so that I can pass things around by reference.
If there is a better way to go about this, please explain, however I would still be interested in how to correctly use the vector.
Edit: For clarification, this is part of a class that is for inventory management in a simple game. The set 10x10 vector represents the inventory grid which is a set size. The Item class contains the item type, a pointer to an image in the resource manager, stack size etc.
My pointer usage was in an attempt to improve performance, since this class is iterated and used to render the whole inventory every frame, using the image pointer.
It seems that you know the size of the matrix beforehand, and that this matrix is squared. Though vector<> is fine, you can also use native vectors in that case.
Item **m = new Item*[ n * n ];
If you want to access position r,c, then you only have to multiply r by n, and then add c:
pos = ( r * n ) + c;
So, if you want to access position 1, 2, and n = 5, then:
pos = ( 1 * 5 ) + 2;
Item * it = m[ pos ];
Also, instead of using plain pointers, you can use smart pointers, such as auto_ptr (obsolete) and unique_ptr, which are more or less similar: once they are destroyed, they destroy the object they are pointing to.
auto_ptr<Item> m = new auto_ptr<Item>[ n * n ];
The only drawback is that now you need to call get() in order to obtain the pointer.
pos = ( 1 * 5 ) + 2;
Item * it = m[ pos ].get();
Here you have a class that summarizes all of this:
class ItemsSquaredMatrix {
ItemsSquaredMatrix(unsigned int i): size( i )
{ m = new std::auto_ptr<Item>[ size * size ]; }
{ delete[] m; }
Item * get(unsigned int row, unsigned int col)
{ return m[ translate( row, col ) ].get(); }
const Item * get(unsigned int row, unsigned int col) const
{ return m[ translate( row, col ) ].get(); }
void set(unsigned int row, unsigned int col, Item * it)
{ m[ translate( row, col ) ].reset( it ); }
unsigned int translate(unsigned int row, unsigned int col) const
{ return ( ( row * size ) + col ); }
unsigned int size;
std::auto_ptr<Item> * m;
Now you only have to create the class Item. But if you created a specific class, then you'd have to duplicate ItemsSquaredMatrix for each new piece of data. In C++ there is a specific solution for this, involving the transformation of the class above in a template (hint: vector<> is a template). Since you are a beginner, it will be simpler to have Item as an abstract class:
class Item {
// more things...
virtual std::string toString() const = 0;
And derive all the data classes you will create from them. Remember to do a cast, though...
As you can see, there are a lot of open questions, and more questions will raise as you keep unveliling things. Enjoy!
Hope this helps.
For numerical work, you want to store your data as locally as possible in memory. For example, if you were making an n by m matrix, you might be tempted to define it as
vector<vector<double>> mat(n, vector<double>(m));
There are severe disadvantages to this approach. Firstly, it will not work with any proper matrix libraries, such as BLAS and LAPACK, which expect the data to be contiguous in memory. Secondly, even if it did, it would lead to lots of random access and pointer indirection in memory, which would kill the performance of your matrix operations. Instead, you need a contiguous block of memory n*m items in size.
vector<double> mat(n*m);
But you wouldn't really want to use a vector for this, as you would then need to translate from 1d to 2d indices manually. There are some libraries that do this for you in C++. One of them is Blitz++, but that seems to not be much developed now. Other alternatives are Armadillo and Eigen. See this previous answer for more details.
Using Eigen, for example, the matrix declaration would look like this:
MatrixXd mat(n,m);
and you would be able to access elements as mat[i][j], and multiply matrices as mat1*mat2, and so on.
The first question is why the pointers. There's almost never any reason
to have a pointer to an std::vector, and it's not that often that
you'd have a vector of pointers. You're definition should probably be:
std::vector<std::vector<Item> > items;
, or at the very least (supposing that e.g. Item is the base of a
polymorphic hierarchy):
std::vector<std::vector<Item*> > items;
As for your problem, the best solution is to wrap your data in some sort
of a Vector2D class, which contains an std::vector<Item> as member,
and does the index calculations to access the desired element:
class Vector2D
int my_rows;
int my_columns;
std::vector<Item> my_data;
Vector2D( int rows, int columns )
: my_rows( rows )
, my_columns( columns )
Item& get( int row, int column )
assert( row >= 0 && row < my_rows
&& column >= 0 && column < my_columns );
return my_data[row * my_columns + column];
class RowProxy
Vector2D* my_owner;
int my_row;
RowProxy(Vector2D& owner, int row)
: my_owner( &owner )
, my_row( row )
Item& operator[]( int column ) const
return my_owner->get( my_row, column );
RowProxy operator[]( int row )
return RowProxy( this, row );
// OR...
Item& operator()( int row, int column )
return get( row, column );
If you forgo bounds checking (but I wouldn't recommend it), the
RowProxy can be a simple Item*.
And of course, you should duplicate the above for const.