I've been looking at Thrust and I stumbled upon a question that almost (but not quite) answered mine: Finding the maximum element value AND its position using CUDA Thrust
The example posted in there in the answer works fine, but how to do the same thing with raw pointers? Let us assume this code which I believe to be correct (ignore the kernel configuration, it's for simplicity):
float* d_A;
const unsigned int noElems = 10;
cudaMalloc(&d_A, noElems * sizeof(float));
initDeviceVector<<<1, noElems>>>(d_A);
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(d_A);
thrust::device_vector<float>::iterator iter =
thrust::max_element(d_ptr, d_ptr + noElems);
I can't quite figure out how to extract the position using iter and raw pointers.
Thank you for your time.
There's probably a number of ways to do this. However working directly from your code, we can compare the value of iter to a device_ptr if we convert it to a suitable device pointer first.
The following fully worked example demonstrates this:
$ cat t436.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/extrema.h>
#include <stdio.h>
__global__ void initDeviceVector(float *data){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
data[idx] = idx%7;
int main(){
float* d_A;
const unsigned int noElems = 10;
cudaMalloc(&d_A, noElems * sizeof(float));
initDeviceVector<<<1, noElems>>>(d_A);
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(d_A);
thrust::device_vector<float>::iterator iter = thrust::max_element(d_ptr, d_ptr + noElems);
int pos = thrust::device_pointer_cast(&(iter[0])) - d_ptr;
printf("pos = %d\n", pos);
return 0;
$ nvcc -arch=sm_20 -o t436 t436.cu
$ ./t436
pos = 6
I have an MPI program written in C++, which I run on 2+ machines: one is a Fedora machine with an Intel i5-4590T CPU, the other is a Windows machine running Ubuntu 22.04 on WSL 2 (CPU: i7-7820x).
I have the same MPI program on both machines, which calls MPI_Alltoallw. On the Fedora machine, the Alltoallw call works fine and the the output buffer is correctly filled, while on WSL this does not happen and the output buffer is left unmodified by the MPI function call (no errors are given during compile or runtime).
I use in both cases mpic++ (g++ version is 11.3.0 on WSL, 11.3.1 on Fedora) and OpenMPI version 4.1.4.
I also tried the same code on an HPC system (RHEL 8.4, g++ 11.3.0, OpenMPI 4.1.4) and it works fine there as well.
Simpler MPI calls (e.g. initializing communicators, a MPI_Bcast) work fine also on WSL.
Is this a known problem with WSL2? Or something like this can arise from a fault in my code?
Edit: here is a minimal example of the code (which also uses Eigen).
#include <iostream>
#include <Eigen/Dense>
#include <unsupported/Eigen/CXX11/Tensor>
#include <cmath>
#include <complex>
#include <mpi.h>
int main() {
const int nc = 8;
const int nzc = 8;
const int nz2 = nzc*nc/2;
const int nzd = 3*nz2;
const int nxs2 = 32;
const int ny = 128;
const int nyc = ny/nc;
const int nbytes_cmplxd = 16;
Eigen::TensorFixedSize<std::complex<double>, Eigen::Sizes<ny, nxs2 + 1, nzc>> xc;
Eigen::TensorFixedSize<std::complex<double>, Eigen::Sizes<nyc, nxs2+1, nzd>> buf2;
xc.setRandom(); //xc is set to some values
int count[nc];
//initialize counts to 1
for (int i = 0; i < nc; i++) {
count[i] = 1;
//first initializes mpi types and displacements for the transposition from xc to buf2
//init simple datatypes
MPI_Datatype sendloc1;
MPI_Datatype recvloc1;
MPI_Type_vector(nzc*(nxs2+1), nyc, ny, MPI_DOUBLE_COMPLEX, &sendloc1);
MPI_Type_vector(nzc*(nxs2+1)*nyc, 1, 1, MPI_DOUBLE_COMPLEX, &recvloc1);
int senddisp1[nc];
int recvdisp1[nc];
MPI_Datatype sendtypev1[nc];
MPI_Datatype recvtypev1[nc];
for (int i = 0; i<nc; i++) {
senddisp1[i] = nbytes_cmplxd * i * (nyc); //displacement due to column major ordering
if (i<(nc/2)) {
recvdisp1[i] = nbytes_cmplxd * i * (nzc*(nxs2+1)*nyc); //displacement equal to entire size of recvtype
} else { //add displacement to introduce padding (equal to 1/3 of the array size)
recvdisp1[i] = nbytes_cmplxd * (i * (nzc*(nxs2+1)*nyc) + nz2*(nxs2+1)*nyc);
sendtypev1[i] = sendloc1;
recvtypev1[i] = recvloc1;
//committing the MPI_Datatype vectors is not needed
MPI_Alltoallw(xc.data(), count, senddisp1, sendtypev1,
buf2.data(), count, recvdisp1, recvtypev1, MPI_COMM_WORLD);
std::cout << buf2 << std::endl; //buf2 is zero after the call
return 0;
I'm trying to write an R wrapper for the FINUFFT routines for calculating the FFT of an unevenly sampled series. I have virtually no experience with C/C++, so I'm working from an example that compares the traditional Fourier transform to the NUFFT. The example code follows.
// this is all you must include for the finufft lib...
#include "finufft.h"
#include <complex>
// also needed for this example...
#include <stdio.h>
#include <stdlib.h>
using namespace std;
int main(int argc, char* argv[])
/* Simple example of calling the FINUFFT library from C++, using plain
arrays of C++ complex numbers, with a math test. Barnett 3/10/17
Double-precision version (see example1d1f for single-precision)
Compile with:
g++ -fopenmp example1d1.cpp -I ../src ../lib-static/libfinufft.a -o example1d1 -lfftw3 -lfftw3_omp -lm
or if you have built a single-core version:
g++ example1d1.cpp -I ../src ../lib-static/libfinufft.a -o example1d1 -lfftw3 -lm
Usage: ./example1d1
int M = 1e6; // number of nonuniform points
int N = 1e6; // number of modes
double acc = 1e-9; // desired accuracy
nufft_opts opts; finufft_default_opts(&opts);
complex<double> I = complex<double>(0.0,1.0); // the imaginary unit
// generate some random nonuniform points (x) and complex strengths (c):
double *x = (double *)malloc(sizeof(double)*M);
complex<double>* c = (complex<double>*)malloc(sizeof(complex<double>)*M);
for (int j=0; j<M; ++j) {
x[j] = M_PI*(2*((double)rand()/RAND_MAX)-1); // uniform random in [-pi,pi)
c[j] = 2*((double)rand()/RAND_MAX)-1 + I*(2*((double)rand()/RAND_MAX)-1);
// allocate output array for the Fourier modes:
complex<double>* F = (complex<double>*)malloc(sizeof(complex<double>)*N);
// call the NUFFT (with iflag=+1): note N and M are typecast to BIGINT
int ier = finufft1d1(M,x,c,+1,acc,N,F,opts);
int n = 142519; // check the answer just for this mode...
complex<double> Ftest = complex<double>(0,0);
for (int j=0; j<M; ++j)
Ftest += c[j] * exp(I*(double)n*x[j]);
int nout = n+N/2; // index in output array for freq mode n
double Fmax = 0.0; // compute inf norm of F
for (int m=0; m<N; ++m) {
double aF = abs(F[m]);
if (aF>Fmax) Fmax=aF;
double err = abs(F[nout] - Ftest)/Fmax;
printf("1D type-1 NUFFT done. ier=%d, err in F[%d] rel to max(F) is %.3g\n",ier,n,err);
free(x); free(c); free(F);
return ier;
Much of this I don't need, such as generating the test series and comparing to the traditional FFT. Further, I want to return the values of the transform, not just an error code indicating success. Below is my code.
#include "finufft.h"
#include <complex>
#include <Rcpp.h>
#include <stdlib.h>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
ComplexVector finufft(int M, NumericVector x, ComplexVector c, int N) {
// From example code for finufft, sets precision and default options
double acc = 1e-9;
nufft_opts opts; finufft_default_opts(&opts);
// allocate output array for the finufft routine:
complex<double>* F = (complex<double>*)malloc(sizeof(complex<double>*)*N);
// Change vector inputs from R types to C++ types
double* xd = as< double* >(x);
complex<double>* cd = as< complex<double>* >(c);
// call the NUFFT (with iflag=-1): note N and M are typecast to BIGINT
int ier = finufft1d1(M,xd,cd,-1,acc,N,F,opts);
ComplexVector Fd = as<ComplexVector>(*F);
return Fd;
When I try to source this in Rstudio, I get the error "no matching function for call to 'as(std::complex<double>*&)'", pointing to the line declaring Fd towards the end. I believe the error indicates that either the function 'as' isn't defined (which I know is false), or the argument to 'as' isn't the correct type. The examples here include one using 'as' to convert to a NumericVector, so unless there's some complication with complex values I don't see why it should be a problem here.
I know there are potential problems using two namespaces, but I don't believe that's the issue here. My best guess is that there's an issue with how I'm trying to use pointers, but I lack the experience to identify it and I can't find any similar examples online to guide me.
Rcpp::as<T> converts from an R data type (SEXP) to a C++ data type, e.g. Rcpp::ComplexVector. This does not fit your situation, where you try to convert from a C-style array to C++. Fortunately Rcpp::Vector, which is the basis for Rcpp::ComplexVector, has a constructor for this task: Vector (InputIterator first, InputIterator last). For the other direction (going from C++ to C-style array) you can use vector.begin() or &vector[0].
However, one needs a reinterpret_cast to convert between Rcomplex* and std::complex<double>*. That should cause no problems, though, since Rcomplex (a.k.a. complex double in C) and std::complex<doulbe> are compatible.
A minimal example:
#include <Rcpp.h>
#include <complex>
using namespace Rcpp;
// [[Rcpp::export]]
ComplexVector foo(ComplexVector v) {
std::complex<double>* F = reinterpret_cast<std::complex<double>*>(v.begin());
int N = v.length();
// do something with F
ComplexVector Fd(reinterpret_cast<Rcomplex*>(F),
reinterpret_cast<Rcomplex*>(F + N));
return Fd;
/*** R
> Rcpp::sourceCpp('56675308/code.cpp')
> set.seed(42)
> foo(runif(4)*(1+1i))
[1] 0.9148060+0.9148060i 0.9370754+0.9370754i 0.2861395+0.2861395i 0.8304476+0.8304476i
BTW, you can move these reinterpret_casts out of sight by using std::vector<std::complex<double>> as argument and return types for your function. Rcpp does the rest for you. This also helps getting rid of the naked malloc:
#include <Rcpp.h>
// dummy function with reduced signature
int finufft1d1(int M, double *xd, std::complex<double> *cd, int N, std::complex<double> *Fd) {
return 0;
// [[Rcpp::export]]
std::vector<std::complex<double>> finufft(int M,
std::vector<double> x,
std::vector<std::complex<double>> c,
int N) {
// allocate output array for the finufft routine:
std::vector<std::complex<double>> F(N);
// Change vector inputs from R types to C++ types
double* xd = x.data();
std::complex<double>* cd = c.data();
std::complex<double>* Fd = F.data();
int ier = finufft1d1(M, xd, cd, N, Fd);
return F;
I have a large device array inputValues of int64_t type. Every 32 elements of this array are sorted in an ascending order. I have an unsorted search array removeValues.
My intention is to look for all the elements in removeValues inside inputValues and mark them as -1. What is the most efficient method to achieve this? I am using a 3.5 cuda device if that helps.
I am not looking for a higher level solution, i.e. I do not want to use thrust or cub, but I want to write this using cuda kernels.
My initial approach was to load every 32 values in shared memory in a thread block. Every thread also loads a single value from removeValues and does an independent binary search on the shared memory array. If found, the value is set according by using an if condition.
Wouldn't this approach involve a lot of bank conflicts and branch divergence? Do you think that branch divergence can be addressed by using ternary operators while implementing the binary search? Even if that is solved, how can bank conflict be eliminated? Since the size of sorted arrays is 32, would it be possible to implement a binary search using shuffle instructions? Would that help?
EDIT : I have added an example to show what I intend to achieve.
Let's say that inputValues is a vector where every 32 elements are sorted:
[2, 4, 6, ... , 64], [95, 97, ... , 157], [1, 3, ... , 63], [...]
The typical size for this array can range between 32*2 to 32*32. The values could range from 0 to INT64_MAX.
An example of removeValues would be:
[7, 75, 95, 106]
The typical size for this array could range from 1 to 1024.
After the operation removeValues would be:
[-1, 75, -1, 106]
The values in inputValues remain unchanged.
I would concur with the answer (now deleted) and comment by #harrism. Since I put some effort into the non-thrust approach, I'll present my findings.
I tried to naively implement a binary search at the warp-level using __shfl(), and then repeat that binary search across the data set, passing the data set through each 32-element group.
It's embarrassing, but my code is around 20x slower than thrust (in fact it may be worse than that if you do careful timing with nvprof).
I made the data sizes a little larger than what was proposed in the question, because the data sizes in the question are so small that the timing is in the dust.
Here's a fully worked example of 2 approaches:
What is approximately outlined in the question, i.e. create a binary search using warp shuffle that can search up to 32 elements against a 32-element ordered array. Repeat this process for as many 32-element ordered arrays as there are, passing the entire data set through each ordered array (hopefully you can start to see some of the inefficiency now.)
Use thrust, essentially the same as what is outlined by #harrism, i.e. sort the grouped data set, and then run a vectorized thrust::binary_search on that.
Here's the example:
$ cat t1030.cu
#include <stdio.h>
#include <assert.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/binary_search.h>
typedef long mytype;
const int gsize = 32;
const int nGRP = 512;
const int dsize = nGRP*gsize;//gsize*nGRP;
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
template <typename T>
__device__ T my_shfl32(T val, unsigned lane){
return __shfl(val, lane);
template <typename T>
__device__ T my_shfl64(T val, unsigned lane){
T retval = val;
int2 t1 = *(reinterpret_cast<int2 *>(&retval));
t1.x = __shfl(t1.x, lane);
t1.y = __shfl(t1.y, lane);
retval = *(reinterpret_cast<T *>(&t1));
return retval;
template <typename T>
__device__ bool bsearch_shfl(T grp_val, T my_val){
int src_lane = gsize>>1;
bool return_val = false;
T test_val;
int shift = gsize>>2;
for (int i = 0; i <= gsize>>3; i++){
if (sizeof(T)==4){
test_val = my_shfl32(grp_val, src_lane);}
else if (sizeof(T)==8){
test_val = my_shfl64(grp_val, src_lane);}
else assert(0);
if (test_val == my_val) return_val = true;
src_lane += (((test_val<my_val)*2)-1)*shift;
assert ((src_lane < gsize)&&(src_lane > 0));}
if (sizeof(T)==4){
test_val = my_shfl32(grp_val, 0);}
else if (sizeof(T)==8){
test_val = my_shfl64(grp_val, 0);}
else assert(0);
if (test_val == my_val) return_val = true;
return return_val;
template <typename T>
__global__ void bsearch_grp(const T * __restrict__ search_grps, T *data){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int tid = threadIdx.x;
if (idx < gsize*nGRP){
T grp_val = search_grps[idx];
while (tid < dsize){
T my_val = data[tid];
if (bsearch_shfl(grp_val, my_val)) data[tid] = -1;
tid += blockDim.x;}
int main(){
// data setup
assert(gsize == 32); //mandatory (warp size)
assert((dsize % 32)==0); //needed to preserve shfl capability
thrust::host_vector<mytype> grps(gsize*nGRP);
thrust::host_vector<mytype> data(dsize);
thrust::host_vector<mytype> result(dsize);
for (int i = 0; i < gsize*nGRP; i++) grps[i] = i;
for (int i = 0; i < dsize; i++) data[i] = i;
// method 1: individual shfl-based binary searches on each group
mytype *d_grps, *d_data;
cudaMalloc(&d_grps, gsize*nGRP*sizeof(mytype));
cudaMalloc(&d_data, dsize*sizeof(mytype));
cudaMemcpy(d_grps, &(grps[0]), gsize*nGRP*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_data, &(data[0]), dsize*sizeof(mytype), cudaMemcpyHostToDevice);
unsigned long long my_time = dtime_usec(0);
bsearch_grp<<<nGRP, gsize>>>(d_grps, d_data);
my_time = dtime_usec(my_time);
cudaMemcpy(&(result[0]), d_data, dsize*sizeof(mytype), cudaMemcpyDeviceToHost);
for (int i = 0; i < dsize; i++) if (result[i] != -1) {printf("method 1 mismatch at %d, was %d, should be -1\n", i, (int)(result[i])); return 1;}
printf("method 1 time: %fs\n", my_time/(float)USECPSEC);
// method 2: thrust sort, followed by thrust binary search
thrust::device_vector<mytype> t_grps = grps;
thrust::device_vector<mytype> t_data = data;
thrust::device_vector<bool> t_rslt(t_data.size());
my_time = dtime_usec(0);
thrust::sort(t_grps.begin(), t_grps.end());
thrust::binary_search(t_grps.begin(), t_grps.end(), t_data.begin(), t_data.end(), t_rslt.begin());
my_time = dtime_usec(my_time);
thrust::host_vector<bool> rslt = t_rslt;
for (int i = 0; i < dsize; i++) if (rslt[i] != true) {printf("method 2 mismatch at %d, was %d, should be 1\n", i, (int)(rslt[i])); return 1;}
printf("method 2 time: %fs\n", my_time/(float)USECPSEC);
// method 3: multiple thrust merges, followed by thrust binary search
return 0;
$ nvcc -O3 -arch=sm_35 t1030.cu -o t1030
$ ./t1030
method 1 time: 0.009075s
method 2 time: 0.000516s
I was running this on linux, CUDA 7.5, GT640 GPU. Obviously the performance will be different on different GPUs, but I'd be surprised if any GPU significantly closed the gap.
In short, you'd be well advised to use a well-tuned library like thrust or cub. If you don't like the monolithic nature of thrust, you could try cub. I don't know if cub has a binary search, but a single binary search against the whole sorted data set is not a difficult thing to write, and it's the smaller part of the time involved (for method 2 -- identifiable using nvprof or additional timing code).
Since your 32-element grouped ranges are already sorted, I also pondered the idea of using multiple thrust::merge operations rather than a single sort. I'm not sure which would be faster, but since the thrust method is already so much faster than the 32-element shuffle search method, I think thrust (or cub) is the obvious choice.
I've got some trouble understanding a SSE2-instruction. According to the microsoft documentation, _mm_load_si128 requires a 16-byte-aligned address as parameter. In the code, which I try to understand, this seems not to be the case:
void f(uchar* buf0, const int n)
ushort* buf = (ushort*)alignPtr(buf0, 16);
for(int i = 0; i < n; i += 16)
__m128i v0 = _mm_load_si128((__m128i*)(buf+i)); // 16-byte-aligned, since buf is 16-byte-aligned and i is divisable by 16.
__m128i v1 = _mm_load_si128((__m128i*)(buf+i+8)); // If buf+i is 16-byte-aligned, then buf+i+8 cannot be 16-byte-aligned.
I reduced the code to the relevant part and renamed some variables. The original code is from the OpenCV implementation of Konoliges blockmatching algorithm (stereobm.cpp, especially line 313). My question is, why is the code correct and what is written into v1?
.h file:
#define VECTOR_SIZE 1024
.cpp file:
int main ()
unsigned int* A;
A = new unsigned int [VECTOR_SIZE];
CopyToDevice (A);
.cu file:
void CopyToDevice (unsigned int *A)
ulong4 *UA
unsigned int VectorSizeUlong4 = VECTOR_SIZE / 4;
unsigned int VectorSizeBytesUlong4 = VectorSizeUlong4 * sizeof(ulong4);
cudaMalloc( (void**)&UA, VectorSizeBytesUlong4 );
// how to use cudaMemcpy to copy data from A to UA?
// I tried to do the following but it gave access violation error:
for (int i=0; i<VectorSizeUlong4; ++i)
UA[i].x = A[i*4 + 0];
UA[i].y = A[i*4 + 1];
UA[i].z = A[i*4 + 2];
UA[i].w = A[i*4 + 3];
// I also tried to copy *A to device and then work on it instead going back to CPU to access *A every time but this did not work again
The CUDA ulong4 is a 16 byte aligned structure defined as
struct __builtin_align__(16) ulong4
unsigned long int x, y, z, w;
this means that the stream of four consecutive 32 bit unsigned source integers you want to use to populate a stream of ulong4 are the same size. The simplest solution is contained right in the text on the image you posted - just cast (either implicitly or explicitly) the unsigned int pointer to a ulong4 pointer, use cudaMemcpydirectly on the host and device memory, and pass the resulting device pointer to whatever kernel function you have that requires a ulong4 input. Your device transfer function could look something like:
ulong4* CopyToDevice (unsigned int* A)
ulong4 *UA, *UA_h;
size_t VectorSizeUlong4 = VECTOR_SIZE / 4;
size_t VectorSizeBytesUlong4 = VectorSizeUlong4 * sizeof(ulong4);
cudaMalloc( (void**)&UA, VectorSizeBytesUlong4);
UA_h = reinterpret_cast<ulong4*>(A); // not necessary but increases transparency
cudaMemcpy(UA, UA_h, VectorSizeBytesUlong4);
return UA;
[Usual disclaimer: written in browser, not tested or compiled, use at own risk]
This should raise all alarm bells:
cudaMalloc( (void**)&UA, VectorSizeBytesUlong4 );
// ...
UA[i].x = A[i*4 + 0];
You are allocating UA on the device and then use it in host code. Don't ever do that. You will need to use cudaMemcpy to copy arrays to the device. This tutorial shows you a basic program that uses cudaMemcpy to copy things over. The length argument to cudaMemcpy is the length of your array in bytes. And in your case that is VECTOR_SIZE * sizeof(unsigned int).