I am trying to do openACC optimizations for many body simulations. Currently, I am facing a problem which lead to memory problem in below
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution
srun: error: jrc0017: task 0: Exited with exit code 1
I am using pgc++ compiler and my compiler flags are -acc -Minfo=accel -ta=tesla -fast -std=c++11 and I don't want to use -ta=tesla:managed because I want to organise memory by myself.
#pragma acc kernels present(sim.part.rx, sim.part.ry, sim.part.rz, sim.part.vx, sim.part.vy, sim.part.vz)
for(int idx = 0; idx < sim.num; ++idx) { // Loop over target particle
prx = sim.part.rx[idx], // my position
pry = sim.part.ry[idx],
prz = sim.part.rz[idx];
float Fx = 0.f, Fy = 0.f, Fz = 0.f; // Force
#pragma acc loop
for(int jdx = 0; jdx < sim.num; ++jdx) { // Loop over interaction partners
if(idx != jdx) { // No self-force
const float dx = prx - sim.part.rx[jdx]; // Distance to partner
const float dy = pry - sim.part.ry[jdx];
const float dz = prz - sim.part.rz[jdx];
const float h = 1.f/sqrt(dx*dx + dy*dy + dz*dz + eps);
const float h3 = h*h*h;
Fx += dx*h3; // Sum up force
Fy += dy*h3;
Fz += dz*h3;
sim.part.vx[idx] += sim.mass*dt*Fx; // update velocity
sim.part.vy[idx] += sim.mass*dt*Fy;
sim.part.vz[idx] += sim.mass*dt*Fz;
If I delete the code in below
sim.part.vx[idx] += sim.mass*dt*Fx; // update velocity
sim.part.vy[idx] += sim.mass*dt*Fy;
sim.part.vz[idx] += sim.mass*dt*Fz;
my code is able to run without problem. But I got memory problem if I un-comment them. It seems that sim.part.vx are try to update the data but compiler don't know which lead to the memory problem.
Does anyone know how to fix this problem?

I suspect the problem is that sim and sim.part are not on the device (or the compiler doesn't realize that they're on the device. As a workaround, can you try introducing pointers to those arrays directly?
float *rx = sim.part.rx, *ry = sim.part.ry, *rz = sim.part.rz,
*vx = sim.part.vx, *vy = sim.part.vy, *vz = sim.part.vz;
#pragma acc kernels present(rx, ry, rz, vx, vy, vz)
for(int idx = 0; idx < sim.num; ++idx) { // Loop over target particle
prx = rx[idx], // my position
pry = ry[idx],
prz = rz[idx];
float Fx = 0.f, Fy = 0.f, Fz = 0.f; // Force
#pragma acc loop
for(int jdx = 0; jdx < sim.num; ++jdx) { // Loop over interaction partners
if(idx != jdx) { // No self-force
const float dx = prx - rx[jdx]; // Distance to partner
const float dy = pry - ry[jdx];
const float dz = prz - rz[jdx];
const float h = 1.f/sqrt(dx*dx + dy*dy + dz*dz + eps);
const float h3 = h*h*h;
Fx += dx*h3; // Sum up force
Fy += dy*h3;
Fz += dz*h3;
vx[idx] += sim.mass*dt*Fx; // update velocity
vy[idx] += sim.mass*dt*Fy;
vz[idx] += sim.mass*dt*Fz;
How are sim and sim.part allocated? It's possible to use unstructured data directives in the constructor and destructor to make sure that sim and sim.part are on the device too. If you've already done this, then another possible solution is to add present(sim, sim.part) to your existing present clause so the compiler knows that you've already taken care of those data structures too.


Why isn't my 4 thread implementation faster than the single thread one?

I don't know much about multi-threading and I have no idea why this is happening so I'll just get to the point.
I'm processing an image and divide the image in 4 parts and pass each part to each thread(essentially I pass the indices of the first and last pixel rows of each part). For example, if the image has 1000 rows, each thread will process 250 of them. I can go in details about my implementation and what I'm trying to achieve in case it can help you. For now I provide the code executed by the threads in case you can detect why this is happening. I don't know if it's relevant but in both cases(1 thread or 4 threads) the process takes around 15ms and pfUMap and pbUMap are unordered maps.
void jacobiansThread(int start, int end,vector<float> &sJT,vector<float> &sJTJ) {
uchar* rgbPointer;
float* depthPointer;
float* sdfPointer;
float* dfdxPointer; float* dfdyPointer;
float fov = radians(45.0);
float aspect = 4.0 / 3.0;
float focal = 1 / (glm::tan(fov / 2));
float fu = focal * cols / 2 / aspect;
float fv = focal * rows / 2;
float strictFu = focal / aspect;
float strictFv = focal;
vector<float> pixelJacobi(6, 0);
for (int y = start; y <end; y++) {
rgbPointer = sceneImage.ptr<uchar>(y);
depthPointer = depthBuffer.ptr<float>(y);
dfdxPointer = dfdx.ptr<float>(y);
dfdyPointer = dfdy.ptr<float>(y);
sdfPointer = sdf.ptr<float>(y);
for (int x = roiX.x; x <roiX.y; x++) {
float deltaTerm;// = deltaPointer[x];
float raw = sdfPointer[x];
if (raw > 8.0)continue;
float dirac = (1.0f / float(CV_PI)) * (1.2f / (raw * 1.44f * raw + 1.0f));
deltaTerm = dirac;
vec3 rgb(rgbPointer[x * 3], rgbPointer[x * 3+1], rgbPointer[x * 3+2]);
vec3 bin = rgbToBin(rgb, numberOfBins);
int indexOfColor = bin.x * numberOfBins * numberOfBins + bin.y * numberOfBins + bin.z;
float s3 = glfwGetTime();
float pF = pfUMap[indexOfColor];
float pB = pbUMap[indexOfColor];
float heavisideTerm;
heavisideTerm = HEAVISIDE(raw);
float denominator = (heavisideTerm * pF + (1 - heavisideTerm) * pB) + 0.000001;
float commonFirstTerm = -(pF - pB) / denominator * deltaTerm;
if (pF == pB)continue;
vec3 pixel(x, y, depthPointer[x]);
float dfdxTerm = dfdxPointer[x];
float dfdyTerm = -dfdyPointer[x];
if (pixel.z == 1) {
cv::Point c = findClosestContourPoint(cv::Point(x, y), dfdxTerm, -dfdyTerm, abs(raw));
if (c.x == -1)continue;
pixel = vec3(c.x, c.y,<float>(cv::Point(c.x, c.y)));
vec3 point3D = pixel;
pixelToViewFast(point3D, cols, rows, strictFu, strictFv);
float Xc = point3D.x; float Xc2 = Xc * Xc; float Yc = point3D.y; float Yc2 = Yc * Yc; float Zc = point3D.z; float Zc2 = Zc * Zc;
pixelJacobi[0] = dfdyTerm * ((fv * Yc2) / Zc2 + fv) + (dfdxTerm * fu * Xc * Yc) / Zc2;
pixelJacobi[1] = -dfdxTerm * ((fu * Xc2) / Zc2 + fu) - (dfdyTerm * fv * Xc * Yc) / Zc2;
pixelJacobi[2] = -(dfdyTerm * fv * Xc) / Zc + (dfdxTerm * fu * Yc) / Zc;
pixelJacobi[3] = -(dfdxTerm * fu) / Zc;
pixelJacobi[4] = -(dfdyTerm * fv) / Zc;
pixelJacobi[5] = (dfdyTerm * fv * Yc) / Zc2 + (dfdxTerm * fu * Xc) / Zc2;
float weightingTerm = -1.0 / log(denominator);
for (int i = 0; i < 6; i++) {
pixelJacobi[i] *= commonFirstTerm;
sJT[i] += pixelJacobi[i];
for (int i = 0; i < 6; i++) {
for (int j = i; j < 6; j++) {
sJTJ[i * 6 + j] += weightingTerm * pixelJacobi[i] * pixelJacobi[j];
This is the part where I call each thread:
vector<std::thread> myThreads;
float step = (roiY.y - roiY.x) / numberOfThreads;
vector<vector<float>> tsJT(numberOfThreads, vector<float>(6, 0));
vector<vector<float>> tsJTJ(numberOfThreads, vector<float>(36, 0));
for (int i = 0; i < numberOfThreads; i++) {
int start = roiY.x+i * step;
int end = start + step;
if (end > roiY.y)end = roiY.y;
myThreads.push_back(std::thread(&pwp3dV2::jacobiansThread, this,start,end,std::ref(tsJT[i]), std::ref(tsJTJ[i])));
vector<float> sJT(6, 0);
vector<float> sJTJ(36, 0);
for (int i = 0; i < numberOfThreads; i++)myThreads[i].join();
Other Notes
To measure time I used glfwGetTime() before and right after the second code snippet. The measurements vary but the average is about 15ms as I mentioned, for both implementations.
Starting a thread has significant overhead, which might not be worth the time if you have only 15 milliseconds worth of work.
The common solution is to keep threads running in the background and send them data when you need them, instead of calling the std::thread constructor to create a new thread every time you have some work to do.
Pure spectaculation but two things might be preventing the full power of parallelization.
Processing speed is limited by the memory bus. Cores will wait until data is loaded before continuing.
Data sharing between cores. Some caches are core specific. If memory is shared between cores, data must traverse down to shared cache before loading.
On Linux you can use Perf to check for cache misses.
if you wanna better time you need to split a cycle runs from a counter, for this you need to do some preprocessing. some fast stuff like make an array of structures with headers for each segment or so. if say you can't mind anything better you can just do vector<int> with values of a counter. Then do for_each(std::execution::par,...) on that. way much faster.
for timings there's
auto t2 = std::chrono::system_clock::now();
std::chrono::milliseconds f = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1);

cuda:multiple threads access the same global variable

#define dimG 16
#define dimB 64
// slovebyGPU
__global__ void SloveStepGPU(float* X, float* Y, int * iCons, int* jCons, int * dCons, float* wCons, int cnt, float c)
int id = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = id; i<cnt; i += dimG*dimB) {
int I = iCons[i];
int J = jCons[i];
int d = dCons[i];
float wc = 1.0f*wCons[i]*c;
if (wc > 1.0)wc = 1.0;
float XI = atomicAdd(&(X[I]), 0);
float XJ = atomicAdd(&(X[J]), 0);
float YI = atomicAdd(&(Y[I]), 0);
float YJ = atomicAdd(&(Y[J]), 0);
float pqx = XI - XJ;
float pqy = YI - YJ;
float mag = sqrtf(pqx*pqx + pqy*pqy);
float r = 1.0f*(d - mag) / 2;
float mx = wc * r * pqx / (mag + eps);
float my = wc * r * pqy / (mag + eps);
if (d == 1) {
atomicAdd(&(X[I]), mx);
atomicAdd(&(Y[I]), my);
atomicAdd(&(X[J]), -mx);
atomicAdd(&(Y[J]), -my);
In this code, I know that X, Y may have data races. My previous thought was: Allowed reading of XI, XJ, YI, YJ may not be the latest data. However, I found that in the process of data race, it may cause XI, XJ, YI, YJ to read random memory values. That is, a memory access violation. Even if I add a lock during reading and writing, I still get the same result. Only when I reduce the size of dimB and dimG so that there is almost no data race, can I get the correct result. Is there any solution?
I use 64-bit compilation under windows + vs2015 + cuda9.1 environment.
However, I used the same code under linux and found no problems.
There is no problem when using nsight cuda debugger under windows. The reason is probably that running with debugger is slow and does not cause data race.
-------update line-----
delete other code
The problem appeared in this if (d == 1), I replaced the if with the device function fminf,fmaxf and so on to solve the problem. I am guessing that the branch was entered in the same warp, and there was data competition and some processes were suspended, which caused strange problems.
if (d == 1) {
atomicAdd(&(X[I]), mx);
atomicAdd(&(Y[I]), my);
float fd = fmaxf(2.0f - d, 0.0f);
X[I] += fd * 1.0f * mx;
Y[I] += fd * 1.0f * my;

Why isn't horizontal advance enough to properly format a glyph?

I am making a simple text renderer with vulkan and I am using freetype to format my text.
I read the freetype tutorial and I have come up with the following function:
void Scribe::CreateSingleLineGeometry(const string &text, ScGeomMetaInfo &info,
ScGeomtry &geometry)
float texture_length = info.texture_len;
float h_offset = info.h_offset;
float char_l = info.char_len;
float v_anchor = info.v_offset;
auto &vertices = geometry.first;
auto &indices = geometry.second;
auto length = max(ft_face->bbox.xMax - ft_face->bbox.xMin,
ft_face->bbox.yMax - ft_face->bbox.yMin);
for(auto &c: text)
auto glyph_index = FT_Get_Char_Index(ft_face, c);
auto error = FT_Load_Glyph(ft_face, glyph_index, FT_LOAD_NO_HINTING);
auto metrics = ft_face->glyph->metrics;
float g_height = metrics.height;
float g_bearing = metrics.horiBearingY;
float correction = 1.0 - (g_height) / float(length) -
float(metrics.horiBearingY - g_height) / float(length);
correction *= char_l;
float bearingX = char_l * float(metrics.horiBearingX) / float(length);
// Insert the vertex positions and uvs into the returned geometry
float h_coords[] = {h_offset + bearingX, h_offset + bearingX + char_l};
float v_coords[] = {v_anchor + correction, v_anchor + correction + char_l};
auto glyph_data = glyph_map[glyph_index];
float tex_h_coords[] = {glyph_data.lt_uv.x, glyph_data.rb_uv.x};
float tex_v_coords[] = {glyph_data.lt_uv.y, glyph_data.rb_uv.y};
for(int x=0; x<2; x++) {
for(int y=0; y<2; y++) {
vertices.insert(end(vertices), {h_coords[x], v_coords[y], 0});
vertices.insert(end(vertices), {tex_h_coords[x], tex_v_coords[y]});
// Setup the indices of the current quad
// There's 4 vertices in a quad, so we offset each index by (quad_size * num_quads)
uint delta = 4 * info.c_num++;
indices.insert(end(indices), {0+delta, 3+delta, 1+delta, 0+delta, 3+delta, 2+delta});
h_offset += char_l * float(metrics.horiAdvance) / float(length) + 0.03;
In particular I want to emphasize the line:
h_offset += char_l * float(metrics.horiAdvance) / float(length) + 0.03;
That 0.03 at the end of the line doesn't come from anywhere, I inserted it there to make things look good.
This is the result with that extra offset:
Which I think looks pretty good. However, if I were to remove the extra offset:
h_offset += char_l * float(metrics.horiAdvance) / float(length);
I get something that doesn't look right at all. Why isn't the advance enough to correctly format the font?

Trying to convert a for loop to SIMD in C++

For a game i'm trying to make faster, i have a part of code that makes sure tanks evade eachother, by pushing them away from eachother with a force.
for ( unsigned int i = 0; i < (MAXP1 + MAXP2); i++ )
if (game->m_Tank[i] == this) continue;
float2 d = pos - game->m_Tank[i]->pos;
float dlsq = (d.x * d.x + d.y * d.y);
if (dlsq < 64) force += normalize(d) * 2.0f;
else if (dlsq < 256) force += normalize(d) * 0.4f;
Now i'm trying to do 4 tanks at a time using SIMD, so i have
for (unsigned int i = 0; i < (MAXP1 + MAXP2) / 4; i++)
union {
__m128 valuesx4;
float valuesx[4];
union {
__m128 valuesy4;
float valuesy[4];
__m128 dx4 = _mm_set_ps((pos - game->m_Tank[i*4]->pos).x, (pos - game->m_Tank[i*4+1]->pos).x, (pos - game->m_Tank[i*4+2]->pos).x, (pos - game->m_Tank[i*4+3]->pos).x);
__m128 dy4 = _mm_set_ps((pos - game->m_Tank[i*4]->pos).y, (pos - game->m_Tank[i*4+1]->pos).y, (pos - game->m_Tank[i*4+2]->pos).y, (pos - game->m_Tank[i*4+3]->pos).y);
__m128 dlsq4 = _mm_add_ps(_mm_mul_ps(dx4, dx4), _mm_mul_ps(dy4, dy4));
__m128 sixtyfour4 = _mm_set_ps1(64.0f);
__m128 twofiftysix4 = _mm_set_ps1(256.0f);
__m128 mask0 = _mm_cmpeq_ps(dlsq4,_mm_set_ps1(0.f));
__m128 adjdlsq4 = _mm_add_ps(dlsq4, _mm_and_ps(mask0, _mm_set_ps1(300.f)));
__m128 mask1 = _mm_cmplt_ps(adjdlsq4, sixtyfour4);
__m128 mask2 = _mm_and_ps(_mm_cmplt_ps(adjdlsq4, twofiftysix4), _mm_cmpge_ps(adjdlsq4,sixtyfour4));
__m128 multiplier4 = _mm_add_ps(_mm_and_ps(mask1, _mm_set_ps1(2.0f)), _mm_and_ps(mask2, _mm_set_ps1(0.4f)));
__m128 rsqrt4 = _mm_rsqrt_ps(adjdlsq4);
__m128 values4 = _mm_mul_ps(rsqrt4, multiplier4);
valuesx4 = _mm_mul_ps(values4, dx4);
valuesy4 = _mm_mul_ps(values4, dy4);
force += (valuesx[0], valuesy[0]);
force += (valuesx[1], valuesy[1]);
force += (valuesx[2], valuesy[2]);
force += (valuesx[3], valuesy[3]);
I adjust the dlsq to account for the case where the checked Tank is the current Tank. For some reason however, in contrast to the normal way where the tanks gather around the target and stay roughly in a filled circle around the target, in the SIMD version, the tanks start following a sloped line and intersecting with eachother. They all move on a line with direction (1,-1) from the target. Does anyone see where it goes wrong?

Second iteration crash - order irrelevant

To save on global memory transfers, and because all of the steps of the code work individually, I have tried to combine all of the kernals into a single kernal, with the first 2 (of 3) steps being done as device calls rather than global calls.
This is failing in the second half of the first step.
There is a function that I need to call twice, to calculate the 2 halves of an image. Regardless of the order the image is calculated in, it crashes on the second iteration.
After examining the code as well as I could, and running it multiple times with different return points, I have found what makes it crash.
void IntersectCone( float* ModDistance,
float* ModIntensity,
float3 ray,
int threadID,
modParam param )
bool ignore = false;
float3 normal = make_float3(0.0f,0.0f,0.0f);
float3 result = make_float3(0.0f,0.0f,0.0f);
float normDist = 0.0f;
float intensity = 0.0f;
float check = abs( Dot(param.position, Cross(param.direction,ray) ) );
if(check > param.r1 && check > param.r2)
ignore = true;
float tran = param.length / (param.r2/param.r1 - 1);
float length = tran + param.length;
float Lsq = length * length;
float cosSqr = Lsq / (Lsq + param.r2 * param.r2);
//Changes the centre position?
float3 position = param.position - tran * param.direction;
float aDd = Dot(param.direction, ray);
float3 e = position * -1.0f;
float aDe = Dot(param.direction, e);
float dDe = Dot(ray, e);
float eDe = Dot(e, e);
float c2 = aDd * aDd - cosSqr;
float c1 = aDd * aDe - cosSqr * dDe;
float c0 = aDe * aDe - cosSqr * eDe;
float discr = c1 * c1 - c0 * c2;
if(discr <= 0.0f)
ignore = true;
float root = sqrt(discr);
float sign;
if(c1 > 0.0f)
sign = 1.0f;
sign = -1.0f;
//Try opposite sign....?
float3 result = (-c1 + sign * root) * ray / c2;
e = result - position;
float dot = Dot(e, param.direction);
float3 s1 = Cross(e, param.direction);
float3 normal = Cross(e, s1);
if( (dot > tran) || (dot < length) )
if(Dot(normal,ray) <= 0)
normal = Norm(normal); //This stuff (1)
normDist = Magnitude(result);
intensity = -IntensAt1m * Dot(ray, normal) / (normDist * normDist);
ModDistance[threadID] = normDist; and this stuff (2)
ModIntensity[threadID] = intensity;
There are two things I can do to to make this not crash, both off which negate the point of the function: If I do not try to write to ModDistance[] and ModIntensity[], or if I do not write to normDist and intensity.
First chance exceptions are thrown by the code above, but not if either of the blocks commented out.
Also, The program only crashes the second time this routine is called.
Have been trying to figure this out all day, any help would be fantastic.
The code that calls it is:
int subrow = threadIdx.y + Mod_Height/2;
int threadID = subrow * (Mod_Width+1) + threadIdx.x;
int obsY = windowY + subrow;
float3 ray = CalculateRay(obsX,obsY);
if( !IntersectSphere(ModDistance, ModIntensity, ray, threadID, param) )
IntersectCone(ModDistance, ModIntensity, ray, threadID, param);
subrow = threadIdx.y;
threadID = subrow * (Mod_Width+1) + threadIdx.x;
obsY = windowY + subrow;
ray = CalculateRay(obsX,obsY);
if( !IntersectSphere(ModDistance, ModIntensity, ray, threadID, param) )
IntersectCone(ModDistance, ModIntensity, ray, threadID, param);
The kernel is running out of resources. As posted in the comments, it was giving the error CudaErrorLaunchOutOfResources.
To avoid this, you should use a __launch_bounds__ specifier to specify the block dimensions you want for your kernel. This will force the compiler to ensure there are enough resources. See the CUDA programming guide for details on __launch_bounds__.