I'm programming in OpenCL using the C++ bindings. I have a problem where on NVidia hardware, my OpenCL code is spontaneously producing very large numbers, and then (on the next run) a "1.#QNaN". My code is pretty much a simple physics simulation using the equation x = vt * .5at^2. The only thing I've noticed odd about it is that the velocities of the particles suddenly shoot up to about 6e+34, which I'm guessing is the maximum floating point value on that machine. However, the velocities/forces before then are quite small, often with values less than 1.
The specific GPU I'm using is the Tesla M2050, with latest drivers. I prototype on my laptop using my AMD Fusion CPU as a platform (it does not have a dedicated GPU) and the problem does not occur there. I am not sure if this is an NVidia driver problem, a problem with my computation, or something else entirely.
Here is the kernel code (Note: I'm reasonably sure mass is always nonzero):
_kernel void update_atom(__global float4 *pos, __global float4 *vel, __global float4 *force,
__global const float *mass, __global const float *radius, const float timestep, const int wall)
{
// Get the index of the current element to be processed
int i = get_global_id(0);
float constMult;
float4 accel;
float4 part;
//Update the position, velocity and force
accel = (float4)(force[i].x/mass[i],
force[i].y/mass[i],
force[i].z/mass[i],
0.0f);
constMult = .5*timestep*timestep;
part = (float4)(constMult*accel.x,
constMult*accel.y,
constMult*accel.z, 0.0f);
pos[i] = (float4)(pos[i].x + vel[i].x*timestep + part.x,
pos[i].y + vel[i].y*timestep + part.y,
pos[i].z + vel[i].z*timestep + part.z,
0.0f);
vel[i] = (float4)(vel[i].x + accel.x*timestep,
vel[i].y + accel.y*timestep,
vel[i].z + accel.z*timestep,
0.0f);
force[i] = (float4)(force[i].x,
force[i].y,
force[i].z,
0.0f);
//Do reflections off the wall
//http://www.3dkingdoms.com/weekly/weekly.php?a=2
float4 norm;
float bouncePos = wall - radius[i];
float bounceNeg = radius[i] - wall;
norm = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
if(pos[i].x >= bouncePos)
{
//Normal is unit YZ vector
pos[i].x = bouncePos;
norm.x = 1.0f;
}
else if(pos[i].x <= bounceNeg)
{
pos[i].x = bounceNeg;
norm.x = -1.0f;
}
if(pos[i].y >= bouncePos)
{
//Normal is unit XZ vector
pos[i].y = bouncePos;
norm.y = 1.0f;
}
else if(pos[i].y <= bounceNeg)
{
//etc
pos[i].y = bounceNeg;
norm.y = -1.0f;
}
if(pos[i].z >= bouncePos)
{
pos[i].z = bouncePos;
norm.z = 1.0f;
}
else if(pos[i].z <= bounceNeg)
{
pos[i].z = bounceNeg;
norm.z = -1.0f;
}
float dot = 2 * (vel[i].x * norm.x + vel[i].y * norm.y + vel[i].z * norm.z);
vel[i].x = vel[i].x - dot * norm.x;
vel[i].y = vel[i].y - dot * norm.y;
vel[i].z = vel[i].z - dot * norm.z;
}
And here's how I store the information into the kernel. PutData just uses the std::vector::push_back function on the positions, velocities and forces of the atoms into the corresponding vectors, and kernel is just a wrapper class I wrote for the OpenCL libraries (you can trust me that I put the right parameters into the right places for the enqueue functions).
void LoadAtoms(int kernelNum, bool blocking)
{
std::vector<cl_float4> atomPos;
std::vector<cl_float4> atomVel;
std::vector<cl_float4> atomForce;
for(int i = 0; i < numParticles; i++)
atomList[i].PutData(atomPos, atomVel, atomForce);
kernel.EnqueueWriteBuffer(kernelNum, posBuf, blocking, 0, numParticles*sizeof(cl_float4), &atomPos[0]);
kernel.EnqueueWriteBuffer(kernelNum, velBuf, blocking, 0, numParticles*sizeof(cl_float4), &atomVel[0]);
kernel.EnqueueWriteBuffer(kernelNum, forceBuf, blocking, 0, numParticles*sizeof(cl_float4), &atomForce[0]);
}
void LoadAtomTypes(int kernelNum, bool blocking)
{
std::vector<cl_float> mass;
std::vector<cl_float> radius;
int type;
for(int i = 0; i < numParticles; i++)
{
type = atomList[i].GetType();
mass.push_back(atomTypes[type].mass);
radius.push_back(atomTypes[type].radius);
}
kernel.EnqueueWriteBuffer(kernelNum, massBuf, blocking, 0, numParticles*sizeof(cl_float), &mass[0]);
kernel.EnqueueWriteBuffer(kernelNum, radiusBuf, blocking, 0, numParticles*sizeof(cl_float), &radius[0]);
}
There is more to my code, as always, but this is what's related to the kernel.
I saw this question, which is similar, but I use cl_float4 everywhere I can so I don't believe it's an issue of alignment. There aren't really any other related questions.
I realize this probably isn't a simple question, but I've run out of ideas until we can get new hardware in the office to test on. Can anyone help me out?
Since no one answered, I suppose I'll just contribute what I've learned so far.
I don't have a definitive conclusion, but at the very least, I figured out why it was doing it so often. Since I'm running this (and other similar kernels) to figure out an estimate for the order of time, I was clearing the lists, resizing them and then re-running the calculations. What I wasn't doing, however, was resizing the buffers. This resulted in some undefined numbers getting pulled by the threads.
However, this doesn't solve the QNaNs I get from running the program over long periods of time. Those simply appear spontaneously. Maybe it's a similar issue that I'm overlooking, but I can't say. If anyone has further input on the issue, it'd be appreciated.
Related
I'm new in OpenCl. I wrote an OpenCL software rasterizer to rasterize triangles. Now the time that is used for a cube is 32Seconds, which is too much, I'm testing on nVidia RTX3080 Laptop.
The result is very weird and it's too slow.
Here is the kernel,
___kernel void fragment_shader(__global struct Fragment* fragments, __global struct Triangle_* triangles, int triCount)
{
size_t px = get_global_id(0); // triCount
//size_t py = get_global_id(1); // triCount
int imageWidth = 256;
int imageHeight = 256;
if(px < triCount)
{
float3 v0Raster = (float3)(triangles[px].v[0].pos[0], triangles[px].v[0].pos[1], triangles[px].v[0].pos[2]);
float3 v1Raster = (float3)(triangles[px].v[1].pos[0], triangles[px].v[1].pos[1], triangles[px].v[1].pos[2]);
float3 v2Raster = (float3)(triangles[px].v[2].pos[0], triangles[px].v[2].pos[1], triangles[px].v[2].pos[2]);
float xmin = min3(v0Raster.x, v1Raster.x, v2Raster.x);
float ymin = min3(v0Raster.y, v1Raster.y, v2Raster.y);
float xmax = max3(v0Raster.x, v1Raster.x, v2Raster.x);
float ymax = max3(v0Raster.y, v1Raster.y, v2Raster.y);
float slope = (ymax - ymin) / (xmax - xmin);
// be careful xmin/xmax/ymin/ymax can be negative. Don't cast to uint32_t
unsigned int x0 = max((uint)0, (uint)(floor(xmin)));
unsigned int x1 = min((uint)(imageWidth) - 1, (uint)(floor(xmax)));
unsigned int y0 = max((uint)0, (uint)(floor(ymin)));
unsigned int y1 = min((uint)(imageHeight) - 1, (uint)(floor(ymax)));
float3 v0 = v0Raster;
float3 v1 = v1Raster;
float3 v2 = v2Raster;
float area = edgeFunction(v0Raster, v1Raster, v2Raster);
for (unsigned int y = y0; y <= y1; ++y) {
for (unsigned int x = x0; x <= x1; ++x) {
float3 p = { x + 0.5f, y + 0.5f, 0 };
float w0 = edgeFunction(v1Raster, v2Raster, p);
float w1 = edgeFunction(v2Raster, v0Raster, p);
float w2 = edgeFunction(v0Raster, v1Raster, p);
if (w0 >= 0 && w1 >= 0 && w2 >= 0) {
fragments[y * 256 + x].col[0] = 1.0f;
fragments[y * 256 + x].col[1] = 0;
fragments[y * 256 + x].col[2] = 0;
}
}
}
}
}
The kernel is supposed to run for every triangle, and does box testing and rasterize the pixels.
here is how I invoke it:
global_size[0] = triCount-1;
auto time_start = std::chrono::high_resolution_clock::now();
err = clEnqueueNDRangeKernel(commandQueue, kernel_fragmentShader, 1, NULL, global_size,
NULL, 0, NULL, NULL);
if (err < 0) {
perror("Couldn't enqueue the kernel_fragmentShader");
exit(1);
}
I tried to omit lighting and everything still it takes around 20seconds to render a cube.
This kind of approach is well suited for massively parallel rendering like on GPU. I assume you are doing this on CPU side so the performance is poor as you have no or too small parallelization and no or very little HW support for used operations. On GPU you got SIMD instructions for most of the stuff needed and a lot of is done in HW instead of in code).
To gain speed on CPU side see how to rasterize rotated rectangle this was standard way of SW rendering back in the days before GPUs. The method simply renders edges of convex polygon (or triangle) as lines into 2 buffers (start end poins per horizontal line) and then just fill or interpolate the horizontal lines. This uses much much less operations per pixel.
Your method computes point inside triangle per each pixel of BBOX which meas much more pixels are processed and each pixel need too many complicated operations which kills performance.
On top of this your code is not optimized for example
fragments[y * 256 + x].col[0] = 1.0f;
fragments[y * 256 + x].col[1] = 0;
fragments[y * 256 + x].col[2] = 0;
Why are you computing y * 256 + x 3 times? also I would feel better with (y<<8)+x but nowadays compilers should do it for you. You can also just add 256 to starting address instead of multiplying...
I do not code in OpenCL (IIRC its for computer vision and DIP not for rendering) so I hope you have direct access to fragments[] and not some constrained with additional test which kills performance a lot (similar to putpixel,setpixel,pixel[][],etc. on most gfx apis which can kill performance even up to 10000x times)
I have been struggling with this problem for over a month, so I really need your help.
To further elaborate on the question :
The question is whether a vector called 'direction' that starts at a vertex called 'start' passes through the 'taget'.
You need to confirm the direction and distance.
I decided that using the dot product was impossible because I went through enough debugging.
The result is good when calculated directly, but why is the result different when executed in the shader?
The same thickness should be printed depending on the distance, but why does the thin line appear when the distance is far?
Do you have any good ideas even if it's not the way I use the rotation matrix?
These are three questions.
First of all, my situation is
drawing fSQ.
I want to check whether the direction of start passes through the target.
Compute in the pixel shader.
1 is one pixel
The screen size is 1920*1080
bool intersect(float2 target, float2 direction, float2 start) {
bool intersecting = false;
static const float thresholdX = 0.5 / SCREENWIDTH;
static const float thresholdY = 0.5 / SCREENHEIGHT;
if (direction.x == 0 && direction.y == 0);
else {
float2 startToTarget = target - start;
float changedTargetPositionX = startToTarget.x * direction.x + startToTarget.y * direction.y;
float changedTargetPositionY = startToTarget.x * (-direction.y) + startToTarget.y * direction.x;
float rangeOfX = (direction.x * direction.x) + (direction.y * direction.y);
if (changedTargetPositionX <= rangeOfX + thresholdX && changedTargetPositionX >= -thresholdX &&
changedTargetPositionY <= thresholdY && changedTargetPositionY >= -thresholdY) {
intersecting = true;
}
}
return intersecting;
We use a rotation matrix to rotate a vector and then check the difference between the two vectors, which works in most cases, but fails for very small pixels.
For example
start = (15,0) direction= (10,0) taget = (10,0)
In this case, the intersect function should return false, but it returns true.
But if the pixel difference is bigger then it works fine.
and
#define MAX = 5;
float2 points[MAX*MAX];
for (float fi = 1; fi < MAX; fi++)
for (float fj = 1; fj < MAX; fj++)
points[(int)(fi * MAX + fj)] = float2(fi / MAX , fj / MAX);
for(uint ni=0; ni < MAX*MAX;ni++)
for(uint nj=3; nj < MAX*MAX; nj++)
if (intersect(uv, points[nj]- points[ni], points[ni])) {
color = float4(1, 0, 0, 1);
return color;
}
return float4(0, 0, 0, 1);
When debugging like this, the line becomes thinner depending on the distance.
All the lines should have the same thickness, but I don't know why.
This is the result of running the debugging code:
We look forward to your reply.
thank you
I am doing a ray-tracer and I implemented perspective correction, calculating the positions of rays to be launched from using the current pixel values in the x and y axes to calculate the current ray direction.
Here the code piece:
float fov = 60;
float invWidth = 1/float(image.getWidth());
float invHeight = 1/float(image.getHeight());
float angle = (fov * M_PI * 0.5/180 );
float aspectratio = image.getWidth()/float(image.getHeight());
point camera = scene.getCamera();
for (int y=0; y<image.getHeight(); y++) {
for (int x=0; x<image.getWidth(); x++) {
......
......
float xx = (((x*invWidth) *2)-1) * angle * aspectratio;
float yy = (((y*invHeight)*2)-1) * angle;
Ray viewRay = { {camera.x, camera.y, camera.z}, {xx, yy, 1.0f}};
So far, so good it works perfectly. However I realized that the values of xx and yy (the direction of the pixels) doesn't need to be calculated for every pixel, only a couple of times equal to the width and length of the image.
So I rewrote these part on this way:
float fov = 60;
float invWidth = 1/float(image.getWidth());
float invHeight = 1/float(image.getHeight());
float angle = (fov * M_PI * 0.5/180 );
float aspectratio = image.getWidth()/float(image.getHeight());
float rays_x [image.getWidth()], rays_y [image.getHeight()];
for (int y=0; y<image.getHeight(); y++)
rays_y [y] = (((y*invHeight)*2)-1) * angle;
for (int x=0; x<image.getWidth(); x++)
rays_x [x] = (((x*invWidth) *2)-1) * angle * aspectratio;
point camera = scene.getCamera();
for (int y=0; y<image.getHeight(); y++) {
float yy = rays_y[y];
for (int x=0; x<image.getWidth(); x++) {
......
......
Ray viewRay = { {camera.x, camera.y, camera.z}, {rays_x[x], yy, 1.0f}};
I basically pre-computed the directions of the rays and stored them into arrays. I expected small improvements in performance, maybe nothing in a pessimistic case, but I never expected it to get WORST.
Before I was taking1.67s to render a scene and now it takes 1.74! Not a massive fall but surprising seeing that I expected to be doing a lot less work now. I disabled compiler optimizations (-O3 and -ffast-math) and tested with the two approaches. Before it tooked between 9.03 and 9.05, and now goes between 9.06 and 9.15
So how should I investigate the matter? The only thing that crossed my mind is less cache hits due to access rays_x [x] every loop iteration and rays_y [y] every 1024 iterations, although I would never suspect that, because it is only 1024*4=4096+(768*4)=7168 bytes in the total. Any ideas will be appreciated.
The compiler will realize that this one:
float yy = (((y*invHeight)*2)-1) * angle;
is constant data, and only needs to be calculated once per loop.
Therefore, your precomputed yy is a waste of performance.
Precomputed xx could help though, but if the expression constains much constant data (i.e. invWidth * 2 and angle * aspectratio) the performance may not increase and may even get worse due to cache misses.
float xx = (((x*invWidth) *2)-1) * angle * aspectratio;
Precomputing the directions will accelerate your tracer. But there is obviously an overhead of creating the lookup-table in the first place. In your code you are creating the tables on the stack and recomputing the directions for every frame. This will be slightly slower because you have to read from the array which you previously didnt and because of the memory allocation overhead. Instead i would suggest you create your lookup arrays on the heap (as pointer outside the method) and precompute your directions only once.
The directions are dependent on values that dont change between frames, so there is no need to compute the directions every frame.
Something like this:
float *rays_x, *rays_y;
void compute_directions()
{
rays_x = new float[image.getWidth()];
rays_y = new float[image.getHeight()];
for (int y=0; y<image.getHeight(); y++)
rays_y[y] = (((y*invHeight)*2)-1) * angle;
for (int x=0; x<image.getWidth(); x++)
rays_x[x] = (((x*invWidth) *2)-1) * angle * aspectratio;
}
void render()
{
float fov = 60;
float invWidth = 1/float(image.getWidth());
float invHeight = 1/float(image.getHeight());
float angle = (fov * M_PI * 0.5/180 );
float aspectratio = image.getWidth()/float(image.getHeight());
point camera = scene.getCamera();
for (int y=0; y<image.getHeight(); y++) {
float yy = rays_y[y];
for (int x=0; x<image.getWidth(); x++) {
......
......
You obviously have to move angle and aspectratio to somewhere else, so you can access them in compute_directions.
Also remember to delete your pointers with delete[] if you dont need them anymore to prevent a memoroy leak.
Judging by your description, it seems you optimized on a hunch by precomputing some values that seem to be computed very fast (shifting the computation to some memory lookups that might not yield any performance improvements - also, this is a hunch!).
Some basic rules on optimization:
Before trying optimize anything: profile.
After optimizing anything: profile.
You cannot expect any performance gain from optimizing before knowing where your program actually spends time.
On Linux you can use GCCs -pg switch and gprof. You can also use perf and valgrind (e.g. callgrind to get an idea of the number of calls to a particular function).
Also check out the perf wiki.
I will try to explain my problem as clear as possible. I have a multithreading framework I have to work on. It's a path tracer renderer. It gives me error when I try to store some information provided by my threads. Trying to avoid posting all the code, I will explain what I mean step by step:
my TileTracer class is a thread
class TileTracer : public Thread{
...
}
and I have a certain number of threads:
#define MAXTHREADS 32
TileTracer* worker[MAXTHREADS];
the number of working threads is set in the following initialization code, where the threads are also started:
void Renderer::Init(){
accumulator = (vec3*)MALLOC64(sizeof(vec3)* SCRWIDTH * SCRHEIGHT);
memset(accumulator, 0, SCRWIDTH * SCRHEIGHT * sizeof(vec3));
SYSTEM_INFO systeminfo;
GetSystemInfo(&systeminfo);
int cores = systeminfo.dwNumberOfProcessors;
workerCount = MIN(MAXTHREADS, cores);
for (int i = 0; i < workerCount; i++)
{
goSignal[i] = CreateEvent(NULL, FALSE, FALSE, 0);
doneSignal[i] = CreateEvent(NULL, FALSE, FALSE, 0);
}
// create and start worker threads
for (int i = 0; i < workerCount; i++)
{
worker[i] = new TileTracer();
worker[i]->init(accumulator, i);
worker[i]->start(); //start the thread
}
samples = 0;
}
the init() method for my thread is simply defined in my header as the following:
void init(vec3* target, int idx) { accumulator = target, threadIdx = idx; }
while the start() is:
void Thread::start()
{
DWORD tid = 0;
m_hThread = (unsigned long*)CreateThread( NULL, 0, (LPTHREAD_START_ROUTINE)sthread_proc, (Thread*)this, 0, &tid );
setPriority( Thread::P_NORMAL );
}
somehow (I don't get exactly where), each thread calls the following main method which is meant to define the color of a pixel (you don't have to understand it all):
vec3 TileTracer::Sample(vec3 O, vec3 D, int depth){
vec3 color(0, 0, 0);
// trace path extension ray
float t = 1000.0f, u, v;
Triangle* tri = 0;
Scene::mbvh->pool4[0].TraceEmbree(O, D, t, u, v, tri, false);
totalRays++;
// handle intersection, if any
if (tri)
{
// determine material color at intersection point
Material* mat = Scene::matList[tri->material];
Texture* tex = mat->GetTexture();
vec3 diffuse;
if (tex)
{
...
}
else diffuse = mat->GetColor();
vec3 I = O + t * D; //we get exactly to the intersection point on the object
//we need to store the info of each bounce of the basePath for the offsetPaths
basePath baseInfo = { O, D, I, tri };
basePathHits.push_back(baseInfo);
vec3 L = vec3(-1 + Rand(2.0f), 20, 9 + Rand(2.0f)) - I; //(-1,20,9) is Hard-code of the light position, and I add Rand(2.0f) on X and Z axis
//so that I have an area light instead of a point light
float dist = length(L) * 0.99f; //if I cast a ray towards the light source I don't want to hit the source point or the light source
//otherwise it counts as a shadow even if there is not. So I make the ray a bit shorter by multiplying it for 0.99
L = normalize(L);
float ndotl = dot(tri->N, L);
if (ndotl > 0)
{
Triangle* tri = 0;
totalRays++;
Scene::mbvh->pool4[0].TraceEmbree(I + L * EPSILON, L, dist, u, v, tri, true);//it just calculates the distance by throwing a ray
//I am just interested in understanding if I hit something or not
//if I don't hit anything I calculate the light transport (diffuse * ndotL * lightBrightness * 1/dist^2
if (!tri) color += diffuse * ndotl * vec3(1000.0f, 1000.0f, 850.0f) * (1.0f / (dist * dist));
}
// continue random walk since it is a path tracer (we do it only if we have less than 20 bounces)
if (depth < 20)
{
// russian roulette
float Psurvival = CLAMP((diffuse.r + diffuse.g + diffuse.b) * 0.33333f, 0.2f, 0.8f);
if (Rand(1.0f) < Psurvival)
{
vec3 R = DiffuseReflectionCosineWeighted(tri->N);//there is weight
color += diffuse * Sample(I + R * EPSILON, R, depth + 1) * (1.0f / Psurvival);
}
}
}
return color;
}
Now, you don't have to understand the whole code for sure because my question is the following: if you notice, in the last method there are the 2 following code lines:
basePath baseInfo = { O, D, I, tri };
basePathHits.push_back(baseInfo);
I just create a simple struct "basePath" defined as follows:
struct basePath
{
vec3 O, D, hit;
Triangle* tri;
};
and I store it in a vector of struct defined at the beginning of my code:
vector<basePath> basePathHits;
The problem is that this seems bringing an exception. Indeed if I try to store these information, that I need later in my code, the program crashes giving me the exception:
Unhandled exception at 0x0FD4FAC1 (msvcr120d.dll) in Template.exe: 0xC0000005: Access violation reading location 0x3F4C1BC1.
Some other times, without changing anything, the error is different and it's the following one:
While, without storing those info, everything works perfectly. Likewise, if I set the number of cores to 1, everything works. So, how come multithreading doesn't allow me to do it? Do not hesitate to ask further info if these are not enough.
Try making the following change to your code:
//we need to store the info of each bounce of the basePath for the offsetPaths
basePath baseInfo = { O, D, I, tri };
static std::mutex myMutex;
myMutex.lock();
basePathHits.push_back(baseInfo);
myMutex.unlock();
If that removes the exceptions then the problem is unsychronised access to basePathHits (i.e. multiple threads calling push_back simultaneously). You need to think carefully about what the best solution to this will be, to minimise the impact of synchronisation on performance.
Possible I did'nt see it, but there is no protection for the target - no mutex or atomic. And as far as I know std::vector needs this for multithreading.
While making a little Pong game in C++ OpenGL, I decided it'd be fun to create arcs (semi-circles) when stuff bounces. I decided to skip Bezier curves for the moment and just go with straight algebra, but I didn't get far. My algebra follows a simple quadratic function (y = +- sqrt(mx+c)).
This little excerpt is just an example I've yet to fully parameterize, I just wanted to see how it would look. When I draw this, however, it gives me a straight vertical line where the line's tangent line approaches -1.0 / 1.0.
Is this a limitation of the GL_LINE_STRIP style or is there an easier way to draw semi-circles / arcs? Or did I just completely miss something obvious?
void Ball::drawBounce()
{ float piecesToDraw = 100.0f;
float arcWidth = 10.0f;
float arcAngle = 4.0f;
glBegin(GL_LINE_STRIP);
for (float i = 0.0f; i < piecesToDraw; i += 1.0f) // Positive Half
{ float currentX = (i / piecesToDraw) * arcWidth;
glVertex2f(currentX, sqrtf((-currentX * arcAngle)+ arcWidth));
}
for (float j = piecesToDraw; j > 0.0f; j -= 1.0f) // Negative half (go backwards in X direction now)
{ float currentX = (j / piecesToDraw) * arcWidth;
glVertex2f(currentX, -sqrtf((-currentX * arcAngle) + arcWidth));
}
glEnd();
}
Thanks in advance.
What is the purpose of sqrtf((-currentX * arcAngle)+ arcWidth)? When i>25, that expression becomes imaginary. The proper way of doing this would be using sin()/cos() to generate the X and Y coordinates for a semi-circle as stated in your question. If you want to use a parabola instead, the cleaner way would be to calculate y=H-H(x/W)^2