fast rasterization on with opencl

fast rasterization on with opencl - opengl

i am writing a rasterizer for real-time 3d rendering with opencl.
my current architecture:
vertex shader: 1 thread per vertex
rasterizer: 1 thread per face that loops over all pixels covered by the face
fragment shader: 1 thread per pixel
this works well when the faces occupy a small screen space but when i have one covering a large portion of the screen, the frame rate tanks on account of the fact that the rasterization thread must synchronously loop over all pixels the face covers.
I think this could be solved by a tiled approach. The screen would be divided into subsections (tiles), and one thread would be launched per tile. Only the faces whose bounding box overlap the tile would be processed.
I have some questions about this method though:
Should I find the tile's overlapping faces the CPU or GPU?
What data structure should be used to store the face lists? They will have variable length, however I believe OpenCL buffers are fixed length.
Sample of host code of current implementation:
// set up vertex shader args
queue.enqueueNDRangeKernel(vertexShader, cl::NullRange, numVerts, cl::NullRange);
// set up rasterizer args
queue.enqueueNDRangeKernel(rasterizer, cl::NullRange, numFaces, cl::NullRange);
// set up fragment shader args
queue.enqueueNDRangeKernel(fragmentShader, cl::NullRange, numPixels, cl::NullRange);
// read frame buffer to draw to screen
queue.enqueueReadBuffer(buffer_screen, CL_TRUE, 0, width * height * 3 * sizeof(unsigned char), screen);
Sample of rasterizer kernel:
float2 bboxmin = (float2)(INFINITY,INFINITY);
float2 bboxmax = (float2)(-INFINITY,-INFINITY);
float2 clampCoords = (float2)(width-1, height-1);
// get bounding box
for (int i=0; i<3; i++) {
for (int j=0; j<2; j++) {
bboxmin[j] = max(0.f, min(bboxmin[j], vs[i][j]));
bboxmax[j] = min(clampCoords[j], max(bboxmax[j], vs[i][j]));
}
}
// loop over all pixels in bounding box
// this is the part that needs to be improved
int2 pix;
for (pix.x=bboxmin.x; pix.x<=bboxmax.x; pix.x++) {
for (pix.y=bboxmin.y; pix.y<=bboxmax.y; pix.y++) {
float3 bc_screen = barycentric(vs[0].xy, vs[1].xy, vs[2].xy, (float2)(pix.x,pix.y), offset);
float3 bc_clip = (float3)(bc_screen.x/vsVP[0][3], bc_screen.y/vsVP[1][3], bc_screen.z/vsVP[2][3]);
bc_clip = bc_clip/(bc_clip.x+bc_clip.y+bc_clip.z);
float frag_depth = dot(homoZs, bc_clip);
int pixInd = pix.x+pix.y*width;
if (bc_screen.x<0 || bc_screen.y<0 || bc_screen.z<0 || zbuffer[pixInd]>frag_depth) continue;
zbuffer[pixInd] = frag_depth;
}
}

A workaround is to cancel rasterization if a face gets too large and just return. This will lead to some visual artifacts, but at least the frame rate won't suffer.

Related

Accessing Index buffer in shaders (Directx 11)

I have a vertex and index buffer and I am rendering a mesh to just one pixel and I want to know which triangle of the mesh is rendered and access its index in index buffer on cpu for further process(Base on my mesh only one triangle can rendered to that pixel).
I first implement it with SV_PrimitiveId and I hope it would generate 0 for first three indexes of index buffer (first triangle) and generate 1 for second three indexes and so on.This way I could copy data from gpu and read that id and find the triangle but the problem was that ids did not correspond to my index buffer(ie. As I run the program it gives for example third triangle id 7, the other time 10 and so on).
I want to know is there anyway to determine which triangle is pixel shader drawing and find its index in index buffer to find it on cpu?

This should work:
C++:
...
Microsoft::WRL::ComPtr<ID3D11Texture2D> pPrimitiveIDs;
Microsoft::WRL::ComPtr<ID3D11RenderTargetView> pPIDsRTV;
Microsoft::WRL::ComPtr<ID3D11Texture2D> pPIDsStaging;
...
const int number_of_rtvs = 2;
ID3D11RenderTargetView* rtvs[number_of_rtvs] =
{
pScreenRTV.Get(),
pPIDsRTV.Get(),
};
pDeviceContext->OMSetRenderTargets(number_of_rtvs, rtvs, pDepthStencilView.Get());
...
pDeviceContext->CopyResource(pPIDStaging.Get(), pPrimitiveIDs.Get());
D3D11_MAPPED_SUBRESOURCE MappedResource;
pDeviceContext->Map(pPIDStaging.Get(), 0, D3D11_MAP_READ, 0, &MappedResource);
// here is the pid
// in case of a 1x1 back buffer you would just read the first value
UINT pid = *((UINT*)MappedResource.pData + MouseX + WindowWidth * MouseY);
pDeviceContext->Unmap(pPIDStaging.Get(), 0);
...
Pixel Shader:
struct PSOutput
{
float4 color : SV_Target0;
uint pid : SV_Target1;
};
PSOutput main(..., uint pid : SV_PrimitiveId)
{
...
PSOutput output =
{
color,
pid,
};
return output;
}

How to count dead particles in the compute shader?

I am working on particle system. For calculation of each particle position, time alive and so on I use compute shader. I have problem to get count of dead particles back to the cpu, so I can set how many particles renderer should render. To store data of particles i use shader storage buffer. To render particles i use instancing. I tried to use atomic buffer counter, it works fine, but it is slow to copy data from buffer to the cpu. I wonder if there is some other option.
This is important part of compute shader
if (pData.timeAlive >= u_LifeTime)
{
pData.velocity = pData.defaultVelocity;
pData.timeAlive = 0;
pData.isAlive = u_Loop;
atomicCounterIncrement(deadParticles)
pVertex.position.x = pData.defaultPosition.x;
pVertex.position.y = pData.defaultPosition.y;
}
InVertex[id] = pVertex;
InData[id] = pData;
To copy data to the cpu i use following code
uint32_t* OpenGLAtomicCounter::GetCounters()
{
glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, m_AC);
glGetBufferSubData(GL_ATOMIC_COUNTER_BUFFER, 0, sizeof(uint32_t) * m_NumberOfCounters, m_Counters);
return m_Counters;
}

Precise Texture Overlay

I'm trying to set up a two-stage render of objects in a 3D engine I'm working on written in C++ with DirectX9 to facilitate transparency (and other things). I thought it was all working nicely until I noticed some dodgyness on the edge of objects rendered before objects using this two stage method.
The two stage method is simple:
Draw model to off-screen ("side") texture of same size using same zbuffer (no MSAA is used anywhere)
Draw off-screen ("side") texture over the top of the main render target with a suitable blend and no alpha test or write
In the image below the left view is with the two stage render of the gray object (a lamppost) with the body in-front of it rendered directly to the target texture. The right view is with the two-stage render disabled, so both are rendered directly onto the target surface.
On close inspection it is as if the side texture is offset by exactly 1 pixel "down" and 1 pixel "right" when rendered over the target surface (but is rendered correctly in-place). This can be seen in an overlay of the off screen texture (which I get my program to write out to a bitmap file via D3DXSaveTextureToFile) over a screen shot below.
One last image so you can see where the edge in the side texture is coming from (it's because rendering to the side texture does use z test). Left is screen short, right is side texture (as overlaid above).
All this leads me to believe that my "overlaying" isn't very effective. The code that renders the side texture over the main render target is shown below (note that the same viewport is used for all scene rendering (on and off screen)). The "effect" object is an instance of a thin wrapper over LPD3DXEFFECT, with the "effect" field (sorry about shoddy naming) being a LPD3DXEFFECT itself.
void drawSideOver(LPDIRECT3DDEVICE9 dxDevice, drawData* ddat)
{ // "ddat" drawdata contains lots of render state information, but all we need here is the handles for the targetSurface and sideSurface
D3DXMATRIX idMat;
D3DXMatrixIdentity(&idMat); // create identity matrix
dxDevice->SetRenderTarget(0, ddat->targetSurface); // switch to targetSurface
dxDevice->SetRenderState(D3DRS_ZENABLE, false); // disable z test and z write
dxDevice->SetRenderState(D3DRS_ZWRITEENABLE, false);
vertexOver overVerts[4]; // create square
overVerts[0] = vertexOver(-1, -1, 0, 0, 1);
overVerts[1] = vertexOver(-1, 1, 0, 0, 0);
overVerts[2] = vertexOver(1, -1, 0, 1, 1);
overVerts[3] = vertexOver(1, 1, 0, 1, 0);
effect.setTexture(ddat->sideTex); // use side texture as shader texture ("tex")
effect.effect->SetTechnique("over"); // change to "over" technique
effect.setViewProj(&idMat); // set viewProj to identity matrix so 1/-1 map directly
effect.effect->CommitChanges();
setAlpha(dxDevice); // this sets up the alpha blending which works fine
UINT numPasses, pass;
effect.effect->Begin(&numPasses, 0);
effect.effect->BeginPass(0);
dxDevice->SetVertexDeclaration(vertexDecOver);
dxDevice->DrawPrimitiveUP(D3DPT_TRIANGLESTRIP, 2, overVerts, sizeof(vertexOver));
effect.effect->EndPass();
effect.effect->End();
dxDevice->SetRenderState(D3DRS_ZENABLE, true); // revert these so we don't mess everything up drawn after this
dxDevice->SetRenderState(D3DRS_ZWRITEENABLE, true);
}
The C++ side definition for the VertexOver struct and constructor (HLSL side shown below somewhere):
struct vertexOver
{
public:
float x;
float y;
float z;
float w;
float tu;
float tv;
vertexOver() { }
vertexOver(float xN, float yN, float zN, float tuN, float tvN)
{
x = xN;
y = yN;
z = zN;
w = 1.0;
tu = tuN;
tv = tvN;
}
};
Inefficiency in re-creating and passing the vertices down to the GPU each draw aside, what I really want to know is why this method doesn't quite work, and if there are any better methods for overlaying textures like this with an alpha blend that won't exhibit this issue
I figured that the texture sampling may matter somewhat in this matter, but messing about with options didn't seem to help much (for example, using a LINEAR filter just makes it fuzzy as you might expect implying that the offset isn't as clear-cut as a 1 pixel discrepancy). Shader code:
struct VS_Input_Over
{
float4 pos : POSITION0;
float2 txc : TEXCOORD0;
};
struct VS_Output_Over
{
float4 pos : POSITION0;
float2 txc : TEXCOORD0;
float4 altPos : TEXCOORD1;
};
struct PS_Output
{
float4 col : COLOR0;
};
Texture tex;
sampler texSampler = sampler_state { texture = <tex>;magfilter = NONE; minfilter = NONE; mipfilter = NONE; AddressU = mirror; AddressV = mirror;};
// side/over shaders (these make up the "over" technique (pixel shader version 2.0)
VS_Output_Over VShade_Over(VS_Input_Over inp)
{
VS_Output_Over outp = (VS_Output_Over)0;
outp.pos = mul(inp.pos, viewProj);
outp.altPos = outp.pos;
outp.txc = inp.txc;
return outp;
}
PS_Output PShade_Over(VS_Output_Over inp)
{
PS_Output outp = (PS_Output)0;
outp.col = tex2D(texSampler, inp.txc);
return outp;
}
I've looked about for a "Blended Blit" or something but I can't find anything, and other related searches have only brought up forums implying that rendering a quad with an orthographic projection is the way to go about doing this.
Sorry if I've given far too much detail for this issue but it's both interesting and infuriating and any feedback would be greatly appreciated.

It looks for me that you problem is the mapping of texels to pixels. You must offset a screen-aligned quad with a half pixel to match the texels direct to the screenpixels. This issue is explaines here: Directly Mapping Texels to Pixels (MSDN)

For anyone else hitting a similar wall, my specific problem solved by adjusting the U and V values of the verticies sent to the GPU for the overlaid texture triangles thus:
for (int i = 0; i < 4; i++)
{
overVerts[i].tu += 0.5 / (float)ddat->targetVp->Width; // ddat->targetVp is the viewport in use, and the viewport is the same size as the texture
overVerts[i].tv += 0.5 / (float)ddat->targetVp->Height;
}
See Directly Mapping Texels to Pixels as provided by Gnietschow's answer for an explanation as to why this makes sense.

CPU Ray Casting

I'm attempting ray casting an octree on the CPU (I know the GPU is better, but I'm unable to get that working at this time, I believe my octree texture is created incorrectly).
I understand what needs to be done, and so far I cast a ray for each pixel, and check if that ray intersects any nodes within the octree. If it does and the node is not a leaf node, I check if the ray intersects it's child nodes. I keep doing this until a leaf node is hit. Once a leaf node is hit, I get the colour for that node.
My question is, what is the best way to draw this to the screen? Currently im storing the colours in an array and drawing them with glDrawPixels, but this does not produce correct results, with gaps in the renderings, as well as the projection been wrong (I am using glRasterPos3fv).
Edit: Here is some code so far, it needs cleaning up, sorry. I have omitted the octree ray casting code as I'm not sure it's needed, but I will post if it'll help :)
void Draw(Vector cameraPosition, Vector cameraLookAt)
{
// Calculate the right Vector
Vector rightVector = Cross(cameraLookAt, Vector(0, 1, 0));
// Set up the screen plane starting X & Y positions
float screenPlaneX, screenPlaneY;
screenPlaneX = cameraPosition.x() - ( ( WINDOWWIDTH / 2) * rightVector.x());
screenPlaneY = cameraPosition.y() + ( (float)WINDOWHEIGHT / 2);
float deltaX, deltaY;
deltaX = 1;
deltaY = 1;
int currentX, currentY, index = 0;
Vector origin, direction;
origin = cameraPosition;
vector<Vector4<int>> colours(WINDOWWIDTH * WINDOWHEIGHT);
currentY = screenPlaneY;
Vector4<int> colour;
for (int y = 0; y < WINDOWHEIGHT; y++)
{
// Set the current pixel along x to be the left most pixel
// on the image plane
currentX = screenPlaneX;
for (int x = 0; x < WINDOWWIDTH; x++)
{
// default colour is black
colour = Vector4<int>(0, 0, 0, 0);
// Cast the ray into the current pixel. Set the length of the ray to be 200
direction = Vector(currentX, currentY, cameraPosition.z() + ( cameraLookAt.z() * 200 ) ) - origin;
direction.normalize();
// Cast the ray against the octree and store the resultant colour in the array
colours[index] = RayCast(origin, direction, rootNode, colour);
// Move to next pixel in the plane
currentX += deltaX;
// increase colour arry index postion
index++;
}
// Move to next row in the image plane
currentY -= deltaY;
}
// Set the colours for the array
SetFinalImage(colours);
// Load array to 0 0 0 to set the raster position to (0, 0, 0)
GLfloat *v = new GLfloat[3];
v[0] = 0.0f;
v[1] = 0.0f;
v[2] = 0.0f;
// Set the raster position and pass the array of colours to drawPixels
glRasterPos3fv(v);
glDrawPixels(WINDOWWIDTH, WINDOWHEIGHT, GL_RGBA, GL_FLOAT, finalImage);
}
void SetFinalImage(vector<Vector4<int>> colours)
{
// The array is a 2D array, with the first dimension
// set to the size of the window (WINDOW_WIDTH * WINDOW_HEIGHT)
// Second dimension stores the rgba values for each pizel
for (int i = 0; i < colours.size(); i++)
{
finalImage[i][0] = (float)colours[i].r;
finalImage[i][1] = (float)colours[i].g;
finalImage[i][2] = (float)colours[i].b;
finalImage[i][3] = (float)colours[i].a;
}
}

Your pixel drawing code looks okay. But I'm not sure that your RayCasting routines are correct. When I wrote my raytracer, I had a bug that caused horizontal artifacts in on the screen, but it was related to rounding errors in the render code.
I would try this...create a result set of vector<Vector4<int>> where the colors are all red. Now render that to the screen. If it looks correct, then the opengl routines are correct. Divide and conquer is always a good debugging method.
Here's a question though....why are you using Vector4 when later on you write the image as GL_FLOAT? I'm not seeing any int->float conversion here....

You problem may be in your 3DDDA (octree raycaster), and specifically with adaptive termination. It results from the quantisation of rays into gridcell form, that causes certain octree nodes which lie slightly behind foreground nodes (i.e. of a higher z depth) and which thus should be partly visible & partly occluded, to not be rendered at all. The smaller your voxels are, the less noticeable this will be.
There is a very easy way to test whether this is the problem -- comment out the adaptive termination line(s) in your 3DDDA and see if you still get the same gap artifacts.

Using vertex buffers in jogl, crash when too many triangles

I have written a simple application in Java using Jogl which draws a 3d geometry. The camera can be rotated by dragging the mouse. The application works fine, but drawing the geometry with glBegin(GL_TRIANGLE) ... calls ist too slow.
So I started to use vertex buffers. This also works fine until the number of triangles gets larger than 1000000. If that happens, the display driver suddenly crashes and my montior gets dark. Is there a limit of how many triangles fit in the buffer? I hoped to get 1000000 triangles rendered at a reasonable frame rate.
I have no idea on how to debug this problem. The nasty thing is that I have to reboot Windows after each launch, since I have no other way to get my display working again. Could anyone give me some advice?
The vertices, triangles and normals are stored in arrays float[][] m_vertices, int[][] m_triangles, float[][] m_triangleNormals.
I initialized the buffer with:
// generate a VBO pointer / handle
if (m_vboHandle <= 0) {
int[] vboHandle = new int[1];
m_gl.glGenBuffers(1, vboHandle, 0);
m_vboHandle = vboHandle[0];
}
// interleave vertex / normal data
FloatBuffer data = Buffers.newDirectFloatBuffer(m_triangles.length * 3*3*2);
for (int t=0; t<m_triangles.length; t++)
for (int j=0; j<3; j++) {
int v = m_triangles[t][j];
data.put(m_vertices[v]);
data.put(m_triangleNormals[t]);
}
data.rewind();
// transfer data to VBO
int numBytes = data.capacity() * 4;
m_gl.glBindBuffer(GL.GL_ARRAY_BUFFER, m_vboHandle);
m_gl.glBufferData(GL.GL_ARRAY_BUFFER, numBytes, data, GL.GL_STATIC_DRAW);
m_gl.glBindBuffer(GL.GL_ARRAY_BUFFER, 0);
Then, the scene gets rendered with:
gl.glBindBuffer(GL.GL_ARRAY_BUFFER, m_vboHandle);
gl.glEnableClientState(GL2.GL_VERTEX_ARRAY);
gl.glEnableClientState(GL2.GL_NORMAL_ARRAY);
gl.glVertexPointer(3, GL.GL_FLOAT, 6*4, 0);
gl.glNormalPointer(GL.GL_FLOAT, 6*4, 3*4);
gl.glDrawArrays(GL.GL_TRIANGLES, 0, 3*m_triangles.length);
gl.glDisableClientState(GL2.GL_VERTEX_ARRAY);
gl.glDisableClientState(GL2.GL_NORMAL_ARRAY);
gl.glBindBuffer(GL.GL_ARRAY_BUFFER, 0);

Try checking the return value of calling glBufferData. It will return GL_OUT_OF_MEMORY if it cannot satisfy numBytes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

fast rasterization on with opencl - opengl

A workaround is to cancel rasterization if a face gets too large and just return. This will lead to some visual artifacts, but at least the frame rate won't suffer.

Related

Accessing Index buffer in shaders (Directx 11)

How to count dead particles in the compute shader?

Precise Texture Overlay

CPU Ray Casting

Using vertex buffers in jogl, crash when too many triangles

Categories

Resources