Rendering multiple triangles with a single vertex buffer - c++
Previously I was using a separate vertex and index buffer for each mesh, but I'd like to try using a single large vertex buffer to reduce DrawIndexed() calls. Let's say I have the following arrays:
SimpleVertex vertices[] =
{
{ XMFLOAT3(-1.0f, 1.0f, -1.0f), XMFLOAT4(0.0f, 0.0f, 1.0f, 1.0f) },
{ XMFLOAT3(1.0f, 1.0f, -1.0f), XMFLOAT4(0.0f, 1.0f, 0.0f, 1.0f) },
{ XMFLOAT3(1.0f, 1.0f, 1.0f), XMFLOAT4(0.0f, 1.0f, 1.0f, 1.0f) },
{ XMFLOAT3(-1.0f, 1.0f, 1.0f), XMFLOAT4(1.0f, 0.0f, 0.0f, 1.0f) },
{ XMFLOAT3(-1.0f, -1.0f, -1.0f), XMFLOAT4(1.0f, 0.0f, 1.0f, 1.0f) },
{ XMFLOAT3(1.0f, -1.0f, -1.0f), XMFLOAT4(1.0f, 1.0f, 0.0f, 1.0f) },
{ XMFLOAT3(1.0f, -1.0f, 1.0f), XMFLOAT4(1.0f, 1.0f, 1.0f, 1.0f) },
{ XMFLOAT3(-1.0f, -1.0f, 1.0f), XMFLOAT4(0.0f, 0.0f, 0.0f, 1.0f) },
};
WORD indices[] =
{
3,1,0,
2,1,3,
0,5,4,
1,5,0,
3,4,7,
0,4,3,
1,6,5,
2,6,1,
2,7,6,
3,7,2,
6,4,5,
7,4,6,
};
This works great for a single indexed cube. But, what if I wish to draw two cubes? How do I set up the index buffer to handle that? I'm confused as to whether the indices are local to each cube and thus should just repeat every 36 indices, or if they should be incremented as such:
SimpleVertex vertices[] =
{
//first cube
{ XMFLOAT3(-1.0f, 1.0f, -1.0f), XMFLOAT4(0.0f, 0.0f, 1.0f, 1.0f) },
{ XMFLOAT3(1.0f, 1.0f, -1.0f), XMFLOAT4(0.0f, 1.0f, 0.0f, 1.0f) },
{ XMFLOAT3(1.0f, 1.0f, 1.0f), XMFLOAT4(0.0f, 1.0f, 1.0f, 1.0f) },
{ XMFLOAT3(-1.0f, 1.0f, 1.0f), XMFLOAT4(1.0f, 0.0f, 0.0f, 1.0f) },
{ XMFLOAT3(-1.0f, -1.0f, -1.0f), XMFLOAT4(1.0f, 0.0f, 1.0f, 1.0f) },
{ XMFLOAT3(1.0f, -1.0f, -1.0f), XMFLOAT4(1.0f, 1.0f, 0.0f, 1.0f) },
{ XMFLOAT3(1.0f, -1.0f, 1.0f), XMFLOAT4(1.0f, 1.0f, 1.0f, 1.0f) },
{ XMFLOAT3(-1.0f, -1.0f, 1.0f), XMFLOAT4(0.0f, 0.0f, 0.0f, 1.0f) },
//second cube
{ XMFLOAT3(-2.0f, 2.0f, -2.0f), XMFLOAT4(0.0f, 0.0f, 1.0f, 1.0f) },
{ XMFLOAT3(2.0f, 2.0f, -2.0f), XMFLOAT4(0.0f, 1.0f, 0.0f, 1.0f) },
{ XMFLOAT3(2.0f, 2.0f, 2.0f), XMFLOAT4(0.0f, 1.0f, 1.0f, 1.0f) },
{ XMFLOAT3(-2.0f, 2.0f, 2.0f), XMFLOAT4(1.0f, 0.0f, 0.0f, 1.0f) },
{ XMFLOAT3(-2.0f, -2.0f, -2.0f), XMFLOAT4(1.0f, 0.0f, 1.0f, 1.0f) },
{ XMFLOAT3(2.0f, -2.0f, -2.0f), XMFLOAT4(1.0f, 1.0f, 0.0f, 1.0f) },
{ XMFLOAT3(2.0f, -2.0f, 2.0f), XMFLOAT4(1.0f, 1.0f, 1.0f, 1.0f) },
{ XMFLOAT3(-2.0f, -2.0f, 2.0f), XMFLOAT4(0.0f, 0.0f, 0.0f, 1.0f) },
};
WORD indices[] =
{
//First cube
3,1,0,
2,1,3,
0,5,4,
1,5,0,
3,4,7,
0,4,3,
1,6,5,
2,6,1,
2,7,6,
3,7,2,
6,4,5,
7,4,6,
//second cube
11,9,8,
10,9,11,
8,13,12,
9,13,8,
11,12,15,
8,12,11,
9,14,13,
10,14,9,
10,15,14,
11,15,10,
14,12,13,
15,12,14,
};
So basically, I'm trying to understand how to draw one large index buffer with multiple objects. Am I thinking about this correctly or should I have the buffer re-use the same index buffer over and over?
I'm aware of using Instancing, but there are times where the geometry changes, so I need to avoid it in this case.
Decided to just start writing code to attack the theory blind. Turns out you do indeed need to increment the index array to treat a large vertex array as one mesh. I wrote a function to illustrate this for those interested:
UINT* VertexCompiler::BuildIndexArray()
{
UINT* indices;
int indexCount = 0;
for (int i = 0; i < (int)mVertexObjects.size(); i++)
indexCount += mVertexObjects[i].NumIndices;
mIndexCount = indexCount;
indices = new UINT[indexCount];
int numObjects = (int)mVertexObjects.size();
int index = 0;
for (int i = 0; i < numObjects; i++)
{
for (int j = 0; j < (int)mVertexObjects[i].NumIndices; j++)
{
indices[index] = mVertexObjects[i].Indices[j] + (mVertexObjects[i].NumVertices * i);
index++;
}
}
return indices;
}
Be sure to delete the pointer to indices after you're finished with it. Usually it's poor design to allocate memory inside of a function and delete it outside, but this is just for demonstration purposes. Typically you should allocate the index array outside and pass it as a pointer to the function.
One important thing to note is the use of UINT. Most textbooks and articles on the subject allocate smaller index buffers with WORD. Your incremented index array will overflow at index value 65535 using WORD memory allocations. So, if you're rendering a large number of vertices where the index would exceed 16 bits, use UINT and don't forget to switch to DXGI_FORMAT_R32_UINT instead of DXGI_FORMAT_R16_UINT.
E.g.
IASetIndexBuffer(indicesBuffer, DXGI_FORMAT_R32_UINT, 0);
Related
D3D9 CubeMap texture
I am making a texture cubemap, and the image is output on all 6 sides. How to print a split image on 6 sides? I want to make it without using shaders and I want to know a site where I can study directX Are there any sites you can recommend? this is my code. ㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁㅁ struct CUBEVERTEX { float x, y, z; float tu, tv; }; void SkyBox::onInit(float scale) { CUBEVERTEX vertice[] = { {-1.0f, 1.0f, -1.0f, 0.0f, 0.0f }, { 1.0f, 1.0f, -1.0f, 1.0f, 0.0f }, { 1.0f, -1.0f, -1.0f, 1.0f, 1.0f }, { 1.0f, -1.0f, -1.0f, 1.0f, 1.0f }, {-1.0f, -1.0f, -1.0f, 0.0f, 1.0f }, {-1.0f, 1.0f, -1.0f, 0.0f, 0.0f }, { 1.0f, 1.0f, 1.0f, 0.0f, 0.0f }, {-1.0f, 1.0f, 1.0f, 1.0f, 0.0f }, {-1.0f, -1.0f, 1.0f, 1.0f, 1.0f }, {-1.0f, -1.0f, 1.0f, 1.0f, 1.0f }, { 1.0f, -1.0f, 1.0f, 0.0f, 1.0f }, { 1.0f, 1.0f, 1.0f, 0.0f, 0.0f }, {-1.0f, -1.0f, -1.0f, 0.0f, 0.0f }, { 1.0f, -1.0f, -1.0f, 1.0f, 0.0f }, { 1.0f, -1.0f, 1.0f, 1.0f, 1.0f }, { 1.0f, -1.0f, 1.0f, 1.0f, 1.0f }, {-1.0f, -1.0f, 1.0f, 0.0f, 1.0f }, {-1.0f, -1.0f, -1.0f, 0.0f, 0.0f }, {-1.0f, 1.0f, 1.0f, 0.0f, 0.0f }, { 1.0f, 1.0f, 1.0f, 1.0f, 0.0f }, { 1.0f, 1.0f, -1.0f, 1.0f, 1.0f }, { 1.0f, 1.0f, -1.0f, 1.0f, 1.0f }, {-1.0f, 1.0f, -1.0f, 0.0f, 1.0f }, {-1.0f, 1.0f, 1.0f, 0.0f, 0.0f }, {-1.0f, 1.0f, 1.0f, 0.0f, 0.0f }, {-1.0f, 1.0f, -1.0f, 1.0f, 0.0f }, {-1.0f, -1.0f, -1.0f, 1.0f, 1.0f }, {-1.0f, -1.0f, -1.0f, 1.0f, 1.0f }, {-1.0f, -1.0f, 1.0f, 0.0f, 1.0f }, {-1.0f, 1.0f, 1.0f, 0.0f, 0.0f }, { 1.0f, 1.0f, -1.0f, 0.0f, 0.0f }, { 1.0f, 1.0f, 1.0f, 1.0f, 0.0f }, { 1.0f, -1.0f, 1.0f, 1.0f, 1.0f }, { 1.0f, -1.0f, 1.0f, 1.0f, 1.0f }, { 1.0f, -1.0f, -1.0f, 0.0f, 1.0f }, { 1.0f, 1.0f, -1.0f, 0.0f, 0.0f } }; m_pd3dDevice->CreateVertexBuffer(sizeof(vertice), 0, D3DFVF_CUBEVERTEX, D3DPOOL_DEFAULT, &m_pVB, 0); void* pVertice; m_pVB->Lock(0, sizeof(vertice), &pVertice, 0); memcpy(pVertice, vertice, sizeof(vertice)); m_pVB->Unlock(); } void SkyBox::render() { D3DXMATRIX matWorld; D3DXMatrixIdentity(&matWorld); m_pd3dDevice->SetTransform(D3DTS_WORLD, &matWorld); m_pd3dDevice->SetTexture(0, texture); m_pd3dDevice->SetStreamSource(0, m_pVB, 0, sizeof(CUBEVERTEX)); m_pd3dDevice->SetFVF(D3DFVF_CUBEVERTEX); m_pd3dDevice->DrawPrimitive(D3DPT_TRIANGLELIST, 0, 12); }
Rendering a cubemap directly as a skybox in Direct3D 9 is not achievable, especially with a "no shaders" requirement. In Direct3D 9, you need to create the resource as a cubemap to use in environment scenarios, and then create SIX individual 2D textures to render it as a skybox--which means having two copies of each cubemap face in memory. In Direct3D 10 or later, you can create a resource and then create two shader resource views: One as a cubemap and another resource as a 2D texture array. This results in one copy of each cubemap face in memory. You can then using shaders render the individual faces of the 2D texture array on a skybox. Here's an example implementation using DirectX 12 Skybox that leverages the DirectX Tool Kit for DX12. Same technique will work for Direct3D 11 as long as you require Direct3D Hardware Feature Level 10.0 or better. Unless you are specifically using Windows XP, there's no reason you should learn Direct3D 9 at this point. Direct3D 11 is the 'mainstream' graphics API you should look at. See Microsoft Docs and DirectX Tool Kit.
Change color of point on mouse click and back to original on release
I am working on an OpenGL code using shaders where one of the requirements is to change the color of a point to white (or anything) on a mouse click and change it back to its original color once released. Is there any function that can retrieve the vertices of the point being clicked and change its color? There are eight points that form a circle whose vertices are (0,√2), (1,1), (√2, 0), (1,-1), (0,-√2), (-1,-1), (-√2,0), (-1,1) (basically a circle with a radius of √2 units) I have initially defined the points as given below Vertex Vertices[] = { { { 0.0f, 1.414214f, 0.0f, 1.0f },{ 0.5f, 0.0f, 0.0f, 1.0f } }, // 0 { { -1.0f, 1.0f, 0.0f, 1.0f },{ 0.0f, 1.0f, 0.0f, 1.0f } }, // 1 { { -1.414214f, 0.0f, 0.0f, 1.0f },{ 0.0f, 0.5f, 0.0f, 1.0f } }, // 2 { { -1.0f, -1.0f, 0.0f, 1.0f },{ 0.0f, 0.0f, 1.0f, 1.0f } }, // 3 { { 0.0f, -1.414214f, 0.0f, 1.0f },{ 0.5f, 0.5f, 1.0f, 1.0f } }, // 4 { { 1.0f, -1.0f, 0.0f, 1.0f },{ 1.0f, 1.0f, 1.0f, 1.0f } }, // 5 { { 1.414214f, 0.0f, 0.0f, 1.0f },{ 0.5f, 1.0f, 1.0f, 1.0f } }, // 6 { { 1.0f, 1.0f, 0.0f, 1.0f },{ 1.0f, 0.0f, 0.0f, 1.0f } }, // 7 };
How can I calculate the vertex normals for my model(house)
I just recently drew a house in my Direct3D11 application and it seems to look ok. But the only problem that I have is calculating the vertex normals for the house. Everytime the light strikes the house, it looks a little awkward, and the worst part is that the light wont even strike the inside of the house. (BTW I am using the spotlight technique to create light). Down below is the vertex buffer that contains the vertices, texture coordinates, and the normals for my poorly drawn model. Can someone show me how to calculate the normals for the house? void Create_Vertex_Buffer_for_House() { D3D11_BUFFER_DESC VertexBufferDesc; D3D11_SUBRESOURCE_DATA VertexBufferData; ZeroMemory(&VertexBufferDesc, sizeof(VertexBufferDesc)); VertexBufferDesc.BindFlags = D3D11_BIND_VERTEX_BUFFER; VertexBufferDesc.ByteWidth = sizeof(Vertex_Buffer) * 34; VertexBufferDesc.Usage = D3D11_USAGE_DEFAULT; VertexBufferDesc.CPUAccessFlags = 0; /* Vertex coordinates, Texture Coordinates, and vertex normals (respectably)*/ Vertex_Buffer Vertices[] = { /* Front wall of the house*/ Vertex_Buffer(-1.0f, -1.0f, 1.0f, 0.0f, 10.0f, -1.0f, -1.0f, 1.0f), Vertex_Buffer(-1.0f, 1.0f, 1.0f, 0.0f, 0.0f, -1.0f, 1.0f, 1.0f), Vertex_Buffer(1.0f, 1.0f, 1.0f, 10.0f, 0.0f, 1.0f, 1.0f, 1.0f), Vertex_Buffer(1.0f, -1.0f, 1.0f, 10.0f, 10.0f, 1.0f, -1.0f, 1.0f), /* Front wall of the house*/ Vertex_Buffer(-4.0f, -1.0f, 1.0f, 0.0f, 10.0f, -4.0f, -1.0f, 1.0f), Vertex_Buffer(-4.0f, 1.0f, 1.0f, 0.0f, 0.0f, -4.0f, 1.0f, 1.0f), Vertex_Buffer(-2.0f, 1.0f, 1.0f, 10.0f, 0.0f, -2.0f, 1.0f, 1.0f), Vertex_Buffer(-2.0f, -1.0f, 1.0f, 10.0f, 10.0f, -2.0f, -1.0f, 1.0f), /* Rooftop of house (front)*/ Vertex_Buffer(-4.0f, 1.0f, 1.0f, 0.0f, 10.0f, -4.0f, 1.0f, 1.0f), Vertex_Buffer(-4.0f, 2.5f, -1.0f, 0.0f, 0.0f, -4.0f, 2.5f, -1.0f), Vertex_Buffer(1.0f, 2.5f, -1.0f, 10.0f, 0.0f, 1.0f, 2.5f, -1.0f), Vertex_Buffer(1.0f, 1.0f, 1.0f, 10.0f, 10.0f, 1.0f, 1.0f, 1.0f), /* Rooftop of the house(back)*/ Vertex_Buffer(-4.0f, 2.5f, -1.0f, 0.0f, 10.0f, -4.0f, 2.5f, 1.0f), Vertex_Buffer(-4.0f, 1.0f, -3.0f, 0.0f, 0.0f, -4.0f, 2.5f, -3.0f), Vertex_Buffer(1.0f, 1.0f, -3.0f, 10.0f, 0.0f, 1.0f, 1.0f, -3.0f), Vertex_Buffer(1.0f, 2.5f, -1.0f, 10.0f, 10.0f, 1.0f, 2.5f, -1.0f), /* Right wall of the house*/ Vertex_Buffer(1.0f, -1.0f, 1.0f, 0.0f, 10.0f, 1.0f, -1.0f, 1.0f), Vertex_Buffer(1.0f, 1.0f, 1.0f, 0.0f, 0.0f, 1.0f, 1.0f, 1.0f), Vertex_Buffer(1.0f, 1.0f, -3.0f, 10.0f, 0.0f, 1.0f, 1.0f, -3.0f), Vertex_Buffer(1.0f, -1.0f, -3.0f, 10.0f, 10.0f, 1.0f, -1.0f, -3.0f), /* right wall of the house(small triangle strip)*/ Vertex_Buffer(1.0f, 1.0f, 1.0f, 0.0f, 10.0f, 1.0f, 1.0f, 1.0f), Vertex_Buffer(1.0f, 2.5f, -1.0f, 0.0f, 0.0f, 1.0f, 2.5f, -1.0f), Vertex_Buffer(1.0f, 1.0f, -3.0f, 10.0f, 0.0f, 1.0f, 1.0f, -3.0f), /* Left wall of the house*/ Vertex_Buffer(-4.0f, -1.0f, -3.0f, 0.0f, 10.0f, -4.0f, -1.0f, -3.0f), Vertex_Buffer(-4.0f, 1.0f, -3.0f, 0.0f, 0.0f, -4.0f, 1.0f, -3.0f), Vertex_Buffer(-4.0f, 1.0f, 1.0f, 10.0f, 0.0f, -4.0f, -1.0f, 1.0f), Vertex_Buffer(-4.0f, -1.0f, 1.0f, 10.0f, 10.0f, -4.0f, -1.0f, 1.0f), /* Left wall of the house (triangle strip)*/ Vertex_Buffer(-4.0f, 1.0f, 1.0f, 0.0f, 10.0f, -4.0f, 1.0f, 1.0f), Vertex_Buffer(-4.0f, 2.5f, -1.0f, 0.0f, 0.0f, -4.0f, 2.5f, -1.0f), Vertex_Buffer(-4.0f, 1.0f, -3.0f, 10.0f, 0.0f, -4.0f, 1.0f, -3.0f), /* Back side of the house*/ Vertex_Buffer(-4.0f, -1.0f, -3.0f, 0.0f, 10.0f, -4.0f, -1.0f, -3.0f), Vertex_Buffer(-4.0f, 1.0f, -3.0f, 0.0f, 0.0f, -4.0f, 1.0f, -3.0f), Vertex_Buffer(1.0f, 1.0f, -3.0f, 10.0f, 0.0f, 1.0f, 1.0f, -3.0f), Vertex_Buffer(1.0f, -1.0f, -3.0f, 10.0f, 10.0f, 1.0f, -1.0f, -3.0f), }; ZeroMemory(&VertexBufferData, sizeof(VertexBufferData)); VertexBufferData.pSysMem = Vertices; device->CreateBuffer(&VertexBufferDesc, &VertexBufferData, &HouseVertexBuffer); } I am not sure if this will help, but down below is my method of implementing the spotlight technique in my pixel shader. struct Light { float3 SpotLight_Position; float range; float3 SpotLight_Direction; float cone; float3 attenuation; float3 directional; float4 ambient; float4 diffuse; }; cbuffer Constant_Buffer { float4x4 TRANSFORMEDMATRIX; /* The final transformed matrix */ float4x4 WORLDSPACE; Light LIGHT; }; struct Vertex_Shader_Output { float4 Positions : SV_POSITION; float2 TextureCoord : TEXTURECOORD; float4 WorldSpace : POSITION; float3 normal : NORMAL; }; struct Sky_Vertex_Shader_Output { float4 Positions : SV_POSITION; float3 TextureCoord : TEXTURECOORD; }; Texture2D Texture; /* Shader Resource for Pixel Shader*/ SamplerState Sampler; /* Shader Resource for Pixel Shader*/ /* PIXEL SHADER THAT WILL BE USED TO CREATE THE FLASH LIGHT*/ float4 Pixelshader(Vertex_Shader_Output input) : SV_TARGET { float4 TextureFormat; float3 lightToPixelVector; float3 finalAmbient; float HowMuchLight; float distance; float3 FinalColor = float3 (0.0f, 0.0f, 0.0f); /* Sampling the texture and storing the format into an object*/ TextureFormat = Texture.Sample(Sampler, input.TextureCoord); /* Scaling the normal vector to a unit length*/ input.normal = normalize(input.normal); /* Creating a vector between the light source and the pixel positions of every object*/ lightToPixelVector = LIGHT.SpotLight_Position - input.WorldSpace; /* Getting the actual distance between the light source and the pixel position*/ distance = length(lightToPixelVector); /* Adding the ambient and the colors of the texture*/ finalAmbient = TextureFormat * LIGHT.ambient; /* If the pixel is too far from the light source*/ if (distance > LIGHT.range) { /* Return the objects color without the light source*/ return float4(finalAmbient, TextureFormat.a); } /* Normalizing the vector to make sure its a unit length*/ lightToPixelVector = normalize(lightToPixelVector); /* Getting the angle between the light source and the vertex normal to see how much light that pixel will receive*/ HowMuchLight = dot(lightToPixelVector, input.normal); if (HowMuchLight > 0.0f) { /* Adding the diffuse and colors of the texture to make the final color*/ FinalColor += TextureFormat * LIGHT.diffuse; /* Calculating the attenuation for the Final Color*/ FinalColor /= (LIGHT.attenuation[0] + (LIGHT.attenuation[1] * distance)) + (LIGHT.attenuation[2] * (distance * distance)); /* */ FinalColor *= pow(max(dot(-lightToPixelVector, LIGHT.SpotLight_Direction), 0.0f), LIGHT.cone); } FinalColor = saturate(FinalColor + finalAmbient); /* Returning the final colors of the texture and the alpha value*/ return float4(FinalColor, TextureFormat.a); }
Easiest way to draw a scalar and vector field with C++?
What is the easiest, while being decently fast, way to draw a grid (say, 100 x 100) of scalar values as colors and vectors as lines from arrays of values in C++? Pure OpenGL? Something like this: I am planning on using either basic OpenGL or SDL. This Windows program will be a real-time demonstration (not static image) and should be able to handle user (cursor) input. I do not think OpenGL alone can handle input.
In OpenGL you can create the array of floats wich can be represents as bitmap image float computations[10000][3] = { { 1, 0.5, 0.5 }, // ..... 100 elements in row //.............................. // 100 rows }; Than call OpenGL function glDrawPixels(100 /*width*/, 100 /*height*/, GL_RGB, GL_FLOAT, computations); or using glTexImage() glTexImage(GL_TEXTURE_2D, 0, GL_RGB, 100 /*width*/, 100 /*height*/, GL_RGB, GL_FLOAT, computations); But it is wery important to represent the the float data in the range of [0; 1]. for example this code draws 8x8 image float pixels[64][3]= { { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f }, { 0.0f, 1.0f, 1.0f }, { 0.0f, .0f, 1.0f }, { 1.0f, 1.0f, 1.0f }, { 0.0f, 0.0f, 1.0f } }; void RenderScene(void) { glClear(GL_COLOR_BUFFER_BIT); glRasterPos2i(0, 0); glDrawPixels(8, 8, GL_RGB, GL_FLOAT, pixels); glFlush(); } the result will be The result picture can be simply scaled using glPixelZoom or magnificated as texture filtration result. here is example of texture loading. glGenTextures(1, &g_nTexture); glBindTexture(GL_TEXTURE_2D, g_nTexture); glTexParameteri(GL_TEXTURE_2D, GL_GENERATE_MIPMAP, GL_TRUE); glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, 8, 8, 0, GL_RGB, GL_FLOAT, pixels); glEnable(GL_TEXTURE_2D); the result of filtering. here you can see smooth color transition. There are others ways to do that things with shaders. Vectors may be drawn separately as single line with the cone on its end than positioned by affine transformations. This all things will give acceptable level of performance for real time application and wide capabilities of human-machine interaction because OS will handle any OpenGL window as native with all OS events support.
SDL_FillRect and SDL_RenderDrawLine are fast enough for this purpose.
3D Convolution with CUDA using shared memory
I'm currently trying to adapt the 2D convolution code from THIS question to 3D and having trouble trying to understand where my error is. My 2D Code looks like this: #include <iostream> #define MASK_WIDTH 3 #define MASK_RADIUS MASK_WIDTH / 2 #define TILE_WIDTH 8 #define W (TILE_WIDTH + MASK_WIDTH - 1) /** * GPU 2D Convolution using shared memory */ __global__ void convolution(float *I, float* M, float *P, int width, int height) { /***** WRITE TO SHARED MEMORY *****/ __shared__ float N_ds[W][W]; // First batch loading int dest = threadIdx.x + (threadIdx.y * TILE_WIDTH); int destY = dest / W; int destX = dest % W; int srcY = destY + (blockIdx.y * TILE_WIDTH) - MASK_RADIUS; int srcX = destX + (blockIdx.x * TILE_WIDTH) - MASK_RADIUS; int src = srcX + (srcY * width); if(srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) N_ds[destY][destX] = I[src]; else N_ds[destY][destX] = 0; // Second batch loading dest = threadIdx.x + (threadIdx.y * TILE_WIDTH) + TILE_WIDTH * TILE_WIDTH; destY = dest / W; destX = dest % W; srcY = destY + (blockIdx.y * TILE_WIDTH) - MASK_RADIUS; srcX = destX + (blockIdx.x * TILE_WIDTH) - MASK_RADIUS; src = srcX + (srcY * width); if(destY < W) { if(srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) N_ds[destY][destX] = I[src]; else N_ds[destY][destX] = 0; } __syncthreads(); /***** Perform Convolution *****/ float sum = 0; int y; int x; for(y = 0; y < MASK_WIDTH; y++) for(x = 0; x < MASK_WIDTH; x++) sum = sum + N_ds[threadIdx.y + y][threadIdx.x + x] * M[x + (y * MASK_WIDTH)]; y = threadIdx.y + (blockIdx.y * TILE_WIDTH); x = threadIdx.x + (blockIdx.x * TILE_WIDTH); if(y < height && x < width) P[x + (y * width)] = sum; __syncthreads(); } int main(int argc, char* argv[]) { int image_width = 16; int image_height = 16; float *deviceInputImageData; float *deviceOutputImageData; float *deviceMaskData; float data[] = { 1.0f, 1.0f, 1.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 3.0f, 3.0f, 3.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 4.0f, 4.0f, 4.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 5.0f, 5.0f, 5.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 6.0f, 6.0f, 6.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 7.0f, 7.0f, 7.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 8.0f, 8.0f, 8.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 9.0f, 9.0f, 9.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 10.0f, 10.0f, 10.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 11.0f, 11.0f, 11.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 12.0f, 12.0f, 12.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 13.0f, 13.0f, 13.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 14.0f, 14.0f, 14.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 15.0f, 15.0f, 15.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 16.0f, 16.0f, 16.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f }; float mask[] = { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f }; // CHECK CHECK CHECK CHECK CHECK int shared_memory_size = W * W; int block_size = TILE_WIDTH * TILE_WIDTH; int max_size = 2 * block_size; std::cout << "Block Size: " << block_size << " - Shared Memory Size: " << shared_memory_size << " - Max Size: " << max_size << std::endl; std::cout << "SHARED MEMORY SIZE HAS TO BE SMALLER THAN MAX SIZE IN ORDER TO WORK PROPERLY !!!!!!!"; cudaMalloc((void **)&deviceInputImageData, image_width * image_height * sizeof(float)); cudaMalloc((void **)&deviceOutputImageData, image_width * image_height * sizeof(float)); cudaMalloc((void **)&deviceMaskData, MASK_WIDTH * MASK_WIDTH * sizeof(float)); cudaMemcpy(deviceInputImageData, data, image_width * image_height * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(deviceMaskData, mask, MASK_WIDTH * MASK_WIDTH * sizeof(float), cudaMemcpyHostToDevice); dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1); dim3 dimGrid((image_width + TILE_WIDTH - 1) / TILE_WIDTH, (image_height + TILE_WIDTH - 1) / TILE_WIDTH); convolution<<<dimGrid, dimBlock>>>(deviceInputImageData, deviceMaskData, deviceOutputImageData, image_width, image_height); cudaDeviceSynchronize(); cudaMemcpy(data, deviceOutputImageData, image_width * image_height * sizeof(float), cudaMemcpyDeviceToHost); // Print data for(int i = 0; i < image_width * image_height; ++i) { if(i % image_width == 0) { std::cout << std::endl; } std::cout << data[i] << " - "; } cudaFree(deviceInputImageData); cudaFree(deviceOutputImageData); cudaFree(deviceMaskData); return 0; } And the 3D equivalent: #include <iostream> #define MASK_WIDTH 3 #define MASK_RADIUS MASK_WIDTH / 2 #define TILE_WIDTH 8 #define W (TILE_WIDTH + MASK_WIDTH - 1) /** * GPU 2D Convolution using shared memory */ __global__ void convolution(float *I, float* M, float *P, int width, int height, int depth) { /***** WRITE TO SHARED MEMORY *****/ __shared__ float N_ds[W][W][W]; // First batch loading int dest = threadIdx.x + (threadIdx.y * TILE_WIDTH) + (threadIdx.z * TILE_WIDTH * TILE_WIDTH); int destTmp = dest; int destX = destTmp % W; destTmp = destTmp / W; int destY = destTmp % W; destTmp = destTmp / W; int destZ = destTmp; int srcZ = destZ + (blockIdx.z * TILE_WIDTH) - MASK_RADIUS; int srcY = destY + (blockIdx.y * TILE_WIDTH) - MASK_RADIUS; int srcX = destX + (blockIdx.x * TILE_WIDTH) - MASK_RADIUS; int src = srcX + (srcY * width) + (srcZ * width * height); if(srcZ >= 0 && srcZ < depth && srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) N_ds[destZ][destY][destX] = I[src]; else N_ds[destZ][destY][destX] = 0; // Second batch loading dest = threadIdx.x + (threadIdx.y * TILE_WIDTH) + (threadIdx.z * TILE_WIDTH * TILE_WIDTH) + TILE_WIDTH * TILE_WIDTH; destTmp = dest; destX = destTmp % W; destTmp = destTmp / W; destY = destTmp % W; destTmp = destTmp / W; destZ = destTmp; srcZ = destZ + (blockIdx.z * TILE_WIDTH) - MASK_RADIUS; srcY = destY + (blockIdx.y * TILE_WIDTH) - MASK_RADIUS; srcX = destX + (blockIdx.x * TILE_WIDTH) - MASK_RADIUS; src = srcX + (srcY * width) + (srcZ * width * height); if(destZ < W) { if(srcZ >= 0 && srcZ < depth && srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) N_ds[destZ][destY][destX] = I[src]; else N_ds[destZ][destY][destX] = 0; } __syncthreads(); /***** Perform Convolution *****/ float sum = 0; int z; int y; int x; for(z = 0; z < MASK_WIDTH; z++) for(y = 0; y < MASK_WIDTH; y++) for(x = 0; x < MASK_WIDTH; x++) sum = sum + N_ds[threadIdx.z + z][threadIdx.y + y][threadIdx.x + x] * M[x + (y * MASK_WIDTH) + (z * MASK_WIDTH * MASK_WIDTH)]; z = threadIdx.z + (blockIdx.z * TILE_WIDTH); y = threadIdx.y + (blockIdx.y * TILE_WIDTH); x = threadIdx.x + (blockIdx.x * TILE_WIDTH); if(z < depth && y < height && x < width) P[x + (y * width) + (z * width * height)] = sum; __syncthreads(); } int main(int argc, char* argv[]) { int image_width = 16; int image_height = 16; int image_depth = 5; float *deviceInputImageData; float *deviceOutputImageData; float *deviceMaskData; float data[] = { 1.0f, 1.0f, 1.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 3.0f, 3.0f, 3.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 4.0f, 4.0f, 4.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 5.0f, 5.0f, 5.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 6.0f, 6.0f, 6.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 7.0f, 7.0f, 7.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 8.0f, 8.0f, 8.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 9.0f, 9.0f, 9.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 10.0f, 10.0f, 10.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 11.0f, 11.0f, 11.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 12.0f, 12.0f, 12.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 13.0f, 13.0f, 13.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 14.0f, 14.0f, 14.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 15.0f, 15.0f, 15.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 16.0f, 16.0f, 16.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 3.0f, 3.0f, 3.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 4.0f, 4.0f, 4.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 5.0f, 5.0f, 5.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 6.0f, 6.0f, 6.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 7.0f, 7.0f, 7.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 8.0f, 8.0f, 8.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 9.0f, 9.0f, 9.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 10.0f, 10.0f, 10.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 11.0f, 11.0f, 11.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 12.0f, 12.0f, 12.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 13.0f, 13.0f, 13.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 14.0f, 14.0f, 14.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 15.0f, 15.0f, 15.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 16.0f, 16.0f, 16.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 3.0f, 3.0f, 3.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 4.0f, 4.0f, 4.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 5.0f, 5.0f, 5.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 6.0f, 6.0f, 6.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 7.0f, 7.0f, 7.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 8.0f, 8.0f, 8.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 9.0f, 9.0f, 9.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 10.0f, 10.0f, 10.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 11.0f, 11.0f, 11.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 12.0f, 12.0f, 12.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 13.0f, 13.0f, 13.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 14.0f, 14.0f, 14.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 15.0f, 15.0f, 15.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 16.0f, 16.0f, 16.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 3.0f, 3.0f, 3.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 4.0f, 4.0f, 4.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 5.0f, 5.0f, 5.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 6.0f, 6.0f, 6.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 7.0f, 7.0f, 7.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 8.0f, 8.0f, 8.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 9.0f, 9.0f, 9.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 10.0f, 10.0f, 10.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 11.0f, 11.0f, 11.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 12.0f, 12.0f, 12.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 13.0f, 13.0f, 13.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 14.0f, 14.0f, 14.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 15.0f, 15.0f, 15.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 16.0f, 16.0f, 16.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 3.0f, 3.0f, 3.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 4.0f, 4.0f, 4.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 5.0f, 5.0f, 5.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 6.0f, 6.0f, 6.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 7.0f, 7.0f, 7.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 8.0f, 8.0f, 8.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 9.0f, 9.0f, 9.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 10.0f, 10.0f, 10.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 11.0f, 11.0f, 11.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 12.0f, 12.0f, 12.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 13.0f, 13.0f, 13.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 14.0f, 14.0f, 14.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 15.0f, 15.0f, 15.0f, 1.0f, 3.0f, 1.0f, 5.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 16.0f, 16.0f, 16.0f, 2.0f, 1.0f, 4.0f, 1.0f, 6.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f }; float mask[] = { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f }; // CHECK CHECK CHECK CHECK CHECK int shared_memory_size = W * W * W; int block_size = TILE_WIDTH * TILE_WIDTH * TILE_WIDTH; int max_size = 3 * block_size; std::cout << "Block Size: " << block_size << " - Shared Memory Size: " << shared_memory_size << " - Max Size: " << max_size << std::endl; std::cout << "SHARED MEMORY SIZE HAS TO BE SMALLER THAN MAX SIZE IN ORDER TO WORK PROPERLY !!!!!!!"; cudaMalloc((void **)&deviceInputImageData, image_width * image_height * image_depth * sizeof(float)); cudaMalloc((void **)&deviceOutputImageData, image_width * image_height * image_depth * sizeof(float)); cudaMalloc((void **)&deviceMaskData, MASK_WIDTH * MASK_WIDTH * MASK_WIDTH * sizeof(float)); cudaMemcpy(deviceInputImageData, data, image_width * image_height * image_depth * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(deviceMaskData, mask, MASK_WIDTH * MASK_WIDTH * MASK_WIDTH * sizeof(float), cudaMemcpyHostToDevice); dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, TILE_WIDTH); dim3 dimGrid((image_width + TILE_WIDTH - 1) / TILE_WIDTH, (image_height + TILE_WIDTH - 1) / TILE_WIDTH, (image_depth + TILE_WIDTH - 1) / TILE_WIDTH); convolution<<<dimGrid, dimBlock>>>(deviceInputImageData, deviceMaskData, deviceOutputImageData, image_width, image_height, image_depth); cudaDeviceSynchronize(); cudaMemcpy(data, deviceOutputImageData, image_width * image_height * image_depth * sizeof(float), cudaMemcpyDeviceToHost); // Print data for(int i = 0; i < image_width * image_height * image_depth; ++i) { if((i % image_width) == 0) std::cout << std::endl; if((i % (image_width * image_height)) == 0) std::cout << std::endl; std::cout << data[i] << " - "; } cudaFree(deviceInputImageData); cudaFree(deviceOutputImageData); cudaFree(deviceMaskData); return 0; } When using a TILE_WIDTH of 8, the convolution seems to partially work nicely, since the second and third layers are the same and also the values seem to be correct. In the 3D case, I calculated the destX, destY and destZ indices according to THIS explanation. The second thing that I changed is the if-condition for the second batch loading: if(destZ < W) to use destZ instead of destY. My question now is what the reason for the incorrect values inside layer 4 and 5 of the output is. I guess I'm missing some understanding on how big the TILE_WIDTH MUST be in order to work properly. From this answer, I created the following check because every thread is supposed to perform at least 2 loads from global to shared memory: // CHECK CHECK CHECK CHECK CHECK int shared_memory_size = W * W; int block_size = TILE_WIDTH * TILE_WIDTH; int max_size = 2 * block_size; std::cout << "Block Size: " << block_size << " - Shared Memory Size: " << shared_memory_size << " - Max Size: " << max_size << std::endl; std::cout << "SHARED MEMORY SIZE HAS TO BE SMALLER THAN MAX SIZE IN ORDER TO WORK PROPERLY !!!!!!!"; Does it also apply in the 3D case, and if so, is it adapted correctly in my 3D check?
Seems like I adapted it correctly, apart from one stupid error: // Second batch loading dest = threadIdx.x + (threadIdx.y * TILE_WIDTH) + (threadIdx.z * TILE_WIDTH * TILE_WIDTH) + TILE_WIDTH * TILE_WIDTH; I forgot one * TILE_WIDTH, so it should be: // Second batch loading dest = threadIdx.x + (threadIdx.y * TILE_WIDTH) + (threadIdx.z * TILE_WIDTH * TILE_WIDTH) + TILE_WIDTH * TILE_WIDTH * TILE_WIDTH;