so what this is supposed to do it create a plane that moves in a sine wave pattern (code that has nothing to do with making it a wave is not shown as it all works fine) and while the plane renders fine it just stays as a flat plane. I'm assuming the problem is that the frame time isn't being called by the vertex shader so I'd like to know how to do that because I'm confused as to why this isn't working
the plane I'm trying to make into a wave ^
manipulation_shader_vs.hlsl
cbuffer TimeBufferType : register(b1)
{
float time;
};
OutputType main(InputType input)
{
OutputType output;
input.position.y = sin(input.position.x + time);
input.normal = float3(-cos(input.position.x + time), 1, 0);
...
timer.cpp
void Timer::frame()
{
INT64 currentTime;
float timeDifference;
// Query the current time.
QueryPerformanceCounter((LARGE_INTEGER*)¤tTime);
timeDifference = (float)(currentTime - startTime);
frameTime = timeDifference / ticksPerS;
// Calc FPS
frames += 1.f;
elapsedTime += frameTime;
if (elapsedTime > 1.0f)
{
fps = frames / elapsedTime;
frames = 0.0f;
elapsedTime = 0.0f;
}
// Restart the timer.
startTime = currentTime;
return;
}
float Timer::getTime()
{
return frameTime;
}
manipulationshader.cpp
void ManipulationShader::setShaderParameters(ID3D11DeviceContext* deviceContext, const XMMATRIX& worldMatrix, const XMMATRIX& viewMatrix, const XMMATRIX& projectionMatrix, ID3D11ShaderResourceView* texture, Light* light, Timer* time)
{
TimeBufferType* timePtr;
deviceContext->Map(TimeBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedResource);
timePtr = (TimeBufferType*)mappedResource.pData;
timePtr->time = time->getTime();
deviceContext->Unmap(TimeBuffer, 0);
deviceContext->PSSetConstantBuffers(0, 1, &TimeBuffer);
}
This is a classic "0 vs. 1 base" bug.
In your shader you have the constant buffer bound to slot 1.
cbuffer TimeBufferType : register(b1)
In your code, you are binding the constant buffer to slot 0.
deviceContext->PSSetConstantBuffers(0, 1, &TimeBuffer);
and as noted in the comments by another StackOverflow user, you are binding it only for the pixel shader, not the vertex shader. You are therefore missing this method call:
deviceContext->VSSetConstantBuffers(0, 1, &TimeBuffer);
It's also important in Direct3D coding to always check functions that return HRESULT for success or failure. It's very hard to debug complex programs otherwise. For many cases, a "fast-fatal" solution like ThrowIfFailed works well. You can also use a conditional like the following:
D3D11_MAPPED_SUBRESOURCE mappedResource = {};
if (SUCCEEDED(deviceContext->Map(TimeBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0,
&mappedResource)))
{
auto timePtr = reinterpret_cast<TimeBufferType*>(mappedResource.pData);
timePtr->time = time->getTime();
deviceContext->Unmap(TimeBuffer, 0);
}
See Microsoft Docs.
One additional recommendation: Take a look at StepTimer for your 'animation time' computation. Time computation is quite tricky, and there are quite a number of edge-cases you are not dealing with. For more, see this blog post.
Related
I'm testing writing to 2D and 3D textures in compute shaders, outputting a gradient noise texture consisting of 32 bit floats. Writing to a 2D texture works fine, but writing to a 3D texture isn't. Are there additional considerations that need to be made when creating a 3D texture when compared to a 2D texture?
Code of how I'm defining the 3D texture below:
HRESULT BaseComputeShader::CreateTexture3D(UINT width, UINT height, UINT depth, DXGI_FORMAT format, ID3D11Texture3D** texture)
{
D3D11_TEXTURE3D_DESC textureDesc;
ZeroMemory(&textureDesc, sizeof(textureDesc));
textureDesc.Width = width;
textureDesc.Height = height;
textureDesc.Depth = depth;
textureDesc.MipLevels = 1;
textureDesc.Format = format;
textureDesc.Usage = D3D11_USAGE_DEFAULT;
textureDesc.BindFlags = D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS;
textureDesc.CPUAccessFlags = 0;
textureDesc.MiscFlags = 0;
return renderer->CreateTexture3D(&textureDesc, 0, texture);
}
HRESULT BaseComputeShader::CreateTexture3DUAV(UINT depth, DXGI_FORMAT format, ID3D11Texture3D** texture, ID3D11UnorderedAccessView** unorderedAccessView)
{
D3D11_UNORDERED_ACCESS_VIEW_DESC uavDesc;
ZeroMemory(&uavDesc, sizeof(uavDesc));
uavDesc.Format = format;
uavDesc.ViewDimension = D3D11_UAV_DIMENSION_TEXTURE3D;
uavDesc.Texture3D.MipSlice = 0;
uavDesc.Texture3D.FirstWSlice = 0;
uavDesc.Texture3D.WSize = depth;
return renderer->CreateUnorderedAccessView(*texture, &uavDesc, unorderedAccessView);
}
HRESULT BaseComputeShader::CreateTexture3DSRV(DXGI_FORMAT format, ID3D11Texture3D** texture, ID3D11ShaderResourceView** shaderResourceView)
{
D3D11_SHADER_RESOURCE_VIEW_DESC srvDesc;
ZeroMemory(&srvDesc, sizeof(srvDesc));
srvDesc.Format = format;
srvDesc.ViewDimension = D3D11_SRV_DIMENSION_TEXTURE3D;
srvDesc.Texture3D.MostDetailedMip = 0;
srvDesc.Texture3D.MipLevels = 1;
return renderer->CreateShaderResourceView(*texture, &srvDesc, shaderResourceView);
}
And how I'm writing to it in the compute shader:
// The texture we're writing to
RWTexture3D<float> outputTexture : register(u0);
[numthreads(8, 8, 8)]
void main(uint3 DTid : SV_DispatchThreadID)
{
float noiseValue = 0.0f;
float value = 0.0f;
float localAmplitude = amplitude;
float localFrequency = frequency;
// Loop for the number of octaves, running the noise function as many times as desired (8 is usually sufficient)
for (int k = 0; k < octaves; k++)
{
noiseValue = noise(float3(DTid.x * localFrequency, DTid.y * localFrequency, DTid.z * localFrequency)) * localAmplitude;
value += noiseValue;
// Calculate a new amplitude based on the input persistence/gain value
// amplitudeLoop will get smaller as the number of layers (i.e. k) increases
localAmplitude *= persistence;
// Calculate a new frequency based on a lacunarity value of 2.0
// This gives us 2^k as the frequency
// i.e. Frequency at k = 4 will be f * 2^4 as we have looped 4 times
localFrequency *= 2.0f;
}
// Output value to 2D index in the texture provided by thread indexing
outputTexture[DTid.xyz] = value;
}
And finally, how I'm running the shader:
// Set the shader
deviceContext->CSSetShader(computeShader, nullptr, 0);
// Set the shader's buffers and views
deviceContext->CSSetConstantBuffers(0, 1, &cBuffer);
deviceContext->CSSetUnorderedAccessViews(0, 1, &textureUAV, nullptr);
// Launch the shader
deviceContext->Dispatch(512, 512, 512);
// Reset the shader now we're done
deviceContext->CSSetShader(nullptr, nullptr, 0);
// Reset the shader views
ID3D11UnorderedAccessView* ppUAViewnullptr[1] = { nullptr };
deviceContext->CSSetUnorderedAccessViews(0, 1, ppUAViewnullptr, nullptr);
// Create the shader resource view for access in other shaders
HRESULT result = CreateTexture3DSRV(DXGI_FORMAT_R32_FLOAT, &texture, &textureSRV);
if (result != S_OK)
{
MessageBox(NULL, L"Failed to create texture SRV after compute shader execution", L"Failed", MB_OK);
exit(0);
}
My bad, simple mistake. Compute shader threads are limited in number. In the compute shader you're limited to a total of 1024 threads, and the dispatch call cannot dispatch more than 65535 thread groups. The HLSL compiler will catch the former issue, but the Visual C++ compiler will not catch the latter issue.
If you create a texture of 512 * 512 * 512 (which seems what you are trying to achieve), your dispatch needs to be divided by groups:
deviceContext->Dispatch(512 / 8, 512 / 8, 512 / 8);
In your previous case, the dispatch was :
512*8 * 512*8 * 512*8 = 68719476736 units
Which very likely triggered the time out detection and crashes the driver
Also the limit of 65535 is per dimension, so in your case you are completely safe to run this.
And last one, you can create both shader resource view and unordered view right after creating your 3d texture (before the dispatch call).
This is generally recommended to avoid mixing context code and resource creation code.
On resource creation, your check is not valid either :
if (result != S_OK)
HRESULT success condition is >= 0
you can use the built in macro instead eg :
if (SUCCEEDED(result))
I've just started up using Direct compute in an attempt to move a fluid simulation I have been working on, onto the GPU. I have found a very similar (if not identical) question here however seems the resolution to my problem is not the same as theirs; I do have my CopyResource the right way round for sure! As with the pasted question, I only get a buffer filled with 0's when copy back from the GPU. I really can't see the error as I don't understand how I can be reaching out of bounds limits. I'm going to apologise for the mass amount of code pasting about to occur but I want be sure I've not got any of the setup wrong.
Output Buffer, UAV and System Buffer set up
outputDesc.Usage = D3D11_USAGE_DEFAULT;
outputDesc.BindFlags = D3D11_BIND_UNORDERED_ACCESS;
outputDesc.ByteWidth = sizeof(BoundaryConditions) * numElements;
outputDesc.CPUAccessFlags = 0;
outputDesc.StructureByteStride = sizeof(BoundaryConditions);
outputDesc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
result =_device->CreateBuffer(&outputDesc, 0, &m_outputBuffer);
outputDesc.Usage = D3D11_USAGE_STAGING;
outputDesc.BindFlags = 0;
outputDesc.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
result = _device->CreateBuffer(&outputDesc, 0, &m_outputresult);
D3D11_UNORDERED_ACCESS_VIEW_DESC uavDesc;
uavDesc.Format = DXGI_FORMAT_UNKNOWN;
uavDesc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
uavDesc.Buffer.FirstElement = 0;
uavDesc.Buffer.Flags = 0;
uavDesc.Buffer.NumElements = numElements;
result =_device->CreateUnorderedAccessView(m_outputBuffer, &uavDesc, &m_BoundaryConditionsUAV);
Running the Shader in my frame loop
HRESULT result;
D3D11_MAPPED_SUBRESOURCE mappedResource;
_deviceContext->CSSetShader(m_BoundaryConditionsCS, nullptr, 0);
_deviceContext->CSSetUnorderedAccessViews(0, 1, &m_BoundaryConditionsUAV, 0);
_deviceContext->Dispatch(1, 1, 1);
// Unbind output from compute shader
ID3D11UnorderedAccessView* nullUAV[] = { NULL };
_deviceContext->CSSetUnorderedAccessViews(0, 1, nullUAV, 0);
// Disable Compute Shader
_deviceContext->CSSetShader(nullptr, nullptr, 0);
_deviceContext->CopyResource(m_outputresult, m_outputBuffer);
D3D11_MAPPED_SUBRESOURCE mappedData;
result = _deviceContext->Map(m_outputresult, 0, D3D11_MAP_READ, 0, &mappedData);
BoundaryConditions* newbc = reinterpret_cast<BoundaryConditions*>(mappedData.pData);
for (int i = 0; i < 4; i++)
{
Debug::Instance()->Log(newbc[i].x.x);
}
_deviceContext->Unmap(m_outputresult, 0);
HLSL
struct BoundaryConditions
{
float3 x;
float3 y;
};
RWStructuredBuffer<BoundaryConditions> _boundaryConditions;
[numthreads(4, 1, 1)]
void ComputeBoundaryConditions(int3 id : SV_DispatchThreadID)
{
_boundaryConditions[id.x].x = float3(id.x,id.y,id.z);
}
I dispatch the Compute shader after I begin a frame and before I end the frame. I have played around with moving the shaders dispatch call outside of the end scene and before the present ect but nothing seems to effect the process. Can't seem to figure this one out!
Holy Smokes I fixed the error! I was creating the compute shader to a different ID3D11ComputeShader pointer! D: Works like a charm! Pheew Sorry and thanks Adam!
I get a bit frustrating trying to figure out why I have to call a DirectX 11 shader twice to see the desired result.
Here's my current state:
I have a 3d object built from vertex and index buffer. This object is then instanced several times. Later, there will be a lot more than just one object, but for now, I'm testing it with this one only. In my render routine, I iterate over all instances, change the world matrices (so that all object instances are put together and form "one big, whole object") and call the shader method to render the data to the screen.
Here's the code so far, that doesn't work:
m_pLevel->Simulate(0.1f);
std::list<CLevelElementInstance*>& lst = m_pLevel->GetInstances();
float x = -(*lst.begin())->GetPosition().x, y = -(*lst.begin())->GetPosition().y, z = -(*lst.begin())->GetPosition().z;
int i = 0;
for (std::list<CLevelElementInstance*>::iterator it = lst.begin(); it != lst.end(); it++)
{
// Extract base element from current instance
CLevelElement* elem = (*it)->GetBaseElement();
// Write vertex and index buffer to video memory
elem->Render(m_pDirect3D->GetDeviceContext());
// Call shader
m_pTextureShader->Render(m_pDirect3D->GetDeviceContext(), elem->GetIndexCount(), XMMatrixTranslation(x, y, z + (i * 8)), viewMatrix, projectionMatrix, elem->GetTexture());
++i;
}
My std::list consists of 4 3d objects, which are all the same. They only differ in their position in 3d space. All of the objects are 8.0f x 8.0f x 8.0f, so for simplicity I just line them up. (As can be seen on the shader render line, where I just add 8 units to the Z dimension)
The result is the following: I only see two elements rendered on the screen. And between them, there's an empty space the size of an element.
At first, I thought I did some errors with the 3d math, but after a lot of time spent debugging my code, I could not find any errors.
And here is now the confusing part:
If I change the content of the for-loop and add another call to the shader, I suddenly see all four elements; and they're all on their correct positions in 3d space:
m_pLevel->Simulate(0.1f);
std::list<CLevelElementInstance*>& lst = m_pLevel->GetInstances();
float x = -(*lst.begin())->GetPosition().x, y = -(*lst.begin())->GetPosition().y, z = -(*lst.begin())->GetPosition().z;
int i = 0;
for (std::list<CLevelElementInstance*>::iterator it = lst.begin(); it != lst.end(); it++)
{
// Extract base element from current instance
CLevelElement* elem = (*it)->GetBaseElement();
// Write vertex and index buffer to video memory
elem->Render(m_pDirect3D->GetDeviceContext());
// Call shader
m_pTextureShader->Render(m_pDirect3D->GetDeviceContext(), elem->GetIndexCount(), XMMatrixTranslation(x, y, z + (i * 8)), viewMatrix, projectionMatrix, elem->GetTexture());
// Call shader a second time - this seems to have no effect but to allow the next iteration to perform it's shader rendering...
m_pTextureShader->Render(m_pDirect3D->GetDeviceContext(), elem->GetIndexCount(), XMMatrixTranslation(x, y, z + (i * 8)), viewMatrix, projectionMatrix, elem->GetTexture());
++i;
}
Does anybody has an idea of what is going on here?
If it helps, here's the code of the shader:
bool CTextureShader::Render(ID3D11DeviceContext* _pDeviceContext, const int _IndexCount, XMMATRIX& _pWorldMatrix, XMMATRIX& _pViewMatrix, XMMATRIX& _pProjectionMatrix, ID3D11ShaderResourceView* _pTexture)
{
bool result = SetShaderParameters(_pDeviceContext, _pWorldMatrix, _pViewMatrix, _pProjectionMatrix, _pTexture);
if (!result)
return false;
RenderShader(_pDeviceContext, _IndexCount);
return true;
}
bool CTextureShader::SetShaderParameters(ID3D11DeviceContext* _pDeviceContext, XMMATRIX& _WorldMatrix, XMMATRIX& _ViewMatrix, XMMATRIX& _ProjectionMatrix, ID3D11ShaderResourceView* _pTexture)
{
HRESULT result;
D3D11_MAPPED_SUBRESOURCE mappedResource;
MatrixBufferType* dataPtr;
unsigned int bufferNumber;
_WorldMatrix = XMMatrixTranspose(_WorldMatrix);
_ViewMatrix = XMMatrixTranspose(_ViewMatrix);
_ProjectionMatrix = XMMatrixTranspose(_ProjectionMatrix);
result = _pDeviceContext->Map(m_pMatrixBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedResource);
if (FAILED(result))
return false;
dataPtr = (MatrixBufferType*)mappedResource.pData;
dataPtr->world = _WorldMatrix;
dataPtr->view = _ViewMatrix;
dataPtr->projection = _ProjectionMatrix;
_pDeviceContext->Unmap(m_pMatrixBuffer, 0);
bufferNumber = 0;
_pDeviceContext->VSSetConstantBuffers(bufferNumber, 1, &m_pMatrixBuffer);
_pDeviceContext->PSSetShaderResources(0, 1, &_pTexture);
return true;
}
void CTextureShader::RenderShader(ID3D11DeviceContext* _pDeviceContext, const int _IndexCount)
{
_pDeviceContext->IASetInputLayout(m_pLayout);
_pDeviceContext->VSSetShader(m_pVertexShader, NULL, 0);
_pDeviceContext->PSSetShader(m_pPixelShader, NULL, 0);
_pDeviceContext->PSSetSamplers(0, 1, &m_pSampleState);
_pDeviceContext->DrawIndexed(_IndexCount, 0, 0);
}
If it helps, I can also post the code from the shaders here.
Any help would be appreciated - I'm totally stuck here :-(
The problem is that you are transposing your data every frame, so it's only "right" every other frame:
_WorldMatrix = XMMatrixTranspose(_WorldMatrix);
_ViewMatrix = XMMatrixTranspose(_ViewMatrix);
_ProjectionMatrix = XMMatrixTranspose(_ProjectionMatrix);
Instead, you should be doing something like:
XMMATRIX worldMatrix = XMMatrixTranspose(_WorldMatrix);
XMMATRIX viewMatrix = XMMatrixTranspose(_ViewMatrix);
XMMATRIX projectionMatrix = XMMatrixTranspose(_ProjectionMatrix);
result = _pDeviceContext->Map(m_pMatrixBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedResource);
if (FAILED(result))
return false;
dataPtr = (MatrixBufferType*)mappedResource.pData;
dataPtr->world = worldMatrix;
dataPtr->view = viewMatrix;
dataPtr->projection = projectionMatrix;
_pDeviceContext->Unmap(m_pMatrixBuffer, 0);
I wonder if the problem is that the temporary created by the call to XMMatrixTranslation(x, y, z + (i * 8)) is being passed into a function by reference, and then passed into another function by reference where it is modified.
My knowledge of the c++ spec isn't complete enough for me to tell whether that's undefined behaviour or not (but I know that there are tricky rules in this area - assigning a temporary to a non-const temporary is not supported, for example). Regardless, it's close enough to being suspicious that even if it is well defined c++ it still might be a dusty corner that could trip up a non-compliant compiler.
To rule out this possibility, try doing it like:
XMMatrix worldMatrix = XMMatrixTranslation(x, y, z + (i * 8));
m_pTextureShader->Render(m_pDirect3D->GetDeviceContext(), elem->GetIndexCount(), worldMatrix, viewMatrix, projectionMatrix, elem->GetTexture());
I will try to explain my problem as clear as possible. I have a multithreading framework I have to work on. It's a path tracer renderer. It gives me error when I try to store some information provided by my threads. Trying to avoid posting all the code, I will explain what I mean step by step:
my TileTracer class is a thread
class TileTracer : public Thread{
...
}
and I have a certain number of threads:
#define MAXTHREADS 32
TileTracer* worker[MAXTHREADS];
the number of working threads is set in the following initialization code, where the threads are also started:
void Renderer::Init(){
accumulator = (vec3*)MALLOC64(sizeof(vec3)* SCRWIDTH * SCRHEIGHT);
memset(accumulator, 0, SCRWIDTH * SCRHEIGHT * sizeof(vec3));
SYSTEM_INFO systeminfo;
GetSystemInfo(&systeminfo);
int cores = systeminfo.dwNumberOfProcessors;
workerCount = MIN(MAXTHREADS, cores);
for (int i = 0; i < workerCount; i++)
{
goSignal[i] = CreateEvent(NULL, FALSE, FALSE, 0);
doneSignal[i] = CreateEvent(NULL, FALSE, FALSE, 0);
}
// create and start worker threads
for (int i = 0; i < workerCount; i++)
{
worker[i] = new TileTracer();
worker[i]->init(accumulator, i);
worker[i]->start(); //start the thread
}
samples = 0;
}
the init() method for my thread is simply defined in my header as the following:
void init(vec3* target, int idx) { accumulator = target, threadIdx = idx; }
while the start() is:
void Thread::start()
{
DWORD tid = 0;
m_hThread = (unsigned long*)CreateThread( NULL, 0, (LPTHREAD_START_ROUTINE)sthread_proc, (Thread*)this, 0, &tid );
setPriority( Thread::P_NORMAL );
}
somehow (I don't get exactly where), each thread calls the following main method which is meant to define the color of a pixel (you don't have to understand it all):
vec3 TileTracer::Sample(vec3 O, vec3 D, int depth){
vec3 color(0, 0, 0);
// trace path extension ray
float t = 1000.0f, u, v;
Triangle* tri = 0;
Scene::mbvh->pool4[0].TraceEmbree(O, D, t, u, v, tri, false);
totalRays++;
// handle intersection, if any
if (tri)
{
// determine material color at intersection point
Material* mat = Scene::matList[tri->material];
Texture* tex = mat->GetTexture();
vec3 diffuse;
if (tex)
{
...
}
else diffuse = mat->GetColor();
vec3 I = O + t * D; //we get exactly to the intersection point on the object
//we need to store the info of each bounce of the basePath for the offsetPaths
basePath baseInfo = { O, D, I, tri };
basePathHits.push_back(baseInfo);
vec3 L = vec3(-1 + Rand(2.0f), 20, 9 + Rand(2.0f)) - I; //(-1,20,9) is Hard-code of the light position, and I add Rand(2.0f) on X and Z axis
//so that I have an area light instead of a point light
float dist = length(L) * 0.99f; //if I cast a ray towards the light source I don't want to hit the source point or the light source
//otherwise it counts as a shadow even if there is not. So I make the ray a bit shorter by multiplying it for 0.99
L = normalize(L);
float ndotl = dot(tri->N, L);
if (ndotl > 0)
{
Triangle* tri = 0;
totalRays++;
Scene::mbvh->pool4[0].TraceEmbree(I + L * EPSILON, L, dist, u, v, tri, true);//it just calculates the distance by throwing a ray
//I am just interested in understanding if I hit something or not
//if I don't hit anything I calculate the light transport (diffuse * ndotL * lightBrightness * 1/dist^2
if (!tri) color += diffuse * ndotl * vec3(1000.0f, 1000.0f, 850.0f) * (1.0f / (dist * dist));
}
// continue random walk since it is a path tracer (we do it only if we have less than 20 bounces)
if (depth < 20)
{
// russian roulette
float Psurvival = CLAMP((diffuse.r + diffuse.g + diffuse.b) * 0.33333f, 0.2f, 0.8f);
if (Rand(1.0f) < Psurvival)
{
vec3 R = DiffuseReflectionCosineWeighted(tri->N);//there is weight
color += diffuse * Sample(I + R * EPSILON, R, depth + 1) * (1.0f / Psurvival);
}
}
}
return color;
}
Now, you don't have to understand the whole code for sure because my question is the following: if you notice, in the last method there are the 2 following code lines:
basePath baseInfo = { O, D, I, tri };
basePathHits.push_back(baseInfo);
I just create a simple struct "basePath" defined as follows:
struct basePath
{
vec3 O, D, hit;
Triangle* tri;
};
and I store it in a vector of struct defined at the beginning of my code:
vector<basePath> basePathHits;
The problem is that this seems bringing an exception. Indeed if I try to store these information, that I need later in my code, the program crashes giving me the exception:
Unhandled exception at 0x0FD4FAC1 (msvcr120d.dll) in Template.exe: 0xC0000005: Access violation reading location 0x3F4C1BC1.
Some other times, without changing anything, the error is different and it's the following one:
While, without storing those info, everything works perfectly. Likewise, if I set the number of cores to 1, everything works. So, how come multithreading doesn't allow me to do it? Do not hesitate to ask further info if these are not enough.
Try making the following change to your code:
//we need to store the info of each bounce of the basePath for the offsetPaths
basePath baseInfo = { O, D, I, tri };
static std::mutex myMutex;
myMutex.lock();
basePathHits.push_back(baseInfo);
myMutex.unlock();
If that removes the exceptions then the problem is unsychronised access to basePathHits (i.e. multiple threads calling push_back simultaneously). You need to think carefully about what the best solution to this will be, to minimise the impact of synchronisation on performance.
Possible I did'nt see it, but there is no protection for the target - no mutex or atomic. And as far as I know std::vector needs this for multithreading.
I'm programming in OpenCL using the C++ bindings. I have a problem where on NVidia hardware, my OpenCL code is spontaneously producing very large numbers, and then (on the next run) a "1.#QNaN". My code is pretty much a simple physics simulation using the equation x = vt * .5at^2. The only thing I've noticed odd about it is that the velocities of the particles suddenly shoot up to about 6e+34, which I'm guessing is the maximum floating point value on that machine. However, the velocities/forces before then are quite small, often with values less than 1.
The specific GPU I'm using is the Tesla M2050, with latest drivers. I prototype on my laptop using my AMD Fusion CPU as a platform (it does not have a dedicated GPU) and the problem does not occur there. I am not sure if this is an NVidia driver problem, a problem with my computation, or something else entirely.
Here is the kernel code (Note: I'm reasonably sure mass is always nonzero):
_kernel void update_atom(__global float4 *pos, __global float4 *vel, __global float4 *force,
__global const float *mass, __global const float *radius, const float timestep, const int wall)
{
// Get the index of the current element to be processed
int i = get_global_id(0);
float constMult;
float4 accel;
float4 part;
//Update the position, velocity and force
accel = (float4)(force[i].x/mass[i],
force[i].y/mass[i],
force[i].z/mass[i],
0.0f);
constMult = .5*timestep*timestep;
part = (float4)(constMult*accel.x,
constMult*accel.y,
constMult*accel.z, 0.0f);
pos[i] = (float4)(pos[i].x + vel[i].x*timestep + part.x,
pos[i].y + vel[i].y*timestep + part.y,
pos[i].z + vel[i].z*timestep + part.z,
0.0f);
vel[i] = (float4)(vel[i].x + accel.x*timestep,
vel[i].y + accel.y*timestep,
vel[i].z + accel.z*timestep,
0.0f);
force[i] = (float4)(force[i].x,
force[i].y,
force[i].z,
0.0f);
//Do reflections off the wall
//http://www.3dkingdoms.com/weekly/weekly.php?a=2
float4 norm;
float bouncePos = wall - radius[i];
float bounceNeg = radius[i] - wall;
norm = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
if(pos[i].x >= bouncePos)
{
//Normal is unit YZ vector
pos[i].x = bouncePos;
norm.x = 1.0f;
}
else if(pos[i].x <= bounceNeg)
{
pos[i].x = bounceNeg;
norm.x = -1.0f;
}
if(pos[i].y >= bouncePos)
{
//Normal is unit XZ vector
pos[i].y = bouncePos;
norm.y = 1.0f;
}
else if(pos[i].y <= bounceNeg)
{
//etc
pos[i].y = bounceNeg;
norm.y = -1.0f;
}
if(pos[i].z >= bouncePos)
{
pos[i].z = bouncePos;
norm.z = 1.0f;
}
else if(pos[i].z <= bounceNeg)
{
pos[i].z = bounceNeg;
norm.z = -1.0f;
}
float dot = 2 * (vel[i].x * norm.x + vel[i].y * norm.y + vel[i].z * norm.z);
vel[i].x = vel[i].x - dot * norm.x;
vel[i].y = vel[i].y - dot * norm.y;
vel[i].z = vel[i].z - dot * norm.z;
}
And here's how I store the information into the kernel. PutData just uses the std::vector::push_back function on the positions, velocities and forces of the atoms into the corresponding vectors, and kernel is just a wrapper class I wrote for the OpenCL libraries (you can trust me that I put the right parameters into the right places for the enqueue functions).
void LoadAtoms(int kernelNum, bool blocking)
{
std::vector<cl_float4> atomPos;
std::vector<cl_float4> atomVel;
std::vector<cl_float4> atomForce;
for(int i = 0; i < numParticles; i++)
atomList[i].PutData(atomPos, atomVel, atomForce);
kernel.EnqueueWriteBuffer(kernelNum, posBuf, blocking, 0, numParticles*sizeof(cl_float4), &atomPos[0]);
kernel.EnqueueWriteBuffer(kernelNum, velBuf, blocking, 0, numParticles*sizeof(cl_float4), &atomVel[0]);
kernel.EnqueueWriteBuffer(kernelNum, forceBuf, blocking, 0, numParticles*sizeof(cl_float4), &atomForce[0]);
}
void LoadAtomTypes(int kernelNum, bool blocking)
{
std::vector<cl_float> mass;
std::vector<cl_float> radius;
int type;
for(int i = 0; i < numParticles; i++)
{
type = atomList[i].GetType();
mass.push_back(atomTypes[type].mass);
radius.push_back(atomTypes[type].radius);
}
kernel.EnqueueWriteBuffer(kernelNum, massBuf, blocking, 0, numParticles*sizeof(cl_float), &mass[0]);
kernel.EnqueueWriteBuffer(kernelNum, radiusBuf, blocking, 0, numParticles*sizeof(cl_float), &radius[0]);
}
There is more to my code, as always, but this is what's related to the kernel.
I saw this question, which is similar, but I use cl_float4 everywhere I can so I don't believe it's an issue of alignment. There aren't really any other related questions.
I realize this probably isn't a simple question, but I've run out of ideas until we can get new hardware in the office to test on. Can anyone help me out?
Since no one answered, I suppose I'll just contribute what I've learned so far.
I don't have a definitive conclusion, but at the very least, I figured out why it was doing it so often. Since I'm running this (and other similar kernels) to figure out an estimate for the order of time, I was clearing the lists, resizing them and then re-running the calculations. What I wasn't doing, however, was resizing the buffers. This resulted in some undefined numbers getting pulled by the threads.
However, this doesn't solve the QNaNs I get from running the program over long periods of time. Those simply appear spontaneously. Maybe it's a similar issue that I'm overlooking, but I can't say. If anyone has further input on the issue, it'd be appreciated.