How to create Shared Buffer on Vulkan Host Program? - glsl

Goal
I want to set 'shared' buffer's size in GLSL on runtime.
Question is "How to create shared memory in Vulkan Host C/C++ program?" \
Example
In OpenCL, below OpenCL Kernel function has '__local' argument.
// foo.cl
__kernel void foo(__global float *dst, __global float *src, __local *buf ){
/*do something great*/
}
and in host C++ program, I create __local memory and pass it kernel args.
void main(){
...
cl_uint args_index = 2;
foo.setArg(args_index,__local(128) ); // make 128bytes local memory and pass it to kernel.
}
I want to do same thing on Vulkan Compute Pipeline.
and I tried below.
GLSL
//foo.comp
#version 450
layout(binding=0) buffer dstBuffer{
uint dst[];
};
layout(binding=1) buffer srcBuffer{
uint src[];
};
// try 1
layout(binding=2) uniform SharedMemSize{
uint size[];
};
shared uint buf[size[0]]; // compile Error! msg : array size must be a constant integer expression
// try 2
layout(binding=2) shared SharedBuffer{
uint buf[];
}; // compile Error! msg :'shared block' not supported
//try 3
layout(binding=2) shared uint buf[]; // compile Error! msg : binding requires uniform or buffer.
I failed above things.
need your help. thanks.

GLSL has shared variables, which represent storage accessible to any member of a work group. However, it doesn't have "shared memory" in the same way as OpenCL. You cannot directly upload data to shared variables from the CPU, for example. You can't get a pointer to shared variables.
The closest you might get to this is to have some kind of shared variable array whose size is determined from outside of the shader. But even then, you're not influencing the size of memory in total; you're influencing the size of just that array.
And you can sort of do that. SPIR-V in OpenGL allows specialization constants. These are values specified by the outside world during the compilation process for the shader object. Such constants can be used as the size for a shared variable array. Of course, this means that changing it requires a full recompile/relink process.

Related

Byte Array Initialization Causes DirectX to Crash

So I'm trying to gain access to vertex buffers on the GPU. Specifically I need to do some calculations with the vertices. So in order to do that I attempt to map the resource (vertex buffer) from the GPU, and copy it into system memory so the CPU can access the vertices. I used the following SO thread to put the code together: How to read vertices from vertex buffer in Direct3d11
Here is my code:
HRESULT hr = pSwapchain->GetDevice(__uuidof(ID3D11Device), (void**)&pDevice);
if (FAILED(hr))
return false;
pDevice->GetImmediateContext(&pContext);
pContext->OMGetRenderTargets(1, &pRenderTargetView, nullptr);
//Vertex Buffer
ID3D11Buffer* veBuffer;
UINT Stride;
UINT veBufferOffset;
pContext->IAGetVertexBuffers(0, 1, &veBuffer, &Stride, &veBufferOffset);
D3D11_MAPPED_SUBRESOURCE mapped_rsrc;
pContext->Map(veBuffer, NULL, D3D11_MAP_READ, NULL, &mapped_rsrc);
void* vert = new BYTE[mapped_rsrc.DepthPitch]; //DirectX crashes on this line...
memcpy(vert, mapped_rsrc.pData, mapped_rsrc.DepthPitch);
pContext->Unmap(veBuffer, 0);
I'm somewhat of a newbie when it comes to C++. So my assumptions may be incorrect. The initialization value that
mapped_rsrc.DepthPitch
returns is quite large. It returns 343597386. According to the documentation I listed below, it states that the return value of DepthPitch is returned in bytes. If I replace the initialization value with a much smaller number, like 10, the code runs just fine. From what I read about the Map() function here: https://learn.microsoft.com/en-us/windows/win32/api/d3d11/ns-d3d11-d3d11_mapped_subresource
It states :
Note The runtime might assign values to RowPitch and DepthPitch that
are larger than anticipated because there might be padding between
rows and depth.
Could this have something to do with the large value that is being returned? If so, does that mean I have to parse DepthPitch to remove any unneeded data? Or maybe it is an issue with the way vert is initialized?
There was no Vertex Buffer bound, so your IAGetVertexBuffers failed to return anything. You have to create a VB.
See Microsoft Docs: How to Create a Vertex Buffer
As someone new to DirectX 11, you should take a look at DirectX Tool Kit.

Getting a constant from a GLSL shader

I have a shader written in GLSL with an array of structs for holding light data. I use a constant to declare the array size, as is good practice. Let's say this variable is declared as
const int NUM_POINT_LIGHTS = 100;
How can I use C++ to pull this data out of the shader, so that my C++ program knows exactly how many lights it has available to it? I've tried declaring it as
const uniform int NUM_POINT_LIGHTS = 100;
As expected, this didn't work (though oddly enough, it appears as though the uniform specification simply overrode the const specification, as the OpenGL complained that I was initializing an array with a non-const value). I also tried
const int NUM_POINT_LIGHTS = 100;
uniform numPointLights = NUM_POINT_LIGHTS;
This would work, except for the fact that GLSL optimizes away unused uniforms so I have to track glsl into thinking the uniform is used somehow in order to be able to get a hold of the data. I've not been able to find any other method to query a program to get a constant value. Does anybody have any ideas how I might be able to pull a constant out of a shader so my program to get information that is functionally encoded in the shader for it's use?
I don't think you can directly get the value of the constant. However, I figure you must use the value of the constant, most likely as the size of a uniform array. If that's the case, you can get the size of the uniform array, which indirectly gets you the value of the constant.
Say your shader contains something like this:
const int NUM_POINT_LIGHTS = 100;
uniform vec3 LightPositions[NUM_POINT_LIGHTS];
Then you can first get the index of this uniform:
const GLchar* uniformName = "LightPositions";
GLuint uniformIdx = 0;
glGetUniformIndices(program, 1, &uniformName, &uniformIdx);
Using this index, you can then retrieve attributes of this uniform:
const int nameLen = strlen("LightPositions") + 1;
const GLchar name[nameLen];
GLint uniformSize = 0;
GLenum uniformType = GL_NONE;
glGetActiveUniform(program, uniformIdx, nameLen, NULL,
&uniformSize, &uniformType, name);
uniformSize should then be the value of the NUM_POINT_LIGHTS constant. Note that I haven't tried this, but I hope I got the arguments right based on the documentation.
A somewhat ugly but possibly very practical solution is of course to parse the value out of the shader source code. Since you need to read it in anyway before passing it to glShaderSource(), picking out the constant value should be easy enough.
Yet another option, if your main goal is to avoid having the constant in multiple places, is to define it in your C++ code, and add the constant definition to the shader code dynamically after reading in the shader code, and before passing it go glShaderSource().
You can't just query a constant from a GLSL program. No such thing is defined in the GLSL spec.
Uniform Buffer Objects might be a way to get around the issue of uniforms being optimized out.
https://www.opengl.org/wiki/Uniform_Buffer_Object

How I use global memory correctly in CUDA?

I'm trying to do an application in CUDA which uses global memory defined with device.
This variables are declared in a .cuh file.
In another file .cu is my main in which I do the cudaMallocs and the cudaMemCpy.
That's a part of my code:
cudaMalloc((void**)&varOne,*tam_varOne * sizeof(cuComplex));
cudaMemcpy(varOne,C_varOne,*tam_varOne * sizeof(cuComplex),cudaMemcpyHostToDevice);
varOne is declared in the .cuh file like this:
__device__ cuComplex *varOne;
When I launch my kernel (I'm not passing varOne as parameter) and try to read varOne with the debugger, it says that can't read the variable. The pointer address it 000..0 so it's obviously that it is wrong.
So, how I have to declare and copy the global memory in CUDA?
First, you need to declare the pointers to the data that will be copied from the CPU to the GPU. In the example above, we want to copy the array original_cpu_array to CUDA global memory.
int original_cpu_array[array_size];
int *array_cuda;
Calculate the memory size that the data will occupy.
int size = array_size * sizeof(int);
Cuda memory allocation:
msg_erro[0] = cudaMalloc((void **)&array_cuda,size);
Copying from CPU to GPU:
msg_erro[0] = cudaMemcpy(array_cuda, original_cpu_array,size,cudaMemcpyHostToDevice);
Execute kernel
Copying from GPU to CPU:
msg_erro[0] = cudaMemcpy(original_cpu_array,array_cuda,size,cudaMemcpyDeviceToHost);
Free Memory:
cudaFree(array_cuda);
For debugging reasons, typically, I save the status of the functions calls in an array. (e.g., cudaError_t msg_erro[var];). This is not strictly necessary, but it will save you some time if an error occurs during the allocation and memory transferences.
And if errors do occur, I print them using a function like:
void printErros(cudaError_t *erros,int size, int flag)
{
for(int i = 0; i < size; i++)
if(erros[i] != 0)
{
if(flag == 0) printf("Alocacao de memoria");
if(flag == 1) printf("CPU -> GPU ");
if(flag == 2) printf("GPU -> CPU ");
printf("{%d} => %s\n",i ,cudaGetErrorString(erros[i]));
}
}
The flag is primarily to indicate the part in the code that the error occurred. For instance, after a memory allocation:
msg_erro[0] = cudaMalloc((void **)&array_cuda,size);
printErros(msg_erro,msg_erro_size, 0);
I have experimented with some example and found that, you cannot directly use the global variable in the kernel without passing to it. Even though you initialize in .cuh file, you need to initialize in the main().
Reason:
If you declare it globally, the Memory is not allocated in the GPU Global Memory. You need to use cudaMalloc((void**)&varOne,sizeof(cuComplex)) for the allocation of memory. It can only allocate memory on GPU. The declaration __device__ cuComplex *varOne; works just as a prototype and variable declaration. But, the memory is not allocated until cudaMalloc((void**)&varOne,sizeof(cuComplex)) is used.
Also, you need to initialize the *varOne in main() as a Host pointer initially. After using cudaMalloc(), it comes to know that the pointer is Device Pointer.
The sequence of steps are: (for my tested code)
int *Ad; //If you can allocate this in .cuh file, you dont need the shown code in main()
__global__ void Kernel(int *Ad){
....
}
int main(){
....
int size=100*sizeof(int);
cudaMalloc((void**)&Ad,size);
cudaMemcpy(Ad,A,size,cudaMemcpyHostToDevice);
....
}

allocating shared memory

i am trying to allocate shared memory by using a constant parameter but getting an error. my kernel looks like this:
__global__ void Kernel(const int count)
{
__shared__ int a[count];
}
and i am getting an error saying
error: expression must have a constant value
count is const! Why am I getting this error? And how can I get around this?
CUDA supports dynamic shared memory allocation. If you define the kernel like this:
__global__ void Kernel(const int count)
{
extern __shared__ int a[];
}
and then pass the number of bytes required as the the third argument of the kernel launch
Kernel<<< gridDim, blockDim, a_size >>>(count)
then it can be sized at run time. Be aware that the runtime only supports a single dynamically declared allocation per block. If you need more, you will need to use pointers to offsets within that single allocation. Also be aware when using pointers that shared memory uses 32 bit words, and all allocations must be 32 bit word aligned, irrespective of the type of the shared memory allocation.
const doesn't mean "constant", it means "read-only".
A constant expression is something whose value is known to the compiler at compile-time.
option one: declare shared memory with constant value (not the same as const)
__global__ void Kernel(int count_a, int count_b)
{
__shared__ int a[100];
__shared__ int b[4];
}
option two: declare shared memory dynamically in the kernel launch configuration:
__global__ void Kernel(int count_a, int count_b)
{
extern __shared__ int *shared;
int *a = &shared[0]; //a is manually set at the beginning of shared
int *b = &shared[count_a]; //b is manually set at the end of a
}
sharedMemory = count_a*size(int) + size_b*size(int);
Kernel <<<numBlocks, threadsPerBlock, sharedMemory>>> (count_a, count_b);
note: Pointers to dynamically shared memory are all given the same address. I use two shared memory arrays to illustrate how to manually set up two arrays in shared memory.
From the "CUDA C Programming Guide":
The execution configuration is specified by inserting an expression of the form:
<<<Dg, Db, Ns, S>>>
where:
Dg is of type dim3 and specifies the dimensioin and size of the grid ...
Db is of type dim3 and specifies the dimension and size of each block ...
Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory. This dynamically allocated memory is used by any of the variables declared as an external array as mentioned in __shared__; Ns is optional argument which defaults to 0;
S is of type cudaStream_t and specifies the associated stream ...
So by using the dynamical parameter Ns, the user can specify the total size of shared memory one kernel function can use, no matter how many shared variables there are in this kernel.
You cannot declare shared variable like this..
__shared__ int a[count];
although if you are sure enough about the max size of array a then you can directly declare like
__shared__ int a[100];
but in this case you should be worried about how many blocks are there in your program , since fixing shared memory to a block ( and not getting utilized fully), will lead you to context switching with global memory( high latency) , thus poor performance...
There is a nice solution to this problem to declare
extern __shared__ int a[];
and allocating the memory while calling kernel from memory like
Kernel<<< gridDim, blockDim, a_size >>>(count)
but you should also be bothered here because if you are using more memory in blocks than you are assigning in kernel , you are going to getting unexpected results.

Who can tell me what this bit of C++ does?

CUSTOMVERTEX* pVertexArray;
if( FAILED( m_pVB->Lock( 0, 0, (void**)&pVertexArray, 0 ) ) ) {
return E_FAIL;
}
pVertexArray[0].position = D3DXVECTOR3(-1.0, -1.0, 1.0);
pVertexArray[1].position = D3DXVECTOR3(-1.0, 1.0, 1.0);
pVertexArray[2].position = D3DXVECTOR3( 1.0, -1.0, 1.0);
...
I've not touched C++ for a while - hence the topic but this bit of code is confusing myself. After the m_pVB->Lock is called the array is initialized.
This is great and all but the problem I'm having is how this happens. The code underneath uses nine elements, but another function (pretty much copy/paste) of the code I'm working with only access say four elements.
CUSTOMVERTEX is a struct, but I was under the impression that this matters not and that an array of structs/objects need to be initialized to a fixed size.
Can anyone clear this up?
Edit:
Given the replies, how does it know that I require nine elements in the array, or four etc...?
So as long as the buffer is big enough, the elements are legal. If so, this code is setting the buffer size if I'm not mistaken.
if( FAILED( m_pd3dDevice->CreateVertexBuffer( vertexCount * sizeof(CUSTOMVERTEX), 0, D3DFVF_CUSTOMVERTEX, D3DPOOL_DEFAULT, &m_pVB, NULL ) ) ) {
return E_FAIL;
}
m_pVB points to a graphics object, in this case presumably a vertex buffer. The data held by this object will not generally be in CPU-accessible memory - it may be held in onboard RAM of your graphics hardware or not allocated at all; and it may be in use by the GPU at any particular time; so if you want to read from it or write to it, you need to tell your graphics subsystem this, and that's what the Lock() function does - synchronise with the GPU, ensure there is a buffer in main memory big enough for the data and it contains the data you expect at this time from the CPU's point of view, and return to you the pointer to that main memory. There will need to be a corresponding Unlock() call to tell the GPU that you are done reading / mutating the object.
To answer your question about how the size of the buffer is determined, look at where the vertex buffer is being constructed - you should see a description of the vertex format and an element count being passed to the function that creates it.
You're pasing a pointer to the CUSTOMVERTEX pointer (pointer to a pointer) into the lock function so lock itself must be/ needs to be creating the CUSTOMVERTEX object and setting your pointer to point to the object it creates.
In order to modify a vertex buffer in DX you have to lock it. To enforce this the DX API will only reveal the guts of a VB through calling Lock on it.
Your code is passing in the address of pVertexArray which Lock points at the VB's internal data. The code then proceeds to modify the vertex data, presumably in preparation for rendering.
You're asking the wrong question, it's not how does it know that you require x objects, it's how YOU know that IT requires x objects. You pass a pointer to a pointer to your struct in and the function returns the pointer to your struct already allocated in memory (from when you first initialized the vertex buffer). Everything is always there, you're just requesting a pointer to the array to work with it, then "release it" so dx knows to read the vertex buffer and upload it to the gpu.
When you created the vertex buffer, you had to specify a size. When you call Lock(), you're passing 0 as the size to lock, which tells it to lock the entire size of the vertex buffer.