Changing VkBuffers used in command with out re-building the command buffers? - glsl

This might be an X Y problem, so here's my issue:
I'm trying to send a command buffer to the GPU that adds values to a shader buffer, eg:
#version 450
#define INPUT_ARRAY_SIZE 1024
layout(std430, binding = 1) buffer InputArray{
float array[ ];
}input_array;
void main()
{
uint index = gl_GlobalInvocationID.x;
if (index >= INPUT_ARRAY_SIZE){
return;
}
InputArray.input_array[index] += 3;
}
I would like to be able to swap out the VkBuffer I use to back the shader buffer with other buffers. ie:
void addValue(device, queue, command_buffer, buffer);
or
void addValue(device, queue, command_buffer, descriptor_set);
where I would swap out buffer for other buffers I want to add values to.
Unfortunately I don't see a way to do that with out re-recording my command buffer. As far as I can tell my only options for minimizing the command buffer impact (which is large when my invocations take nano seconds), is to use secondary command buffers, and use pipeline cache some how. Otherwise I would have to create a command buffer for every single new buffer, which is not feasible when I have more than 100 commands. It doesn't seem to be possible to use VkUpdateDescriptorSets with out re-recording as well.
Is there a way to use pre-recorded command buffers, and change the VkBuffer used behind the shader buffer at will with out re-recording the command?

Not without the EXT_descriptor_index extension. Descriptor values (the location of the GPU resources they represent) are supposed to be baked into the CB at write time, not read from some external source.
Even with descriptor index, you need to ensure that the CB is not being executed before you can update the descriptor. So that would require a GPU/CPU sync (which may or may not be bad, depending on your semaphore/submission code structure).
Otherwise I would have to create a command buffer for every single new buffer, which is not feasible when I have more than 100 commands.
You should not put each command in its own buffer. You should bundle as many commands together as possible.
In general, the cost of building command buffers is pretty low. Coupled with threading their construction, they shouldn't be your primary concern here. Especially when the number of commands is as low as "more than 100 commands;" Vulkan users routinely issue thousands of commands into CBs repeatedly, every frame.

Related

AT command response parser

I am working on my own implementation to read AT commands from a Modem using a microcontroller and c/c++
but!! always a BUT!! after I have two "threads" on my program, the first one were I am comparing the possible reply from the Moden using strcmp which I believe is terrible slow
comparing function
if (strcmp(reply, m_buffer) == 0)
{
memset(buffer, 0, buffer_size);
buffer_size = 0;
memset(m_buffer, 0, m_buffer_size);
m_buffer_size = 0;
return 0;
}
else
return 1;
this one works fine for me with AT commands like AT or AT+CPIN? where the last response from the Modem is "OK" and nothing in the middle, but it is not working with commands like AT+CREG?, wheres it responses:
+REG: n,n
OK
and I am specting for "+REG: n,n" but I believe strncpy is very slow and my buffer data is replaced for "OK"
2nd "thread" where it enables a UART RX interruption and replaces my buffer data every time it receives new data
Interruption handle:
m_buffer_size = buffer_size;
strncpy(m_buffer, buffer, buffer_size + m_buffer_size);
Do you know any out there faster than strcmp? or something to improve the AT command responses reading?
This has the scent of an XY Problem
If you have seen the buffer contents being over written, you might want to look into a thread safe queue to deliver messages from the RX thread to the parsing thread. That way even if a second message arrives while you're processing the first, you won't run into "buffer overwrite" problems.
Move the data out of the receive buffer and place it in another buffer. Two buffers is rarely enough, so create a pool of buffers. In the past I have used linked lists of pre-allocated buffers to keep fragmentation down, but depending on the memory management and caching smarts in your microcontroller, and the language you elect to use, something along the lines of std::deque may be a better choice.
So
Make a list of free buffers.
When a the UART handling thread loop looks something like,
Get a buffer from the free list
Read into the buffer until full or timeout
Pass buffer to parser.
Parser puts buffer in its own receive list
Parsing sends a signal to wake up its thread.
Repeat until terminated. If the free list is emptied, your program is probably still too slow to keep up. Perhaps adding more buffers will allow the program to get through a busy period, but if the data flow is relatively constant and the free list empties out... Well, you have a problem.
Parser loop also repeats until terminated looks like:
If receive list not empty,
Get buffer from receive list
Process buffer
Return buffer to free list
Otherwise
Sleep
Remember to protect the lists from concurrent access by the different threads. C11 and C++11 have a number of useful tools to assist you here.

Monitor buffers in GNU Radio

I have a question regarding buffering in between blocks in GNU Radio. I know that each block in GNU (including custom blocks) have buffers to store items that are going to be sent or received items. In my project, there is a certain sequence I have to maintain to synchronize events between blocks. I am using GNU radio on the Xilinx ZC706 FPGA platform with the FMCOMMS5.
In the GNU radio companion I created a custom block that controls a GPIO Output port on the board. In addition, I have an independent source block that is feeding information into the FMCOMMS GNU block. The sequence I am trying to maintain is that, in GNU radio, I first send data to the FMCOMMS block, second I want to make sure that the data got consumed by the FMCOMMS block (essentially by checking buffer), then finally I want to control the GPIO output.
From my observations, the source block buffer doesn’t seem to send the items until it’s full. This will cause a major issue in my project because this means that the GPIO data will be sent before or in parallel with sending the items to the other GNU blocks. That’s because I’m setting the GPIO value through direct access to its address in the ‘work’ function of my custom block.
I tried to use pc_output_buffers_full() in the ‘work’ function of my custom source in order to monitor the buffer, but I’m always getting 0.00. I’m not sure if it’s supposed to be used in custom blocks or if the ‘buffer’ in this case is something different from where the output items are stored. Here's a small code snippet which shows the problem:
char level_count = 0, level_val = 1;
vector<float> buff (1, 0.0000);
for(int i=0; i< noutput_items; i++)
{
if(level_count < 20 && i< noutput_items)
{
out[i] = gr_complex((float)level_val,0);
level_count++;
}
else if(i<noutput_items)
{
level_count = 0;
level_val ^=1;
out[i] = gr_complex((float)level_val,0);
}
buff = pc_output_buffers_full();
for (int n = 0; n < buff.size(); n++)
cout << fixed << setw(5) << setprecision(2) << setfill('0') << buff[n] << " ";
cout << "\n";
}
Is there a way to monitor the buffer so that I can determine when my first part of data bits have been sent? Or is there a way to make sure that the each single output item is being sent like a continuous stream to the next block(s)?
GNU Radio Companion version: 3.7.8
OS: Linaro 14.04 image running on the FPGA
Or is there a way to make sure that the each single output item is being sent like a continuous stream to the next block(s)?
Nope, that's not how GNU Radio works (at all!):
A while back I wrote an article that explains how GNU Radio deals with buffers, and what these actually are. While the in-memory architecture of GNU Radio buffers might be of lesser interest to you, let me quickly summarize the dynamics of it:
The buffers that (general_)work functions are called with behave for all that's practical like linearly addressable ring buffers. You get a random number of samples at once (restrictable to minimum numbers, multiples of numbers), and all that you not consume will be handed to you the next time work is called.
These buffers hence keep track of how much you've consumed, and thus, how much free space is in a buffer.
The input buffer a block sees is actually the output buffer of the "upstream" block in the flow graph.
GNU Radio's computation is backpressure-controlled: Any block's work method will immediately be called in an endless loop given that:
There's enough input for the block to do work,
There's enough output buffer space to write to.
Therefore, as soon as one block finishes its work call, the upstream block is informed that there's new free output space, thus typically leading to it running
That leads to high parallelity, since even adjacent blocks can run simultaneously without conflicting
This architecture favors large chunks of input items, especially for blocks that take a relative long time to computer: while the block is still working, its input buffer is already being filled with chunks of samples; when it's finished, chances are it's immediately called again with all the available input buffer being already filled with new samples.
This architecture is asynchronous: even if two blocks are "parallel" in your flow graph, there's no defined temporal relation between the numbers of items they produce.
I'm not even convinced switching GPIOs at times based on the speed computation in this completely non-deterministic timing data flow graph model is a good idea to start with. Maybe you'd rather want to calculate "timestamps" at which GPIOs should be switched, and send (timestamp, gpio state) command tuples to some entity in your FPGA that keeps absolute time? On the scale of radio propagation and high-rate signal processing, CPU timing is really inaccurate, and you should use the fact that you have an FPGA to actually implement deterministic timing, and use the software running on the CPU (i.e. GNU Radio) to determine when that should happen.
Is there a way to monitor the buffer so that I can determine when my first part of data bits have been sent?
Other than that, a method to asynchronously tell another another block that, yes, you've processed N samples, would be either to have a single block that just observes the outputs of both blocks that you want to synchronize and consumes an identical number of samples from both inputs, or to implement something using message passing. Again, my suspicion is that this is not a solution to your actual problem.

OpenCL SHA1 Throughput Optimisation

Hoping someone more experienced in OpenCL usage may be able to help me here! I'm doing a project (to help me learn a bit more crypto and to try my hand at GPGPU programming) where I'm trying to implement my own SHA-1 algorighm.
Ultimately my question is about maximizing my throughput rates. At present I'm seeing something like 56.1 MH/sec, which compares very badly to open source programs I've looked at, such as John the Ripper and OCLHashcat, which are giving 1,000 and 1,500 MH/sec respectively (heck, I'd be well-chuffed with a 3rd of that!).
So, what I'm doing
I've written a SHA-1 implementation in an OpenCL kernel and a C++ host application to load data to the GPU (using CL 1.2 C++ wrapper). I'm generating blocks of candidate data to hash in a threaded fashion on the CPU and loading this data onto the global GPU memory using the CL C++ call to enqueueWriteBuffer (using uchars to represent the bytes to hash):
errorCode = dispatchQueue->enqueueWriteBuffer(
inputBuffer,
CL_FALSE,//CL_TRUE,
0,
sizeof(cl_uchar) * inputBufferSize,
passwordBuffer,
NULL,
&dispatchDelegate);
I'm en-queuing data using enqueueNDRangeKernel in the following manner (where global worksize is a user-defined variable, at present I've set this to my GPUs maximum flattened global worksize of 16.777 million per run):
errorCode = dispatchQueue->enqueueNDRangeKernel(
*kernel,
NullRange,
NDRange(globalWorkgroupSize, 1),
NullRange,
NULL,
NULL);
This means that (per dispatch) I load 16.777 million items in a 1D array and index from my kernel into this using get_global_offset(0).
My Kernel signature:
__kernel void sha1Crack(__global uchar* out, __global uchar* in,
__constant int* passLen, __constant int* targetHash,
__global bool* collisionFound)
{
//Kernel Instance Global GPU Mem IO Mapping:
__private int id = get_global_id(0);
__private int inputIndexStart = id * passwordLen;
//Select Password input key space:
#pragma unroll
for (i = 0; i < passwordLen; i++)
{
inputMem[i] = in[inputIndexStart + i];
}
//SHA1 Code omitted for brevity...
}
So, given all this: am I doing something fundamentally wrong in the way I'm loading data? I.e. 1 call to enqueueNDrange for 16.7 million kernel executions over a 1D input vector? Should I be using a 2-D space and sub-dividing into localworkgroup ranges? I tried playing with this but it didn't seem quicker.
Or, perhaps as likely is my algorithm itself the source of slowness? I've spent a good while optimizing it and manually unrolling all of the loop stages using pre-processor directives.
I've read about memory coalescing on the hardware. Could that be my issue? :S
Any advice at all appreciated! If I've missed anything important please let me know and I'll update.
Thanks in advance! ;)
Update: 16,777,216 is the device maximum reported workgroup size; 256**3. The global array of boolean values is one boolean. It's set to false at the start of the kernel enqueue, then a branching statement sets this to true if a collision is found only - will that force a convergence? passwordLen is the length of the current input value and target hash is an int[4] encoded hash to check against.
Your 'maximum flattened global worksize' should be multiplied by passwordLen. It is the number of kernels you can run, not the maximal length of an input array. You can most likely send much more data than this to the GPU.
Other potential issues: the 'generating blocks of candidate data to hash in a threaded fashion on the CPU', try doing this in advance of the kernel iterations to see whether the delay is in the generation of the data blocks or in the processing of the kernels; your sha1 algorithm is the other obvious potential issue. I'm not sure how much you've really optimised it by 'unrolling' the loops, usually the bigger optimisation issue is 'if' statements (if a single kernel instance within a workgroup tests to true then all of the lockstepped workgroup instances must follow that branch in parallel).
And DarkZeros is correct, you should manually play with the local workgroup size making it the highest common multiple of the global size and the number of kernels which can be run at once on the card. The easiest way to do this is to round up the global work group size to the next multiple of the card capacity and use an external if{} statement in the kernel only running the kernel for global_id less than the actual number of kernels you want to run.
Dave.

Pooling PBOs and textures?

I have an application which does a lot of GPGPU using Opengl and Pixel Buffer Objects to transfer and process data.
Currently I employ a pooling of these resources, basically I have a pool for every buffer dimensions and usage that my application uses. When the usage of resource finishes it returns to its respective pool for re-use. However, I'm starting to have seconds thoughts whether there is any is in this since I need "orphan" the PBOs before re-use to not interfere with ongoing transfers.
My question is whether there is any merit is in pooling resources such as PBOs and textures, or would it be just a good to simply allocate from OpenGL directly when needed?
Here is an example of what I am doing. Vice versa with textures.
std::shared_ptr<pbo> create_pbo(int size, bool write)
{
auto pool = pbo_pools[write][size];
std::shared_ptr<pbo> buffer;
if(!pool->try_pop(buffer))
buffer = ogl_thread_.invoke([=]{return new pbo(size, write);});
return spl::shared_ptr<pbo>(buffer.get(), [=](pbo*) mutable
{
ogl_thread_.begin_invoke([=]() mutable
{
if(write)
buffer->map();
else // read
buffer->unmap();
pool->push(buffer);
});
});
}
I'm starting to have seconds thoughts whether there is any is in this since I need "orphan" the PBOs before re-use to not interfere with ongoing transfers.
No you don't have to. That's the nice thing about PBOs: You can submit new data into them, while a call to glTex(Sub)Image may still be reading from them, without the read operation being corrupted.

Silence between played buffers in OpenAL?

I use alSourceQueueBuffers to stream buffers into a AL sound source. I have buffers of different size that need to be played one after another. So far so good, however, between some buffer I need a variable amount of silence, how can I add it programmatic?
Perhaps the easiest way would be to generate buffers that hold silence of the length needed and queue them appropriately. You just need to make an array full of zeros based on the sample rate and the desired length of silence and pass it into the buffer.
If you want things to be more complicated, then you can't queue all of the buffers. You queue the one that needs to play right now and set a timer for when it will be done (and the amount of silent time has also passed). Then you can queue the next buffer. Or you can poll the source to see if it has stopped and when it does, start counting down the silent time. You could also use the streaming functionality...
Edit:
This worked for me. Sample rate needs to be the same as other buffers queued on your source. You could also have a 'greatest common denominator' length buffer and just queue it up multiple times.
int sampleRate=22050;
double sTime=2.5; // How long to maintain silence.
int sampleCount= int(sTime*sampleRate);
int byteCount = sampleCount*sizeof(short);
short* silence = (short*)malloc(byteCount);
memset(silence,0,byteCount);
alBufferData(silenceBuffer,AL_FORMAT_MONO16,silence,byteCount,sampleRate);
alSourceQueueBuffers(mySource,1,&silenceBuffer);
free(silence);