I have been trying to implement Vulkan Memory Allocator in my application, but all I get is a blank screen. I use deferred and HDR pipeline and I allocate all my attachments using VMA. I tried outputting solid color in the fragment shader and it worked, so I suppose that the problem is that my vertex and index buffers are empty. I use staging buffer to copy data into vertex buffer. I fill the staging buffer like this:
void* data;
vmaMapMemory(_allocator, stagingAllocation, &data);
memcpy(data, bufferData, bufferSize);
vmaUnmapMemory(_allocator, stagingAllocation);
I also tried using vertex buffer without staging buffer and setting the flags
allocCreateInfo.flags = VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT | VMA_ALLOCATION_CREATE_MAPPED_BIT;
and copying data directly to vertex buffer with
memcpy(allocInfo.pMappedData, bufferData, bufferSize);
but that does not seem to work either.
Full code:
VkBuffer stagingBuffer;
VkBuffer vertexBuffer;
VmaAllocation stagingAllocation = createBuffer(bufferSize, VK_BUFFER_USAGE_TRANSFER_SRC_BIT, VMA_MEMORY_USAGE_CPU_ONLY, stagingBuffer);
void* data;
vmaMapMemory(_allocator, stagingAllocation, &data);
memcpy(data, bufferData, bufferSize);
vmaUnmapMemory(_allocator, stagingAllocation);
VmaAllocation allocation = createBuffer(bufferSize, VK_BUFFER_USAGE_TRANSFER_DST_BIT | VK_BUFFER_USAGE_VERTEX_BUFFER_BIT, VMA_MEMORY_USAGE_GPU_ONLY, vertexBuffer);
copyBuffer(stagingBuffer, vertexBuffer, bufferSize);
vmaDestroyBuffer(_allocator, stagingBuffer, stagingAllocation);
createBuffer function:
VmaAllocation createBuffer(VkDeviceSize size, VkBufferUsageFlags usage, VmaMemoryUsage memoryUsage, VkBuffer &buffer) {
VkBufferCreateInfo bufferInfo{};
bufferInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
bufferInfo.size = size;
bufferInfo.usage = usage;
bufferInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
VmaAllocationCreateInfo allocCreateInfo = {};
allocCreateInfo.usage = memoryUsage;
VmaAllocation allocation;
if (vmaCreateBuffer(_allocation, &bufferInfo, & allocCreateInfo, &buffer, &allocation, nullptr) != VK_SUCCESS) {
throw std::runtime_error("Failed to create buffer");
}
return allocation;
}
If I set allocCreateInfo.flags to 0, I get weird flickering.
I have looked at many examples and all of them seem to be similar to my code.
I should also mention that it was working before I started using VMA and I have only modified the code I listed (plus the same code for image creation, but that seems to work fine).
I am on MacOS, Vulkan 1.2.
Any help would be greatly appreciated.
EDIT:
So after some hours of debugging, I figured it out. The problem was not in my vertex buffers, but rather in uniform buffers. I used flag VMA_MEMORY_USAGE_CPU_TO_GPU. After setting it to VMA_MEMORY_USAGE_CPU_ONLY, it started to work. This is only a temporary solution, since I do not want to store all my uniform buffers in cpu memory. If somebody knows what is wrong, please let me know.
Thank you
EDIT 2:
So turns out that VMA does not use VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT or VK_MEMORY_PROPERTY_HOST_COHERENT_BIT automatically when you set usage to be VMA_MEMORY_USAGE_CPU_TO_GPU. You need to include these flags in allocCreateInfo.requiredFlags. After doing this it works fine. However, I could not find this in the docs, so it was quite hard to debug.
Related
I have been porting my RabbitCT CUDA implementation to OpenCL and I'm running into issues with pinned memory.
For CUDA a host buffer is created that buffers the input images to be processed in pinned memory. This allows the host to catch the next batch of input images while the GPU processes the current batch. A simplified mockup of my CUDA implementation is as follows:
// globals
float** hostProjBuffer = new float*[BUFFER_SIZE];
float* devProjection[STREAMS_MAX];
cudaStream_t stream[STREAMS_MAX];
void initialize()
{
// initiate streams
for( uint s = 0; s < STREAMS_MAX; s++ ){
cudaStreamCreateWithFlags (&stream[s], cudaStreamNonBlocking);
cudaMalloc( (void**)&devProjection[s], imgSize);
}
// initiate buffers
for( uint b = 0; b < BUFFER_SIZE; b++ ){
cudaMallocHost((void **)&hostProjBuffer[b], imgSize);
}
}
// main function called for all input images
void backproject(imgdata* r)
{
uint projNr = r->imgnr % BUFFER_SIZE;
uint streamNr = r->imgnr % STREAMS_MAX;
// When buffer is filled, wait until work in current stream has finished
if(projNr == 0) {
cudaStreamSynchronize(stream[streamNr]);
}
// copy received image data to buffer (maps double precision to float)
std::copy(r->I_n, r->I_n+(imgSizeX * imgSizeY), hostProjBuffer[projNr]);
// copy image and matrix to device
cudaMemcpyAsync( devProjection[streamNr], hostProjBuffer[projNr], imgSize, cudaMemcpyHostToDevice, stream[streamNr] );
// call kernel
backproject<<<numBlocks, threadsPerBlock, 0 , stream[streamNr]>>>(devProjection[streamNr]);
}
So, for CUDA, I create a pinned host pointer for each buffer item and copy the data to the device before executing kernel of each stream.
For OpenCL I initially did something similar when following the Nvidia OpenCL Best Practices Guide. Here they recommend creating two buffers, one for copying the kernel data to and one for the pinned memory. However, this leads to the implementation using double the device memory as both the kernel and pinned memory buffers are allocated on the device.
To get around this memory issue, I created an implementation where only a mapping is made to the device as it is needed. This can be seen in the following implementation:
// globals
float** hostProjBuffer = new float* [BUFFER_SIZE];
cl_mem devProjection[STREAMS_MAX], devMatrix[STREAMS_MAX];
cl_command_queue queue[STREAMS_MAX];
// initiate streams
void initialize()
{
for( uint s = 0; s < STREAMS_MAX; s++ ){
queue[s] = clCreateCommandQueueWithProperties(context, device, NULL, &status);
devProjection[s] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, imgSize, NULL, &status);
}
}
// main function called for all input images
void backproject(imgdata* r)
{
const uint projNr = r->imgnr % BUFFER_SIZE;
const uint streamNr = r->imgnr % STREAMS_MAX;
// when buffer is filled, wait until work in current stream has finished
if(projNr == 0) {
status = clFinish(queue[streamNr]);
}
// map host memory region to device buffer
hostProjBuffer[projNr] = (float*) clEnqueueMapBuffer(queue[streamNr], devProjection[streamNr], CL_FALSE, CL_MAP_WRITE_INVALIDATE_REGION, 0, imgSize, 0, NULL, NULL, &status);
// copy received image data to hostbuffers
std::copy(imgPtr, imgPtr + (imgSizeX * imgSizeY), hostProjBuffer[projNr]);
// unmap the allocated pinned host memory
clEnqueueUnmapMemObject(queue[streamNr], devProjection[streamNr], hostProjBuffer[projNr], 0, NULL, NULL);
// set stream specific arguments
clSetKernelArg(kernel, 0, sizeof(devProjection[streamNr]), (void *) &devProjection[streamNr]);
// launch kernel
clEnqueueNDRangeKernel(queue[streamNr], kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, NULL);
clFlush(queue[streamNr]);
clFinish(queue[streamNr]); //should be removed!
}
This implementation does use a similar amount of device memory as the CUDA implementation. However, I have been unable to get this last code example working without a clFinish after each loop, which significantly hampers the performance of the application. This indicates data is lost as the host moves ahead of the kernel. I tried increasing my buffer size to the number of input images, but this did not work either. So somehow during execution, the hostBuffer data gets lost.
So, with the goal to write OpenCL code similar to CUDA, I have three questions:
What is the recommended implementation for OpenCL pinned memory?
Is my OpenCL implementation similar to how CUDA handles pinned memory?
What causes the wrong data to be used in the OpenCL example?
Thanks in advance!
Kind regards,
Remy
PS: Question initially asked at the Nvidia developer forums
I am pretty new to linux aio (libaio) and am hoping someone here with more experience can help me out.
I have a system that performs high-speed DMA from a PCI device to the system RAM. I then want to write the data to disk using libaio directly from the DMA address space. Currently the memory used for DMA is reserved on boot through the use of the "memmap" kernel boot command.
In my application, the memory is mapped with mmap to a virtual userspace address pointer which I believe I should be able to pass as my buffer to the io_prep_pwrite() call. However, when I do this, the write does not seem to succeed. When the request is reaped with io_getevents the "res" field has error code -22 which prints as "Bad Address".
However, if I do a memcpy from the previously mapped memory location to a new buffer allocated by my application and then use the pointer to this local buffer in the call to io_prep_pwrite the request works just fine and the data is written to disk. The problem is that performing the memcpy creates a bottle-neck for the speed at which I need to stream data to disk and I am unable to keep up with the DMA rate.
I do not understand why I have to copy the memory first for the write request to work. I created a minimal example to illustrate the issue below. The bufBaseLoc is the mapped address and the localBuffer is the address the data is copied to. I do not want to have to perform the following line:
memcpy(localBuffer, bufBaseLoc, WRITE_SIZE);
...or have the localBuffer at all. In the "Prepare IOCB" section I want to use:
io_prep_pwrite(iocb, fid, bufBaseLoc, WRITE_SIZE, 0);
...but it does not work. But the local buffer, which is just a copy, does work.
Does anyone have any insight into why? Any help would be greatly appreciated.
Thanks,
#include <cstdio>
#include <string>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <libaio.h>
#define WRITE_SIZE 0x80000 //4MB buffer
#define DMA_BASE_ADDRESS 0x780000000 //Physical address of memory reserved at boot
//Reaping callback
void writeCallback(io_context_t ctx, struct iocb *iocb, long res, long res2)
{
//Res is number of bytes written by the request. It should match the requested IO size. If negative it is an error code
if(res != (long)iocb->u.c.nbytes)
{
fprintf(stderr, "WRITE_ERROR: %s\n", strerror(-res));
}
else
{
fprintf(stderr, "Success\n");
}
}
int main()
{
//Initialize Kernel AIO
io_context_t ctx = 0;
io_queue_init(256, &ctx);
//Open /dev/mem and map physical address to userspace
int fdmem = open("/dev/mem", O_RDWR);
void *bufBaseLoc = mmap(NULL, WRITE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fdmem, DMA_BASE_ADDRESS);
//Set memory here for test (normally done be DMA)
memset(bufBaseLoc, 1, WRITE_SIZE);
//Initialize Local Memory Buffer (DON’T WANT TO HAVE TO DO THIS)
uint8_t* localBuffer;
posix_memalign((void**)&localBuffer, 4096, WRITE_SIZE);
memset(localBuffer, 1, WRITE_SIZE);
//Open/Allocate file on disk
std::string filename = "tmpData.dat";
int fid = open(filename.c_str(), O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH);
posix_fallocate(fid, 0, WRITE_SIZE);
//Copy from DMA buffer to local process buffer (THIS IS THE COPY I WANT TO AVOID)
memcpy(localBuffer, bufBaseLoc, WRITE_SIZE);
//Prepare IOCB
struct iocb *iocbs[1];
struct iocb *iocb = (struct iocb*) malloc(sizeof (struct iocb));
io_prep_pwrite(iocb, fid, localBuffer, WRITE_SIZE, 0); //<--THIS WORKS (but is not what I want to do)
//io_prep_pwrite(iocb, fid, bufBaseLoc, WRITE_SIZE, 0); //<--THIS DOES NOT WORK (but is what I want to do)
io_set_callback(iocb, writeCallback);
iocbs[0] = iocb;
//Send Request
int res = io_submit(ctx, 1, iocbs);
if (res !=1)
{
fprintf(stderr, "IO_SUBMIT_ERROR: %s\n", strerror(-res));
}
//Reap Request
struct io_event events[1];
size_t ret = io_getevents(ctx, 1, 1, events, 0);
if (ret==1)
{
io_callback_t cb=(io_callback_t)events[0].data;
struct iocb *iocb_e = events[0].obj;
cb(ctx, iocb_e, events[0].res, events[0].res2);
}
}
It could be that your original buffer is not aligned.
Any buffer that libAIO writes to disk needs to be aligned(to around 4K I think). When you allocate your temporary buffer you use posix_memalign, which will allocate it correctly aligned, meaning the write will succeed. If your original reserved buffer is not aligned it could be causing you issues.
If you can try to allocate that initial buffer with an aligned alloc it could solve the issue.
So i am making a short of tower defense game. I shared a build with them so i can check if everything performs as it should on another host.
And what actually happens is that while everything renders perfectly on my side (both on my mac/xcode + windows/visual studio 2012), on my friend's side it seems like the geometry is messed up. Each object on my screen is represented by a VBO which i use every time to render to different locations. But it seems like my VBOs have all the geometry imported from all models. (Hence the Tower with the tree on the side.)
Here's the result:
(My computer)(My friend's computer)
As by now i've managed to debug this issue up to a certain point. I can tell it's not the way i'm importing my models because i'm creating a debug.txt file with all the vectors before i send them to gpu as VBOs and on both computers they output the same values. So i their vectors are not getting messed up by memory issues or anything like that. So i am thinking maybe it is the way i am setting up or rendering my VBOs
What strikes me the most though is why things work on my computer while they are not on my friends computer. One difference i know for sure is that my computer is a developer station(whatever this means) while my friend's computer is not.
This is my VBO loading function and my VBO drawing function:
I use glfw to create my window and context and include glew headers to enable some of the newer opengl functions.
void G4::Renderer::LoadObject(
G4::TILE_TYPES aType,
std::vector<float> &v_data,
std::vector<float> &n_data,
std::vector<float> &t_data,
float scale,
bool has_texture,
unsigned int texture_id
)
{
unsigned int vertices_id, vertices_size, normals_id, texturecoordinates_id;
vertices_size = static_cast<unsigned int>(v_data.size());
glGenBuffers(1, &vertices_id);
glGenBuffers(1, &normals_id);
//::->Vertex array buffer upload.
glBindBuffer(GL_ARRAY_BUFFER, vertices_id);
glBufferData(GL_ARRAY_BUFFER, sizeof(float)*v_data.size(), &v_data.front(), GL_STATIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
//::->Normal Array buffer upload.
glBindBuffer(GL_ARRAY_BUFFER, normals_id);
glBufferData(GL_ARRAY_BUFFER, sizeof(float)*n_data.size(), &n_data.front(), GL_STATIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
if (has_texture)
{
glGenBuffers(1, &texturecoordinates_id);
glBindBuffer(GL_ARRAY_BUFFER, texturecoordinates_id);
glBufferData(GL_ARRAY_BUFFER, sizeof(float)*t_data.size(), &(t_data[0]), GL_STATIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
}
this->vbos[aType].Update(vertices_id, vertices_size, normals_id, texture_id, texturecoordinates_id, scale, has_texture);
}
Draw code:
void G4::Renderer::DrawUnit(G4::VBO aVBO, bool drawWithColor, float r, float g, float b, float a)
{
bool model_has_texture = aVBO.HasTexture();
glEnableClientState(GL_VERTEX_ARRAY);
glEnableClientState(GL_NORMAL_ARRAY);
if (model_has_texture && !drawWithColor) {
glEnableClientState(GL_TEXTURE_COORD_ARRAY);
glEnable(GL_TEXTURE_2D);
}
if (drawWithColor)
{
glColor4f(r, g, b, a);
}
glScalef(aVBO.GetScaleValue(), aVBO.GetScaleValue(), aVBO.GetScaleValue());
glBindBuffer(GL_ARRAY_BUFFER, aVBO.GetVerticesID());
glVertexPointer(3, GL_FLOAT, 0, 0);
glBindBuffer(GL_ARRAY_BUFFER, 0);
glBindBuffer(GL_ARRAY_BUFFER, aVBO.GetNormalsID());
glNormalPointer(GL_FLOAT, 0, 0);
glBindBuffer(GL_ARRAY_BUFFER, 0);
if (model_has_texture && !drawWithColor)
{
glBindTexture(GL_TEXTURE_2D, aVBO.GetTextureID());
glBindBuffer(GL_ARRAY_BUFFER, aVBO.GetTextureCoordsID());
glTexCoordPointer(2, GL_FLOAT, 0, 0);
glBindBuffer(GL_ARRAY_BUFFER, 0);
}
glDrawArrays(GL_TRIANGLES, 0, aVBO.GetVerticesSize());
if (model_has_texture && !drawWithColor) {
glDisableClientState(GL_TEXTURE_COORD_ARRAY);
glDisable(GL_TEXTURE_2D);
}
glDisableClientState(GL_NORMAL_ARRAY);
glDisableClientState(GL_VERTEX_ARRAY);
}
I'm out of ideas i hope someone can direct me on how to debug this any further.
The OpenGL spec. does not specify the exact behaviour that should occur when you issue a draw call with more vertices than your buffer stores. The reason this may work correctly on one machine and not on another comes down to implementation. Each vendor is free to do whatever they want if this situation occurs, so the render artifacts might show up on AMD hardware but not at all on nVIDIA or Intel. Making matters worse, there is actually no error state generated by a call to glDrawArrays (...) when it is asked to draw too many vertices. You definitely need to test your software on hardware sourced from multiple vendors to catch these sorts of errors; who manufactures the GPU in your computer, and the driver version, is just as important as the operating system and compiler.
Nevertheless there are ways to catch these silly mistakes. gDEBugger is one, and there is also a new OpenGL extension I will discuss below. I prefer to use the new extension because in my experience, in addition to deprecated API calls and errors (which gDEBugger can be configured to monitor), the extension can also warn you for using inefficiently aligned data structures and various other portability and performance issues.
I wanted to add some code I use to use OpenGL Debug Output in my software, since this is an example of an errant behaviour that does not actually generate an error that you can catch with glGetError (...). Sometimes, you can catch these mistakes with Debug Output (though, I just tested it and this is not one of those situations). You will need an OpenGL Debug Context for this to work (the process of setting this up is highly platform dependent), but it is a context flag just like forward/backward compatible and core (glfw should make this easy for you).
Automatic breakpoint macro for x86 platforms
// Breakpoints that should ALWAYS trigger (EVEN IN RELEASE BUILDS) [x86]!
#ifdef _MSC_VER
# define eTB_CriticalBreakPoint() if (IsDebuggerPresent ()) __debugbreak ();
#else
# define eTB_CriticalBreakPoint() asm (" int $3");
#endif
Enable OpenGL Debug Output (requires a Debug Context and a relatively recent driver, OpenGL 4.x era)
// SUPER VERBOSE DEBUGGING!
if (glDebugMessageControlARB != NULL) {
glEnable (GL_DEBUG_OUTPUT_SYNCHRONOUS_ARB);
glDebugMessageControlARB (GL_DONT_CARE, GL_DONT_CARE, GL_DONT_CARE, 0, NULL, GL_TRUE);
glDebugMessageCallbackARB ((GLDEBUGPROCARB)ETB_GL_ERROR_CALLBACK, NULL);
}
Some important utility functions to replace enumerant values with more meaningful text
const char*
ETB_GL_DEBUG_SOURCE_STR (GLenum source)
{
static const char* sources [] = {
"API", "Window System", "Shader Compiler", "Third Party", "Application",
"Other", "Unknown"
};
int str_idx =
min ( source - GL_DEBUG_SOURCE_API,
sizeof (sources) / sizeof (const char *) );
return sources [str_idx];
}
const char*
ETB_GL_DEBUG_TYPE_STR (GLenum type)
{
static const char* types [] = {
"Error", "Deprecated Behavior", "Undefined Behavior", "Portability",
"Performance", "Other", "Unknown"
};
int str_idx =
min ( type - GL_DEBUG_TYPE_ERROR,
sizeof (types) / sizeof (const char *) );
return types [str_idx];
}
const char*
ETB_GL_DEBUG_SEVERITY_STR (GLenum severity)
{
static const char* severities [] = {
"High", "Medium", "Low", "Unknown"
};
int str_idx =
min ( severity - GL_DEBUG_SEVERITY_HIGH,
sizeof (severities) / sizeof (const char *) );
return severities [str_idx];
}
DWORD
ETB_GL_DEBUG_SEVERITY_COLOR (GLenum severity)
{
static DWORD severities [] = {
0xff0000ff, // High (Red)
0xff00ffff, // Med (Yellow)
0xff00ff00, // Low (Green)
0xffffffff // ??? (White)
};
int col_idx =
min ( severity - GL_DEBUG_SEVERITY_HIGH,
sizeof (severities) / sizeof (DWORD) );
return severities [col_idx];
}
My Debug Output Callback (somewhat messy, because it prints each field in a different color in my software)
void
ETB_GL_ERROR_CALLBACK (GLenum source,
GLenum type,
GLuint id,
GLenum severity,
GLsizei length,
const GLchar* message,
GLvoid* userParam)
{
eTB_ColorPrintf (0xff00ffff, "OpenGL Error:\n");
eTB_ColorPrintf (0xff808080, "=============\n");
eTB_ColorPrintf (0xff6060ff, " Object ID: ");
eTB_ColorPrintf (0xff0080ff, "%d\n", id);
eTB_ColorPrintf (0xff60ff60, " Severity: ");
eTB_ColorPrintf ( ETB_GL_DEBUG_SEVERITY_COLOR (severity),
"%s\n",
ETB_GL_DEBUG_SEVERITY_STR (severity) );
eTB_ColorPrintf (0xffddff80, " Type: ");
eTB_ColorPrintf (0xffccaa80, "%s\n", ETB_GL_DEBUG_TYPE_STR (type));
eTB_ColorPrintf (0xffddff80, " Source: ");
eTB_ColorPrintf (0xffccaa80, "%s\n", ETB_GL_DEBUG_SOURCE_STR (source));
eTB_ColorPrintf (0xffff6060, " Message: ");
eTB_ColorPrintf (0xff0000ff, "%s\n\n", message);
// Force the console to flush its contents before executing a breakpoint
eTB_FlushConsole ();
// Trigger a breakpoint in gDEBugger...
glFinish ();
// Trigger a breakpoint in traditional debuggers...
eTB_CriticalBreakPoint ();
}
Since I could not actually get your situation to trigger a debug output event, I figured I would at least show an example of an event I was able to trigger. This is not an error that you can catch with glGetError (...), or an error at all for that matter. But it is certainly a draw call issue that you might be completely oblivious to for the duration of your project without using this extension :)
OpenGL Error:
=============
Object ID: 102
Severity: Medium
Type: Performance
Source: API
Message: glDrawElements uses element index type 'GL_UNSIGNED_BYTE' that is not optimal for the current hardware configuration; consider using 'GL_UNSIGNED_SHORT' instead.
After further debugging sessions with my friends and much tryouts i managed to find the problem. This took me two solid days to figure out and really it was just a silly mistake.
glDrawArrays(GL_TRIANGLES, 0, aVBO.GetVerticesSize());
The above code does not get the vertices size (as points) but as a total number of floats stored there. So everything is multiplied by 3. Adding a /3 solved it.
So i assume since the total points where multiplied by 3 times the vbo "stole" data from other vbos stored on the gpu. (Hence the tree model stack to my tower).
What i can't figure out yet though, and would like an answer on that, is why on my computer everything rendered fine but not on other computers. As i state in my original question a hint would be that my computer is actually a developer station while my friend's not.
Anyone who is kind enough to explain why this effect doesn't reproduce on me i will gladly accept his answer as a solution to my problem.
Thank you
After rearranging some OpenCL/GL interop code it stopped working and gave the error in the title.
I already read many threads that discussed this issue and blamed the (nvidia) driver for it. But as the code ran before this should not be my problem.
Code
As I already said I have too much code to post all of it. Also I am not able to reproduce the behaviour in minimal examples so far, so I am going to describe most of the program in pseudo code and parts in excerpts from my real code.
I use Qt, therefore following functions are all members of my subclass of QGLWidget:
void initializeGL()
{
glewInit();
cl::Platform::get(&clPlatforms);
cl_context_properties props[] = {
CL_GL_CONTEXT_KHR, (cl_context_properties)glXGetCurrentContext(),
CL_GLX_DISPLAY_KHR, (cl_context_properties)glXGetCurrentDisplay(),
CL_CONTEXT_PLATFORM, (cl_context_properties)clPlatforms[0](),
0
};
clContext = cl::Context(CL_DEVICE_TYPE_GPU, props, cl_notify, &err);
std::vector<cl::Device> devices = clContext.getInfo<CL_CONTEXT_DEVICES>();
clDevice = devices[0];
clQueue = cl::CommandQueue(clContext, clDevice, 0, 0);
clProgram = cl::Program(context, sources);
clProgram.build();
clKernel = cl::Kernel(clProgram, "main");
glGenBuffers(1, &vbo);
glBindBuffers(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, bufferSize, buffer, GL_DYNAMIC_DRAW);
glBindBuffers(GL_ARRAY_BUFFER, 0);
vboBuffer = cl::BufferGL(clContext, CL_MEM_READ_WRITE, vbo);
clKernel.setArg(0, vboBuffer);
}
void paintGL()
{
simulate();
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
//camera transformations ...
glEnable(GL_VERTEX_ARRAY);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glVertexPointer(3, GL_FLOAT, sizeof(cl_float3), 0);
glDrawElements(GL_TRIANGLES, 3 * numTriangles, GL_UNSIGNED_INT, triangleIdicesArray);
glBindBuffer(GL_ARRAY_BUFFER, 0);
}
void simulate()
{
std::vector<cl::Memory> sharedObjects;
sharedObjects.push_back(vboBuffer);
clQueue.enqueueAcquireGLObjects(&sharedObjects); //this triggers the error
clQueue.enqueNDRangKernel(clKernel, cl::NullRange, cl::NDRangle(numVertices), cl::NullRange);
clQueue.enqueueReleaseGLObjects(&sharedObjects);
}
Some further investigation about glEnqueueAcquireGLObjects turned up (in the notes section):
Prior to calling clEnqueueAcquireGLObjects, the application must ensure that any pending GL operations which access the objects specified in mem_objects have completed. This may be accomplished portably by issuing and waiting for completion of a glFinish command on all GL contexts with pending references to these objects.
Adding a finish on the OpenCL command queue before and after the acquire/release got me rid of these errors.
As with other errors I find the error code misleading and the notify function did not add any valuable information (it said the same).
I need to parse a video stream (mpeg ts) from proprietary network protocol (which I already know how to do) and then I would like to use OpenCV to process the video stream into frames. I know how to use cv::VideoCapture from a file or from a standard URL, but I would like to setup OpenCV to read from a buffer(s) in memory where I can store the video stream data until it is needed. Is there a way to setup a call back method (or any other interfrace) so that I can still use the cv::VideoCapture object? Is there a better way to accomplish processing the video with out writing it out to a file and then re-reading it. I would also entertain using FFMPEG directly if that is a better choice. I think I can convert AVFrames to Mat if needed.
I had a similar need recently. I was looking for a way in OpenCV to play a video that was already in memory, but without ever having to write the video file to disk. I found out that the FFMPEG interface already supports this through av_open_input_stream. There is just a little more prep work required compared to the av_open_input_file call used in OpenCV to open a file.
Between the following two websites I was able to piece together a working solution using the ffmpeg calls. Please refer to the information on these websites for more details:
http://ffmpeg.arrozcru.org/forum/viewtopic.php?f=8&t=1170
http://cdry.wordpress.com/2009/09/09/using-custom-io-callbacks-with-ffmpeg/
To get it working in OpenCV, I ended up adding a new function to the CvCapture_FFMPEG class:
virtual bool openBuffer( unsigned char* pBuffer, unsigned int bufLen );
I provided access to it through a new API call in the highgui DLL, similar to cvCreateFileCapture. The new openBuffer function is basically the same as the open( const char* _filename ) function with the following difference:
err = av_open_input_file(&ic, _filename, NULL, 0, NULL);
is replaced by:
ic = avformat_alloc_context();
ic->pb = avio_alloc_context(pBuffer, bufLen, 0, pBuffer, read_buffer, NULL, NULL);
if(!ic->pb) {
// handle error
}
// Need to probe buffer for input format unless you already know it
AVProbeData probe_data;
probe_data.buf_size = (bufLen < 4096) ? bufLen : 4096;
probe_data.filename = "stream";
probe_data.buf = (unsigned char *) malloc(probe_data.buf_size);
memcpy(probe_data.buf, pBuffer, probe_data.buf_size);
AVInputFormat *pAVInputFormat = av_probe_input_format(&probe_data, 1);
if(!pAVInputFormat)
pAVInputFormat = av_probe_input_format(&probe_data, 0);
// cleanup
free(probe_data.buf);
probe_data.buf = NULL;
if(!pAVInputFormat) {
// handle error
}
pAVInputFormat->flags |= AVFMT_NOFILE;
err = av_open_input_stream(&ic , ic->pb, "stream", pAVInputFormat, NULL);
Also, make sure to call av_close_input_stream in the CvCapture_FFMPEG::close() function instead of av_close_input_file in this situation.
Now the read_buffer callback function that is passed in to avio_alloc_context I defined as:
static int read_buffer(void *opaque, uint8_t *buf, int buf_size)
{
// This function must fill the buffer with data and return number of bytes copied.
// opaque is the pointer to private_data in the call to avio_alloc_context (4th param)
memcpy(buf, opaque, buf_size);
return buf_size;
}
This solution assumes the entire video is contained in a memory buffer and would probably have to be tweaked to work with streaming data.
So that's it! Btw, I'm using OpenCV version 2.1 so YMMV.
Code to do similar to the above, for opencv 4.2.0 is on:
https://github.com/jcdutton/opencv
Branch: 4.2.0-jcd1
Load the entire file into RAM pointed to by buffer, of size buffer_size.
Sample code:
VideoCapture d_reader1;
d_reader1.open_buffer(buffer, buffer_size);
d_reader1.read(input1);
The above code reads the first frame of video.