Does glMapBuffer occupy GPU time? - opengl

A common usage of glMapBuffer is
previousPBO.render();
bindNextPBO();
GLubyte* src = (GLubyte*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
if(src) {
doSomeWork(src);
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
}
There are 2 PBO working in alternative way.
However, sometimes, the doSomeWork(.) may be in another thread. If the code above applied,
current thread must wait for doSomeWork() to finish. Another solution is:
previousPBO.render();
bindNextPBO();
if(currentPBO.mapped) {
currentPBO.mapped = false;
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
}
GLubyte* src = (GLubyte*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
if(src) {
currentPBO.mapped = true;
doSomeWork(src);
}
In this case, the map->unmap procedure of the same PBO spans two frames.
Does the unmapped state of PBO stall the rendering of GPU?
Any bad influence on performance?

The second code snippet is just WRONG. What if doSomeWork didn't finish in time? Now it is accessing memory that you unmapped, and will cause an access violation (or worse).
A better approach is to wait for completion of the other thread and unmap the buffer at the end of the frame just before SwapBuffers.

Related

glfwSwapBuffers slow (>3s)

The bounty expires in 7 days. Answers to this question are eligible for a +50 reputation bounty.
Paul Aner is looking for a canonical answer:
I think the reason for this question is clear: I want the main-loop to NOT lock while a compute shader is processing larger amounts of data. I could try and seperate the data into smaller snippets, but if the computations were done on CPU, I would simply start a thread and everything would run nice and smoothly. Altough I of course would have to wait until the calculation-thread delivers new data to update the screen - the GUI (ImGUI) would not lock up...
I have written a program that does some calculations on a compute shader and the returned data is then being displayed. This works perfectly, except that the program execution is blocked while the shader is running (see code below) and depending on the parameters, this can take a while:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
GLfloat* mapped = (GLfloat*)(glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY));
memcpy(Result, mapped, sizeof(GLfloat) * X * Y);
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
}
void main
{
// Initialization stuff
// ...
while (glfwWindowShouldClose(Window) == 0)
{
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glfwPollEvents();
glfwSwapInterval(2); // Doesn't matter what I put here
CalculatateSomething(Result);
Render(Result);
glfwSwapBuffers(Window.WindowHandle);
}
}
To keep the main loop running while the compute shader is calculating, I changed CalculateSomething to something like this:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
GPU_sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
}
bool GPU_busy()
{
GLint GPU_status;
if (GPU_sync == NULL)
return false;
else
{
glGetSynciv(GPU_sync, GL_SYNC_STATUS, 1, nullptr, &GPU_status);
return GPU_status == GL_UNSIGNALED;
}
}
These two functions are part of a class and it would get a little messy and complicated if I had to post all that here (if more code is needed, tell me). So every loop when the class is told to do the computation, it first checks, if the GPU is busy. If it's done, the result is copied to CPU-memory (or a calculation is started), else it returns to main without doing anything else. Anyway, this approach works in that it produces the right result. But my main loop is still blocked.
Doing some timing revealed that CalculateSomething, Render (and everything else) runs fast (as I would expect them to do). But now glfwSwapBuffers takes >3000ms (depending on how long the calculations of the compute shader take).
Shouldn't it be possible to switch buffers while a compute shader is running? Rendering the result seems to work fine and without delay (as long as the compute shader is not done yet, the old result should get rendered). Or am I missing something here (queued OpenGL calls get processed before glfwSwapBuffers does something?)?
Edit:
I'm not sure why this question got closed and what additional information is needed (maybe other than the OS, which would be Windows). As for "desired behavior": Well - I'd like the glfwSwapBuffers-call not to block my main loop. For additional information, please ask...
As pointed out by Erdal Küçük an implicit call of glFlush might cause latency. I did put this call before glfwSwapBuffer for testing purposes and timed it - no latency here...
I'm sure, I can't be the only one who ever ran into this problem. Maybe someone could try and reproduce it? Simply put a compute shader in the main-loop that takes a few seconds to do it's calculations. I have read somewhere that similar problems occur escpecially when calling glMapBuffer. This seems to be an issue with the GPU-driver (mine would be an integrated Intel-GPU). But nowhere have I read about latencies above 200ms...
Solved a similar issue with GL_PIXEL_PACK_BUFFER effectively used as an offscreen compute shader. The approach with fences is correct, but you then need to have a separate function that checks the status of the fence using glGetSynciv to read the GL_SYNC_STATUS. The solution (admittedly in Java) can be found here.
An explanation for why this is necessary can be found in: in #Nick Clark's comment answer:
Every call in OpenGL is asynchronous, except for the frame buffer swap, which stalls the calling thread until all submitted functions have been executed. Thus, the reason why glfwSwapBuffers seems to take so long.
The relevant portion from the solution is:
public void finishHMRead( int pboIndex ){
int[] length = new int[1];
int[] status = new int[1];
GLES30.glGetSynciv( hmReadFences[ pboIndex ], GLES30.GL_SYNC_STATUS, 1, length, 0, status, 0 );
int signalStatus = status[0];
int glSignaled = GLES30.GL_SIGNALED;
if( signalStatus == glSignaled ){
// Ready a temporary ByteBuffer for mapping (we'll unmap the pixel buffer and lose this) and a permanent ByteBuffer
ByteBuffer pixelBuffer;
texLayerByteBuffers[ pboIndex ] = ByteBuffer.allocate( texWH * texWH );
// map data to a bytebuffer
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, pbos[ pboIndex ] );
pixelBuffer = ( ByteBuffer ) GLES30.glMapBufferRange( GLES30.GL_PIXEL_PACK_BUFFER, 0, texWH * texWH * 1, GLES30.GL_MAP_READ_BIT );
// Copy to the long term ByteBuffer
pixelBuffer.rewind(); //copy from the beginning
texLayerByteBuffers[ pboIndex ].put( pixelBuffer );
// Unmap and unbind the currently bound pixel buffer
GLES30.glUnmapBuffer( GLES30.GL_PIXEL_PACK_BUFFER );
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, 0 );
Log.i( "myTag", "Finished copy for pbo data for " + pboIndex + " at: " + (System.currentTimeMillis() - initSphereStart) );
acknowledgeHMReadComplete();
} else {
// If it wasn't done, resubmit for another check in the next render update cycle
RefMethodwArgs finishHmRead = new RefMethodwArgs( this, "finishHMRead", new Object[]{ pboIndex } );
UpdateList.getRef().addRenderUpdate( finishHmRead );
}
}
Basically, fire off the computer shader, then wait for the glGetSynciv check of GL_SYNC_STATUS to equal GL_SIGNALED, then rebind the GL_SHADER_STORAGE_BUFFER and perform the glMapBuffer operation.

ffmpeg C API - creating queue of frames

I have created using the C API of ffmpeg a C++ application that reads frames from a file and writes them to a new file. Everything works fine, as long as I write immediately the frames to the output. In other words, the following structure of the program outputs the correct result (I put only the pseudocode for now, if needed I can also post some real snippets but the classes that I have created for handling the ffmpeg functionalities are quite large):
AVFrame* frame = av_frame_alloc();
int got_frame;
// readFrame returns 0 if file is ended, got frame = 1 if
// a complete frame has been extracted
while(readFrame(inputfile,frame, &got_frame)) {
if (got_frame) {
// I actually do some processing here
writeFrame(outputfile,frame);
}
}
av_frame_free(&frame);
The next step has been to parallelize the application and, as a consequence, frames are not written immediately after they are read (I do not want to go into the details of the parallelization). In this case problems arise: there is some flickering in the output, as if some frames get repeated randomly. However, the number of frames and the duration of the output video remains correct.
What I am trying to do now is to separate completely the reading from writing in the serial implementation in order to understand what is going on. I am creating a queue of pointers to frames:
std::queue<AVFrame*> queue;
int ret = 1, got_frame;
while (ret) {
AVFrame* frame = av_frame_alloc();
ret = readFrame(inputfile,frame,&got_frame);
if (got_frame)
queue.push(frame);
}
To write frames to the output file I do:
while (!queue.empty()) {
frame = queue.front();
queue.pop();
writeFrame(outputFile,frame);
av_frame_free(&frame);
}
The result in this case is an output video with the correct duration and number of frames that is only a repetition of the last 3 (I think) frames of the video.
My guess is that something might go wrong because of the fact that in the first case I use always the same memory location for reading frames, while in the second case I allocate many different frames.
Any suggestions on what could be the problem?
Ah, so I'm assuming that readFrame() is a wrapper around libavformat's av_read_frame() and libavcodec's avcodec_decode_video2(), is that right?
From the documentation:
When AVCodecContext.refcounted_frames is set to 1, the frame is
reference counted and the returned reference belongs to the caller.
The caller must release the frame using av_frame_unref() when the
frame is no longer needed.
and:
When
AVCodecContext.refcounted_frames is set to 0, the returned reference
belongs to the decoder and is valid only until the next call to this
function or until closing or flushing the decoder.
Obviously, from this it follows from this that you need to set AVCodecContext.refcounted_frames to 1. The default is 0, so my gut feeling is you need to set it to 1 and that will fix your problem. Don't forget to use av_fame_unref() on the pictures after use to prevent memleaks, and also don't forget to free your AVFrame in this loop if got_frame = 0 - again to prevent memleaks:
while (ret) {
AVFrame* frame = av_frame_alloc();
ret = readFrame(inputfile,frame,&got_frame);
if (got_frame)
queue.push(frame);
else
av_frame_free(frame);
}
(Or alternatively you could implement some cache for frame so you only realloc it if the previous object was pushed in the queue.)
There's nothing obviously wrong with your pseudocode. The problem almost certainly lies in how you lock the queue between threads.
Your memory allocation seems same to me. Do you maybe do something else in between reading and writing the frames?
Is queue the same queue in the routines that read and write the frames?

Any CUDA operation after cudaStreamSynchronize blocks until all streams are finished

While profiling my CUDA application with NVIDIA Visual Profiler I noticed that any operation after cudaStreamSynchronize blocks until all streams are finished. This is very odd behavior because if cudaStreamSynchronize returns that means that the stream is finished, right? Here is my pseudo-code:
std::list<std::thread> waitingThreads;
void startKernelsAsync() {
for (int i = 0; i < 200; ++i) {
cudaHostAlloc(cpuPinnedMemory, size, cudaHostAllocDefault);
memcpy(cpuPinnedMemory, data, size);
cudaMalloc(gpuMemory);
cudaStreamCreate(&stream);
cudaMemcpyAsync(gpuMemory, cpuPinnedMemory, size, cudaMemcpyHostToDevice, stream);
runKernel<<<32, 32, 0, stream>>>(gpuMemory);
cudaMemcpyAsync(cpuPinnedMemory, gpuMemory, size, cudaMemcpyDeviceToHost, stream);
waitingThreads.push_back(std::move(std::thread(waitForFinish, cpuPinnedMemory, stream)));
}
while (waitingThreads.size() > 0) {
waitingThreads.front().join();
waitingThreads.pop_front();
}
}
void waitForFinish(void* cpuPinnedMemory, cudaStream_t stream, ...) {
cudaStreamSynchronize(stream);
cudaStreamDestroy(stream); // <== This blocks until all streams are finished.
memcpy(data, cpuPinnedMemory, size);
cudaFreeHost(cpuPinnedMemory);
cudaFree(gpuMemory);
}
If I put cudaFreeHost before cudaStreamDestroy then it becomes the blocking operation.
Is there anything conceptually wrong here?
EDIT: I found another weird behavior, sometimes it un-blocks in the middle of processing of streams and then processes the rest of streams.
Normal behavior:
Strange behavior (happens quite often):
EDIT2: I am testing on Tesla K40c card with compute capability 3.5 on CUDA 6.0.
As suggested in comments, it may be viable to reduce number of streams however in my application the memory transfers are quite fast and I want to use streams mainly to dynamically schedule work to GPU. The problem is that after stream finishes I need to download data from pinned memory and clear allocated memory for further streams which seems to be blocking operation.
I am using one stream per data-set because every data-set has different size and processing takes unpredictably long time.
Any ideas how to solve this?
I haven't found why the operations are blocking but I concluded that I can not do anything about it so I decided ti implement memory and streams pooling (as suggested in comments) to re-use GPU memory, pinned CPU memory and streams to avoid any kind of deletion.
In case anybody would be interested here is my solution. The start kernel behaves as asynchronous operation that schedules kernel and callback is called after the kernel is finished.
std::vector<Instance*> m_idleInstances;
std::vector<Instance*> m_workingInstances;
void startKernelAsync(...) {
// Search for finished stream.
while (m_idleInstances.size() == 0) {
findFinishedInstance();
if (m_idleInstances.size() == 0) {
std::chrono::milliseconds dur(10);
std::this_thread::sleep_for(dur);
}
}
Instance* instance = m_idleInstances.back();
m_idleInstances.pop_back();
// Fill CPU pinned memory
cudaMemcpyAsync(..., stream);
runKernel<<<32, 32, 0, stream>>>(gpuMemory);
cudaMemcpyAsync(..., stream);
m_workingInstances.push_back(clusteringInstance);
}
void findFinishedInstance() {
for (auto it = m_workingInstances.begin(); it != m_workingInstances.end();) {
Instance* inst = *it;
cudaError_t status = cudaStreamQuery(inst->stream);
if (status == cudaSuccess) {
it = m_workingInstances.erase(it);
m_callback(instance->clusterGroup);
m_idleInstances.push_back(inst);
}
else {
++it;
}
}
}
And at the and just wait for everybody to finish:
virtual void waitForFinish() {
while (m_workingInstances.size() > 0) {
Instance* instance = m_workingInstances.back();
m_workingInstances.pop_back();
m_idleInstances.push_back(instance);
cudaStreamSynchronize(instance->stream);
finalizeInstance(instance);
}
}
And here is a graph form profiler, works as a charm!
Check out the list of "Implicit Synchronization" rules in the Cuda C Programming Guide PDF that comes with the toolkit. (Section 3.2.5.5.4 in my copy, but you might have a different version.)
If your GPU is "compute capability 3.0 or lower", there are some special rules that apply. My guess would be that cudaStreamDestroy() is hitting one of those limitations.

Program crash on boost::thread_group

I'm facing an issue with boost::thread_group.
My software executes some image processing on images acquired from multiple cameras and so far, the single-thread/serial execution of operations are completed successfully and I've tested all the single functions using google-test framework, just to avoid the possibilities of some code-mistakes or crashes of algorithms.
When i enable the multithread processing, where i feed my processing function with different data (the data are not shared between the threads, but i want to process 4 images in parallel, to speed up the execution), i receive after a while an error of segmentation fault and the program quits. All the try/catch blocks are not avoiding this crash.
Is there any hint/way to process in parallel different data and store them on disk?
I'll write down the snippet of code which causes the error:
boost::thread_group processingThreads;
for(unsigned int i = 0; i < images.size(); ++i)
{
processingThreads.create_thread(boost::bind(processingFunction, i, param, backgrounds[i][0], images[i][0]));
}
processingThreads.join_all();
where images and backgrounds are a std::vector> (one vector for each camera).
How can I deal with parallel executions of the same function?
Is Boost suitable for this goal?
EDIT
processingFunction is structured as following (shortly, because it's a long function)
bool processingThread(const unsigned int threadID, const ProjectConfiguration &param, Images &background, Images &cloud)
{
try{
if(!loadImage(cloud))
return false;
///Then: resize image, compute background mask, segment it.
return true;
catch(...){ return false; }
}
Thank you in advance
Mike

Prevent frame dropping while saving frames to disk

I am trying to write C++ code which saves incoming video frames to disk. Asynchronously arriving frames are pushed onto queue by a producer thread. The frames are popped off the queue by a consumer thread. Mutual exclusion of producer and consumer is done using a mutex. However, I still notice frames being dropped. The dropped frames (likely) correspond to instances when producer tries to push the current frame onto queue but cannot do so since consumer holds the lock. Any suggestions ? I essentially do not want the producer to wait. A waiting consumer is okay for me.
EDIT-0 : Alternate idea which does not involve locking. Will this work ?
Producer initially enqueues n seconds worth of video. n can be some small multiple of frame-rate.
As long as queue contains >= n seconds worth of video, consumer dequeues on a frame by frame basis and saves to disk.
When the video is done, the queue is flushed to disk.
EDIT-1: The frames arrive at ~ 15 fps.
EDIT-2 : Outline of code :
Main driver code
// Main function
void LVD::DumpFrame(const IplImage *frame)
{
// Copies frame into internal buffer.
// buffer object is a wrapper around OpenCV's IplImage
Initialize(frame);
// (Producer thread) -- Pushes buffer onto queue
// Thread locks queue, pushes buffer onto queue, unlocks queue and dies
PushBufferOntoQueue();
// (Consumer thread) -- Pop off queue and save to disk
// Thread locks queue, pops it, unlocks queue,
// saves popped buffer to disk and dies
DumpQueue();
++m_frame_id;
}
void LVD::Initialize(const IplImage *frame)
{
if(NULL == m_buffer) // first iteration
m_buffer = new ImageBuffer(frame);
else
m_buffer->Copy(frame);
}
Producer
void LVD::PushBufferOntoQueue()
{
m_queingThread = ::CreateThread( NULL, 0, ThreadFuncPushImageBufferOntoQueue, this, 0, &m_dwThreadID);
}
DWORD WINAPI LVD::ThreadFuncPushImageBufferOntoQueue(void *arg)
{
LVD* videoDumper = reinterpret_cast<LVD*>(arg);
LocalLock ll( &videoDumper->m_que_lock, 60*1000 );
videoDumper->m_frameQue.push(*(videoDumper->m_buffer));
ll.Unlock();
return 0;
}
Consumer
void LVD::DumpQueue()
{
m_dumpingThread = ::CreateThread( NULL, 0, ThreadFuncDumpFrames, this, 0, &m_dwThreadID);
}
DWORD WINAPI LVD::ThreadFuncDumpFrames(void *arg)
{
LVD* videoDumper = reinterpret_cast<LVD*>(arg);
LocalLock ll( &videoDumper->m_que_lock, 60*1000 );
if(videoDumper->m_frameQue.size() > 0 )
{
videoDumper->m_save_frame=videoDumper->m_frameQue.front();
videoDumper->m_frameQue.pop();
}
ll.Unlock();
stringstream ss;
ss << videoDumper->m_saveDir.c_str() << "\\";
ss << videoDumper->m_startTime.c_str() << "\\";
ss << setfill('0') << setw(6) << videoDumper->m_frame_id;
ss << ".png";
videoDumper->m_save_frame.SaveImage(ss.str().c_str());
return 0;
}
Note:
(1) I cannot use C++11. Therefore, Herb Sutter's DDJ article is not an option.
(2) I found a reference to an unbounded single producer-consumer queue. However, the author(s) state that enqueue(adding frames) is probably not wait-free.
(3) I also found liblfds, a C-library but not sure if it will serve my purpose.
The queue cannot be the problem. Video frames arrive at 16 msec intervals, at worst. Your queue only needs to store a pointer to a frame. Adding/removing one in a thread-safe way can never take more than a microsecond.
You'll need to look for another explanation and solution. Video does forever present a fire-hose problem. Disk drives are not generally fast enough to keep up with an uncompressed video stream. So if your consumer cannot keep up with the producer then something is going go give. With a dropped frame the likely outcome when you (correctly) prevent the queue from growing without bound.
Be sure to consider encoding the video. Real-time MPEG and AVC encoders are available. After they compress the stream you should not have a problem keeping up with the disk.
Circular buffer is definitely a good alternative. If you make it use a 2^n size, you can also use this trick to update the pointers:
inline int update_index(int x)
{
return (x + 1) & (size-1);
}
That way, there is no need to use expensive compare (and consequential jumps) or divide (the single most expensive integer operation in any processor - not counting "fill/copy large chunks of memory" type operations).
When dealing with video (or graphics in general) it is essential to do "buffer management". Typically, this is a case of tracking state of the "framebuffer" and avoiding to copy content more than necessary.
The typical approach is to allocate 2 or 3 video-buffers (or frame buffers, or what you call it). A buffer can be owned by either the producer or the consumer. The transfer is ONLY the ownership. So when the video-driver signals that "this buffer is full", the ownership is now with the consumer, that will read the buffer and store it to disk [or whatever]. When the storing is finished, the buffer is given back ("freed") so that the producer can re-use it. Copying the data out of the buffer is expensive [takes time], so you don't want to do that unless it's ABSOLUTELY necessary.