I'm using OpenGL for some GPGPU processing. So I have different threads giving work to a OpenGL processing thread.
After each "work-item" I need to call glReadPixels and glMapBuffer in order to transfer back data to the host from the PBO. The problem with this however is that glMapBuffer blocks the thread and no useful work can be done until the DMA transfer finished, even though the GPU is idle. The usual way to solve this is to create a pipeline with a time depth of the longest DMA transfer. However, as I'm working on a low latency system this is suboptimal.
Is there a way to maybe wait for glMapBuffer on a separate thread or maybe get some notification as to when the DMA transfer has finished in order to reduce the latency as much as possible?
Do some additional work in other threads than the one glMapBuffer blocks in? You can have multiple OpenGL contexts, each active in its own thread; if they're configured to share objects they can operate simultanously.
However a DMA actually means work for the GPU, at least the bandwidth to it is fully consumed and so you might end up with even worse performance.
Related
If I call glDrawElements with the draw target being the back buffer, and then I call glReadPixels, is it guaranteed that I will read what was drew?
In other word, is glDrawElements a blocking call?
Note: I am observing an weird issue here that may be caused by glDrawElements not being blocking...
In other word, is glDrawElements a blocking call?
That's not how OpenGL works.
The OpenGL memory model is built on the "as if" rule. Certain exceptions aside, everything in OpenGL will function as if all of the commands you issued have already completed. In effect, everything will work as if every command blocked until it completed.
However, this does not mean that the OpenGL implementation actually works this way. It just has to do everything to make it appear to work that way.
Therefore, glDrawElements is generally not a blocking call; however, glReadPixels (when reading to client memory) is a blocking call. Because the results of a pixel transfer directly to client memory must be available when glReadPixels has returned, the implementation must check to see if there are any outstanding rendering commands going to the framebuffer being read from. If there are, then it must block until those rendering commands have completed. Then it can execute the read and store the data in your client memory.
If you were reading to a buffer object, there would be no need for glReadPixels to block. No memory accessible to the client will be modified by the function, since you're reading into a buffer object. So the driver can issue the readback asynchronously. However, if you issue some command that depends on the contents of this buffer (like mapping it for reading or using glGetBufferSubData), then the OpenGL implementation must stall until the reading operation is done.
In short, OpenGL tries to delay blocking for as long as possible. Your job, to ensure performance, is to help OpenGL to do so by not forcing an implicit synchronization unless absolutely necessary. Sync objects can help with this.
This question is related to using cuda streams to run many kernels
In CUDA there are many synchronization commands
cudaStreamSynchronize,
CudaDeviceSynchronize,
cudaThreadSynchronize,
and also cudaStreamQuery to check if streams are empty.
I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible.
Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an application and two of them need to complete for me to launch a forth streams should i use 2 cudaStreamSyncs or just one cudaDeviceSync what will incur less loss ?
The main difference between synchronize methods is "polling" and "blocking."
"Polling" is the default mechanism for the driver to wait for the GPU - it waits for a 32-bit memory location to attain a certain value written by the GPU. It may return the wait more quickly after the wait is resolved, but while waiting, it burns a CPU core looking at that memory location.
"Blocking" can be requested by calling cudaSetDeviceFlags() with cudaDeviceScheduleBlockingSync, or calling cudaEventCreate() with cudaEventBlockingSync. Blocking waits cause the driver to insert a command into the DMA command buffer that signals an interrupt when all preceding commands in the buffer have been executed. The driver can then map the interrupt to a Windows event or a Linux file handle, enabling the synchronization commands to wait without constantly burning CPU, as do the default polling methods.
The queries are basically a manual check of that 32-bit memory location used for polling waits; so in most situations, they are very cheap. But if ECC is enabled, the query will dive into kernel mode to check if there are any ECC errors; and on Windows, any pending commands will be flushed to the driver (which requires a kernel thunk).
I want to do parallel rendering with 2 GPUs. So a readback from GPU1 and then drawing pixels to GPU2 are needed.
I created two windows in each screen with its own GPU connected. And there are two threads associated to each window.
However, the readpixel+drawpixel is a bottleneck. So a async PBO method is considered: 2 PBOs for reading back and 2 PBOs for drawing back in alternative way.
My question is:
Could Pointer returned from glMapBufferARB be used in another thread and different GPU?
If not, I must copy data to main memory and copy it to another GPU, the bottleneck will be CPU->GPU copying. Is there any better idea?
Yes, pointer form glMapBuffer can be used by any thread - even without GL context. Just remember to synchronize threads and don't call glUnmapBuffer before thread finishes its job with pointer.h
I'm re-implementing some sections of an image processing library that's multithreaded C++ using pthreads. I'd like to be able to invoke a CUDA kernel in every thread and trust the device itself to handle kernel scheduling, but I know better than to count on that behavior. Does anyone have any experience with this type of issue?
CUDA 4.0 made it much simpler to drive a single CUDA context from multiple threads - just call cudaSetDevice() to specify which CUDA device you want the thread to submit commands.
Note that this is likely to be less efficient than driving the CUDA context from a single thread - unless the CPU threads have other work to keep them occupied between kernel launches, they are likely to get serialized by the mutexes that CUDA uses internally to keep its data structures consistent.
Perhaps Cuda streams are the solution to your problem. Try to invoke kernels from a different stream in each thread. However, I don't see how this will help, as I think that your kernel executions will be serialized, even though they are invoked in parallel. In fact, Cuda kernel invocations even on the same stream are asynchronous by nature, so you can make any number of invocations from the same thread. I really don't understand what you are trying to achieve.
I have a small architecture doubt about organizing code in separate functional units (most probably threads?). Application being developed is supposed to be doing the following tasks:
Display some images on a screen (i.e. slideshow)
Read the data from external device through the USB port
Match received data against the corresponding image (stimulus)
Do some data analysis
Plot the results of data analysis
My thoughts were to organize the application into the following modules:
GUI thread (+ image slideshow)
USB thread buffering the received data
Thread for analyzing/plotting data (main GUI thread should not be blocked while plotting the data which might consume some more time)
So, what do you generally think about this concept? Is there anything else you think that might be a better fit in this particular scenario?
You can probably get away with combining 1 & 2, since the slide-show feature is essentially gui oriented anyway.
For #3, you may be able to make do with some kind of asynchronous I/O methodology, so that you don't need to dedicate a polling thread. Not sure if you can do this with USB, but you can certainly get async I/O with serial and network interfaces, so it's worth looking into.
It's probably a good idea to move heavy-weight tasks like 4 & 5 to their own thread. If you aren't doing the analysis and plotting concurrently, maybe one thread for them both. However, you should really consider how much cpu time these activities will need. If the worst-case analyze-and-plot takes much less than half a second, you might even just perform these actions with a call from the gui. Conversely, if there are cases where this will take longer than that, a separate thread is favorable b/c your users won't like a laggy gui.
Just bear in mind that the dark side of threads lies in the inevitable challenge of coordinating them.
Because of the way the Windows API works, especially with regard to user input and window ownership. You can really only do UI on a single thread. If you try and use multiple threads, they just end up locking each other out and only 1 thread runs at a time. There are some specialized exceptions, but you have to be a real master of the API to pull it off.
So.
GUI thread, owns the Window, and handles all user input.
USB listening thread, you would know better than I whether this makes sense
Thread(s) for analyzing/plotting data, once again, I can't speak to this, but I'm skeptical that they will really both be running at the same time. It seems more likely this it would be analyze then plot so 1 thread.
Thread for rendering frames for a slideshow.
I'm not sure how plotting isn't the same thing as the slideshow, but I do think you can have a background thread for drawing the slideshow as long as it doesn't display the images.
You can render (i.e. draw to a bitmap or DirectX surface) in a background thread, you just can't show it in a window. But you could hand completed bitmaps off to the GUI thread and have it do the actual displaying of the bitmap. This is essentially how a lot of video playback code works.
A lot of this depends on how much is involved in performing 3 (Do some data analysis.) and 4 (Plot analyzed data.)
My instincts would be:
Definitely have a separate thread for reading the data off the USB. Assuming for a moment that 3 is dependent on reading the data, then I would do 3 in the same thread as reading the data. This will simplify your signaling to the GUI when the data is ready. This also assumes the processing is quick, and won't block the USB port (How is that being read? IO completion ports?). If the processing takes time then you need a separate thread.
Likewise if image slide processing show takes a long time, this should be done in a separate thread. If this can be quickly recalculated depending say in a paint function, I would keep it as part of the main GUI.
There is some overhead with context switch of threads, and for each thread added complexity of signaling. So I would only add a thread to solve blocking of the GUI and the USB port. It may be possible to do all of this in just two threads.
4 and 5 are definitely good ideas. That being said, avoid using low-level threads unless you absolutely must.
I'd check out Boost and Boost::Thread. Not only does it make your code more portable, but I haven't worked with an easier library for threading.
If you are using Builder 2009, you should look at TThread. It has some stuff to simplify thread coding.
I can't help thinking that you may be going a bit overboard here. A USB port can't really deliver data terribly quickly -- it's theoretical bandwidth is only 480 Mbits/second, and realistically, it's a pretty rare USB device that can really get very close to that.
Unless the analysis you've mentioned is quite a bit more complex than you've implied, my guess is that a single thread is probably entirely adequate. I'd think hard about using overlapped I/O to read the data, and MsgWaitForMultipleObjects for the main message loop.
It seems to me that the main place you stand a good chance of gaining a lot is in plotting the data after it's processed. It might be worth considering something like OpenGL or DirectX Graphics to do the drawing. Especially if you're producing quite a bit of output, this can give a really substantial speed improvement. In an ideal situation, multiple threads might multiply your speed by the number of available cores -- typically 2 or 4 on today's machines. Drawing the output is likely to be the slowest part of the job, and hardware acceleration can easily speed that up by a considerably larger factor -- 10x is at the low end of what you can typically expect, and 100x is fairly common.