I did google for the question, and got from this link
clEnqueueAcquireGLObjects
Acquire OpenCL memory objects that have been created from OpenGL objects.
These objects need to be acquired before they can be used by any OpenCL commands queued to a command-queue.
I really don't understand why these objects need to be acquired. In my opinion, the reason of the acquiring is NOT OpenGL/OpenCL synchronization because the synchronization can be achieved by glFinish and clFinish.
I mean, if clEnqueueAcquireGLObjects/clEnqueueReleaseGLObjects are used, then glFinish/clFinish are redundant, and vice-versa.
I mean, if clEnqueueAcquireGLObjects/clEnqueueReleaseGLObjects are used, then glFinish/clFinish are redundant, and vice-versa.
You're thinking about this in entirely the wrong way.
glFinish causes OpenGL to perform a full CPU synchronization, such that the implementation will have completed all commands afterwards. clFinish does something similar for OpenCL.
The fact that you called one or the other has absolutely no effect on what a different system does. OpenGL has no idea that OpenCL exists, and vice-versa. glFinish has nothing to do with clFinish and vice-versa. So while OpenGL may have finished making some modification to an object, OpenCL has no idea that these modifications took place.
The purpose of acquiring and releasing OpenGL objects is for OpenCL and OpenGL to talk to one another. When objects are acquired, OpenCL tells OpenGL, "Hey, see these objects? They're mine now, so give them to me." This means that the OpenGL/OpenCL driver will do whatever mechanics are necessary to transfer access control over those objects to OpenCL.
For example, if an object has been paged out of GPU memory, OpenCL acquiring it may need to make it resident again. OpenCL and OpenGL have two separate sets of records that refer to this memory; by acquiring the object, you synchronize the OpenCL data with changes made by OpenGL. And so forth.
Notice that these mechanics have nothing at all to do with synchronizing GPU operations. They are about making the objects accessible to OpenCL.
If your OpenCL implementation doesn't have cl_khr_gl_event, then you must use OpenGL's synchronization mechanism to ensure that those objects are no longer in use before you acquire them. The two functions aren't redundant; they're doing different things to ensure the integrity of the system.
Related
How lightweight are operations for creating and destroying CUDA streams? E.g. for CPU threads these operations are heavy, therefore they usually pool CPU threads. Shall I pool CUDA streams too? Or is it fast to create a stream every time I need it and then destroy it?
Guidance from NVIDIA is that you should pool CUDA streams. Here is a comment from the horse's mouth, https://github.com/pytorch/pytorch/issues/9646:
There is a cost to creating, retaining, and destroying CUDA streams in
PyTorch master. In particular:
Tracking CUDA streams requires atomic refcounting
Destroying a CUDA
stream can (rarely) cause implicit device synchronization
The
refcounting issue has been raised as a concern for expanding stream
tracing to allow streaming backwards, for example, and it's clearly
best to avoid implicit device synchronization as it causes an often
unexpected performance degradation.
For static frameworks the recommended best practice is to create all
the needed streams upfront and destroy them after the work is done.
This pattern is not immediately applicable to PyTorch, but a per
device stream pool would achieve a similar effect.
It probably doesn't matter whether creating streams is fast or not. Creating them once and reusing then will always be faster than continually creating and destroying them.
Whether amortizing that latency is actually important depends on your application much more than anything else.
If I use two different cairo_t (and related cairo_surface_t etc) objects in two different threads, can I be guaranteed that there will be no race conditions due to shared global state?
Can I also formally pass a cairo_t object from one thread to another without any unexpected behaviour (possibly arising from thread local storage)?
This bug-tracking discussion should answer your questions : https://bugs.freedesktop.org/show_bug.cgi?id=74355
1. Cairo should be re-entrant
Uli Schlachter 2014-02-03 18:25:06 UTC
(In reply to comment #0)
share a single cairo_surface_t between the threads, and have each thread
draw using its own cairo_t. This crashes, but maybe I'm hoping for too much
(although an image surface is essentially just a big array of bytes that
should be writable from multiple threads).
Sure, just an array. And this works as long as you expect anything
like useful results. Cairo is supposed to be thread-safe as long as
the threads don't share any state (well, this is an
oversimplification, but your first approach isn't supposed to work).
2. Thread local storage can crash Pixman
Søren Sandmann Pedersen 2014-02-17 16:49:02 UTC
It is possible that pixman's support for TLS on Windows is simply
buggy; it may be that not a lot of people have been using pixman in a
multithreaded way on Windows (or have worked around the problem in
some way). We will need some kind of way to reproduce the issue to
know.
In pixman 0.32.0 and later there is a test program called
'thread-test' that may reproduce this issue if you can get it running
on Windows.
As a policy, you should always consider third parties libraries not-tread safe, until proven otherwise.
Since your title asks for reentrancy: There aren't many callbacks in cairo, but as long as you don't cause any recursive callbacks, you should be fine.
Cairo definitely isn't signal-safe and I can't really imagine it being so.
And for your actual question about threads:
There isn't that much global state in cairo and most of that is protected via apropriate mutexes. There were/are some bugs with font locking. If you stumble upon thread safety problems and can write a not-too-huge, self-contained program that reproduces the problem, the problem should be quickly fixed. So any thread-safety issues are considered bugs.
And yes, this does not apply to sharing state between threads. Only implicitely used global state is protected. You cannot use any object that cairo hands to you in multiple threads at the same time. But you can freely move an object between threads.
If I call glDrawElements with the draw target being the back buffer, and then I call glReadPixels, is it guaranteed that I will read what was drew?
In other word, is glDrawElements a blocking call?
Note: I am observing an weird issue here that may be caused by glDrawElements not being blocking...
In other word, is glDrawElements a blocking call?
That's not how OpenGL works.
The OpenGL memory model is built on the "as if" rule. Certain exceptions aside, everything in OpenGL will function as if all of the commands you issued have already completed. In effect, everything will work as if every command blocked until it completed.
However, this does not mean that the OpenGL implementation actually works this way. It just has to do everything to make it appear to work that way.
Therefore, glDrawElements is generally not a blocking call; however, glReadPixels (when reading to client memory) is a blocking call. Because the results of a pixel transfer directly to client memory must be available when glReadPixels has returned, the implementation must check to see if there are any outstanding rendering commands going to the framebuffer being read from. If there are, then it must block until those rendering commands have completed. Then it can execute the read and store the data in your client memory.
If you were reading to a buffer object, there would be no need for glReadPixels to block. No memory accessible to the client will be modified by the function, since you're reading into a buffer object. So the driver can issue the readback asynchronously. However, if you issue some command that depends on the contents of this buffer (like mapping it for reading or using glGetBufferSubData), then the OpenGL implementation must stall until the reading operation is done.
In short, OpenGL tries to delay blocking for as long as possible. Your job, to ensure performance, is to help OpenGL to do so by not forcing an implicit synchronization unless absolutely necessary. Sync objects can help with this.
So when you call opengl functions, like glDraw or gLBufferData, does it cause the thread of the program to stop and wait for GL to finish the calls?
If not, then how does GL handle calling important functions like glDraw, and then immediately afterwards having a setting changed that affects the draw calls?
No, they (mostly) do not. The majority of GL functions are buffered when used and actually executed later. This means that you cannot think of the CPU and the GPU as two processors working together at the same time. Usually, the CPU executes a bunch of GL functions that get buffered and, as soon as they are delivered to the GPU, this one executes them. This means that you cannot reliably control how much time it took for a specific GL function to execute by just comparing the time before and after it's execution.
If you want to do that, you need to first run a glFinish() so it will actually wait for all previously buffered GL calls to execute, and then you can start counting, execute the calls that you want to benchmark, call glFinish again to make sure these calls executed as well, and then finish the benchmark.
On the other hand, I said "mostly". This is because reading functions will actually NEED to synchronize with the GPU to show real results and so, in this case, they DO wait and freeze the main thread.
edit: I think the explanation itself answers the question you asked second, but just in case: the fact that all calls are buffered make it possible for a draw to complete first, and then change a setting afterwards for succesive calls
It strictly depends on the OpenGL call in question and the OpenGL state. When you make OpenGL calls, the implementation first queues them up internally and then executes them asynchronously to the calling program's execution. One important concept of OpenGL are synchronization points. Those are operations in the work queue that require the OpenGL call to block until certain conditions are met.
OpenGL objects (textures, buffer objects, etc.) are purely abstract and by specification the handle of an object in the client program always to the data, the object has at calling time of OpenGL functions that refer to this object. So take for example this sequence:
glBindTexture(GL_TEXTURE_2D, texID);
glTexImage2D(..., image_1);
draw_textured_quad();
glTexImage2D(..., image_2);
draw_textured_quad();
The first draw_textured_quad may return even long before anything has been drawn. However by making the calls OpenGL creates an internal reference to the data currently hold by the texture. So when glTexImage2D is called a second time, which may happen before the first quad was drawn, OpenGL must internally create a secondary texture object that's to become texture texID and to be used by the second calls of draw_textured_quad. If glTexSubImage2D was called, it would even have to make a modified copy of it.
OpenGL calls will only block, if the result of the call modifies client side memory and depends of data generated by previous OpenGL calls. In other words, when doing OpenGL calls, the OpenGL implementation internally generates a dependency tree to keep track of what depends on what. And when a synchronization point must block it will at least block until all dependencies are met.
I'm re-implementing some sections of an image processing library that's multithreaded C++ using pthreads. I'd like to be able to invoke a CUDA kernel in every thread and trust the device itself to handle kernel scheduling, but I know better than to count on that behavior. Does anyone have any experience with this type of issue?
CUDA 4.0 made it much simpler to drive a single CUDA context from multiple threads - just call cudaSetDevice() to specify which CUDA device you want the thread to submit commands.
Note that this is likely to be less efficient than driving the CUDA context from a single thread - unless the CPU threads have other work to keep them occupied between kernel launches, they are likely to get serialized by the mutexes that CUDA uses internally to keep its data structures consistent.
Perhaps Cuda streams are the solution to your problem. Try to invoke kernels from a different stream in each thread. However, I don't see how this will help, as I think that your kernel executions will be serialized, even though they are invoked in parallel. In fact, Cuda kernel invocations even on the same stream are asynchronous by nature, so you can make any number of invocations from the same thread. I really don't understand what you are trying to achieve.