How can I get data from the buffer efficiently onto the GPU in PyTorch? - ray

I have a ray actor which collects experiences (the buffer) and a ray actor that optimizes over them (the learner) + several actors that only collect experiences. This is similar to the Ape-X reinforcement learning algorithm.
My main problem is that using the learner to sample from the buffer takes a lot of time, because data can only be transferred from the buffer to the learner in cpu format (even if the learner and buffer are on the same machine). Consequently, in order to run an optimization pass on the learner, I will still need to push the data to the GPU, after every time I call ray.get(buffer.GetSamples.remote()). This is very inefficient and takes a lot of time away from optimization calculations.
In an ideal world, the buffer would continuously push random samples to the GPU, and the learner could simply pick a chunk from those at each pass. How can I make this work?
Also, putting both the learner and the buffer into one ray actor doesn't work, because ray (and obv python) seems to have significant problems with multi-threading, and doing it in a serial fashion defeats the purpose (as it will be even slower).
Note that this is a follow-up to another question from me here.
EDIT: I should note that this is for PyTorch

You can call .cuda() and push the loaded samples onto a python Queue, and in another thread consume those GPU samples from the queue.
This is how the Ape-X implementation in Ray manages concurrent data loading for TensorFlow.

Related

In Vulkan, is it beneficial for the graphics queue family to be separate from the present queue family?

As far as I can tell it is possible for a queue family to support presenting to the screen but not support graphics. Say I have a queue family that supports both graphics and presenting, and another queue family that only supports presenting. Should I use the first queue family for both processes or should I delegate the first to graphics and the latter to presenting? Or would there be no noticeable difference between these two approaches?
No such HW exists, so best approach is no approach. If you want to be really nice, you can handle the separate present queue family case with expending minimal brain-power on it. Though you have no way to test it on real HW that needs it. So I would say abort with a nice error message would be as adequate, until you can get your hands on actual HW that does it.
I think there is bit of a design error here on Khronoses part. Separate present queue does look like a more explicit way. But then, present op itself is not a queue operation, so the driver can use whatever it wants anyway. Also separate present requires extra semaphore, and Queue Family Ownership Transfer (or VK_SHARING_MODE_CONCURRENT resource). The history went the way that no driver is so extremist to report a separate present queue. So I made KhronosGroup/Vulkan-Docs#1234.
For rough notion of what happens at vkQueuePresentKHR, you can inspect Mesa code: https://github.com/mesa3d/mesa/blob/bf3c9d27706dc2362b81aad12eec1f7e48e53ddd/src/vulkan/wsi/wsi_common.c#L1120-L1232. There's probably no monkey business there using the queue you provided except waiting on your semaphore, or at most making a blit of the image. If you (voluntarily) want to use separate present queue, you need to measure and whitelist it only for drivers (and probably other influences) it actually helps (if any such exist, and if it is even worth your time).
First off, I assume you mean "beneficial" in terms of performance, and whenever it comes to questions like that you can never have a definite answer except by profiling the different strategies. If your application needs to run on a variety of hardware, you can have it profile the different strategies the first time it's run and save the results locally for repeated use, provide the user with a benchmarking utility they can run if they see poor performance, etc. etc. Trying to reason about it in the abstract can only get you so far.
That aside, I think the easiest way to think about questions like this is to remember that when it comes to graphics programming, you want to both maximize the amount of work that can be done in parallel and minimize the amount of work overall. If you want to present an image from a non-graphics queue and you need to perform graphics operations on it, you'll need to transfer ownership of it to the non-graphics queue when graphics operations on it have finished. Presumably, that will take a bit of time in the driver if nothing else, so it's only worth doing if it will save you time elsewhere somehow.
A common situation where this would probably save you time is if the device supports async compute and also lets you present from the compute queue. For example, a 3D game might use the compute queue for things like lighting, blur, UI, etc. that make the most sense to do after geometry processing is finished. In this case, the game engine would transfer ownership of the image to be presented to the compute queue first anyway, or even have the compute queue own the swapchain image from beginning to end, so presenting from the compute queue once its work for the frame is done would allow the graphics queue to stay busy with the next frame. AMD and NVIDIA recommend this sort of approach where it's possible.
If your application wouldn't otherwise use the compute queue, though, I'm not sure how much sense it makes or not to present on it when you have the option. The advantage of that approach is that once graphics operations for a given frame are over, you can have the graphics queue immediately release ownership of the image for it and acquire the next one without having to pause to present it, which would allow presentation to be done in parallel with rendering the next frame. On the other hand, you'll have to transfer ownership of it to the compute queue first and set up presentation there, which would add some complexity and overhead. I'm not sure which approach would be faster and I wouldn't be surprised if it varies with the application and environment. Of course, I'm not sure how many realtime Vulkan applications of any significant complexity fit this scenario today, and I'd guess it's not very many as "per-pixel" things tend to be easier and faster to do with a compute shader.

Possible options to calculate a part of paralleled Program over GPU

Hi I am not so familiar with gpu and I Just have a theoretical question.
So I am working on an application called Sassena, which calculates Neutron scattering from Molecular dynamics trajectories. This application is written in parallel with MPI and works for CPUs very well. But I am willing to run this app over GPU to make it faster. ofcourse not all of it but partly. when I look at the Source Code, The way it works is typical MPI, meaning the first rank send the data to each node individually and then each nodes does the calculation. Now, there is a part of calculation which is using Fast Fourier Transform(FFT), which consumes the most time and I want to send this part to GPU.
I see 2 Solutions ahead of me:
when the nodes reach the FFT part, they should send back the data to the main node, and when the main node gathered all the data it sends them to GPU, then GPU does the FFT, sends it back to cpu and cpu does the rest.
Each node would dynamically send the data to GPU and after the GPU does the FFT, it sends back to each node and they do the rest of their job.
So my Question is which one of these two are possible. I know first one is doable but it is having a lot of communication which is time consuming. But the second way I don't know if it is possible at all or not. I know in the second case it will be dependent on the Computer architecture as well. But is CUDA or OpenCL capable of this at all??
Thanks for any idea.
To my knowledge you are not restricted by CUDA. What you are restricted here is the number of GPUs you have. You need to create some sort of queue that distributes your work to the available GPUs and keeps track of free resources. Depending on ratio between the number of CPUs to the number of GPUs and the amount of time each FFT takes, you may be waiting longer for each FFT to be passed to the GPU compared to just doing it on each core.
What I mean is that you lose the asynchronous computation of FFT which is performed on each core. Rather, CPU 2 have to wait for CPU 1 to finish its FFT computation to be able to initiate a new kernel on GPU.
Other than what I have said, it is possible to create a simple mutex which is locked when a CPU starts computing its FFT and is unlocked when it finishes so that the next CPU can use the GPU.
You can look at StarPU. It is a task based api which can handle sending tasks to GPUs. It is also designed for distributed memory models.

How to write data into a buffer and write the buffer into a binary file with a second thread?

I am getting data from a sensor(camera) and writing the data into a binary file. The problem is it takes lot of space on the disk.
So, I used the compression from boost (zlib) and the space reduced a lot! The problem is the compression process is slow and lots of data is missing.
So, I want to implement two threads, with one getting the data from the camera and writing the data into a buffer. The second thread will take the front data of the buffer and write it into the binary file. And in this case, all the data will be present.
How do I implement this buffer? It needs to expand dynamically and pop_front. Shall I use std::deque, or does something better already exist?
First, you have to consider these four rates (or speeds):
Speed of Production (SP): The average number of bytes your sensor produces per second.
Speed of Compression (SC): The average number of bytes per second you can compress. This is the number of input bytes to the compression algorithm.
Rate of Compression (RC): The average ratio of compressed data to uncompressed data your compress algorithm produces (ratio of size of output to the input of compression.) (This is obviously somewhere between 0 and 1.)
Speed of Writing (SW): The average number of bytes you can write to disk, per second.
If SC is less than SP, you are in trouble. It means you can't compress all the data you gather from your sensor, in real time. Which means you'll eventually run out of buffer memory. You'll have to find a faster compression algorithm, or dedicate more CPU cores to compression.
If SW is less than SP times RC (which is the size of sensor data after compression,) you are again in trouble. It means you can't write out your output data as fast as you are producing and compressing them, and again, you will eventually run out of buffer memory, no matter how much you have. You might be able to gain some speed by adopting a better write strategy or file system, but a real gain in SW comes from a better disk system (RAID, SSD, better hardware, etc.)
Now, if everything is OK speed-wise, you can probably employ something like the following architecture to read, compress and write the data out:
You'll have three threads (or two, described later) that do one part of the pipeline each. You'll also have two thread-safe queues, one for communication from each stage of the pipeline to the next.
Assuming the two queues are named Q1 and Q2, the high-level operation of the threads will look like this:
Input Thread:
Read K bytes of sensor data
Put the whole K bytes as a unit on Q1.
Go to 1.
Compression Thread:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer and put it on Q2.
Go to 1.
Output Thread:
Wait till there is something on Q2.
Pop one buffer of data from Q2.
Write the buffer to the output file.
Go to 1.
The most CPU-intensive part of the work is in the second thread, and the other two probably don't consume much CPU time and therefore probably can share a CPU core. This means that the above strategy may be runnable on two cores. But it can also run on a single core if the workload is light, or require many many cores. That all depends on the four rates I described up top.
Using asynchronous writes (e.g. IOCP on Windows or epoll on Linux,) you can drop the third thread and the second queue altogether. Then your second thread needs to execute something like this:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer.
Issue an asynchronous write request to the OS to write out the compressed buffer to disk.
Go to 1.
There are four more issues worth mentioning:
K should be selected so that the time required for various (usually constant time) activities associated with allocating a buffer, pushing it into and popping it from a thread-safe queue, starting a compression run and issuing a write request into a file become negligible relative to doing the actual work (reading sensor data, compressing bytes and writing to disk.) This usually means that K needs to be as large as possible. But if K is very large (many megabytes or hundreds of megabytes) then if your application crashes, you'll lose a lot of data. You need to find a balance between performance and risk of data loss. I suggest (without any knowledge of your specific needs and constraints) a value between 10KiB to 1MiB for K.
Implementing a thread-safe queue is easy if you have some knowledge and experience with concurrent/parallel programming, but rather hard and error-prone if you do not. Finding good examples and implementations should not be hard. A normal std::deque or std::list or std::anything won't be usable by itself, but can used as a good basis for writing a thread-safe queue.
Note that you are queuing buffers of data, not individual numbers or bytes. If you pass your data one number at a time through this pipeline, it will be painfully slow and wasteful.
Some compression algorithms are limited in how much data they can consume in each invocation, or that you must sync the output of each one call to compression routine with one call to the decompression routine later on. These might affect the choice of K, and also how you write your output file. You might have to add some metadata so that you can be able to actually decompress and read the data later.

Multithreaded Realtime audio programming - To block or Not to block

When writing audio software many people on the internet say it is paramount not to use either memory allocation or blocking code, i.e no locks. Due to the fact these are non deterministic so could cause the output buffer to underflow and the audio will glitch.
Real Time Audio Progrmaming
When I write video software, I generally use both, i.e. allocating video frames on the heap and passing between threads using locks and conditional variables (bounded buffers). I love the power this provides as a separate thread can be used for each operation, allowing the software to max out each of the cores, giving the best performance.
With audio I'd like to do something similar, passing frames of maybe 100 samples between threads, however, there are two issues.
How do I generate the frames without using memory allocation? I suppose I could use a pool of frames that have been pre-allocated but this seems messy.
I'm aware you can use lock free queue and that boost has a nice library to do this. This would be a great way to share between threads, but constantly polling the queue to see if data is available seems like a massive waist of CPU time.
In my experience using mutexes doesn't actually take much time at all, provided that the section where the mutex is locked is short.
What is the best way to achieve passing audio frames between threads, whilst keeping latency to a minimum, not wasting resources and using relatively little non-deterministic behaviour?
Seems like you did your research! You've already identified the two main problems that could be the root-cause of audio glitches. The question is: How much of this was important 10 years ago and is only folklore and cargo-cult programming these days.
My two cents:
1. Heap allocations in the rendering loop:
These can have quite a lot overhead depending on how small your processing chunks are. The main culprit is, that very few run-times have a per-thread heap, so each time you mess with the heap your performance depends on what other threads in your process do. If for example a GUI thread is currently deleting thousands of objects, and you - at the same time - access the heap from the audio rendering thread you may experience a significant delay.
Writing your own memory management with pre-allocated buffers may sound messy, but in the end it's just two functions that you can hide somewhere in a utility source. Since you usually know your allocation sizes in advance there is a lot of opportunity to fine-tune and optimize your memory management. You can store your segments as a simple linked list for example. If done right this has the benefit that you allocate the last used buffer again. This buffer has a very high probability of beeing in the cache.
If fixed size allocators don't work for you have a look at ring-buffers. They fit the use-cases of streaming audio very well.
2. To lock, or not to lock:
I'd say, these days using mutex and semaphore locks are fine if you can estimate that you do less than 1000 to 5000 of them per second (on a PC, things are different on something like a Raspberry Pi etc.). If you stay below that range it is unlikely that the overhead shows up in a performance profile.
Translated to your use-case: If you for example work with 48kHz audio and 100 sample chunks you generate roughly 960 lock/unlock operation in a simple two thread consumer/producer pattern. that is well within the range. In case you completely max out the rendering thread the locking will not show up in a profiling. If you on the other hand only use like 5% of the available processing power the locks may show up, but you will not have a performance problem either :-)
Going lock-less is also an option, but so are hybrid solutions that first do some lock-less tries and then fall back to hard locking. You'll get the best of both worlds that way. There is a lot of good stuff to read about this topic on the net.
In any case:
You should raise the thread priority of your non GUI threads gently to make sure that if they run into a lock, they get out of it quickly. It is also a good idea to read what Priority Inversion is, and what you can do to avoid it:
https://en.wikipedia.org/wiki/Priority_inversion
'I suppose I could use a pool of frames that have been pre-allocated but this seems messy' - not really. Either allocate an array of frames, or new up frames in a loop, and then shove the indices/pointers onto a blocking queue. Now you have an auto-managed pool of frames. Pop one off when you need a frame, push it back on when you are done with it. No continual malloc/free/new/delete, no chance or memory-runaway, simpler debugging, and frame flow-control, (if the pool runs out, threads asking for frames will wait until frames are released back into the pool), all built in.
Using an array may seem easier/safer/faster than a new loop, but newing individual frames does have an advantage - you can easily change the number of frames in the pool at runtime.
Um, why are you passing frames of 100 samples between threads?
Assuming that you are working at a nominal sample rate of 44.1kHz, and passing 100 samples at a time between threads, that presumes that your thread switching rate must be at least 100 samples / (44100 samples/s * 2). The 2 represents both the producer and the consumer. That means you have a time slice of ~1.13 ms for every 100 samples you send. Nearly all operating systems run at time slices greater than 10 ms. So it is impossible to build an audio engine where you are sharing only 100 samples between threads at 44.1kHz on a modern OS.
The solution is to buffer more samples per time slice, either via a queue or by using larger frames. Most modern real time audio APIs use 128 samples per channel (on dedicated audio hardware) or 256 samples per channel (on game consoles).
Ultimately, the answer to your question is mostly the answer you would expect... Pass around uniquely owned queues of pointers to buffers, not the buffers themselves; manage ALL audio buffers in a fixed pool allocated at program start; and lock all queues for as little time as necessary.
Interestingly, this is one of the few good situations in audio programming where there is a distinct performance advantage to busting out the assembly code. You definitely don't want a malloc and free occurring with every queue lock. Operating-system provided atomic locking functions can ALWAYS be improved upon, if you know your CPU.
One last thing: there's no such thing as a lockfree queue. All multithread "lockfree" queue implementations rely on a CPU barrier intrinsic or a hard compare-and-swap somewhere to make sure that exclusive access to memory is guaranteed per thread.

Which OpenGL functions are not GPU-accelerated?

I was shocked when I read this (from the OpenGL wiki):
glTranslate, glRotate, glScale
Are these hardware accelerated?
No, there are no known GPUs that
execute this. The driver computes the
matrix on the CPU and uploads it to
the GPU.
All the other matrix operations are
done on the CPU as well :
glPushMatrix, glPopMatrix,
glLoadIdentity, glFrustum, glOrtho.
This is the reason why these functions
are considered deprecated in GL 3.0.
You should have your own math library,
build your own matrix, upload your
matrix to the shader.
For a very, very long time I thought most of the OpenGL functions use the GPU to do computation. I'm not sure if this is a common misconception, but after a while of thinking, this makes sense. Old OpenGL functions (2.x and older) are really not suitable for real-world applications, due to too many state switches.
This makes me realise that, possibly, many OpenGL functions do not use the GPU at all.
So, the question is:
Which OpenGL function calls don't use the GPU?
I believe knowing the answer to the above question would help me become a better programmer with OpenGL. Please do share some of your insights.
Edit:
I know this question easily leads to optimisation level. It's good, but it's not the intention of this question.
If anyone knows a set of GL functions on a certain popular implementation (as AshleysBrain suggested, nVidia/ATI, and possibly OS-dependent) that don't use the GPU, that's what I'm after!
Plausible optimisation guides come later. Let's focus on the functions, for this topic.
Edit2:
This topic isn't about how matrix transformations work. There are other topics for that.
Boy, is this a big subject.
First, I'll start with the obvious: Since you're calling the function (any function) from the CPU, it has to run at least partly on the CPU. So the question really is, how much of the work is done on the CPU and how much on the GPU.
Second, in order for the GPU to get to execute some command, the CPU has to prepare a command description to pass down. The minimal set here is a command token describing what to do, as well as the data for the operation to be executed. How the CPU triggers the GPU to do the command is also somewhat important. Since most of the time, this is expensive, the CPU does not do it often, but rather batches commands in command buffers, and simply sends a whole buffer for the GPU to handle.
All this to say that passing work down to the GPU is not a free exercise. That cost has to be pitted against just running the function on the CPU (no matter what we're talking about).
Taking a step back, you have to ask yourself why you need a GPU at all. The fact is, a pure CPU implementation does the job (as AshleysBrain mentions). The power of the GPU comes from its design to handle:
specialized tasks (rasterization, blending, texture filtering, blitting, ...)
heavily parallel workloads (DeadMG is pointing to that in his answer), when a CPU is more designed to handle single-threaded work.
And those are the guiding principles to follow in order to decide what goes in the chip. Anything that can benefit from those ought to run on the GPU. Anything else ought to be on the CPU.
It's interesting, by the way. Some functionality of the GL (prior to deprecation, mostly) are really not clearly delineated. Display lists are probably the best example of such a feature. Each driver is free to push as much as it wants from the display list stream to the GPU (typically in some command buffer form) for later execution, as long as the semantics of the GL display lists are kept (and that is somewhat hard in general). So some implementations only choose to push a limited subset of the calls in a display list to a computed format, and choose to simply replay the rest of the command stream on the CPU.
Selection is another one where it's unclear whether there is value to executing on the GPU.
Lastly, I have to say that in general, there is little correlation between the API calls and the amount of work on either the CPU or the GPU. A state setting API tends to only modify a structure somewhere in the driver data. It's effect is only visible when a Draw, or some such, is called.
A lot of the GL API works like that. At that point, asking whether glEnable(GL_BLEND) is executed on the CPU or GPU is rather meaningless. What matters is whether the blending will happen on the GPU when Draw is called. So, in that sense, Most GL entry points are not accelerated at all.
I could also expand a bit on data transfer but Danvil touched on it.
I'll finish with the little "s/w path". Historically, GL had to work to spec no matter what the hardware special cases were. Which meant that if the h/w was not handling a specific GL feature, then it had to emulate it, or implement it fully in software. There are numerous cases of this, but one that struck a lot of people is when GLSL started to show up.
Since there was no practical way to estimate the code size of a GLSL shader, it was decided that the GL was supposed to take any shader length as valid. The implication was fairly clear: either implement h/w that could take arbitrary length shaders -not realistic at the time-, or implement a s/w shader emulation (or, as some vendors chose to, simply fail to be compliant). So, if you triggered this condition on a fragment shader, chances were the whole of your GL ended up being executed on the CPU, even when you had a GPU siting idle, at least for that draw.
The question should perhaps be "What functions eat an unexpectedly high amount of CPU time?"
Keeping a matrix stack for projection and view is not a thing the GPU can handle better than a CPU would (on the contrary ...). Another example would be shader compilation. Why should this run on the GPU? There is a parser, a compiler, ..., which are just normal CPU programs like the C++ compiler.
Potentially "dangerous" function calls are for example glReadPixels, because data can be copied from host (=CPU) memory to device (=GPU) memory over the limited bus. In this category are also functions like glTexImage_D or glBufferData.
So generally speaking, if you want to know how much CPU time an OpenGL call eats, try to understand its functionality. And beware of all functions, which copy data from host to device and back!
Typically, if an operation is per-something, it will occur on the GPU. An example is the actual transformation - this is done once per vertex. On the other hand, if it occurs only once per large operation, it'll be on the CPU - such as creating the transformation matrix, which is only done once for each time the object's state changes, or once per frame.
That's just a general answer and some functionality will occur the other way around - as well as being implementation dependent. However, typically, it shouldn't matter to you, the programmer. As long as you allow the GPU plenty of time to do it's work while you're off doing the game sim or whatever, or have a solid threading model, you shouldn't need to worry about it that much.
#sending data to GPU: As far as I know (only used Direct3D) it's all done in-shader, that's what shaders are for.
glTranslate, glRotate and glScale change the current active transformation matrix. This is of course a CPU operation. The model view and projection matrices just describes how the GPU should transforms vertices when issue a rendering command.
So e.g. by calling glTranslate nothing is translated at all yet. Before rendering the current projection and model view matrices are multiplied (MVP = projection * modelview) then this single matrix is copied to the GPU and then the GPU does the matrix * vertex multiplications ("T&L") for each vertex. So the translation/scaling/projection of the vertices is done by the GPU.
Also you really should not be worried about the performance if you don't use these functions in an inner loop somewhere. glTranslate results in three additions. glScale and glRotate are a bit more complex.
My advice is that you should learn a bit more about linear algebra. This is essential for working with 3D APIs.
There are software rendered implementations of OpenGL, so it's possible that no OpenGL functions run on the GPU. There's also hardware that doesn't support certain render states in hardware, so if you set a certain state, switch to software rendering, and again, nothing will run on the GPU (even though there's one there). So I don't think there's any clear distinction between 'GPU-accelerated functions' and 'non-GPU accelerated functions'.
To be on the safe side, keep things as simple as possible. The straightforward rendering-with-vertices and basic features like Z buffering are most likely to be hardware accelerated, so if you can stick to that with the minimum state changing, you'll be most likely to keep things hardware accelerated. This is also the way to maximize performance of hardware-accelerated rendering - graphics cards like to stay in one state and just crunch a bunch of vertices.