As far as I understand from OpenGL documentation about compute shader compute spaces, I can divide data space into local invocations (threads) which will execute in parallel and in workgroups which will contain some number of local invocations and they will be executed not parallel (?) in random order, is I'm understand it correctly. Main question is what is the best strategy to divide data, should I always will try to maximize local invocation size and minimize number of workgroups to get better parallel execution or any other strategy will be better (for example I have 10000 elements in data buffer (velocity in x direction maybe) and any of element can be computed independent, how to determine best number of invocations (threads) and workgroups)?
P.S. For everyone who stumbles upon this question, here is an interesting article to read, which might answer your questions https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-usage-large-thread-groups/
https://www.opengl.org/registry/doc/glspec45.core.pdf
Chapter 19:
A work group is a collection of shader invocations that execute the
same code, potentially in parallel.
While the individual shader
invocations within a work group are executed as a unit, work groups
are executed completely independently and in unspecified order.
After reading these section quite a few times over I find the "best" solution is to maximize local invocation size and minimize number of work groups because you then tell the driver to omit the requirement of invocation sets being independent. Fewer requirements mean fewer rules for the platform when it parses your intent into an execution, which universially yield better (or the same) result.
An invocation within a work group may share data with other members of
the same workgroup through shared variables (see section 4.3.8(“Shared
Variables”) of the OpenGL Shading Language Specification) and issue
memory and control barriers to synchronize with other members of the
same work group
Independence between invocations can be derived by the platform when compiling the shader code.
Related
I'm writing a raytracer with EM-processing (radar application). The following is done for many different simulation locations of the radar:
raytracing, generates a lot of data
EM-postprocessing, generates a scalar and then deletes the raytrace data
Each simulation point will be encapsulated in its own instance of a class (all grouped in a std::vector) with a specification of the radar location for that simulation point, references to data it will only read (shared by all simulation points), and properties for storing its results (so each simulation point has its own). Because of this setup I thought I could benefit of using a for_each loop with std::execution::par_unseq policy without taking further measures. Correct?
The problem however is that the raytracing generates so much data that, when for example having 10 thousand simulation locations, I may run out of memory when the scheduler decides to do all the raytracing first because it is allowed to do so with par_unseq. So my idea was to write a normal for loop which feeds an inner parallel for_each loop with, say, 100 simulation points at the time. Is this an optimal solution for my case? Or did I totally misinterpret how parallel things work?
Quoting the standard, §25.3.3/8, one can find (emphasis mine):
The semantics of invocation with execution::unsequenced_policy, execution::parallel_policy, or execution::parallel_unsequenced_policy allow the implementation to fall back to sequential execution if the system cannot parallelize an algorithm invocation, e.g., due to lack of resources.
So implementations are allowed by the standards to fall back to sequential execution if they hit a resource hog.
I am playing with compute shaders in vulkan and reached a problem which i can not solve to my satisfaction. I have 2 compute shader. The first one calculates the amount of invocation needed (among other things) in the second one and writes these ( indirect through atomicAdd - every Invocation adds an unkown amount to the whole ) in an field of VkDispatchIndirectCommand. The Problem is VkDispatchIndirectCommand represents the amount of WorkGroups and not Invocations and the Invocation count per WorkGroup should be at least subgroupSize (ex. 32 at nvidia).
My first try to correct the amount between both shader runs on the host side resulted in an imense perfomance drop. What would be a better aproach or is there even an ideal solution in vulkan, which I just do not know yet.
From the use of atomicAdd, it sounds like the number of invocations you want is calculated in a distributed way across all the invocations of the first dispatch. Assuming you can't change that, and really need a post-process to convert from number of invocations to number of workgroups, you can run a very small dispatch (one thread) after the first one which does that conversion before the indirect dispatch. This is essentially what you're doing on the CPU, but done on the GPU in a pipelined way that should have lower latency.
Let's say I have an opengl compute shader written in GLSL, executing on a NVidia Geforce 970.
At the start of the shader, a single invocation writes to a "Shader Storage Buffer Object" (SSBO).
I then issue a suitable barrier, like memoryBarrier() in my GLSL.
I then read from the memory written in the first step, in each invocation.
Will that first write be visible to all invocations in the current compute operation?
At https://www.khronos.org/opengl/wiki/Memory_Model#Ensuring_visibility , Khronos say:
"Use coherent and an appropriate memoryBarrier* or groupMemoryBarrier call if you use a mechanism like barrier to synchronize between invocations."
I'm pretty sure it's possible to synchronize this way within a work group. But does it work for all invocations in every work group, in the entire compute operation?
I'm unsure how an entire set of work groups is scheduled. I would expect them to possibly run sequentially, making the kind of synchronization I'm asking about impossible?
But does it work for all invocations in every work group, in the entire compute operation?
No. The scope of barrier is explicitly within a work group. And you cannot have visibility of operations that you haven't ensured have happened yet. The order of execution of work groups with respect to one another is undefined, so you don't know if one work group has executed yet.
What you want isn't really possible. You need instead to change how your shaders work so that work groups are not dependent on each other. In this case, you can have every work group perform this computation. And instead of storing it in global memory via an SSBO, store the result in a shared variable.
Yes, you'll be computing the same value in each group. But that will yield better performance than having all of those work groups wait on one work group. Especially since that's not something you can actually do.
I have seen many implementations of parallel scan; the two main ones are Hillis & Steel and blelloch scan. Though all the implementations I have seen work within shared memory, memory only shared between threads in a block.
Are there any implementations of scan that work well over arrays that have more elements than threads per block, i.e. the array will not fit into shared memory?
This link mentions a scan implementation I see in all my searches, a Hillis Steele version, example 39-1 https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html.
Is the only option to do a segmented scan on sub arrays within the array and then do a "final scan" adding a magnitude value from the prior sub array to the next?
With or without shared memory, CUDA kernels execute in chunks (threadblocks) that can execute in any order. To take full advantage of the hardware, you must have multiple threadblocks in your kernel call, but this creates an uncertain execution order.
Because of this, a scan algorithm that works across a large array will necessarily have to work in threadblock-sized pieces (in some fashion). If we have multiple threadblocks, then a given threadblock has no way of knowing whether other threadblocks have finished their work on adjacent data. (Yes, there are contrived mechanisms to allow inter-threadblock communication, but these are fraught with difficulty and don't solve the problem on a large scale.)
The net effect of this is that algorithms like this generally imply a global sync of some sort, and the only safe-in-any-scenario global sync is the kernel launch. Threadblocks can do a portion of their work independently, but when it comes time to stitch the work of threadblocks together, we must wait until step A is completed across all threadblocks before proceeding with step B.
Therefore I think you'll find that most device-wide scan algorithms, including the chapter 39 GPU Gems example you linked, as well as thrust and cub will launch multiple kernels to get this job done, since the kernel launch gives a convenient global sync.
Note that we can certainly devise a scan that has individual threadblocks that "work on more elements than threads per block", but this does not ultimately solve our problem (unless we use only 1 threadblock), because we must launch multiple threadblocks to take full advantage of the hardware, and multiple threadblocks in the general case introduces the global sync necessity.
The cub and thrust implementations I mentioned are both open-source template libraries, so you can certainly study the code there if you wish (not a trivial undertaking). They do represent high-quality approaches designed and built by CUDA experts. You can also at a high level study their behavior quite easily using:
nvprof --print-gpu-trace ./mycode
to get a quick read on how many kernels are being launched and what data transfers may be occurring, or you can use nvvp, the visual profiler, to study this.
I’m conducting a molecular dynamics simulation, and I’ve been struggling for quite a while to implement it in parallel, and although I succeeded in fully loading my 4-thread processor, the computation time in parallel is greater than the computation time in serial mode.
Studying at which point of time each thread starts and finishes its loop iteration, I’ve noticed a pattern: it’s as if different threads are waiting for each other.
It was then that I turned my attention to the structure of my program. I have a class, an instance of which represents my system of particles, containing all the information about particles and some functions that use this information. I also have a class instance of which represents my interatomic potential, containing parameters of potential function along with some functions (one of those functions calculates force between two given particles).
And so in my program there exist instances of two different classes, and they interact with each other: some functions of one class take references to instances of another class.
And the block I’m trying to implement in parallel looks like this:
void Run_simulation(Class_system &system, Class_potential &potential, some other arguments){
#pragma omp parallel for
for(…)
}
for(...) is the actual computation, using data from the system instance of the Class_system class and some functions from thepotential instance of the Class_potential class.
Am I right that it’s this structure that’s the source of my troubles?
Could you suggest me what has to be done in this case? Must I rewrite my program in completely different manner? Should I use some different tool to implement my program in parallel?
Without further details on your simulation type I can only speculate, so here are my speculations.
Did you look into the issue of load balancing? I guess the loop distributes the particles among threads but if you have some kind of a restricted range potential, then the computational time might differ from particle to particle in the different regions of the simulation volume, depending on the spatial density. This is a very common problem in molecular dynamics and one that is very hard to solve properly in distributed memory (MPI in most cases) codes. Fortunately with OpenMP you get direct access to all particles at each computing element and so the load balancing is much easier to achieve. It is not only easier, but it is also built-in, so to speak - simply change the scheduling of the for directive with the schedule(dynamic,chunk) clause, where chunk is a small number whose optimal value might vary from simulation to simulation. You might make chunk part of the input data to the program or you might instead write schedule(runtime) and then play with different scheduling classes by setting the OMP_SCHEDULE environment variable to values like "static", "dynamic,1", "dynamic,10", "guided", etc.
Another possible source of performance degradation is false sharing and true sharing. False sharing occurs when your data structure is not suitable for concurrent modification. For example, if you keep 3D positional and velocity information for each particle (let's say you use velocity Verlet integrator), given IEEE 754 double precision, each coordinate/velocity triplet takes 24 bytes. This means that a single cache line of 64 bytes accommodates 2 complete triplets and 2/3 of another one. The consequence of this is that no matter how you distribute the particles among the threads, there would always be at lest two threads that would have to share a cache line. Suppose that those threads run on different physical cores. If one thread writes to its copy of the cache line (for example it updates the position of a particle), the cache coherency protocol would be involved and it will invalidate the cache line in the other thread, which would then have to reread it from a slower cache of even from main memory. When the second thread update its particle, this would invalidate the cache line in the first core. The solution to this problem comes with proper padding and proper chunk size choice so that no two threads would share a single cache line. For example, if you add a superficial 4-th dimension (you can use it to store the potential energy of the particle in the 4-th element of the position vector and the kinetic energy in the 4-th element of the velocity vector) then each position/velocity quadruplet would take 32 bytes and information for exactly two particles would fit in a single cache line. If you then distribute an even number of particles per thread, you automatically get rid of possible false sharing.
True sharing occurs when threads access concurrently the same data structure and there is an overlap between the parts of the structure, modified by the different threads. In molecular dynamics simulations this occurs very frequently as we want to exploit the Newton's third law in order to cut the computational time in two when dealing with pairwise interaction potentials. When one thread computes the force acting on particle i, while enumerating its neighbours j, computing the force that j exerts on i automatically gives you the force that i exerts on j so that contribution can be added to the total force on j. But j might belong to another thread that might be modifying it at the same time, so atomic operations have to be used for both updates (both, sice another thread might update i if it happens to neighbour one of more of its own particles). Atomic updates on x86 are implemented with locked instructions. This is not that horribly slow as often presented, but still slower than a regular update. It also includes the same cache line invalidation effect as with false sharing. To get around this, at the expense of increased memory usage one could use local arrays to store partial force contributions and then perform a reduction in the end. The reduction itself has to either be performed in serial or in parallel with locked instructions, so it might turn out that not only there is no gain from using this approach, but rather it could be even slower. Proper particles sorting and clever distribution between the processing elements so to minimise the interface regions can be used to tackle this problem.
One more thing that I would like to touch is the memory bandwidth. Depending on your algorithm, there is a certain ratio between the number of data elements fetched and the number of floating point operations performed at each iteration of the loop. Each processor has only a limited bandwidth available for memory fetches and if it happens that your data does not quite fit in the CPU cache, then it might happen that the memory bus is unable to deliver enough data to feed so many threads executing on a single socket. Your Core i3-2370M has only 3 MiB of L3 cache so if you explicitly keep the position, velocity and force for each particle, you can only store about 43000 particles in the L3 cache and about 3600 particles in the L2 cache (or about 1800 particles per hyperthread).
The last one is hyperthreading. As High Performance Mark has already noted, hyperthreads share a great deal of core machinery. For example there is only one AVX vector FPU engine that is shared among both hyperthreads. If your code is not vectorised, you lose a great deal of computing power available in your processor. If your code is vectorised, then both hyperthreads will get into each others way as they fight for control over the AVX engine. Hyperthreading is useful only when it is able to hide memory latency by overlaying computation (in one hyperthread) with memory loads (in another hyperthread). With dense numerical codes that perform many register operations before they perform memory load/store, hyperthreading gives no benefits whatsoever and you'd be better running with half the number of threads and explicitly binding them to different cores as to prevent the OS scheduler from running them as hyperthreads. The scheduler on Windows is particularly dumb in this respect, see here for an example rant. Intel's OpenMP implementation supports various binding strategies controlled via environment variables. GNU's OpenMP implementation too. I am not aware of any way to control threads binding (a.k.a. affinity masks) in Microsoft's OpenMP implementation.