Can i run this code in a loop without reading results from SSBO? And only read SSBO results after 100 iterations.
for (int i=0; i <100; i++){
glDispatchCompute(1, 200, 1);
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);//i understand this needed to ensure
//it is done running the glsl code in GPU from previous iteration
}
Also will glsl code executed e.g. second time within the loop(i==1) see results of first glsl execution in SSBO (i==0)?
Finally do i really need glMemoryBarrier call in the loop or it can be outside the loop? I am concerned that GPU code will not see changes done by first iteration in SSBO when executed second time.
1) Yes, you can run your shader multiple times without reading the contents of the buffer you are writting to and read them at the end (this is a very common practice on iterative GPU sorting algorithms)
2) If you are reading/writting to the same buffer, yes, they will be visible
3) Yes, you need a barrier, otherwise the compute shader dipatch will be launched without waiting for the previous to finish, which will lead to wrong results (as you are concerned), if not crashes. However, the barrier type will depend on what you are doing within your shader. Here is a full list of barriers
https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glMemoryBarrier.xhtml
Most probably, if you are focusing on reading/writing to a SSBO, you should use the barrier GL_SHADER_STORAGE_BARRIER_BIT, but if you are not sure, you can just use GL_ALL_BARRIER_BITS.
Related
I'm trying to implement forward+ rendering using compute shaders in GLSL 4.6, but I don't know how to synchronize threads within a work group when working with off-screen pixels. For example, my window resolution is 1600x900, and I'm using a work group size of 16x16, where each thread or invocation corresponds to a single pixel on the screen, this means that size_x = 1600/16 = 100 and size_y = 900/16 = 56.25, so I need to call
glDispatchCompute(100, 57, 1);
As you can see, some threads in a work group may represent pixels that extend beyond the screen, so I want to return early or discard these off-screen pixels to skip the complex computation. However, my compute shader also contains a barrier() call in several places in order to synchronize local threads, I don't know how to combine them. The documentation says
For any given static instance of barrier in a compute shader, all invocations within a single work group must enter it before any are allowed to continue beyond it.
......
Barriers are also disallowed after a return statement
The only workaround I can think of is to fake computations for these threads, or use if-else to let them finish early in an intermediate stage between two barrier() calls. I guess this will introduce a little performance penalty. So, is there a better way to rule out invalid threads in a work group? I believe this problem is quite common for compute shaders so there might be an idiomatic way of handling it.
I can define a shared data structure (for example an array):
shared float [gl_WorkGroupSize.x]
for each workgroup. Execution order inside a workgroup is undefined so at some point I may need to synchronize all threads which use a shared array, for example all threads have to write some data to the shared array before calculations. I found two ways to achieve this:
OpenGL SuperBible:
barrier();
memoryBarrierShared();
OpenGL 4 Shading Language Cookbook:
barrier();
Should I call memoryBarrierShared after barrier ? Could you give me some practical examples when I can use memoryBarrierShared or memoryBarrier without using barrier ?
Memory barriers ensure visibility in otherwise incoherent memory access.
What this really means is that an invocation of your compute shader will not be allowed to attempt some sort of optimization that would read and/or write cached memory.
Writing to something like a Shader Storage Buffer is an example of ordinarily incoherent memory access, without a memory barrier changes made in one invocation are only guaranteed to be visible within that invocation. Other invocations are allowed to maintain their own cached view of the memory unless you tell the GLSL compiler to enforce coherent memory access and where to do so (memoryBarrier* ()).
There is a serious caveat here, and that is that visibility is only half of the equation. Forcing coherent memory access when the shader is compiled does nothing to solve actual execution order issues across threads in a workgroup. To make sure that all executions in a workgroup have finished processing up to a certain point in your shader, you must use barrier ().
Consider the following Comptue Shader pseudo code:
#version 450
layout (local_size_x = 128) in;
shared float foobar [128]; // shared implies coherent
void main (void)
{
foobar [gl_LocalInvocationIndex] = 0.0;
memoryBarrierShared (); // Ensure change to foobar is visible in other invocations
barrier (); // Stall until every thread is finished clearing foobar
// At this point, _every_ index (0-127) of `foobar` will have the value **0.0**.
// Without the barrier, and just the memory barrier, the contents of everything
// but foobar [gl_LocalInvocationIndex] would be undefined at this point.
}
Outside of GLSL, there are also barriers at the GL command level (glMemoryBarrier (...)). You would use those in situations where you need a compute shader to finish executing before GL is allowed to do something that depends on its results.
In the traditional render pipeline GL can implicitly figure out which commands must wait for others to finish (e.g. glReadPixels (...) stalls until all commands finish writing to the framebuffer). However, with compute shaders and image load/store, implicit synchronization no longer works and you have to tell GL which pipeline memory operations must be finished and visible to the next command.
I use an atomic counter in a compute shader with an atomic_uint bound to a dynamic GL_ATOMIC_COUNTER_BUFFER (in a similar way to this opengl-atomic-counter tutorial lighthouse3d).
I'm using the atomic counter in a particle system to check a condition has been reached for all particles; I expect to see counter==numParticles when all of the particles are in the correct place.
I map the buffer each frame and check if the atomic counter has counted all of the particles:
GLuint *ptr = (GLuint *) glMapBuffer( GL_ATOMIC_COUNTER_BUFFER, GL_READ_ONLY );
GLuint particleCount = ptr[ 0 ];
glUnmapBuffer( GL_ATOMIC_COUNTER_BUFFER );
if( particleCount == numParticles() ){ // do stuff }
On a single GPU host the code works fine and particleCount always reaches numParticles() but on a multi gpu host the particleCount never reaches numParticles().
I can visually check that the condition has been reached and the test should be true however particleCount is changing each frame going up and down but never reaching numParticles().
I have tried an opengl memory barrier on the GL_ATOMIC_COUNTER_BARRIER_BIT before I unmap particleCount:
glMemoryBarrier(GL_ATOMIC_COUNTER_BARRIER_BIT);
GLuint *ptr = (GLuint *) glMapBuffer( GL_ATOMIC_COUNTER_BUFFER, GL_READ_ONLY );
GLuint particleCount = ptr[ 0 ];
glUnmapBuffer( GL_ATOMIC_COUNTER_BUFFER );
if( particleCount == m_particleSystem->numParticles() )
{ // do stuff }
and I've tried a glsl barrier before incrementing the counter in the compute shader:
memoryBarrierAtomicCounter();
atomicCounterIncrement( particleCount );
but the atomic counter doesn't seem to synchronise across devices.
What is the correct way to synchronise so that the atomic counter works with multiple devices?
Your choice of memory barrier is actually inappropriate in this situation.
That barrier (GL_ATOMIC_COUNTER_BARRIER_BIT) would make changes to the atomic counter visible (e.g. flush caches and run shaders in a specific order), but what it does not do is make sure that any concurrent shaders are complete before you map, read and unmap your buffer.
Since your buffer is being mapped and read back, you do not need that barrier - that barrier is for coherency between shader passes. What you really need is to ensure all shaders that access your atomic counter are finished before you try to read data using a GL command, and for this you need GL_BUFFER_UPDATE_BARRIER_BIT.
GL_BUFFER_UPDATE_BARRIER_BIT:
Reads/writes via glBuffer(Sub)Data, glCopyBufferSubData, glProgramBufferParametersNV, and glGetBufferSubData, or to buffer object memory mapped by glMapBuffer(Range) after the barrier will reflect data written by shaders prior to the barrier.
Additionally, writes via these commands issued after the barrier will wait on the completion of any shader writes to the same memory initiated prior to the barrier.
You may be thinking about barriers from the wrong perspective. The barrier you need depends on which type of operation the memory read needs to be coherent to.
I would suggest brushing up on the incoherent memory access usecases:
(1) Shader write/read between rendering commands
One Rendering Command writes incoherently, and the other reads. There is no need for coherent(GLSL qualifier) here at all. Just use glMemoryBarrier before issuing the reading rendering command, using the appropriate access bit.
(2) Shader writes, other OpenGL operations read
Again, coherent is not necessary. You must use a glMemoryBarrier before performing the read, using a bitfield that is appropriate to the reading operation of interest.
In case (1), the barrier you want is in-fact GL_ATOMIC_COUNTER_BARRIER_BIT, because it will force strict memory and execution order rules between different shader passes that share the same atomic counter.
In case (2), the barrier you want is GL_BUFFER_UPDATE_BARRIER_BIT. The "reading operation of interest" is glMapBuffer (...) and as shown above, that is covered under GL_BUFFER_UPDATE_BARRIER_BIT.
In your situation, you are reading the buffer back using the GL API. You need GL commands to wait for all pending shaders to finish writing (this does not happen automatically for incoherent memory access - image load/store, atomic counters, etc.). That is textbook case (2).
The following GLSL compute shader simply copies inImage to outImage. It is derived from a more complex post-processing pass.
In the first several lines of main(), a single thread loads 64 pixels of data into the shared array. Then, after synchronizing, each of the 64 threads writes one pixel to the output image.
Depending on how I synchronize, I get different results. I originally thought memoryBarrierShared() would be the correct call, but it produces the following result:
which is the same result as having no synchronization or using memoryBarrier() instead.
If I use barrier(), I get the following (desired) result:
The striping is 32 pixels wide, and if I change the workgroup size to anything less than or equal to 32, I get correct results.
What's going on here? Am I misunderstanding the purpose of memoryBarrierShared()? Why should barrier() work?
#version 430
#define SIZE 64
layout (local_size_x = SIZE, local_size_y = 1, local_size_z = 1) in;
layout(rgba32f) uniform readonly image2D inImage;
uniform writeonly image2D outImage;
shared vec4 shared_data[SIZE];
void main() {
ivec2 base = ivec2(gl_WorkGroupID.xy * gl_WorkGroupSize.xy);
ivec2 my_index = base + ivec2(gl_LocalInvocationID.x,0);
if (gl_LocalInvocationID.x == 0) {
for (int i = 0; i < SIZE; i++) {
shared_data[i] = imageLoad(inImage, base + ivec2(i,0));
}
}
// with no synchronization: stripes
// memoryBarrier(); // stripes
// memoryBarrierShared(); // stripes
// barrier(); // works
imageStore(outImage, my_index, shared_data[gl_LocalInvocationID.x]);
}
The problem with image load store and friends is, that the implementation cannot be sure anymore that a shader only changes the data of it's dedicated output values (e.g. the framebuffer after a fragment shader). This applies even more so to compute shaders, which don't have a dedicated output, but only output things by writing data into writable store, like images, storage buffers or atomic counters. This may require manual synchronization between individual passes as otherwise the fragment shader trying to access a texture might not have the most recent data written into that texture with image store operations by a preceding pass, like your compute shader.
So it may be that your compute shader works perfectly, but it is the synchronization with the following display (or whatever) pass (that needs to read this image data somehow) that fails. For this purpose there exists the glMemoryBarrier function. Depending on how you read that image data in the display pass (or more precisely the pass that reads the image after the compute shader pass), you need to give a different flag to this function. If you read it using a texture, use GL_TEXTURE_FETCH_BARRIER_BIT, if you use an image load again, use GL_SHADER_IMAGE_ACCESS_BARRIER_BIT, if using glBlitFramebuffer for display, use GL_FRAMEBUFFER_BARRIER_BIT...
Though I don't have much experience with image load/store and manual memory snynchronization and this is only what I came up with theoretically. So if anyone knows better or you already use a proper glMemoryBarrier, then feel free to correct me. Likewise does this not need to be your only error (if any). But the last two points from the linked Wiki article actually address your use case and IMHO make it clear that you need some kind of glMemoryBarrier:
Data written to image variables in one rendering pass and read by the shader in a later pass need not use coherent variables or
memoryBarrier(). Calling glMemoryBarrier with the
SHADER_IMAGE_ACCESS_BARRIER_BIT set in barriers between passes is
necessary.
Data written by the shader in one rendering pass and read by another mechanism (e.g., vertex or index buffer pulling) in a later pass need
not use coherent variables or memoryBarrier(). Calling
glMemoryBarrier with the appropriate bits set in barriers between
passes is necessary.
EDIT: Actually the Wiki article on compute shaders says
Shared variable access uses the rules for incoherent memory access.
This means that the user must perform certain synchronization in order
to ensure that shared variables are visible.
Shared variables are all implicitly declared coherent, so you don't
need to (and can't use) that qualifier. However, you still need to
provide an appropriate memory barrier.
The usual set of memory barriers is available to compute shaders, but
they also have access to memoryBarrierShared(); this barrier is
specifically for shared variable ordering. groupMemoryBarrier()
acts like memoryBarrier(), ordering memory writes for all kinds of
variables, but it only orders read/writes for the current work group.
While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are
executing in lock-step. If you need to ensure that an invocation has
written to some variable so that you can read it, you need to
synchronize execution with the invocations, not just issue a memory
barrier (you still need the memory barrier though).
To synchronize reads and writes between invocations within a work group, you must employ the barrier() function. This forces an
explicit synchronization between all invocations in the work group.
Execution within the work group will not proceed until all other
invocations have reach this barrier. Once past the barrier(), all
shared variables previously written across all invocations in the
group will be visible.
So this actually sounds like you need the barrier there and the memoryBarrierShared is not enough (though you don't need both, as the last sentence says). The memory barrier will just synchronize the memory, but it doesn't stop the execution of the threads to cross it. Thus the threads won't read any old cached data from the shared memory if the first thread has already written something, but they can very well reach the point of reading before the first thread has tried to write anything at all.
This actually fits perfectly to the fact that for 32 and below block sizes it works and that the first 32 pixels work. At least on NVIDIA hardware 32 is the warp size and thus the number of threads that operate in perfect lock-step. So the first 32 threads (well, every block of 32 threads) always work exactly parallel (well, conceptually that is) and thus they cannot introduce any race-conditions. This is also the case why you don't actually need any synchronization if you know you work inside a single warp, a common optimization.
I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?
Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.
The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?
One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.
I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.