RenderPass Dependency and (transition) memory barrier - c++

I am facing a comprehensive issue.
Let's say I have an image in a TRANSFER_LAYOUT layout. Going that way, the memory is already made available (not visible).
Let's say I update a uniform buffer (via vkCmdCopyBuffer).
Now let's say I have a renderPass (with an "empty frameBuffer", so there is no colorAttachment to make thing simpler) that use the prior image in SHADER_READ_OPTIMAL layout and the uniform buffer we just update. The image and the buffer are both used inside the fragment shader.
Is it correct to do the following?
Transition the image to SHADER_READ_LAYOUT
srcAccess = 0; // layers will say error : it must be TRANSFER_READ
dstAccess = 0; // The visibility will be made in the renderpass dependency (hower layers tells that it should be SHADER_READ I think)
srcPipe = TOP_OF_PIPE;
dstPipe = BOTTOM_OF_PIPE;
It is, in my understanding, meaningless to use different access than 0 here because TOP_OF_PIPE and BOTTOM_OF_PIPE does not access memory.
In the renderpass dependency from VK_EXTERNAL_SUBPASS :
srcAccess = TRANSFER_WRITE; // for the uniformBuffer
dstAccess = SHADER_READ; // for the uniform and the image
srcPipeline = TRANSFER; // For the uniformBuffer
dstPipeline = FRAGMENT_SHADER; // They are used here
Going that way, we are sure that the uniform buffer will not have any problems : The data are both made available and visible thanks to the renderPass. The memory should also be made visible for the image (also thanks to the dependency). However, the transition is write here to happened "not before" the bottom stage. Since I am using the image in FRAGMENT_STAGE, is it a mistake? Or is the "end of the renderPass dependency" behave like a bottom stage?
This code works on NVIDIA, and on AMD, but I am not sure it is really correct

To understand synchronization perfectly, one must simply read the specification; especially the Execution and Memory Dependencies theory chapter. Lets analyze your situation in terms of what is written in the specification.
You have (only) three synchronization commands: S1 (image transition to TRANSFER_LAYOUT and availability operation), S2 (image transition to SHADER_READ_LAYOUT), and S3 (render pass VK_EXTERNAL dependency).
Your command buffer is an ordered list like: [Cmds0, S1, Cmds1, S2, Cmds2, S3, Cmds3].
For S1 let's assume you did the first part of a dependency (i.e. src part) correctly. You only said you made the image available from.
You also said you did not made it visible to, so let's assume dstAccess was 0 and dstStage was probably BOTTOM_OF_PIPE.
S2 has no execution dependency and it has not memory dependency. Only layout transition. There is an layout transition synchronization exception in the spec saying that layout transitions are performed in full in submission order (i.e. implicit execution dependency is automagically added). I personally would not be comfortable relying on it (and I would not trust drivers to implement it correctly on the first try). But lets assume it is valid, and assume the image will be correctly transitioned and made available from (and not visible to) at some point after the S1.
The S3 is an external subpass dependency on non-attachment resource, but the spec reassures us it is no different than vkCmdPipelineBarrier with a VkMemoryBarrier.
The second part of the dependency (i.e. dst) in S3 seems correct for your needs.
TL;DR, so far so good.
The first part of the dependency (i.e. dst) in S3 would indeed be the problematic one.
There is no automatic layout transitions for non-attachment resources, so we cannot rely on that crutch as above.
Set of commands A3 are all the commands before the renderpass.
Synchronization scope A3S are only those kinds of operations that are on the srcStage pipline stage or any logically earlier stage (i.e. TOP_OF_PIPE up to the specified STAGE_TRANSFER).
The execution dependency is made between A3' and B3'. Above we agreed the B3' half of the dependency is correct. The A3' half is intersection of A3 and A3S.
The layout transition in S2 is made between srcPipe = TOP_OF_PIPE and dstPipe = BOTTOM_OF_PIPE, so basically can be anywhere. It can be as late as in BOTTOM_OF_PIPE (to be precise, happens-before BOTTOM_OF_PIPE of commands recorded after S2 is executed).
So the layout transition is part of A3, but there is no guarantee it is part of A3S; so there would not be guarantee the transition is part of the intersection A3'.
That means there is no guarantee the layout transition to SHADER_READ_LAYOUT happens-before the image reading in the first subpass in STAGE_FRAGMENT_SHADER.
There is no correct memory dependency either because that is defined in terms of A3' too.
EDIT: Somehow missed this, which is probably the problem:
Or is the "end of the renderPass dependency" behave like a bottom stage?
Beginning and end of a render pass does not behave like anything. It only affects submission order. In the presence of VK_EXTERNAL dependency only that applies (and of course any other previous explicit synchronization commands). What happens without explicit VK_EXTERNAL dependency is described in the spec too bellow Valid Usage sections of VkSubpassDependency (basically all memory that is available before TOP_OF_PIPE is made visible to the whole first subpass for attachment use).

Related

Loading Box2D b2World from .Dump() file

I am trying to save and load the state of b2World in order to resume a simulation at a later time, but with the states of the Collision Manager, etc being exactly maintained. What is the best way to do this (without getting into library internals, and having to use boost serialize while monitoring public/private members of every class)? Is there a way to repurpose the log file from b2World.dump function to construct the object again?
I think parsing Dump as-is is a dead end.
First, the output of Dump seems to be executable C++ code:
b2ChainShape chainShape;
b2Vec2 vertices[] = {b2Vec2(-5,0), b2Vec2(5,0), b2Vec2(5,5), b2Vec2(4,1), b2Vec2(-4,1), b2Vec2(-5,5)};
chainShape.CreateLoop(vertices, 6);
b2FixtureDef groundFixtureDef;
groundFixtureDef.density = 0;
groundFixtureDef.shape = &chainShape;
Secondly, there is the problem of dumping floating point values with enough precision to recreate the original object.
Finally, some objects don't seem to support Dumping at all.
Some alternatives:
Hack box2d and add your own state-preserving dumping mechanism
Keep all box2d objects in a specific memory area and use memory snapshotting and/or checkpointing techniques to restore that memory again on load. One such library I know of is Ken, but I'm sure there are other implementations.

Questions about Pure ECS (Entity Component System) and update systems

I have written an ECS but I have some questions about the update phase. (in systems)
I have read many articles, but not found references to this sort of problems.
In order to have benefits from ECS (cache friendly, for example), they are the following requirements:
Entity must be just an ID.
Components must be only pure data (struct with no logic).
Systems contains the logic and update the components.
No interactions between systems (instead systems communicate by
adding “Tag” components to entities).
So, the logic applied in each system is fine and all works when they are no “user code”.
But, when we deal with user code (for example the user can attach C++ code to an object (like Unity, Unreal)), the problems come:
Since, components contain only the data, when the user modify the
local position, the world position is not updated (the world position
will be computed when the Transform System will process each
Transform Component. So if the user asks for the world position after
modifying its local position, it will get the previous world position
and not the actual.
When an entity is removed, its children must be removed. Since the
component contains only the data and not logic, the children will
not be removed (it will be on the next Parent System update). So we
have some “delay” (the children will still be accessible but will be
removed on the next Parent System update).
Supposing we have the entities A, B, C. B is a child of A. In the
user code (c++ code attached to the entity), the user set the parent
of B has C, then remove entity A. When the Parent System will
update, it will detect that A has been removed, (it can also detect
that the parent of Entity A has changed) but how the system can know
if the entity A has been removed after the parent change of Entity B
or before?
Adding logic into components will ruins advantage of pure ECS (doing the same actions on all same components, in a cache friendly way), so IMHO it's not a solution.
Anyone have the solution? I would like to know how are you deals with this sort of problems with your ECS implementation.
Thanks!
I had the same questions like you.
I modeled my solution after reading that (This is a must read honestly):
Gamasutra: Syncing a data-oriented ECS with a stateful external system
A possible solution to that is to set up some rules regarding reading and writing components.
I follow the rule that is always ok to read from components data, but if you have to write data to the component inside an external system (one that it's not part of the components interface systems) you must use always a transform system function.
For example:
Having a transform component like that:
struct transform {
glm::vec2 position = glm::vec2(0);
glm::vec2 scale = glm::vec2(1);
float rot_radians = 0.0f;
glm::mat3 ltp = glm::mat3(1);
glm::mat3 ltw = glm::mat3(1);
entt::entity parent = entt::null;
std::vector<entt::entity> children;
};
I will define some systems to write changes to it like this:
void set_position(entt::registry& r, entt::entity e, glm::vec2 position);
void set_rotation(entt::registry& r, entt::entity e, float rot_radians);
void set_scale(entt::registry& r, entt::entity e, glm::vec2 scale);
void set_parent(entt::registry& r, entt::entity to, entt::entity parent = entt::null);
Inside those functions, you are allowed to read/write transform component data freely.
The more I work with ECS, the more I tend to think like if I was programming in C. You have data and functions that change that data. Of course you can go and change the component data directly but I realized that It's not worth to spend time trying to avoid that, it's simply a bug or bad programming if you do it in a component that needs some more things after you update data.

Choosing between multiple shaders based on uniform variable

I want to choose from 2 fragment shaders based on the value of an uniform variable. I want to know how to do that.
I have onSurfaceCreated function which does compile and link to create program1 and glGetAttribLocation of that program1
In my onDrawFrame I do glUseProgram(program1). This function runs for every frame.
My problem is that, in the function onDrawFrame(), I get value of my uniform variable. There I have to choose between program1 or program2. But program1 is already compiled linked and all. How to do this? How will I change my program accordingly and use that since it is already done in onSurfaceCreated.?
Looks like you need to prepare both programs in your onSurfaceCreated function. I'll try to illustrate that with a sample code. Please organize it in a more accurate manner in your project:
// onSurfaceCreated function:
glCompileShader(/*shader1 for prog1*/);
glCompileShader(/*shader2 for prog1*/);
//...
glCompileShader(/*shadern for prog1*/);
glCompileShader(/*shader1 for prog2*/);
glCompileShader(/*shader2 for prog2*/);
//...
glCompileShader(/*shadern for prog2*/);
glLinkProgram(/*prog1*/);
glLinkProgram(/*prog2*/);
u1 = glGetUniformLocation(/*uniform in prog1*/);
u2 = glGetUniformLocation(/*uniform in prog2*/);
// onDrawFrame
if(I_need_prog1_condition) {
glUseProgram(prog1);
glUniform(/*set uniform using u1*/);
} else {
glUseProgram(prog2);
glUniform(/*set uniform using u2*/);
}
If you want to use the same set of uniforms form different programs (like in the code above), there exists a more elegant and up-to-date solution: uniform buffer objects! For example, you can create a buffer object with all variables that any of your shaders may need, but each of your shader programs can use only a subset of them. Moreover, you can determine unneeded (optimized-out) uniforms using glGetActiveUniform.
Also please note that the title of your question is a bit misleading. It looks like you want to choose an execution branch not in your host code (i.e. onDrawFrame function), but in your shader code. This approach is known as uber-shader technique. There are lots of discussions about them in the Internet like these:
http://www.gamedev.net/topic/659145-what-is-a-uber-shader/
http://www.shawnhargreaves.com/hlsl_fragments/hlsl_fragments.html
If you decide to do so, remember that GPU is not really good at handling if statements and other branching.

Does IASetInputLayout check to see if you pass an already set input layout?

I am designing a game engine in DirectX 11 and I had a question about the ID3D11DeviceContext::IASetInputLayout function. From what i can find in the documentation there is no mention of what the function will do if you set an input layout to the device that has been previously set. In context, if i were to do the following:
//this assumes dc is a valid ID3D11DeviceContex interface and that
//ia is a valid ID3D11InputLayout interface.
dc->IASetInputLayout(&ia);
//other program lines: drawing, setting vertex shaders/pixel shaders, etc.
dc->IASetInputLayout(&ia);
//continue execution
would this incur a performance penalty through device state switching, or would the runtime recognize the input layout as being equivalent to the one already set and return?
While I also can not find anything related to if the InputLayout is already set, you could get a pointer to the input layout already bound by calling ID3D11DeviceContext::IAGetInputLayout or by doing an internal check by keeping your own reference, that way you do not have a call to your ID3D11DeviceContext object.
As far as I know, it should detect that there are no changes and so the call is to be ignored. But it can be easily tested - just call this method 10000 times each render and see how bad FPS drop is :)

OPENGL ARB_occlusion_query Occlusion Culling

for (int i = 0; i < Number_Of_queries; i++)
{
glBeginQueryARB(GL_SAMPLES_PASSED_ARB, queries[i]);
Box[i]
glEndQueryARB(GL_SAMPLES_PASSED_ARB);
}
I'm curious about the method suggested in GPU GEMS 1 for occlusion culling where a certain number of querys are performed. Using the method described you can't test individual boxes against each other so are you supposed to do the following?
Test Box A -> Render Box A
Test Box B -> Render Box B
Test Box C -> Render Box C
and so on...
I'm not sure if I understand you correctly, but isn't this one of the drawbacks of the naive implementation of first rendering all boxes (and not writing to depth buffer) and then using the query results to check every object? But your suggestion to use the query result of a single box immediately is an even more naive approach as this stalls the pipeline. If you read this chapter (assuming you refer to chapter 29) further, they present a simple technique to overcome the disadvantages of both naive approaches (that is, just render everything normally and use the query results of the previous frame).
I think (it would have been good to link the GPU gems article...) you are confused about somewhat asynchronous queries as described in extensions like this:
http://developer.download.nvidia.com/opengl/specs/GL_NV_conditional_render.txt
If I recall correctly there were other extensions to check for the availability of a result without blocking also.
As Christian Rau points out doing just "query, wait for result, do stuff based on result" might stall and might not be any gain because of that, depending on how much work is in "do stuff". In fact, doing the query, waiting for it to round trip just to save a single draw call is most likely not going to help at all.