So i am implementing terrain and at run-time implementing different techniques such as fault line, diamond-square
I have subgrids that are responsible for culling sections out and basically when performing one of the algorithms i loop through the sections and map the vertex buffer and then update the positions and finally unmap
I notice if i press twice fast the keyboard input which triggers the algorithms parts of the mesh dissappear and end up with weird values that are as if they are uninitialised. I solve this by putting a small timer around the call which on my laptop is 0.4sec
I put my project onto a uni computer and i was able to take the timer right down to almost 0 wait and the algorithm worked perfectly without any loss of data
so my question is can anyone give me info about mapping and unmapping and why im assuming my GPU is possibly slower to make the changes
thanks
Related
I am making a simple STG engine with OpenGL (To be exact, with LWJGL3).In this game, there can be several different types of items(called bullet) in one frame, and each type can have 10-20 instances.I hope to find an efficient way to render it.
I have read some books about modern OpenGL and find a method called "Instanced Rendering", but it seems only to work with same instances.Should I use for-loop to draw all items directly for my case?
Another question is about memory.Should I create an VBO for each frame, since the number of items is always changing?
Not the easiest question to answer but I'll try my best anyways.
An important property of OpenGL is that the OpenGL context is always bound to a single thread. So every OpenGL-method has to be called within that thread. A common way of dealing with this is using Queuing.
Example:
We are using Model-View-Controller architecture.
We have 3 threads; One to read input, one to handle received messages and one to render the scene.
Here OpenGL context is bound to rendering thread.
The first thread receives a message "Add model to position x". First thread has no time to handle the message, because there might be another message coming right after and we don't want to delay it. So we just give this message for the second thread to handle by adding it to second thread's queue.
Second thread reads the message and performs the required tasks as far as it can before OpenGL context is required. Like reads the Wavefront (.obj)-file from the memory and creates arrays from the received data.
Our second thread then queues this data to our OpenGL thread to handle. OpenGL thread generates VBOs and VAO and stores the data in there.
Back to your question
OpenGL generated Objects stay in the context memory until they are manually deleted or the context is destroyed. So it works kind of like C, where you have to manually allocate memory and free it after it's no more used. So you should not create new Objects for each frame, but reuse the data that stays unchanged. Also when you have multiple objects that use the same model or texture, you should just load that model once and apply all object specific differences on shaders.
Example:
You have an environment with 10 rocks that all share the same rock model.
You load the data, store it in VBOs and attach those VBOs into a VAO. So now you have a VAO defining a rock.
You generate 10 rock entities that all have position, rotation and scale. When rendering, you first bind the shader, then bind the model and texture, then loop through the stone entities and for each stone entity you bind that entity's position, rotation and scale (usually stored in a transformationMatrix) and render.
bind shader
load values to shader's uniform variables that don't change between entities.
bind model and texture (as those stay the same for each rock)
for(each rock in rocks){
load values to shader's uniform variables that do change between each rock, like the transformation.
render
}
unbind shader
Note: You don't need to unbind/bind shader each frame if you only use one shader. Same goes for VAO's and every other OpenGL object as well. So the binding will also stay over each rendering cycle.
Hope this will help you when getting started. Altho I would recommend some tutorial that might have a bit more context to it.
I have read some books about modern OpenGL and find a method called
"Instanced Rendering", but it seems only to work with same
instances.Should I use for-loop to draw all items directly for my
case?
Another question is about memory.Should I create an VBO for each
frame, since the number of items is always changing?
These both depend on the amount of bullets you plan on having. If you think you will have less than a thousand bullets, you can almost certainly push all of them to a VBO each frame and upload and your end users will not notice. If you plan on some obscene amount, then don't do this.
I would say that you should write everything each frame because it's the simplest to do right now, and if you start noticing performance issues then you need to look into instancing or some other method. When you get to "later" you should be more comfortable with OpenGL and find out ways to optimize it that won't be over your head (not saying it is over your head right now, but more experience can only help make it less complex later on).
Culling bullets not on the screen either should be on your radar.
If you plan on having a ridiculous amount of bullets on screen, then you should say so and we can talk about more advanced methods, however my guess is that if you ever reach that limit on today's hardware then you have a large ambitious game with a zoomed out camera and a significant amount of entities on screen, or you are zoomed up and likely have a mess on your screen anyways.
20 objects is nothing. Your program will be plenty fast no matter how you draw them.
When you have 10000 objects, then you'll want to ask for an efficient way.
Until then, draw them whichever way is most convenient. This probably means a separate draw call per object.
I am in the process of trying to write a Monte Carlo Path tracer using Metal. I have the entire pipeline working (almost) correctly. I have some weird banding issues, but that seems like it has more to do with my path tracing logic than with Metal.
For people who may not have experience with path tracer, what it does is generates rays from the camera, bounces it around the scene in random directions for a given depth (in my case, 8), and at every intersection/bounce it shades the ray with that material's color. The end goal is for it to "converge" to a very nice and clean image by averaging out many iterations over and over again. In my code, my metal compute pipeline is run over and over again, with the pipeline representing an iteration.
The way I've structured my compute pipeline is using the following stages:
Generate Rays
Looping 8 times (i.e. bounce the ray around 8 times) :
1 - Compute Ray Intersections and Bounce it off that intersection (recalculate direction)
2 - "Shade" the ray's color based on that intersection
Get the average of all iterations by getting the current texture buffer's color, multiplying it by iteration and then add the ray's color to it, then divide by iteration+1. Then store that new combined_color in the same exact buffer location.
So on an even higher level what my entire Renderer does is:
1 - Do a bunch of ray calculations in a compute shader
2 - Update Buffer (which is the MTKView's drawable)
The problem is that for some reason, what happens is that my texture cycles between 3 different levels of color accumulation and keeps glitching between different colors, almost as if there are three different programs trying to write to the same buffer. This can't be due to race conditions because we're reading and writing from the same buffer location right? How can this be happening?
Here is my system trace of the first few iterations:
As you can see, for the first few iterations, it doesn't render anything for some reason, and they're super fast. I have no clue why that is. Then later, the iterations are very slow. Here is a close-up of that first part:
I've tried outputting just a single iteration's color every time, and it seems perfectly fine. My picture doesn't converge to a clean image (which is what happens after multiple iterations are averaged)
I've tried using semaphores to synchronize things, but all I end up with is a stalled program because I keep waiting for the command buffer and for some reason it is never ready. I think I may just not be getting semaphores. I've tried looking things up and I seem to be doing it right.
Help.. I've been working on this bug for two days. I can't fix it. I've tried everything. I just do not know enough to even begin to discern the problem. Here is a link to the project. The system trace file can be found under System Traces/NaiveIntegrator.trace. I hate to just paste in my code, and I understand this is not recommended at SO, but the problem is I just have no idea where the bug could be. I promise that as soon as the problem is fixed I'll paste the relevant code snippets here.
If you run the code you will see a simple cornell box with a yellow sphere in the middle. If you leave the code running for a while you'll see that the third image in the cycle eventually converges to a decent image (ignoring the banding effect on the floor that is probably irrelevant). But the problem is that the image keeps flickering between 3 different images.
I'm wondering how e.g. graphic (/game) engines do their job with lot's of heterogeneous data while a customized simple rendering loop turns into a nightmare when you have some small changes.
Example:
First, let's say we have some blocks in our scene.
Graphic-Engine: create cubes and move them
Customized: create cube template for vertices, normals, etc. copy and translate them to the position and copy e.g. in a vbo. One glDraw* call does the job.
Second, some weird logic. We want block 1, 4, 7, ... to rotate on x-axis, 2, 5, 8, ... on y-axis and 3, 6, 9 on z-axis with a rotation speed linear to the camera distance.
Graphic-Engine: manipulating object's matrix and it works
Customized: (I think) per object glDraw* call with changing model-matrix uniform is not a good idea, so a translation matrix should be something like an attribute? I have to update them every frame.
Third, a block should disappear if the distance to the camera is lower than any const value Q.
Graphic-Engine: if (object.distance(camera) < Q) scene.drop(object);
Customized: (I think) our vbo is invalid and we have to recreate it?
Again to the very first sentence: it feels like engines do those manipulations for free, while we have to rethink how to provide and update data. And while we do so, the engine (might, but I actually don't know) say: 'update whatever you want, at least I'm going to send all matrizes'.
Another Example: What about a voxel-based world (e.g. Minecraft) where we only draw the visible surface, and we are able to throw a bomb and destroy many voxels. If the world's view data is in one huge buffer we only have one glDraw*-call but have to recreate the buffer every time then. If there are smaller chunks, we have many glDraw*-calls and also have to manipulate buffers, which are smaller.
So is it a good deal to send let's say 10MB of buffer update data instead of 2 gl*-calls with 1MB? How many updates are okay? Should a rendering loop deal with lazy updates?
I'm searching for a guide what a 60fps application should be able to update/draw per frame to get a feeling of what is possible. For my tests, every optimization try is another bottleneck.
And I don't want those tutorials which says: hey there is a new cool gl*Instance call which is super-fast, buuuuut you have to check if your gpu supports it. Well, I also rather consider this an optimization than a meaningful implementation at first.
Do you have any ideas, sources, best practices or rule of thumb how a rendering/updating routine best play together?
My questions are all nearly the same:
How many updates per frame are okay on today's hardware?
Can I lazy-load data to have it after a few frames, but without freezing my application
Do I have to do small updates and profile my loop if there are some microseconds left till next rendering?
Maybe I should implement a real-time profiler which gets a feeling over time, how expensive updates are and can determine the amount of updates per frame?
Thank you.
It's unclear how any of your questions relate to your "graphics engines" vs "customized" examples. All the updates you do with a "graphics engines" are translated to those OpenGL calls in the end.
In brief:
How many updates per frame are okay on today's hardware?
Today's PCIe bandwidth is huge (can go as high as 30 GB/s). However, to utilize it in its entirety you have to reduce the number transactions via consolidating OpenGL calls. The exact number of updates entirely depends on the hardware, drivers, and the way you use them, and graphics hardware is diverse.
This is the kind of answer you didn't want to hear, but unfortunately you have to face the truth: to reduce the number of OpenGL calls you have to use the newer version APIs. E.g. instead of setting each uniform individually you are better to submit a bunch of them through uniform shader buffer objects. Instead of submitting each MVP of each model individually, it's better to use instanced rendering. And so on.
An even more radical approach would be to move to a lower-level (and newer) API, i.e. Vulkan, which aims to solve exactly this problem: the cost of submitting work to the GPU.
Can I lazy-load data to have it after a few frames, but without freezing my application
Yes, you can upload buffer objects asynchronously. See Buffer Object Streaming for details.
Do I have to do small updates and profile my loop if there are some microseconds left till next rendering?
Maybe I should implement a real-time profiler which gets a feeling over time, how expensive updates are and can determine the amount of updates per frame?
You don't need any of these if you do it asynchronously.
Voxel engine (like Minecraft) optimization suggestions?
As a fun project (and to get my Minecraft-adict son excited for programming) I am building a 3D Minecraft-like voxel engine using C# .NET4.5.1, OpenGL and GLSL 4.x.
Right now my world is built using chunks. Chunks are stored in a dictionary, where I can select them based on a 64bit X | Z<<32 key. This allows to create an 'infinite' world that can cache-in and cache-out chunks.
Every chunk consists of an array of 16x16x16 block segments. Starting from level 0, bedrock, it can go as high as you want (unlike minecraft where the limit is 256, I think).
Chunks are queued for generation on a separate thread when they come in view and need to be rendered. This means that chunks might not show right away. In practice you will not notice this. NOTE: I am not waiting for them to be generated, they will just not be visible immediately.
When a chunk needs to be rendered for the first time a VBO (glGenBuffer, GL_STREAM_DRAW, etc.) for that chunk is generated containing the possibly visible/outside faces (neighboring chunks are checked as well). [This means that a chunk potentially needs to be re-tesselated when a neighbor has been modified]. When tesselating first the opaque faces are tesselated for every segment and then the transparent ones. Every segment knows where it starts within that vertex array and how many vertices it has, both for opaque faces and transparent faces.
Textures are taken from an array texture.
When rendering;
I first take the bounding box of the frustum and map that onto the chunk grid. Using that knowledge I pick every chunk that is within the frustum and within a certain distance of the camera.
Now I do a distance sort on the chunks.
After that I determine the ranges (index, length) of the chunks-segments that are actually visible. NOW I know exactly what segments (and what vertex ranges) are 'at least partially' in view. The only excess segments that I have are the ones that are hidden behind mountains or 'sometimes' deep underground.
Then I start rendering ... first I render the opaque faces [culling and depth test enabled, alpha test and blend disabled] front to back using the known vertex ranges. Then I render the transparent faces back to front [blend enabled]
Now... does anyone know a way of improving this and still allow dynamic generation of an infinite world? I am currently reaching ~80fps#1920x1080, ~120fps#1024x768 (screenshots: http://i.stack.imgur.com/t4k30.jpg, http://i.stack.imgur.com/prV8X.jpg) on an average 2.2Ghz i7 laptop with a ATI HD8600M gfx card. I think it must be possible to increase the number of frames. And I think I have to, as I want to add entity AI, sound and do bump and specular mapping. Could using Occlusion Queries help me out? ... which I can't really imagine based on the nature of the segments. I already minimized the creation of objects, so there is no 'new Object' all over the place. Also as the performance doesn't really change when using Debug or Release mode, I don't think it's the code but more the approach to the problem.
edit: I have been thinking of using GL_SAMPLE_ALPHA_TO_COVERAGE but it doesn't seem to be working?
gl.Enable(GL.DEPTH_TEST);
gl.Enable(GL.BLEND); // gl.Disable(GL.BLEND);
gl.Enable(GL.MULTI_SAMPLE);
gl.Enable(GL.SAMPLE_ALPHA_TO_COVERAGE);
To render a lot of similar objects, I strongly suggest you take a look into instanced draw : glDrawArraysInstanced and/or glDrawElementsInstanced.
It made a huge difference for me. I'm talking from 2 fps to over 60 fps to render 100000 similar icosahedrons.
You can parametrize your cubes by using Attribs ( glVertexAttribDivisor and friends ) to make them differents. Hope this helps.
It's on ~200fps currently, should be OK. The 3 main things that I've done are:
1) generation of both chunks on a separate thread.
2) tessellation the chunks on a separate thread.
3) using a Deferred Rendering Pipeline.
Don't really think the last one contributed much to the overall performance but had to start using it because of some of the shaders. Now the CPU is sort of falling asleep # ~11%.
This question is pretty old, but I'm working on a similar project. I approached it almost exactly the same way as you, however I added in one additional optimization that helped out a lot.
For each chunk, I determine which sides are completely opaque. I then use that information to do a flood fill through the chunks to cull out the ones that are underground. Note, I'm not checking individual blocks when I do the flood fill, only a precomputed bitmask for each chunk.
When I'm computing the bitmask, I also check to see if the chunk is entirely empty, since empty chunks can obviously be ignored.
I am programming a game using Visual C++ 2008 Express and the Ogre3D sdk.
My core gameplay logic is designed to run at 100 times/second. For simplicity, I'll say it's a method called 'gamelogic()'. It is not time-based, which means if I want to "advance" game time by 1 second, I have to call 'gamelogic()' 100 times. 'gamelogic()' is lightweight in comparison to the game's screen rendering.
Ogre has a "listener" logic that informs your code when it's about to draw a frame and when it has finished drawing a frame. If I just call 'gamelogic()' just before the frame rendering, then the gameplay will be greatly affected by screen rendering speed, which could vary from 5fps to 120 fps.
The easy solution that comes to mind is : calculate the time elapsed since last rendered frame and call 'gamelogic()' this many times before the next frame: 100 * timeElapsedInSeconds
However, I pressume that the "right" way to do it is with multithreading; have a separate thread that runs 'gamelogic()' 100 times/sec.
The question is, how do I achieve this and what can be done when there is a conflict between the 2 separate threads : gamelogic changing screen content (3d object coordinates) while Ogre is rendering the screen at the same time .
Many thanks in advance.
If this is your first game application, using multi-threading to achieve your results might be more work than you should really tackle on your first game. Sychronizing a game loop and render loop in different threads is not an easy problem to solve.
As you correctly point out, rendering time can greatly affect the "speed" of your game. I would suggest that you do not make your game logic dependent on a set time slice (i.e. 1/100 of a second). Make it dependent on the current frametime (well, the last frametime since you don't know how long your current frame will take to render).
Typically I would write something like below (what I wrote is greatly simplified):
float Frametime = 1.0f / 30.0f;
while(1) {
game_loop(Frametime); // maniuplate objects, etc.
render_loop(); // render the frame
calculate_new_frametime();
}
Where Frametime is the calculcated frametime that the current frame took. When you process your game loop you are using the frametime from the previous frame (so set the initial value to something reasonable, like 1/30th or 1/15th of a second). Running it on the previous frametime is close enough to get you the results that you need. Run your game loop using that time frame, then render your stuff. You might have to change the logic in your game loop to not assume a fixed time interval, but generally those kinds of fixes are pretty easy.
Asynchoronous game/render loops may be something that you ultimately need, but that is a tough problem to solve. It involves taking snapshops of objects and their relevant data, putting those snapshots into a buffer and then passing the buffer to the rendering engine. That memory buffer will have to be correctly partitioned around critical sections to avoid having the game loop write to it while the render loop is reading from it. You'll have to take care to make sure that you copy all relevant data into the buffer before passing to the render loop. Additionally, you'll have to write logic to stall either the game or render loops while waiting for one or the other to complete.
This complexity is why I suggest writing it in a more serial manner first (unless you have experience, which you might). The reason being is that doing it the "easy" way first will force you to learn about how your code works, how the rendering engine works, what kind of data the rendering engine needs, etc. Multithreading knowledge is defintely required in complex game development these days, but knowing how to do it well requires indepth knowledge of how game systems interact with each other.
There's not a whole lot of benefit to your core game logic running faster than the player can respond. About the only time it's really useful is for physics simulations, where running at a fast, fixed time step can make the sim behave more consistently.
Apart from that, just update your game loop once per frame, and pass in a variable time delta instead of relying on the fixed one. The benefit you'll get from doing multithreading is minimal compared to the cost, especially if this is your first game.
Double buffering your render-able objects is an approach you could explore. Meaning, the rendering component is using 1 buffer which is updated when all game actions have updated the relevant object in the 2nd buffer.
But personally I don't like it, I'd (and have, frequently) employ Mark's approach.