I am in the process of trying to write a Monte Carlo Path tracer using Metal. I have the entire pipeline working (almost) correctly. I have some weird banding issues, but that seems like it has more to do with my path tracing logic than with Metal.
For people who may not have experience with path tracer, what it does is generates rays from the camera, bounces it around the scene in random directions for a given depth (in my case, 8), and at every intersection/bounce it shades the ray with that material's color. The end goal is for it to "converge" to a very nice and clean image by averaging out many iterations over and over again. In my code, my metal compute pipeline is run over and over again, with the pipeline representing an iteration.
The way I've structured my compute pipeline is using the following stages:
Generate Rays
Looping 8 times (i.e. bounce the ray around 8 times) :
1 - Compute Ray Intersections and Bounce it off that intersection (recalculate direction)
2 - "Shade" the ray's color based on that intersection
Get the average of all iterations by getting the current texture buffer's color, multiplying it by iteration and then add the ray's color to it, then divide by iteration+1. Then store that new combined_color in the same exact buffer location.
So on an even higher level what my entire Renderer does is:
1 - Do a bunch of ray calculations in a compute shader
2 - Update Buffer (which is the MTKView's drawable)
The problem is that for some reason, what happens is that my texture cycles between 3 different levels of color accumulation and keeps glitching between different colors, almost as if there are three different programs trying to write to the same buffer. This can't be due to race conditions because we're reading and writing from the same buffer location right? How can this be happening?
Here is my system trace of the first few iterations:
As you can see, for the first few iterations, it doesn't render anything for some reason, and they're super fast. I have no clue why that is. Then later, the iterations are very slow. Here is a close-up of that first part:
I've tried outputting just a single iteration's color every time, and it seems perfectly fine. My picture doesn't converge to a clean image (which is what happens after multiple iterations are averaged)
I've tried using semaphores to synchronize things, but all I end up with is a stalled program because I keep waiting for the command buffer and for some reason it is never ready. I think I may just not be getting semaphores. I've tried looking things up and I seem to be doing it right.
Help.. I've been working on this bug for two days. I can't fix it. I've tried everything. I just do not know enough to even begin to discern the problem. Here is a link to the project. The system trace file can be found under System Traces/NaiveIntegrator.trace. I hate to just paste in my code, and I understand this is not recommended at SO, but the problem is I just have no idea where the bug could be. I promise that as soon as the problem is fixed I'll paste the relevant code snippets here.
If you run the code you will see a simple cornell box with a yellow sphere in the middle. If you leave the code running for a while you'll see that the third image in the cycle eventually converges to a decent image (ignoring the banding effect on the floor that is probably irrelevant). But the problem is that the image keeps flickering between 3 different images.
Related
So i am implementing terrain and at run-time implementing different techniques such as fault line, diamond-square
I have subgrids that are responsible for culling sections out and basically when performing one of the algorithms i loop through the sections and map the vertex buffer and then update the positions and finally unmap
I notice if i press twice fast the keyboard input which triggers the algorithms parts of the mesh dissappear and end up with weird values that are as if they are uninitialised. I solve this by putting a small timer around the call which on my laptop is 0.4sec
I put my project onto a uni computer and i was able to take the timer right down to almost 0 wait and the algorithm worked perfectly without any loss of data
so my question is can anyone give me info about mapping and unmapping and why im assuming my GPU is possibly slower to make the changes
thanks
Voxel engine (like Minecraft) optimization suggestions?
As a fun project (and to get my Minecraft-adict son excited for programming) I am building a 3D Minecraft-like voxel engine using C# .NET4.5.1, OpenGL and GLSL 4.x.
Right now my world is built using chunks. Chunks are stored in a dictionary, where I can select them based on a 64bit X | Z<<32 key. This allows to create an 'infinite' world that can cache-in and cache-out chunks.
Every chunk consists of an array of 16x16x16 block segments. Starting from level 0, bedrock, it can go as high as you want (unlike minecraft where the limit is 256, I think).
Chunks are queued for generation on a separate thread when they come in view and need to be rendered. This means that chunks might not show right away. In practice you will not notice this. NOTE: I am not waiting for them to be generated, they will just not be visible immediately.
When a chunk needs to be rendered for the first time a VBO (glGenBuffer, GL_STREAM_DRAW, etc.) for that chunk is generated containing the possibly visible/outside faces (neighboring chunks are checked as well). [This means that a chunk potentially needs to be re-tesselated when a neighbor has been modified]. When tesselating first the opaque faces are tesselated for every segment and then the transparent ones. Every segment knows where it starts within that vertex array and how many vertices it has, both for opaque faces and transparent faces.
Textures are taken from an array texture.
When rendering;
I first take the bounding box of the frustum and map that onto the chunk grid. Using that knowledge I pick every chunk that is within the frustum and within a certain distance of the camera.
Now I do a distance sort on the chunks.
After that I determine the ranges (index, length) of the chunks-segments that are actually visible. NOW I know exactly what segments (and what vertex ranges) are 'at least partially' in view. The only excess segments that I have are the ones that are hidden behind mountains or 'sometimes' deep underground.
Then I start rendering ... first I render the opaque faces [culling and depth test enabled, alpha test and blend disabled] front to back using the known vertex ranges. Then I render the transparent faces back to front [blend enabled]
Now... does anyone know a way of improving this and still allow dynamic generation of an infinite world? I am currently reaching ~80fps#1920x1080, ~120fps#1024x768 (screenshots: http://i.stack.imgur.com/t4k30.jpg, http://i.stack.imgur.com/prV8X.jpg) on an average 2.2Ghz i7 laptop with a ATI HD8600M gfx card. I think it must be possible to increase the number of frames. And I think I have to, as I want to add entity AI, sound and do bump and specular mapping. Could using Occlusion Queries help me out? ... which I can't really imagine based on the nature of the segments. I already minimized the creation of objects, so there is no 'new Object' all over the place. Also as the performance doesn't really change when using Debug or Release mode, I don't think it's the code but more the approach to the problem.
edit: I have been thinking of using GL_SAMPLE_ALPHA_TO_COVERAGE but it doesn't seem to be working?
gl.Enable(GL.DEPTH_TEST);
gl.Enable(GL.BLEND); // gl.Disable(GL.BLEND);
gl.Enable(GL.MULTI_SAMPLE);
gl.Enable(GL.SAMPLE_ALPHA_TO_COVERAGE);
To render a lot of similar objects, I strongly suggest you take a look into instanced draw : glDrawArraysInstanced and/or glDrawElementsInstanced.
It made a huge difference for me. I'm talking from 2 fps to over 60 fps to render 100000 similar icosahedrons.
You can parametrize your cubes by using Attribs ( glVertexAttribDivisor and friends ) to make them differents. Hope this helps.
It's on ~200fps currently, should be OK. The 3 main things that I've done are:
1) generation of both chunks on a separate thread.
2) tessellation the chunks on a separate thread.
3) using a Deferred Rendering Pipeline.
Don't really think the last one contributed much to the overall performance but had to start using it because of some of the shaders. Now the CPU is sort of falling asleep # ~11%.
This question is pretty old, but I'm working on a similar project. I approached it almost exactly the same way as you, however I added in one additional optimization that helped out a lot.
For each chunk, I determine which sides are completely opaque. I then use that information to do a flood fill through the chunks to cull out the ones that are underground. Note, I'm not checking individual blocks when I do the flood fill, only a precomputed bitmask for each chunk.
When I'm computing the bitmask, I also check to see if the chunk is entirely empty, since empty chunks can obviously be ignored.
First of all:Windows XP SP3, 2GB RAM, Intel core 2 Duo 2.33 GHz, nVidia 9600GT 1GB RAM. OpenGL 3.3 fully updated.
Short description of what I am doing:Ideally I need to put ONE single pixel in a GL texture (A) using glTexSubImage2D every frame.Then, modify the texture inside a shader-FBO-quadfacingcamera setup and replace the original image with the resulting FBO.
Of course, I don't want a FBO Feedback Loop, so instead I put the modified version inside a temporary texture and do the update separately with glCopyTexSubImage2D.
The sequence is now:
1) Put one pixel in a GL texture (A) using glTexSubImage2D every frame (with width=height=1).2) This modified version A is to be used/modified inside a shader-FBO-quad setup to be rendered into a different texture (B).3) The resulting texture B is to be overwritten over A using glCopyTexSubImage2D.4) Repeat...
By repeating this loop I want to achieve a slow fading effect by multiplying the color values in the shader by 0.99 every frame.
2 things are badly wrong:1) with a fading factor of 0.99 repeated every frame, the fading stops at RGB 48,48,48. Thus, leaving a trail of greyish pixels not fully faded out.2) the program runs at 100 FPS. Very bad. Because if I comment out the glCopyTexSubImage2D the program goes at 1000 FPS!!
I achieve 1000 FPS also by commenting out just glTexSubImage2D and leaving alone glCopyTexSubImage2D. This fact to clarify that glTexSubImage2D and glCopyTexSubImage2D are NOT the bottleneck by themselves (I tried to replace glCopyTexSubImage2D with a secondary FBO to do the copying, same results).
Observation: the bottleneck shows when both those commands are working!
Hard mode: no PBOs pls.
Link with source and exe:http://www.mediafire.com/?ymu4v042a1aaha3(CodeBlocks and SDL used)FPS counts are written into stdout.txt
I ask for a workaround for the 2 things exposed up there.Expected results: full fade-out effect to plain black at 800-1000 FPS.
To problem 1:
You are experiencing some precision (and quantization) issues here. I assume you are using some 8 Bit UNORM framebuffer format, so anything you write to it will be rounded the next discrete step out of 256 levels. Think about it: 48*0.99 = 47.52, which will end up as 48 again, so it will not get any darker that. Using some real floating point format would be a solution, but it is likely to greatly decrease overall performance...
The fade out operation you chose is simply not the best choice, it might be better to add some linear term to guarantee that you decrease the value by at least 1/255.
To problem 2:
It is hard to say what the actual bottleneck here is. As you are not using PBOs, you are limited to synchronous texture updates.
However, why do you need to do that copy operation at all? The standard approach to this kind of things would be some texture/FBO/color buffer "ping-pong", where you just swap the "role" of the textures after each iteration. So you get the sequence:
update A
render into B (reading from A)
update B
render into A (reading from B)
Problem 2: splatting arbitrary pixels into a texture as fast as possible.
Since probably the absolute fastest way to dynamically upload data to the GPU from main memory consists in Vertex Arrays or VBOs, then the solution to problem 2 gets trivial:
1) create Vertex Array and Color Array
(or interleave coordinates and colors, performance/bandwidth may vary);
2) Z component =0. We want points to lie on the floor;
3) camera pointing downwards with orthographic projection
(being sure to match exactly the screen size with coordinate ranges);
4) render to texture with FBO using GL_POINTS w/ glPointSize=1 and GL_POINT_SMOOTH disabled.
Pretty standard. Now the program runs at 750 fps. Close enough. My dreams were all like "Hey mom look! I'm running glTexSubImage2D at 1000 fps!" and then meh.Though glCopyTexSubImage2D is very fast. Would recommend.
Not sure if this is the best way to GPU-accelerate fadings but given the results one must note a strong concentration of Force with this one. Anyway the problem with the fading stopping half-way is fixed by setting a minimum constant decrement variable, so even if the exponential curve fails the fading will finish no matter what.
I'm writing a 2D, sprite-based game and I'm having a hard time with making collision detection. First of all, I am well aware of other methods and in fact I'm using Box2D's quadtree queries to filter out non-overlapping sprites. So pixel-perfect detection would be used only on sprites that were found to overlap and would be used only a few times per frame. The sprites are rotating but not scaling.
The problem is I need it done with pixels because the sprites can change over time and making and using e.g. Box2D's geometric shapes for approximate the bitmap will get really complicated.
I did some research and found out these methods are possible in OpenGL in order to check if any pixels with non-zero alpha channel overlap:
1) Rendering sprites to a texture/buffer with e.g. 50% alpha and proper blending function, copying the result to RAM and checking if there's any pixel with alpha greater with e.g. 80%.
This method is simple but as I checked copying back is extremely slow.
2) Using OpenGL's occlusion query.
From what I found out on the net occlusion queries can be tricky (plus sometimes you need to wait until the next frame to get the result) and buggy on some graphic cards. The fact such queries don't produce results immediately is a deal breaker because of how the game is designed to work.
3) Shaders and atomic counters.
I'm not sure if it would work but it seems that using a fragment shader when rendering a second sprite that would increase an atomic counter each time it overwrites something and then checking the counter's value on the CPU side could be a solution. The only problem is that atomic counters are pretty new and 2,3-years old machines may not support them.
Is there something I missed? Or should I just forgot about using GPU and write my own renderer just for collision detection on CPU?
Atomic Counters is an appropriate way to do this on the GPU. Since you're going to be checking many many pixels, you might as well do this in parallel. The big performance question here is asynchronously reading it back, but this depends on how you make your engine of course.
With OpenGL 4.2 you can use atomic counters. Check if your graphics card supports this, it's quite possible it does, you should check this.
I'm getting some repeating lags in my opengl application.
I'm using the win32 api to create the window and I'm also creating a 2.2 context.
So the main loop of the program is very simple:
Clearing the color buffer
Drawing a triangle
Swapping the buffers.
The triangle is rotating, that's the way I can see the lag.
Also my frame time isn't smooth which may be the problem.
But I'm very very sure the delta time calculation is correct because I've tried plenty ways.
Do you think it could be a graphic driver problem?
Because a friend of mine run almost the exactly same program except I do less calculations + I'm using the standard opengl shader.
Also, His program use more CPU power than mine and the CPU % is smoother than mine.
I should also add:
On my laptop I get same lag every ~1 second, so I can see some kind of pattern.
There are many reasons for a jittery frame rate. Off the top of my head:
Not calling glFlush() at the end of each frame
other running software interfering
doing things in your code that certain graphics drivers don't like
bugs in graphics drivers
Using the standard windows time functions with their terrible resolution
Try these:
kill as many running programs as you can get away with. Use the process tab in the task manager (CTRL-SHIFT-ESC) for this.
bit by bit, reduce the amount of work your program is doing and see how that affects the frame rate and the smoothness of the display.
if you can, try enabling/disabling vertical sync (you may be able to do this in your graphic card's settings) to see if that helps
add some debug code to output the time taken to draw each frame, and see if there are anomalies in the numbers, e.g. every 20th frame taking an extra 20ms, or random frames taking 100ms.