How to go about benchmarking a software rasterizer - c++

Ok, ive been developing a software rasterizer for some time now, but have no idea how to go about benchmarking it to see if its actually any good.... i mean say you can render X amount of verts ant Y frames per second, what would be a good way to analyse this data to see if its any good? rather than someone just saying
"30 fps with 1 light is good" etc?

What do you want to measure? I suggest fillrate and triangle rate. Basically fillrate is how many pixels your rasterizer can spit out each second, Triangle rate is how many triangles your rasterizer + affine transformation functions can push out each second, independent of the fillrate. Here's my suggestion for measuring both:
To measure the fillrate without getting noise from the time used for the triangle setup, use only two triangles, which forms a quad. Start with a small size, and then increase it with a small interval. You should eventually find an optimal size with respect to the render time of one second. If you don't, you can perform blending, with full-screen triangle pairs, which is a pretty slow operation, and which only burns fillrate. The fillrate becomes width x height of your rendered triangle. For example, 4 megapixels / second.
To measure the triangle rate, do the same thing; only for triangles this time. Start with two tiny triangles, and increase the number of triangles until the rendering time reaches one second. The time used by the triangle/transformation setup is much more apparent in small triangles than the time used to fill it. The units is triangle count/second.
Also, the overall time used to render a frame might be comparable too. The render time for a frame is the derivative of the global time, i.e delta time. The reciprocal of the delta time is the number of frames per second, if that delta time was constant for all frames.
Of course, for these numbers to be half-way comparable across rasterizers, you have to use the same techniques and features. Comparing numbers from a rasterizer which uses per-pixel lighting against another which uses flat-shading doesn't make much sense. Resolution and color depth should also be equal.
As for optimization, getting a proper profiler should do the trick. GCC has the GNU profiler gprof. If you want an opinion on clever things to optimize in a rasterizer, ask that as a seperate question. I'll answer to the best of my ability.

If you want to determine if it's "any good" you will need to compare your rasterizer with other rasterizers. "30 fps with 1 light" might be extremely good, if no-one else has ever managed to go beyond, say, 10 fps.

Related

Linear Interpolation and Object Collision

I have a physics engine that uses AABB testing to detect object collisions and an animation system that does not use linear interpolation. Because of this, my collisions act erratically at times, especially at high speeds. Here is a glaringly obvious problem in my system...
For the sake of demonstration, assume a frame in our animation system lasts 1 second and we are given the following scenario at frame 0.
At frame 1, the collision of the objects will not bet detected, because c1 will have traveled past c2 on the next draw.
Although I'm not using it, I have a bit of a grasp on how linear interpolation works because I have used linear extrapolation in this project in a different context. I'm wondering if linear interpolation will solve the problems I'm experiencing, or if I will need other methods as well.
There is a part of me that is confused about how linear interpolation is used in the context of animation. The idea is that we can achieve smooth animation at low frame rates. In the above scenario, we cannot simply just set c1 to be centered at x=3 in frame 1. In reality, they would have collided somewhere between frame 0 and frame 1. Does linear interpolation automatically take care of this and allow for precise AABB testing? If not, what will it solve and what other methods should I look into to achieve smooth and precise collision detection and animation?
The phenomenon you are experiencing is called tunnelling, and is a problem inherent to discrete collision detection architectures. You are correct in feeling that linear interpolation may have something to do with the solution as it can allow you to, within a margin of error (usually), predict the path of an object between frames, but this is just one piece of a much larger solution. The terminology I've seen associated with these types of solutions is "Continuous Collision Detection". The topic is large and gets quite complex, and there are books that discuss it, such as Real Time Collision Detection and other online resources.
So to answer your question: no, linear interpolation on its own won't solve your problems*. Unless you're only dealing with circles or spheres.
What to Start Thinking About
The way the solutions look and behave are dependant on your design decisions and are generally large. So just to point in the direction of the solution, the fundamental idea of continuous collision detection is to figure out: How far between the early frame and the later frame does the collision happen, and in what position and rotation are the two objects at this point. Then you must calculate the configuration the objects will be in at the later frame time in response to this. Things get very interesting addressing these problems for anything other than circles in two dimensions.
I haven't implemented this but I've seen described a solution where you march the two candidates forward between the frames, advancing their position with linear interpolation and their orientation with spherical linear interpolation and checking with discrete algorithms whether they're intersecting (Gilbert-Johnson-Keerthi Algorithm). From here you continue to apply discrete algorithms to get the smallest penetration depth (Expanding Polytope Algorithm) and pass that and the remaining time between the frames, along to a solver to get how the objects look at your later frame time. This doesn't give an analytic answer but I don't have knowledge of an analytic answer for generalized 2 or 3D cases.
If you don't want to go down this path, your best weapon in the fight against complexity is assumptions: If you can assume your high velocity objects can be represented as a point things get easier, if you can assume the orientation of the objects doesn't matter (circles, spheres) things get easier, and it keeps going and going. The topic is beyond interesting and I'm still on the path of learning it, but it has provided some of the most satisfying moments in my programming period. I hope these ideas get you on that path as well.
Edit: Since you specified you're working on a billiard game.
First we'll check whether discrete or continuous is needed
Is any amount of tunnelling acceptable in this game? Not in billiards
no.
What is the speed at which we will see tunnelling? Using a 0.0285m
radius for the ball (standard American) and a 0.01s physics step, we
get 2.85m/s as the minimum speed that collisions start giving bad
response. I'm not familiar with the speed of billiard balls but that
number feels too low.
So just checking on every frame if two of the balls are intersecting is not enough, but we don't need to go completely continuous. If we use interpolation to subdivide each frame we can increase the velocity needed to create incorrect behaviour: With 2 subdivisions we get 5.7m/s, which is still low; 3 subdivisions gives us 8.55m/s, which seems reasonable; and 4 gives us 11.4m/s which feels higher than I imagine billiard balls are moving. So how do we accomplish this?
Discrete Collisions with Frame Subdivisions using Linear Interpolation
Using subdivisions is expensive so it's worth putting time into candidate detection to use it only where needed. This is another problem with a bunch of fun solutions, and unfortunately out of scope of the question.
So you have two candidate circles which will very probably collide between the current frame and the next frame. So in pseudo code the algorithm looks like:
dt = 0.01
subdivisions = 4
circle1.next_position = circle1.position + (circle1.velocity * dt)
circle2.next_position = circle2.position + (circle2.velocity * dt)
for i from 0 to subdivisions:
temp_c1.position = interpolate(circle1.position, circle1.next_position, (i + 1) / subdivisions)
temp_c2.position = interpolate(circle2.position, circle2.next_position, (i + 1) / subdivisions)
if intersecting(temp_c1, temp_c2):
intersection confirmed
no intersection
Where the interpolate signature is interpolate(start, end, alpha)
So here you have interpolation being used to "move" the circles along the path they would take between the current and the next frame. On a confirmed intersection you can get the penetration depth and pass the delta time (dt / subdivisions), the two circles, the penetration depth and the collision points along to a resolution step that determines how they should respond to the collision.

Cost of using multiple render targets

I am using glsl as a framework for GPGPU for real-time image-processing. I am currently trying to "shave off" a few more milliseconds to make my application real-time. Here's the basic setup:
I take an input image, calculate several transformations of it, and then output a result image. For instance, Let the input image be I. Then the one fragment shader calculates f(I); the second calculates g(I); and the last one calculates h(f(I),g(I)).
My question is regarding efficiently calculating f(I),g(I): does it matter if I use 2 separate fragment shaders (and therefore 2 rendering passes), or if I use a single fragment shader with 2 outputs? Will the latter run faster? I have mostly found discussions about the "how-to"; not about the performance.
Edit
Thanks for the replies so far. Following several remarks, here's an example for my use-case with some more details:
I want a to filter the rows of image I with a 1-d filter; and also filter the rows of the squared image (each pixel is squared). f(I) = filter rows and g(I) = square and filter rows:
shader1: (input image) I --> filter rows --> I_rows (output image)
shader2: (input image) I --> square pixels and filter rows--> I^2_rows (output image)
The question is: will writing a single shader that does both operations be faster than running these two shaders one after the other? #derhass suggests that the answer is positive, because of accessing the same texture locations and enjoying locality. But if it wasn't for the texture-locality: would I still be enjoying a performance boost? or is a single shader rendering to two outputs basically equivalent to two render passes?
Using multiple render passes is usually slower than using one pass with MRT output, but this will also depend on your situation.
As I understand it, both f(I) and g(I) sample the input image I, and if each samples the same (or closely neighboring) loactions, you can greatly profit from the texture cache between the different operations - you have to sample the input texture just once, instead of two times with the multipass approach.
Taking this approach one step further: Do you even need the intermediate results f(I) and g(I) separately? Maybe you could just put h(f(I),g(I)) directly onto one shader, so you do neither need multiple passes and MRTs. If you want to be able to dyanmically combine your operations, you can still use that apporach and programatically combine different shader code parts dynamically to implement the operations (where possible), and use multiple passes only where absolutely necessary.
EDIT
As the question has been updated in the meantime, I think I can give some more specific answers:
What I said so far, especially about putting h(f(I),g(f(I)) into one shader is only a good idea if h (or f and g) will not need any neighboring pixels. If h is a nxn filter kernel, you would have to access nxn different input texels, and since those inputs are not directly known, you would have to calculate f and g for each of them. If both f and h are filter kernels, the effective filter size of the compound operation will be greater, and it is much better to calculate the intermediate results first and use multiple passes.
Looking at the specific issue you describe, it comes down to this.
If you use two separate shaders in the most naive way, you rendering will look like this.
use the shader1
select some output color buffer
draw a quad
use shader2
select some different color buffer
draw a quad
Every draw call has its overhead. The GL will have to do some extra validation. Switching the shaders might be the most expensive extra step here compared to the combined shader approach, as it might force a GPU pipeline flush. Als, for each draw call, you have the vertex processing, rasterization, and per fragment attribute interolation operations. With just one shader, lot's of this overhead is going away, and the per-fragment calculations described so far can be "shared" for both filters.
But if it wasn't for the texture-locality: would I still be enjoying a
performance boost?
Because of the stuff I said so far, and specific to the shaders you presented, I tend to say: yes. But the effect will be very small to neglegible if we ignore the texture acesses here, especially if we assume reasonable high resolution images so that the relative overhead compared to the total amount of work appearts small. I would at least say that using a single pass MRT setup will not be slower. However, only benchmarking/profiling the very specific implementation on a specific GPU will give a definitive answer.
Why did I say "the shaders you presented". Because in both cases, you do the image squaring in one shader. You could split that into two different shaders and renderpasses also. In that case, you would get additional overhead (to the already mentioned) for writing the intermediate results, and having to read that back. However, since you run a filter over the intermediate resulte, you do not have to square any input texel more than once, but in the combined approach, you do. If the squaring operation is expensive enough, and your filter size is big enough, you could in theory save more time than is introduced by the overhead of multiple passes. Again, only benchmarking/profiling cann tell you where the break even would lie.
I have done some benchmarking with MRT vs. multiple render passes myself in the past, although the image processing operations I was interested in are a bit different from yours. What I found is that in such scenarios, the texture access is the key factor, and you can hide a lot of other calculations (like squaring a color value) in the texture access latency. I think that your "But if it wasn't for the texture-locality" is a bit unrealistic, since it is the major contribution to the overall running time. And it isn't just the locality, it is also the number of texture accesses in total: With your multipe-shader approach, an imge of size w*h, and a 1D filter of size n, you will end up with 2*w*h*n texture accesses overall, while with the combined approach, you will just have reduced to *w*h*n, and that will make a huge difference in the past.
For a AMD FirePro V9800,image size of 1920x1080, and just copying the pixels to two output buffers by rendering textured quds, I got with two passes: ~0,320ms (even without switching shaders) vs 1 pass MRT: ~0,230ms. So execution time was reduced by "only" 30%, but this was with just one texutre fetch per shader invocation. With filter kernels, I'd expect to see this figure getting closer to 50% reduction with increasing kernel size (but I haven't measured that, though).
Let us ignore any potential benefits from hardware-specific things like data cache, register re-use, etc. that might occur if you do your entire algorithm in a single shader invocation and focus entirely on algorithm complexity for a minute.
A Gaussian Blur on a 2D image is a separable filter (X and Y can be blurred as a much simpler series of 1D blurs), and you can actually get better performance if you split the horizontal and vertical application into two passes.
Consider the complexity of two 1D blurs vs. one 2D blur in Big O:
Two-Pass Gaussian Blur (Two 1D blurs):
     
Single-Pass Gaussian Blur (Single 2D blur):
     
Deferred shading is another example. Instead of one massive loop over all lights in a single-pass, many implementations will do one pass per-light shading only the area of the screen that each individual light actually covers.
Multi-pass is not always a bad thing, when it simplifies your algorithm as in the case of a separable filter or light coverage, it is often good.
Your results may vary, but if you can show an appreciable difference in algorithm complexity in Big O notation using one approach over the other, it is worth exploring the run-time performance of both implementations.

I need advice how to improve graphics

I have file with table containing 23 millions records the following form {atomName, x, y, z, transparence}. For solutions I decided to use OpenGL.
My task to render it. In first iteration, I used block "glBegin/glEnd" and have drawed every atom as point some color. This solution worked. But I got 0.002 fps.
Then i tried using VBO. I formed three buffers: vertex, color and indexes. This solution worked. I got 60 fps, but i have not comfortable binding buffers and i am drawing points, not spheres.
Then i read about VAO, which can simplify binding buffers. Ok, it is worked. I got comfortable binding.
Now i want to draw spheres, not points. I thought, to form relative to each point of the set of vertices on which it will be possible to build a sphere (with some accuracy). But if i have 23 million vertices, i must calculate yet ~12 or more vertices relaty every point. 23 000 000 * 4 (float) = 1 Gb data, perhaps it not good solution.
What is the best next move i should do? I can not fully understand, applicable shaders in this task or exist other ways.
About your drawing process
My task to render it. In first iteration, I used block "glBegin/glEnd" and have drawed every atom as point some color. This solution worked. But I got 0.002 fps.
Think about it: For every of your 23 million records you make at least one function call directly (glVertex) and probably several function calls implicitly by that. Even worse, glVertex likely causes a context switch. What this means is, that your CPU hits several speed bumps for every vertex it has to processes. A top notch CPU these days has a clock rate of about 3 GHz and a pipeline length in the order of 10 instructions. When you make a context switch that pipeline gets stalled, in the worst case it then takes one pipeline length to actually process one single instruction. Lets consider that you have to perform at least 1000 instructions for processing a single glVertex call (which is actually a rather optimistic estimation). That alone means, that you're limited to process at most 3 million vertices per second. So at 23 million vertices that's already less than one FPS then.
But you also got context switches in there, which add a further penality. And probably a lot of branching which create further pipeline flushes.
And that's just the glVertex call. You also have colors in there.
And you wonder that immediate mode is slow?
Of course it's slow. Using the Immediate Mode has been discouraged for well over 15 years. Vertex Arrays are available since OpenGL-1.1.
This solution worked. I got 60 fps,
Yes, because all the data resides on the GPU's own memory now. GPUs are massively parallel and optimized to crunch this kind of data and doing the operations they do.
but i have not comfortable binding buffers
Well, OpenGL is not a high level scene graph library. It's a mid to low level drawing API. You use it like a sophisticated pencil to draw on a digital canvas.
Then i read about VAO
Well, VAOs are meant to coalesce buffer objects that belong together so it makes sense using them.
Now i want to draw spheres, not points.
You have two options:
Using point sprite textures. This means that your points will get area when drawn, and that area gets a texture applied. I think this is the best method for you. Given the right shader you can even give your point sprite the right kind of depth values, so that your "spheres" will actually intersect like spheres in the depth buffer.
The other option is using instancing a single sphere geometry, using your atom records as control data for the instancing process. This would then process real sphere geometry. However I fear that implementing an instanced drawing process might be a bit too advanced for your skill level at the moment.
About drawing 23 million points
Seriously what kind of display do you have available, that you can draw 23 million, distinguishable points? Your typical computer screen will have some about 2000×1500 points. The highest resolution displays you can buy these days have about 4k×2.5k pixels, i.e. 10 million individual pixels. Let's assume your atoms are evenly distributed in a plane: At 23 million atoms to draw each pixel will get several times overdrawn. You simply can't display 23 million individual atoms that way. Another way to look at this is, that the display's pixel grid implies a spatial sampling and you can't reproduce anything smaller than twice the average sampling distance (sampling theorem).
So it absolutely makes sense to draw only a subset of the data, namely the subset that's actually in view. Also if you're zoomed very far out (i.e. you have the full dataset in view) it makes sense to coalesce atoms closeby.
It definitely makes sense to sort your data into a spatial subdivision structure. In your case I think an octree would be a good choice.

Ray Tracing: Only use single ray instead of both reflection & refraction rays

I am currently trying to understand the ray tracer developed by Kevin Beason (smallpt: http://www.kevinbeason.com/smallpt/) and if I understand the code correctly he randomly chooses to either reflect or refract the ray (if the surface is both reflective and refractive).
Line 71-73:
return obj.e + f.mult(depth>2 ? (erand48(Xi)<P ? // Russian roulette
radiance(reflRay,depth,Xi)*RP:radiance(Ray(x,tdir),depth,Xi)*TP) :
radiance(reflRay,depth,Xi)*Re+radiance(Ray(x,tdir),depth,Xi)*Tr);
Can anybody please explain the disadvantages of only casting a single ray instead of both of them? I had never heard of this technique and I am curious what the trade-off is, given that it results in a huge complexity reduction.
This is a monte-carlo ray tracer. Its advantages are that you don't spawn an exponentially increasing number of rays - which can occur in some simple geometries.. The down side is that you need to average over a large number of samples. Typically you sample until the expected deviation from the true value is "low enough". Working out how many samples is required requires some stats - or you just take a lot of samples.
Presumably he's relying on super-sampling pixels and trusting that the average colour will work out roughly correct, although not as accurate.
i.e. fire 4 rays through one pixel and on average 2 are reflected, 2 are refracted.
Combine them to get an approximation of one ray reflected and refracted.

Fastest way to calculate OpenGL texture similarity/distance?

Here's what I have:
I load a texture from disk, say 256x256, say a picture of penguin
I create another texture with same dimensions and I draw some stuff on it
I want to find distance between the two openGL textures AS FAST AS POSSIBLE (accuracy of the distance function is of LEAST concern).
Actually, they might not be textures in the first place.
I might as well compare a region of the framebuffer to the loaded texture or do some other "magic".
How to do this super-fast?
This has little to do with OpenGL in fact. OpenGL is just a 3D rasterization "driver".
The main idea is to get a distance/similarity algorithm, which is a tricky task. For beginning you could do a Root Mean Square Deviation algorithm. It will give you a number of how far away pixel values are.
When you implement it on CPU you could benchmark it for your needs and maybe convert to OpenCL. I don't think loading the "penguin" to GPU just to compare it to other shapes you prepare there is a super-fast process by itself.
Try to be more specific with your next question and avoid attitude of "I have a very well microscope, how do I hit nails with it super-fast?"
You could use the Pearson-Product for matching different images. You can find it on an answer of mine. Instead of matching a template more little than the original image, you could correlated directly two images.
Shortly, you get the deviation of each textel from the average, giving you a correlation factor.
But improving that algorithm by using shaders could be a little hard. First, you have to compute the textel average: maybe some OpenGL extension (like histogram) may help you in this task.
Then, you could use fragment shaders to perform single-component computation (the difference between each textel with the averaged one computed previously. The average textel has to be passed as uniform, and the result should be stored in floating-point texture (you can render it on a framebuffer object).
Then, you should sum up all textel of the resulting texture in order to get the correlation between the two souce textures.
This may worth in the case images are very big. Otherwise I think it's just better to execute the algorithm on CPU, using SIMD instruction set (like MMX, SSE, AVX).