I'm writing a 2D, sprite-based game and I'm having a hard time with making collision detection. First of all, I am well aware of other methods and in fact I'm using Box2D's quadtree queries to filter out non-overlapping sprites. So pixel-perfect detection would be used only on sprites that were found to overlap and would be used only a few times per frame. The sprites are rotating but not scaling.
The problem is I need it done with pixels because the sprites can change over time and making and using e.g. Box2D's geometric shapes for approximate the bitmap will get really complicated.
I did some research and found out these methods are possible in OpenGL in order to check if any pixels with non-zero alpha channel overlap:
1) Rendering sprites to a texture/buffer with e.g. 50% alpha and proper blending function, copying the result to RAM and checking if there's any pixel with alpha greater with e.g. 80%.
This method is simple but as I checked copying back is extremely slow.
2) Using OpenGL's occlusion query.
From what I found out on the net occlusion queries can be tricky (plus sometimes you need to wait until the next frame to get the result) and buggy on some graphic cards. The fact such queries don't produce results immediately is a deal breaker because of how the game is designed to work.
3) Shaders and atomic counters.
I'm not sure if it would work but it seems that using a fragment shader when rendering a second sprite that would increase an atomic counter each time it overwrites something and then checking the counter's value on the CPU side could be a solution. The only problem is that atomic counters are pretty new and 2,3-years old machines may not support them.
Is there something I missed? Or should I just forgot about using GPU and write my own renderer just for collision detection on CPU?
Atomic Counters is an appropriate way to do this on the GPU. Since you're going to be checking many many pixels, you might as well do this in parallel. The big performance question here is asynchronously reading it back, but this depends on how you make your engine of course.
With OpenGL 4.2 you can use atomic counters. Check if your graphics card supports this, it's quite possible it does, you should check this.
Related
I'm working on a data visualisation application where I need to draw about 20 different time series overlayed in 2D, each consisting of a few million data points. I need to be able to zoom and pan the data left and right to look at it and need to be able to place cursors on the data for measuring time intervals and to inspect data points. It's very important that when zoomed out all the way, I can easily spot outliers in the data and zoom in to look at them. So averaging the data can be problematic.
I have a naive implementation using a standard GUI framework on linux which is way too slow to be practical. I'm thinking of using OpenGL instead (testing on a Radeon RX480 GPU), with orthogonal projection. I searched around and it seems VBOs to draw line strips might work, but I have no idea if this is the best solution (would give me the best frame rate).
What is the best way to send data sets consisting of millions of vertices to the GPU, assuming the data does not change, and will be redrawn each time the user interacts with it (pan/zoom/click on it)?
In modern OpenGL (versions 3/4 core profile) VBOs are the standard way to transfer geometric / non-image data to the GPU, so yes you will almost certainly end up using VBOs.
Alternatives would be uniform buffers, or texture buffer objects, but for the application you're describing I can't see any performance advantage in using them - might even be worse - and it would complicate the code.
The single biggest speedup will come from having all the data points stored on the GPU instead of being copied over each frame as a 2D GUI might be doing. So do the simplest thing that works and then worry about speed.
If you are new to OpenGL, I recommend the book "OpenGL SuperBible" 6th or 7th edition. There are many good online OpenGL guides and references, just make sure you avoid the older ones written for OpenGL 1/2.
Hope this helps.
Voxel engine (like Minecraft) optimization suggestions?
As a fun project (and to get my Minecraft-adict son excited for programming) I am building a 3D Minecraft-like voxel engine using C# .NET4.5.1, OpenGL and GLSL 4.x.
Right now my world is built using chunks. Chunks are stored in a dictionary, where I can select them based on a 64bit X | Z<<32 key. This allows to create an 'infinite' world that can cache-in and cache-out chunks.
Every chunk consists of an array of 16x16x16 block segments. Starting from level 0, bedrock, it can go as high as you want (unlike minecraft where the limit is 256, I think).
Chunks are queued for generation on a separate thread when they come in view and need to be rendered. This means that chunks might not show right away. In practice you will not notice this. NOTE: I am not waiting for them to be generated, they will just not be visible immediately.
When a chunk needs to be rendered for the first time a VBO (glGenBuffer, GL_STREAM_DRAW, etc.) for that chunk is generated containing the possibly visible/outside faces (neighboring chunks are checked as well). [This means that a chunk potentially needs to be re-tesselated when a neighbor has been modified]. When tesselating first the opaque faces are tesselated for every segment and then the transparent ones. Every segment knows where it starts within that vertex array and how many vertices it has, both for opaque faces and transparent faces.
Textures are taken from an array texture.
When rendering;
I first take the bounding box of the frustum and map that onto the chunk grid. Using that knowledge I pick every chunk that is within the frustum and within a certain distance of the camera.
Now I do a distance sort on the chunks.
After that I determine the ranges (index, length) of the chunks-segments that are actually visible. NOW I know exactly what segments (and what vertex ranges) are 'at least partially' in view. The only excess segments that I have are the ones that are hidden behind mountains or 'sometimes' deep underground.
Then I start rendering ... first I render the opaque faces [culling and depth test enabled, alpha test and blend disabled] front to back using the known vertex ranges. Then I render the transparent faces back to front [blend enabled]
Now... does anyone know a way of improving this and still allow dynamic generation of an infinite world? I am currently reaching ~80fps#1920x1080, ~120fps#1024x768 (screenshots: http://i.stack.imgur.com/t4k30.jpg, http://i.stack.imgur.com/prV8X.jpg) on an average 2.2Ghz i7 laptop with a ATI HD8600M gfx card. I think it must be possible to increase the number of frames. And I think I have to, as I want to add entity AI, sound and do bump and specular mapping. Could using Occlusion Queries help me out? ... which I can't really imagine based on the nature of the segments. I already minimized the creation of objects, so there is no 'new Object' all over the place. Also as the performance doesn't really change when using Debug or Release mode, I don't think it's the code but more the approach to the problem.
edit: I have been thinking of using GL_SAMPLE_ALPHA_TO_COVERAGE but it doesn't seem to be working?
gl.Enable(GL.DEPTH_TEST);
gl.Enable(GL.BLEND); // gl.Disable(GL.BLEND);
gl.Enable(GL.MULTI_SAMPLE);
gl.Enable(GL.SAMPLE_ALPHA_TO_COVERAGE);
To render a lot of similar objects, I strongly suggest you take a look into instanced draw : glDrawArraysInstanced and/or glDrawElementsInstanced.
It made a huge difference for me. I'm talking from 2 fps to over 60 fps to render 100000 similar icosahedrons.
You can parametrize your cubes by using Attribs ( glVertexAttribDivisor and friends ) to make them differents. Hope this helps.
It's on ~200fps currently, should be OK. The 3 main things that I've done are:
1) generation of both chunks on a separate thread.
2) tessellation the chunks on a separate thread.
3) using a Deferred Rendering Pipeline.
Don't really think the last one contributed much to the overall performance but had to start using it because of some of the shaders. Now the CPU is sort of falling asleep # ~11%.
This question is pretty old, but I'm working on a similar project. I approached it almost exactly the same way as you, however I added in one additional optimization that helped out a lot.
For each chunk, I determine which sides are completely opaque. I then use that information to do a flood fill through the chunks to cull out the ones that are underground. Note, I'm not checking individual blocks when I do the flood fill, only a precomputed bitmask for each chunk.
When I'm computing the bitmask, I also check to see if the chunk is entirely empty, since empty chunks can obviously be ignored.
I'm writing a 2D platformer game using SDL with C++. However I have encountered a huge issue involving scaling to resolution. I want the the game to look nice in full HD so all the images for the game have been created so that the natural resolution of the game is 1920x1080. However I want the game to scale down to the correct resolution if someone is using a smaller resolution, or to scale larger if someone is using a larger resolution.
The problem is I haven't been able to find an efficient way to do this.I started by using the SDL_gfx library to pre-scale all images but this doesn't work as it creates a lot of off-by-one errors, where one pixel was being lost. And since my animations are contained in one image when the animation would play the animation would slightly move up or down each frame.
Then after some looking round I have tried using opengl to handle the scaling. Currently my program draws all the images to a SDL_Surface that is 1920x1080. It then converts this surface to a opengl texture, scales this texture to the screen resolution, then draws the texture. This works fine visually but the problem is that its not efficient at all. Currently I am getting a max fps of 18 :(
So my question is does anyone know of an efficient way to scale the SDL display to the screen resolution?
It's inefficient because OpenGL was not designed to work that way. Main performance problems with current design:
First problem: You're software rasterizing with SDL. Sorry, but no matter what you do with this configuration, that will be a bottleneck. At a resolution of 1920x1080, you have 2,073,600 pixels to color. Assuming it takes you 10 clock cycles to shade each 4-channel pixel, on a 2GHz processor you're running a maximum of 96.4 fps. That doesn't sound bad, except you probably can't shade pixels that fast, and you still haven't done AI, user input, game mechanics, sound, physics, and everything else, and you're probably drawing over some pixels at least once anyway. SDL_gfx may be quick, but for large resolutions, the CPU is just fundamentally overtasked.
Second problem: Each frame, you're copying data across the graphics bus to the GPU. This is the slowest thing you can possibly do graphics-wise. Image data is probably the worst of that, because there's typically so much of it. Basically, each frame you're telling the GPU to copy two million some pixels from RAM to VRAM. According to Wikipedia, you can expect, for 2,073,600 pixels at 4 bytes each, no more than 258.9 fps, which again doesn't sound bad until you remember everything else you need to do.
My recommendation: switch your application completely to OpenGL. This removes the need to render to a texture and copy to the screen--just render directly to the screen! Also, scaling is handled automatically by your view matrix (glOrtho/gluOrtho2D for 2D), so you don't have to care about the scaling issue at all--your viewport will just show everything at the same scale. This is the ideal solution to your problem.
Now, it comes with the one major drawback that you have to recode everything with OpenGL draw commands (which is work, but not too hard, especially in the long run). Short of that, you can try the following ideas to improve speed:
PBOs. Pixel buffer objects can be used to address problem two by making texture loading/copying asynchronous.
Multithread your rendering. Most CPUs have at least two cores and on newer chips two register states can be saved for a single core (Hyperthreading). You're essentially duplicating how the GPU solves the rendering problem (have a lot of threads going). I'm not sure how thread safe SDL_gfx is, but I bet that something could be worked out, especially if you're only working on different parts of the image at the same time.
Make sure you pay attention to what place your draw surface is in SDL. It should probably be SDL_SWSURFACE (because you're drawing on the CPU).
Remove VSync. This can improve performance, even if you're not running at 60Hz
Make sure you're drawing your original texture--DO NOT scale it up or down to a new one. Draw it at a different size, and let the rasterizer do the work!
Sporadically update: Only update half the image at a time. This will probably close to double your "framerate", and it's (usually) not noticeable.
Similarly, only update the changing parts of the image.
Hope this helps.
I have a rendering step which I would like to perform on a dynamically-generated texture.
The algorithm can operate on rows independently in parallel. For each row, the algorithm will visit each pixel in left-to-right order and modify it in situ (no distinct output buffer is needed, if that helps). Each pass uses state variables which must be reset at the beginning of each row and persist as we traverse the columns.
Can I set up OpenGL shaders, or OpenCL, or whatever, to do this? Please provide a minimal example with code.
If you have access to GL 4.x-class hardware that implements EXT_shader_image_load_store or ARB_shader_image_load_store, I imagine you could pull it off. Otherwise, in-situ read/write of an image is generally not possible (though there are ways with NV_texture_barrier).
That being said, once you start wanting pixels to share state the way you do, you kill off most of your potential gains from parallelism. If the value you compute for a pixel is dependent on the computations of the pixel to its left, then you cannot actually execute each pixel in parallel. Which means that the only parallelism your algorithm actually has is per-row.
That's not going to buy you much.
If you really want to do this, use OpenCL. It's much friendlier to this kind of thing.
Yes, you can do it. No, you don't need 4.X hardware for that, you need fragment shaders (with flow control), framebuffer objects and floating point texture support.
You need to encode your data into 2D texture.
Store "state variable" in 1st pixel for each row, and encode the rest of the data into the rest of the pixels. It goes without saying that it is recommended to use floating point texture format.
Use two framebuffers, and render them onto each other in a loop using fragment shader that updates "state variable" at the first column, and performs whatever operation you need on another column, which is "current". To reduce amount of wasted resources you can limit rendering to columns you want to process. NVidia OpenGL SDK examples had "game of life", "GDGPU fluid", "GPU partciles" demos that work in similar fashion - by encoding data into texture and then using shaders to update it.
However, because you can do it, it doesn't mean you should do it and it doesn't mean that it is guaranteed to be fast. Some GPUs might have a very high memory texture memory read speed, but relatively slow computation speed (and vice versa) and not all GPUs have many conveyors for processing things in parallel.
Also, depending on your app, CUDA or OpenCL might be more suitable.
How can we run a OpenGL applications (say a games) in higher frame rate like 500 - 800 FPS ?
For a example AOE 2 is running with more than 700 FPS (I know it is about DirectX). Eventhough I just clear buffers and swap buffers within the game loop, I can only get about 200 (max) FPS. I know that FPS isn't a good messurenment (and also depend on the hardware), but I feel I missed some concepts in OpenGL. Did I ? Pls anyone can give me a hint ?
I'm getting roughly 5.600 FPS with an empty display loop (GeForce 260 GTX, 1920x1080). Adding glClear lowers it to 4.000 FPS which is still way over 200...
A simple graphics engine (AoE2 style) should run at about 100-200 FPS (GeForce 8 or similar). Probably more if it's multi-threaded and fully optimized.
I don't know what exactly you do in your loop or what hardware that is running on, but 200 FPS sounds like you are doing something else besides drawing nothing (sleep? game logic stuff? greedy framework? Aero?). The swapbuffer function should not take 5ms even if both framebuffers have to be copied. You can use a profile to check where the most CPU time is spent (timing results from gl* functions are mostly useless though)
If you are doing something with OpenGL (drawing stuff, creating textures, etc.) there is a nice extension to measure times called GL_EXT_timer_query.
Some general optimization tips:
don't use immediate mode (glBegin/glEnd), use VBO and/or display lists+vertex arrays instead
use some culling technique to remove objects outside your view (opengl would have to cull every polygon separately)
try minimizing state changes, especially changing the bound texture or vertex buffer
AOE 2 is a DirectDraw application, not Direct3D. There is no way to compare OpenGL and DirectDraw.
Also, check the method you're using for swapping buffers. In Direct3D there are flip method, copy method, and discard method. The best one is discard, which means that you don't care about previous contents in the buffer, and allow the driver to manage them efficiently.
One of the things you seem to miss (judging from your answer/comments, so correct me if I'm wrong) is that you need to determine what to render.
For example as you said you have multiple layers and such, well the first thing you need to do is not render anything that is off screen (which is possible and is sometimes done). What you should also do is not render things that you are certain are not visible, for example if some area of the top layer is not transparent (or filled up) you should not render the layers below it.
In general what I'm trying to say is that it is in most cases better to eliminate invisible things in the logic than to render all things and just let the things on top end up in the rendered image.
If your textures are small, try to combine them in one bigger texture and address them via texture coordinates. That will save you a lot of state changes. If your textures are e.g. 128x128, you can put 16 of them in one 512x512 texture, bringing your texture related state changes down by a factor of 16.