Sprite Batch concept - c++

I would like to confirm the following, Is it fine to use just one sprite-batch and draw it fonts, and other animated sprites ? if that's true, how many quads that can be batched using just one sprite-batch?is that an issue of DirectX API and it takes care of that or GPU ?

Yes it is ok to use one sprite batch object for fonts and other sprites. In fact it is probably better that way.
The number of sprites that can be batched is up to the implementation. If you are using the SpriteBatch class in the DirectXTK, then it uses a growing array as you add sprites to it so there is no real limit to the number of sprites you can give it (except for memory). Internally it creates a vertex buffer that can handle 2048 sprites or 2048*4 vertices. This doesn't limit the amount of sprites that you can send to the SpriteBatch. It just means that if you queue up 3000 sprites for example, it will need to make at least two draw calls to render everything (more if you are using multiple textures).
So, the number of sprites that can be drawn in one call depends on the size of the vertex buffer that the implementation has created. The maximum size of a vertex buffer ultimately depends on how much memory is available.

Related

How many draw calls is acceptable in vulkan?

Ive been working on a vulkan renderer and am having a bit of a pickle. Currently I am using vulkan to render 2d sprites, and just imported a whole map to draw. The map is 40x40 with 1600 tiles. I cannot instance/batch these as there are moving objects in the scene and I may need to interject draw calls in between ( Some objects need to be rendered in front of others ). However, when I render these 1600 sprites individually my cpu CHUGS and it takes ~20ms to accomplish JUST the sprites. This happens in a separate thread and does the following:
Start command buffer & render pass
For every sprite to draw
Set up translation matrix.
Fetch the material if its not cached
If this command buffer is not bound to the pipeline bind it.
Bind the descriptor set given by the material if not already bound.
Push translation matrix to pipeline using push constant.
Draw.
End command buffer & render pass & submit.
My question I guess is, is 1600 too much? Should I try and find ways to batch this? Would it make more sense to just spend these clock cycles building a big buffer on the gpu and only draw once? I figured this was less efficient since I only really submit once for all commands given.
Yes, 1600 draw calls for this type of application is too many. It sounds like you could possibly use a single vkCmdDrawIndexedIndirect().
You would just need to create SSBOs for your per sprite matrices and texture samplers to index into each draw using gl_DrawIDARB in the shaders (don't forget to enable VK_KHR_SHADER_DRAW_PARAMETERS_EXTENSION_NAME).
Your CPU-side pre-draw preparation per frame would consist of setting the correct vertex/index buffer offsets within the VkDrawIndexedIndirectCommand structure, as well as setting up any required texture loads and populating your descriptors.
If draw order is a consideration for you, your application could track depth per sprite and then make sure they're set up for draw in the correct order.

How to do particle binning in OpenGL?

Currently I'm creating a particle system and I would like to transfer most of the work to the GPU using OpenGL, for gaining experience and performance reasons. At the moment, there are multiple particles scattered through the space (these are currently still created on the CPU). I would more or less like to create a histogram of them. If I understand correctly, for this I would first translate all the particles from world coordinates to screen coordinates in a vertex shader. However, now I want to do the following:
So, for each pixel a hit count of how many particles are inside. Each particle will also have several properties (e.g. a colour) and I would like to sum them for every pixel (as shown in the lower-right corner). Would this be possible using OpenGL? If so, how?
The best tool I recomend for having the whole data (if it fits on GPU memory) is the use of SSBO.
Nevertheless, you need data after transforming them (e.g. by a projection). Still SSBO is your best option:
In the fragment shader you read the properties of already handled particles (let's say, the rendered pixel) and write modified properties (number of particles at this pixel, color, etc) to the same index in the buffer.
Due to parallel nature of GPU, several instances coming from different particles may be doing concurrently the work for the same index. Thus you need to handle this on your own. Read Memory model and Atomic operations
Another approach, but limited, is using Blending
The idea is that each fragment increments the actual color value of the frame buffer. This can be done using GL_FUNC_ADD for glBlendEquationSeparate and using as fragment-output-color a value of 1/255 (normalized integer) for each RGB/a component.
Limitations come from the [0-255] range: Only up to 255 particles in the same pixel, the rest amount is clamped to this range and so "lost".
You have four components RGBA, thus four properties can be handled. But can have several renderbuffers in a FBO.
You can read the FBO by glReadPixels. Use glReadBuffer first with a GL_COLOR_ATTACHMENTi if you use a FBO instead of the default frame buffer.

Reducing opengl draw calls vs binding smaller textures

I'm making an isometric (2D) game with SFML. I handle the drawing order (depth) by sorting all drawables by their Y position and it works perfectly well.
The game uses an enourmous amount of art assets, such that the npcs, monsters and player graphics alone are contained in their own 4k texture atlas. It is logistically not possible for me to put everything into one atlas. The target devices would not be able to handle textures of that size. Please do not focus on WHY it's impossible, and understand that I simply MUST use seperate files for my textures in this case.
This causes a problem. Let's say I have a level with 2 npcs and 2 pillars. The npcs are in NPCs.png and the pillars are in CastleLevel.png. Depending on where the npcs move, the drawing order (hence the opengly texture binding order) can be different. Let's say the Y positions are sorted like this:
npc1, pillar1, npc2, pillar 2
This would mean that opengl has to switch between the 2 textures twice. My question is, should I:
a) keep the texture atlasses OR
b) divide them all into smaller png files (1 png per npc, 1 png per pillar etc). Since the textures must be changed multiple times anyway, would it improve performance if opengl had to bind smaller textures instead?
Is it worth keeping the texture atlasses because it will SOMETIMES reduce the number of draw calls?
Since the textures must be changed multiple times anyway, would it improve performance if opengl had to bind smaller textures instead?
Almost certainly not. The cost of a texture bind is fixed; it isn't based on the texture's size.
It would be better for you to either:
Properly batch your rendering. That is, when you say "draw NPC1", you don't actually draw it yet. You stick some data in an array, and later on, you execute "draw NPCs", which draws all of the NPCs you've buffered in one go.
Use a bigger texture atlas, probably involving array textures. Each layer of the array texture would be one of the atlases you load. This way, you only ever bind one texture to render your scene.
Deal with it. 2D games aren't exactly stressful on the GPU or CPU. The overhead from the additional state changes will not be what knocks you down from 60FPS to 30FPS.

Mouse-picking using off-screen rendering?

I have 3d-scene with a lot of simple objects (may be huge number of them), so I think it's not very good idea to use ray-tracing for picking objects by mouse.
I'd like to do something like this:
render all these objects into some opengl off-screen buffer, using pointer to current object instead of his color
render the same scene onto the screen, using real colors
when user picks a point with (x,y) screen coordinates, I take the value from the off-screen buffer (from corresponding position) and have a pointer to object
Is it possible? If yes- what type of buffer can I choose for "drawing with pointers"?
I suppose you can render in two passes. First to a buffer or a texture data you need for picking and then on the second pass the data displayed. I am not really familiar with OGL but in DirectX you can do it like this: http://www.two-kings.de/tutorials/dxgraphics/dxgraphics16.html. You could then find a way to analyse the texture. Keep in mind that you are rendering data twice, which will not necessarily double your render time (as you do not need to apply all your shaders and effects) bud it will be increased quite a lot. Also per each frame you are essentially sending at least 2MB of data (if you go for 1byte per pixel on 2K monitor) from GPU to CPU but that might change if you have more than 256 objects on screen.
Edit: Here is how to do the same with OGL although I cannot verify that the tutorial is correct: http://www.opengl-tutorial.org/intermediate-tutorials/tutorial-14-render-to-texture/ (There is also many more if you look around on Google)

Draw a bunch of elements generated by CUDA/OpenCL?

I'm new to graphics programming, and need to add on a rendering backend for a demo we're creating. I'm hoping you guys can point me in the right direction.
Short version: Is there any way to send OpenGL an array of data for distinct elements, without having to issue a draw command for each element distinctly?
Long version: We have a CUDA program (will eventually be OpenCL) which calculates a bunch of data for a bunch of objects for us. We then need to render these objects using, e.g., OpenGL.
The CUDA kernel can generate our vertices, and using OpenGL interop, it can shove these in an OpenGL VBO and not have to transfer the data back to host device memory. But the problem is we have a bunch (upwards of a million is our goal) distinct objects. It seems like our best bet here is allocating one VBO and putting every object's vertices into it. Then we can call glDrawArrays with offsets and lengths of each element inside that VBO.
However, each object may have a variable number of vertices (though the total vertices in the scene can be bounded.) I'd like to avoid having to transfer a list of start indices and lengths from CUDA -> CPU every frame, especially given that these draw commands are going right back to the GPU.
Is there any way to pack a buffer with data such that we can issue only one call to OpenGL to render the buffer, and it can render a number of distinct elements from that buffer?
(Hopefully I've also given enough info to avoid a XY problem here.)
One way would be to get away from understanding these as individual objects and making them a single large object drawn with a single draw call. The question is, what data is it that distinguishes the objects from each other, meaning what is it you change between the individual calls to glDrawArrays/glDrawElements?
If it is something simple, like a color, it would probably be easier to supply this an additional per-vertex attribute. This way you can render all objects as one single large object using a single draw call with the indiviudal sub-objects (which really only exist conceptually now) colored correctly. The memory cost of the additional attribute may be well worth it.
If it is something a little more complex (like a texture), you may still be able to index it using an additional per-vertex attribute, being either an index into a texture array (as texture arrays should be supported on CUDA/OpenCL-able hardware) or a texture coordinate into a particular subregion of a single large texture (a so-called texture atlas).
But if the difference between those objects is something more complex, as a different shader or something, you may really need to render individual objects and make individual draw calls. But you still don't need to neccessarily make a round-trip to the CPU. With the use of the ARB_draw_indirect extension (which is core since GL 4.0, I think, but may be supported on GL 3 hardware (and thus CUDA/CL-hardware), don't know) you can source the arguments to a glDrawArrays/glDrawElements call from an additional buffer (into which you can write with CUDA/CL like any other GL buffer). So you can assemble the offset-length-information of each individual object on the GPU and store them in a single buffer. Then you do your glDrawArraysIndirect loop offsetting into this single draw-indirect-buffer (with the offset between the individual objects now being constant).
But if the only reason for issuing multiple draw calls is that you want to render the objects as single GL_TRIANGLE_STRIPs or GL_TRIANGLE_FANs (or, god beware, GL_POLYGONs), you may want to reconsider just using a bunch of GL_TRIANGLES so that you can render all objects in a single draw call. The (maybe) time and memory savings from using triangle strips are likely to be outweight by the overhead of multiple draw calls, especially when rendering many small triangle strips. If you really want to use strips or fans, you may want to introduce degenerate triangles (by repeating vertices) to seprate them from each other, even when drawn with a single draw call. Or you may look into the glPrimitiveRestartIndex function introduced with GL 3.1.
Probably not optimal, but you could make a single glDrawArray on your whole buffer...
If you use GL_TRIANGLES, you can fill your buffer with zeroes, and write only the needed vertices in your kernel. This way "empty" regions of your buffer will be drawn as 0-area polygons ( = degenerate polygons -> not drawn at all )
If you use GL_TRIANGLE_STRIP, you can do the same, but you'll have to duplicate your first vertex in order to make a fake triangle between (0,0,0) and your mesh.
This can seem overkill, but :
- You'll have to be able to handle as many vertices anyway
- degenerate triangles use no fillrate, so they are almost free (the vertex shader is still computed, though)
A probably better solution would be to use glDrawElements instead : In you kernel, you also generate an index list for your whole buffer, which will be able to completely skip regions of your buffer.