If I use fixed-point (or integers with 1 describing the smallest game unit) to describe my vertex vectors, how can I setup OpenGL/eigen transformations to work with it? If I'm doing this in my vertex shader:
gl_Position = projectionMatrix * viewMatrix * modelMatrix * vec4(in_Position, 1.0)
If I pass in_Position in as a vec3 of GL_INT, while I pass in the matrices as GL_FLOAT mat4, will the proper casting be done? Is there a performance cost?
Is it possible to prepare my transformation matrices to be in fixed-point as well?
This is being done with a 2D game, which I think makes it more feasible than with 3D. I would really prefer the accuracy, since it seems there is degradation of position on large maps when things get far away from the origin. I realize I could probably get away with only object position being an integer while the vertices are still described as floats. However, I think my collision scheme will work better with fixed-point vertices. What is generally the performance difference?
This will imply a int to float conversion that will penalize your performances. You should cast in_Position to vec3 at CPU to GPU copy time. If you use a Matrix object to store them on CPU, you can cast them with:
MatrixXf data_as_float = data_as_int.cast<float>();
Then call glBufferData with data_as_float.
Ok, after some experimentation, I've settled on a solution.
gl_Position = projviewMatrix * vec4(ivec3(in_Position - camera), 1.0);
camera is a uniform uvec3, and in_Position is the uvec3 position input. Translation is performed as a separate operation, while the view scaling, rotation, and projection is done with a mat4 of floats (projviewMatrix) as usual.
Care must be taken to ensure the proper types and input commands (glVertexAttribIPointer) are used. OpenGL seems very eager to cast to float yet leave the data in an integer type, so any small error will result in mangled input.
It simply is not feasible to perform the projviewMatrix multiply in fixed-point, since you do not have access to an intermediary 64-bit storage for the multiplications. Only if the bits used by in_Position and projviewMatrix sum to 32 would it approach usability, but considering that coords for rendering will be so close to the origin and no extra ops are gained (still need to shift after multiply, GPU will take as long for floats as ints), there is no reason to perform fixed-point arithmetic after the position has been centered by camera.
Of course, this is ignoring the royal pain it is to actually manipulate the integer position data. I really wouldn't recommend it.
Related
My vertex positions V∈ℝ³ⁿ should be the result of multiplication of a large, dense matrix B∈ℝ³ⁿˣᵏ with a vector z∈ℝᵏ:
V = B z
In my case, k≈300, so as I understand this is far too big to store the relevant rows Bᵢ for the ith vertex as vertex attributes.
Currently, I'm computing this multiplication in the vertex shader by setting z as a uniform, by packing B into a square texture then using a texelFetch and a for-loop. Something like:
uniform int n; // number of vertices
uniform int k; // size of z
uniform int s; // size of texture square where B is packed
uniform float z[512];
uniform sampler2D tex;
in float id; // index of vertex
out vec3 v;
void main()
{
v = vec3(0,0,0);
for(int j = 0;j < k; j++)
{
int index = int(id)*k+j;
int si = index % s;
int sj = int((index - si)/s);
v = v + texelFetch(tex,ivec2(si,sj),0).xyz*z[j];
}
On my MacBook Pro M1, this works reasonably well for n≈50,000 and k≈100. Increasing either, I start to get dropped frames.
Is doing this in a vertex shader a good idea?
My computation is similar to blendshapes. How are those typically computed on the GPU?
Ideally, I'd like to stick to opengl. If not, is there an opencl or some other way to best achieve this?
I'm a little confused about your shader. z is never used, but there's access to an array q. I assume this is a typo?
Leaving aside shader type, depending on whether you're compute or memory bound, you may also be leaving some possibly substantial optimisation on the table:
Your si/sj computations use integer division/division remainder. Try to avoid those expensive calculations in your hot loop. Instead, perform those calculations once before the loop, then increment si on every iteration, followed by simple test if you've hit the end of the row if (si >= s) { si = 0; ++sj; } I know they say avoid branching in shaders, but simple conditionals like this will either not branch at all, or at least have a good chance of being cheaper than integer division. (No point guessing: measure which is faster.)
You say your matrix is dense. Is your input vector z/q, though? If a lot of elements there are 0, it may be wiser to skip the parts of the matrix that would be multiplied by 0 altogether. Either by explicitly testing for 0 in the input vector elements, or by passing the input vector in as a list of nonzero elements. (Arrays containing nonzero values and their indices, number of nonzero values.) Whether this is worth it depends on the extent to which your performance is limited by memory bandwidth; if it's more than half of elements, it's definitely worth trying, but given how expensive it is to read your huge matrix, it's likely worth it even for a small number of zeroes.
Finally, you don't specify how you arrive at your giant matrix B. If you can decompose its calculation at all, that may be worth doing if it reduces the amount of data you need to read in your vertex shader.
Shader types
One of the main things that should influence your decision of whether to use an expensive vertex shader or a compute shader is whether you are re-running the computation with the same values at all. If these are all single-use calculations, you might be able to stick with a vertex shader. If your z/q vector and B matrix are constant over multiple render passes, you'll want to cache the result.
A compute shader will also give you more control over parallelism, and you don't have to mess around with packing matrices into textures: just use arrays.
What I mean by more control over parallelism: Instead of performing each matrix multiplication slice sequentially, you can compute each vector element in a work-group, with each work-item in the group computing a few slices, and then the group performs a parallel reduction in local/group memory to accumulate the final result. This is often more efficient for equally occupying all your GPU's shader units.
Choice of API/Platform
If you're targetting macOS and other Apple platforms, you might want to consider using Metal instead of OpenGL and OpenCL as the latter 2 are marked deprecated by Apple, and tools (profiling, debugging, …) support is non-existent compared to Metal. (Yes this is incredibly annoying.) On other platforms, OpenGL offers compute shaders as well, but Apple stopped implementing new OpenGL features before compute shaders, so the only way to implement them in macOS is via OpenCL or Metal. (Using Metal should let you determine whether your shader's performance is compute or memory bound, for example, and so lets you better guide your optimisation.)
Is it possible to obtain the location of a glsl mat4 individual column? I want to update the value of an individual column of a matrix defined in the shader without actually having the set the whole matrix uniform.
I have a game where the camera orientation stays often the same but the translation part changes frequently. My idea was to only update the affected translation part of the VP matrix (projection * view) in order to squeeze some performance.
You cannot. You can no more set only one column of a matrix uniform than you can set the high two bytes of a uint uniform without setting the other two bytes too. When it comes to uniforms, matrices are just as much a basic type as a vector or scalar.
My idea was to only update the translation part of the view matrix in order to squeeze some performance.
This will not do that. The performance of doing an extremely small CPU-to-GPU memory transfer will be dominated by the overhead of doing any CPU-to-GPU transfer. That is, the cost to transfer 16 bytes will be basically identical to the cost of transferring 64 bytes. The amount of data transferred only becomes significant to the cost of the transfer when that amount starts getting large (kilo/mega bytes).
So this is a waste of time. Just transfer the matrix and move on. Premature optimization is the root of all evil.
Is it possible to obtain the location of a glsl mat4 individual column?
No.
For a mat uniform variable type, you need to use the appropriate glUnfiformMatrix...() call, and you cannot update individual parts of it:
Possible alternatives:
Use an Uniform Buffer Object where you can indivudally control every single byte, as already suggested by #Rabbid76's comment.
Use uniform vec4 mymat[4] instead of mat4 and construct the matrix in the shader (if needed) or directly use the individual column vectors for the calculations.
no you can't. generally you don't need to do this because of modern hardware do this kind of cpu to gpu transfer so fast so there is no sensible difference between sending a column or a whole matrix. but in my case (in embedded systems using opengl es) i got a bottleneck in the similar situation. so i change uniforms to attributes. each row of matrix would become a vec4 attribute and we use VBO to send data. frequent data changing would accomplished by buffer streaming. i don't say this is a good way and will increase performance (hardwares are diffrent) but in my case it works fine.
I'm currently working on 2D graphics, and as far as I can tell every vertex is ultimately processed as a 4D point in homogeneous space. So I say to myself: what a waste of resources! I gather that the hardware is essentially designed to handle 3D scenes, and as such may be hardcoded to do 4d linear algebra. Yet, is there a way to write shaders (or enable a bunch of options) so that only genuine 2d coordinates are used in hard memory? I know one could embed two 2x2 matrices in a 4x4 matrix, but the gl_Position variable being a vec4 seems to end the track here. I'm not looking for some kind of "workaround" hack like this, but rather of a canonical way to make OpenGL do it, like a specific mode/state.
I've not been able to find either sample code or even a simple mention of such a fact on the net, so I gather it should simply be impossible/not desirable for, say, performance reasons. Is that so?
Modern GPUs are actually scalar architectures. In GLSL you can write also shorter vectors. vec2 is a perfectly valid type and you can create vertex arrays with just 2 scalar elements per vector, as defined by the size parameter of glVertexAttribPointer
As Anon M. Coleman commented, OpenGL will internally perform a vec4(v, [0, [0]], 1) construction for any data passed in as a vertex attribute of dimension < 4.
In the vertex shader you must assign a vec4 to gl_Position. But you can trivially expand a vec2 to a vec4:
vec2 v2;
gl_Position = vec4(v2, 0, 1);
Yes, the gl_Position output always must be a vec4, due to the fact OpenGL specifies operations in clip space. But this is not really a bottleneck at all.
All credit goes to Andon M. Coleman, who perfectly answered the question as a comment. I just quote it here for the sake of completion:
«Absolutely not. Hardware itself is/was designed around 4-component data and instructions for many years. Modern GPUs are scalar friendly, and they have to be considering the push for GPGPU (but older NV GPUs pre-GeForce 8xxx have a purely vector ALU). Now, as for vertex attributes, you get 16 slots of size (float * 4) for storage. This means whether you use a vec2 or vec4 vertex attribute, it actually behaves like a vec4. This can be seen if you ever write vec4 and only give enough data for 2 of the components - GL automatically assigns Z = 0.0 and W = 1.0.
Furthermore, you could not implement clipping in 2D space with 2D coordinates. You need homogeneous coordinates to produce NDC coordinates. You would need to use window space coordinates, which you cannot do from a vertex shader. After the vertex shader finishes, GL will perform clipping, perspective divide and viewport mapping to arrive at window space. But window space coordinates are still 4D (the Z component may not contribute to a location in window space, but it does affect fragment tests). »
I'm trying to make an GLSL shader that multiplies a 90x10 matrix with an 10x1 one. The 90x1 result corresponds to the xyz values of 30 vertices. The first large matrix is only loaded at startup. The other matrix, on the other hand, can change at each render.
How could this be done? I'm guessing the first matrix could be stored as a texture, but I have no idea what to do with the second.
Just pass the second matrix as a uniform array of floats.
uniform float vec10[10];
and perform the multiplication element by element.
Note that if that's too slow, you can try packing your large texture in such a way that you can read 4 elements with a single texelfetch.
If you want to see the syntax for binding uniform arrays, consult http://www.opengl.org/wiki/Uniform_(GLSL) .
Note, that its also completely legal to store this second matrix in texture as well; I'm just not sure of the performance impact of doing so as opposed to sending as a uniform. But get it working first, profile and optimize later.
Is it possible in desktop GLSL to pass a fixed size array of floats to the vertex shader as an attribute? If yes, how?
I want to have per vertex weights for character animation so I would like to have something like the following in my vertex shader:
attribute float weights[25];
How would I fill the attribute array from my C++ & OpenGL program? I have seen in another question that I could get the attribute location of the array attribute and then just add the index to that location. Could someone give an example on that for my pretty large array?
Thanks.
Let's start with what you asked for.
On pretty much no hardware that exists currently will attribute float weights[25]; compile. While shaders can have arrays of attributes, each array index represents a new attribute index. And on all hardware the currently exists, the maximum number of attribute indices is... 16. You'd need 25, and that's just for the weights.
Now, you can mitigate this easily enough by remembering that you can use vec4 attributes. Thus, you store every four array elements in a single attribute. Your array would be attribute vec4 weights[7]; which is doable. Your weight-fetching logic will have to change of course.
Even so, you don't seem to be taking in the ramifications of what this would actually mean for your vertex data. Each attribute represents a component of a vertex's data. Each vertex for a rendering call will have the same amount of data; the contents of that data will differ, but not how much data.
In order to do what you're suggesting, every vertex in your mesh would need 25 floats describing the weight. Even if this was stored as normalized unsigned bytes, that's still 25 extra bytes worth of data at a minimum. That's a lot. Especially considering that for the vast majority of vertices, most of these values will be 0. Even in the worst case, you'd be looking at maybe 6-7 bones affecting an single vertex.
The way skinning is generally done in vertex shaders is to limit the number of bones that affects a single vertex to four. This way, you don't use an array of attributes; you just use a vec4 attribute for the weights. Of course, you also now need to say which bone is associated with which weight. So you have a second vec4 attribute that specifies the bone index for that weight.
This strikes a good balance. You only take up 2 extra attributes (which can be unsigned bytes in terms of size). And for the vast majority of vertices, you'll never even notice, because most vertices are only influenced by 1-3 bones. A few uses 4, and fewer still use 5+. In those cases, you just cut off the lowest weights and recompute the weights of the others proportionately.
Nicol Bolas already gave you an answer how to restructure your task. You should do it, because processing 25 floats for a vertex, probably through some quaternion multiplication will waste a lot of good GPU processing power; most of the attributes for a vertex will translate close to an identity transform anyway.
However for academic reasons I'm going to tell you, how to pass 25 floats per vertex. The key is not using attributes for this, but fetching the data from some buffer, a texture. The GLSL vertex shader stage has the builtin variable gl_VertexID, which passes the index of the currently processed vertex. With recent OpenGL you can access textures from the vertex shader as well. So you have a texture of size vertex_count × 25 holding the values. In your vertex shader you can access them using the texelFetch function, i.e. texelFetch(param_buffer, vec2(gl_VertexID, 3));
If used in skeletal animation this system is often referred to as texture skinning. However it should be used sparingly, as it's a real performance hog. But sometimes you can't avoid it, for example when implementing a facial animation system where you have to weight all the vertices to 26 muscles, if you want to accurately simulate a human face.