lvalue in assignment too complex - glsl

Following code causes a glsl error: lvalue in assignment too complex
for(int i = 0; i < 4; i++)
{
if(Lgt.lights[i].position.w == 0.0)
{
LightDir[i] = normalize(vec3(Lgt.lights[i].position));
ViewDir[i] = normalize(cameraWorldPosition - worldPosition);
}
else
{
LightDir[i] = normalize(vec3(Lgt.lights[i].position) - worldPosition);
ViewDir[i] = normalize(cameraWorldPosition - worldPosition);
}
}
But in the other shader program an identical piece of code works fine. When the code doesn't contain if statement, for example
for(int i = 0; i < 4; i++)
{
LightDir[i] = normalize(vec3(Lgt.lights[i].position) - worldPosition);
ViewDir[i] = normalize(cameraWorldPosition - worldPosition);
}
everything is ok, but when I use multiplication:
for(int i = 0; i < 4; i++)
{
LightDir[i] = LocalMat * normalize(vec3(Lgt.lights[i].position) - worldPosition);
ViewDir[i] = LocalMat * normalize(cameraWorldPosition - worldPosition);
}
I get that error again. Can anyone tell what is going on ?

I'm guessing that you're running on a GPU/driver combination that doesn't allow indexed assignments. So in order to compile the code, it will need to completely unroll the loop, changing all the indexes in the lvalues into constants. This apparently happens in some cases, but not all.
If you're using an Nvidia GPU/driver, you might try putting #pragma optionNV unroll all in the top of your shader program to force full unrolling of all loops -- but that might cause problems if you have other loops that shouldn't be unrolled.

I also had this error, but my cause was different.
It turns out I had a hard coded maximum index in my for loop which was bigger than the array I was trying to assign to. My guess is that the compiler got confused when attempting to unfold and there were not enough indices for the lvalue.

Related

How to get norm of a row in Eigen?

I want to get the norm of a row. I've written following code but it's not true!
for (int i = 0; i < A.rows(); i++)
A.row(i) = A.row(i).array() / (A.row(i).norm());
It is worth mentioning, type of A is MatrixXcf. In your opinion, what's the problem?
I don't see any problem with your code. Anyways, what you wrote can be written more compact as either of these:
// assign result to new variable:
Eigen::MatrixXcf N = A.rowwise().normalized();
// or in-place normalization:
A.rowwise().normalize();

Orthogonalization in QR Factorization outputting slightly innaccurate orthogonalized matrix

I am writing code for QR Factorization and for some reason my orthogonal method does not work as intended. Basically, my proj() method is outputting random projections. Here is the code:
apmatrix<double> proj(apmatrix<double> v, apmatrix<double> u)
//Projection of u onto v
{
//proj(v,u) = [(u dot v)/(v dot v)]*v
double a = mult(transpose(u,u),v)[0][0], b = mult(transpose(v,v),v)[0][0], c = (a/b);
apmatrix<double>k;
k.resize(v.numrows(),v.numcols());
for(int i = 0; i<v.numrows(); i++)
{
for(int j = 0; j<v.numcols(); j++)
{
k[i][j]=v[i][j]*c;
}
}
return k;
}
I tested the method by itself with manual matrix inputs, and it seems to work fine. Here is my orthogonal method:
apmatrix<double> orthogonal(apmatrix<double> A) //Orthogonal
{
/*
n = (number of columns of A)-1
x = columns of A
v0 = x0
v1 = x1 - proj(v0,x1)
vn = xn - proj(v0,xn) - proj(v1,xn) - ... - proj(v(n-1),xn)
V = {v1, v2, ..., vn} or [v0 v1 ... vn]
*/
apmatrix<double> V, x, v;
int n = A.numcols();
V.resize(A.numrows(),n);
x.resize(A.numrows(), 1);
v.resize(A.numrows(),1);
for(int i = 0; i<A.numrows(); i++)
{
x[i][0]=A[i][1];
v[i][0]=A[i][0];
V[i][0]=A[i][0];
}
for (int c = 1; c<n; c++) //Iterates through each col of A as if each was its own matrix
{
apmatrix<double>vn,vc; //vn = Orthogonalized v (avoiding matrix overwriting of v); vc = previously orthogonalized v
vn=x;
vc.resize(v.numrows(), 1);
for(int i=0; i<c; i++) //Vn = an-(sigma(t=1, n-1, proj(vt, xn))
{
for(int k = 0; k<V.numrows(); k++)
vc[k][0] = V[k][i]; //Sets vc to designated v matrix
apmatrix<double>temp = proj(vc, x);
for(int j = 0; j<A.numrows(); j++)
{
vn[j][0]-=temp[j][0]; //orthogonalize matrix
}
}
for(int k = 0; k<V.numrows(); k++)
{
V[k][c]=vn[k][0]; //Subtracts orthogonalized col to V
v[k][0]=V[k][c]; //v is redundant. more of a placeholder
}
if((c+1)<A.numcols()) //Matrix Out of Bounds Checker
{
for(int k = 0; k<A.numrows(); k++)
{
vn[k][0]=0;
vc[k][0]=0;
x[k][0]=A[k][c+1]; //Moves x onto next v
}
}
}
system("PAUSE");
return V;
}
For testing purposes, I have been using the 2D Array: [[1,1,4],[1,4,2],[1,4,2],[1,1,0]]. Each column is its own 4x1 matrix. The matrices should be outputted as: [1,1,1,1]T, [-1.5,1.5,1.5,-1.5]T, and [2,0,0,-2]T respectively. What's happening now is that the first column comes out correctly (it's the same matrix), but the second and third come out to something that is potentially similar but not equal to their intended values.
Again, each time I call on the orthogonal method, it outputs something different. I think it's due to the numbers inputted in the proj() method, but I am not fully sure.
The apmatrix is from the AP college board, back when they taught cpp. It is similar to vectors or ArrayLists in Java.
Here is a link to apmatrix.cpp and to the documentation or conditions (probably more useful), apmatrix.h.
Here is a link to the full code (I added visual markers to see what the computer is doing).
It's fair to assume that all custom methods work as intended (except maybe Matrix Regressions, but that's irrelevant). And be sure to enter the matrix using the enter method before trying to factorize. The code might be inefficient partly because I self-taught myself cpp not too long ago and I've been trying different ways to fix my code. Thank you for the help!
As said in comments:
#AhmedFasih After doing more tests today, I have found that it is in-fact some >memory issue. I found that for some reason, if a variable or an apmatrix object >is declared within a loop, initialized, then that loop is reiterated, the >memory does not entirely wipe the value stored in that variable or object. This >is noted in two places in my code. For whatever reason, I had to set the >doubles a,b, and c to 0 in the proj method and apmatrixdh to 0 in the >mult method or they would store some value in the next iteration. Thank you so >much for you help!

Interfering Vector in glBegin()

I am trying to implement code for an assignment to render skeleton and mesh animations. In my glBegin(GL_TRIANGLES) section, I have some vectors that appear to be interfering with my information when it shouldn't.
glBegin(GL_TRIANGLES);
for (int i = 0; i < mesh->nfaces.size(); i += 1)
for (int k = 0; k < 3; k += 1) {
int j = k;//2 - k;
glm::vec4 myPointPrime;
myPointPrime.w = 1;
myPoint.x = ecks = mesh->vertex[mesh->faces[i][j]][0];
myPoint.y = why = mesh->vertex[mesh->faces[i][j]][1];
myPoint.z = zed = mesh->vertex[mesh->faces[i][j]][2];
// Stuff vvvv THIS CAUSES PROBLEMS
for (int t = 0; t < mySkeleton->vertex.at(i).size(); t++) {
myPointPrime += mySkeleton->vertex[i][j] * MyXformations * myPoint;
}
glNormal3f(mesh->normal[mesh->nfaces[i][j]][0],
mesh->normal[mesh->nfaces[i][j]][1],
mesh->normal[mesh->nfaces[i][j]][2]);
glVertex3f(mesh->vertex[mesh->faces[i][j]][0],
mesh->vertex[mesh->faces[i][j]][1],
mesh->vertex[mesh->faces[i][j]][2]);
// glVertex3f(myPointPrime.x, myPointPrime.y, myPointPrime.z);
// glVertex3f(myPoint.x, myPoint.y, myPoint.z);
}
glEnd();
The myPointPrime += ... code is doing something weird to my Vertex calls, the scene won't render unless I comment out that for loop.
If I comment out the loop, then the scene renders, but I think I kinda need the loop if animating something like 16,000 vertexes is going to have any performance at all.
Is having that there kind of like having it automatically multiply with the glVertex calls?
Edit:
Below is another version of the code I hope should be more clear, instead of calculating the points in the actual drawing code I change the whole mesh to supposedly follow the skeleton each frame, but nothing is rendered.
for (int vertex_i = 0; vertex_i < mesh->nfaces.size(); vertex_i++) {
for (int k = 0; k < 3; k += 1) {
int j = k;//2 - k;
pointp.x = 0;
pointp.y = 0;
pointp.z = 0;
for (int t = 0; t < mySkeleton->vertex.at(vertex_i).size(); t++) {
point.x = mesh->vertex[mesh->faces[vertex_i][j]][0];
point.y = mesh->vertex[mesh->faces[vertex_i][j]][1];
point.z = mesh->vertex[mesh->faces[vertex_i][j]][2];
//glPushMatrix();
pointp += mySkeleton->vertex[vertex_i][t] * myTranslationMatrix * myRotationMatrix * point;
cout << "PointP X: " << pointp.x << " PointP Y: " << pointp.y << " PointP Z: " << pointp.z << endl;
mesh->vertex[mesh->faces[vertex_i][j]][0] = pointp.x;
mesh->vertex[mesh->faces[vertex_i][j]][1] = pointp.y;
mesh->vertex[mesh->faces[vertex_i][j]][2] = pointp.z;
//myPointPrime += MyXformations * myPoint;
}
}
}
My assumption is that maybe the calculations for pointp isn't doing what I think its doing?
mySkeleton->vertex[vertex_i][t] is a vector from my 'skeleton' class, it holds all of the weights for every vertex, there are 17 weights per vertex.
"MyXformations" is a 4x4 matrix passed from my skeleton animation function that holds the last known key frame and this is applied to the vertexes.
point is the current point in the vertex.
Your loop variable is t. However, you refer to j in the loop. Looks to me like your loop might simply be crashing for larger values of j.
You're not using t inside the for loop. Is this expected?
mySkeleton->vertex[i][j] looks like it's out of bounds since j should be for mesh->faces/mesh->nfaces.
Also you can use glNormal3fv and glVertex3fv with arrays.
With out of bounds memory operations you can get all sorts of weird stuff happening, although I can't see any out of bound writes. Your * operators don't modify the objects do they?
If you're worried about performance. You shouldn't be using immediate mode. Instead, put all your data on the GPU with buffer objects (including join/bone transformations) and animate on the fly in the vertex shader.
This is from a few years ago, but worth a read: Animated Crowd Rendering.

triangular matrix conversion and auto parallelization

I'm playing a bit with auto parallelization in ICC (11.1; old, but can't do anything about it) and I'm wondering why the compiler can't parallelize the inner loop for a simple gaussian elimination:
void makeTriangular(float **matrix, float *vector, int n) {
for (int pivot = 0; pivot < n - 1; pivot++) {
// swap row so that the row with the largest value is
// at pivot position for numerical stability
int swapPos = findPivot(matrix, pivot, n);
std::swap(matrix[pivot], matrix[swapPos]);
std::swap(vector[pivot], vector[swapPos]);
float pivotVal = matrix[pivot][pivot];
for (int row = pivot + 1; row < n; row++) { // line 72; should be parallelized
float tmp = matrix[row][pivot] / pivotVal;
for (int col = pivot + 1; col < n; col++) { // line 74
matrix[row][col] -= matrix[pivot][col] * tmp;
}
vector[row] -= vector[pivot] * tmp;
}
}
}
We're only writing to the arrays dependent on the private row (and col) variable and row is guaranteed to be larger than pivot, so it should be obvious to the compiler that we aren't overwriting anything.
I'm compiling with -O3 -fno-alias -parallel -par-report3 and get lots of dependencies ala: assumed FLOW dependence between matrix line 75 and matrix line 73. or assumed ANTI dependence between matrix line 73 and matrix line 75. and the same for line 75 alone. What problem does the compiler have? Obviously I could tell it exactly what to do with some pragmas, but I want to understand what the compiler can get alone.
Basically the compiler can't figure out that there's no dependency due to the name matrix and the name vector being both read from and written too (even though with different regions). You might be able to get around this in the following fashion (though slightly dirty):
void makeTriangular(float **matrix, float *vector, int n)
{
for (int pivot = 0; pivot < n - 1; pivot++)
{
// swap row so that the row with the largest value is
// at pivot position for numerical stability
int swapPos = findPivot(matrix, pivot, n);
std::swap(matrix[pivot], matrix[swapPos]);
std::swap(vector[pivot], vector[swapPos]);
float pivotVal = matrix[pivot][pivot];
float **matrixForWriting = matrix; // COPY THE POINTER
float *vectorForWriting = vector; // COPY THE POINTER
// (then parallelize this next for loop as you were)
for (int row = pivot + 1; row < n; row++) {
float tmp = matrix[row][pivot] / pivotVal;
for (int col = pivot + 1; col < n; col++) {
// WRITE TO THE matrixForWriting VERSION
matrixForWriting[row][col] = matrix[row][col] - matrix[pivot][col] * tmp;
}
// WRITE TO THE vectorForWriting VERSION
vectorForWriting[row] = vector[row] - vector[pivot] * tmp;
}
}
}
Bottom line is just give the ones you're writing to a temporarily different name to trick the compiler. I know that it's a little dirty and I wouldn't recommend this kind of programming in general. But if you're sure that you have no data dependency, it's perfectly fine.
In fact, I'd put some comments around it that are very clear to future people who see this code that this was a workaround, and why you did it.
Edit: I think the answer was basically touched on by #FPK and an answer was posted by #Evgeny Kluev. However, in #Evgeny Kluev's answer he suggests making this an input parameter and that might parallelize but won't give the correct value since the entries in matrix won't be updated. I think the code I posted above will give the correct answer too.
The same auto-parallelization problem is on icc 12.1. So I used this newer version for experiments.
Adding an output matrix to your function's parameter list and changing body of the third loop to this
out[row][col] = matrix[row][col] - matrix[pivot][col] * tmp;
fixed the "FLOW dependence" problem. Which means, "-fno-alias" affects only function parameters, while contents of the single parameter remain under suspicion of being aliased. I don't know why this option does not affect everything. Since different parts of your matrix do not really alias each other, you can just leave this additional parameter to the function and pass the same matrix through this parameter.
Interestingly, while complaining about 'matrix', compiler say nothing about 'vector', which really has aliasing problems: this line vector[row] -= vector[pivot] * tmp; may lead to false aliasing (writing to vector[row] in one thread may touch the cache line, storing vector[pivot], used by every thread).
"FLOW dependence" is not the only problem in this code. After it was fixed, compiler still refuses to parallelize second and third loops because of "insufficient computational work". So I tried to give it some extra work:
float tmp = matrix[row][pivot] * pivotVal;
...
out[row][col] = matrix[row][col] - matrix[pivot][col] *tmp /pivotVal /pivotVal;
And after all this, the second loop was at last parallelized, though I'm not sure if it gained any speed improvement.
Update: I found a better alternative to giving computer "some extra work". Option -par-threshold50 does the trick.
I have no access to an icc to test my idea but I suspect the compiler fears aliasing: matrix is defined as float**: an array of pointers pointing to arrays of floats. All those pointers could point to the same float array so parallizing this would be very dangerous. This would make no sense, but the compiler cannot know.

weird performance in C++ (VC 2010)

I have this loop written in C++, that compiled with MSVC2010 takes a long time to run. (300ms)
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
float val = 0;
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
val += original[k*w + m] * ds[k - i + hr][m - j + hr];
}
}
heat_map[i*w + j] = val;
}
}
}
It seemed a bit strange to me, so I did some tests then changed a few bits to inline assembly: (specifically, the code that sums "val")
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
__asm {
fldz
}
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
float val = original[k*w + m] * ds[k - i + hr][m - j + hr];
__asm {
fld val
fadd
}
}
}
float val1;
__asm {
fstp val1
}
heat_map[i*w + j] = val1;
}
}
}
Now it runs in half the time, 150ms. It does exactly the same thing, but why is it twice as quick? In both cases it was run in Release mode with optimizations on. Am I doing anything wrong in my original C++ code?
I suggest you try different floating-point calculation models supported by the compiler - precise, strict or fast (see /fp option) - with your original code before making any conclusions. I suspect that your original code was compiled with some overly restrictive floating-point model (not followed by your assembly in the second version of the code), which is why the original is much slower.
In other words, if the original model was indeed too restrictive, then you were simply comparing apples to oranges. The two versions didn't really do the same thing, even though it might seem so at the first sight.
Note, for example, that in the first version of the code the intermediate sum is accumulated in a float value. If it was compiled with precise model, the intermediate results would have to be rounded to the precision of float type, even if the variable val was optimized away and the internal FPU register was used instead. In your assembly code you don't bother to round the accumulated result, which is what could have contributed to its better performance.
I'd suggest you compile both versions of the code in /fp:fast mode and see how their performances compare in that case.
A few things to check out:
You need to check that is actually is the same code. As in, are your inline assembly statements exactly the same as those generated by the compiler? I can see three potential differences (potential because they may be optimised out). The first is the initial setting of val to zero, the second is the extra variable val1 (unlikely since it will most likely just change the constant subtraction of the stack pointer), the third is that your inline assembly version may not put the interim results back into val.
You need to make sure your sample space is large. You didn't mention whether you'd done only one run of each version or a hundred runs but, the more runs, the better, so as to remove the effect of "noise" in your statistics.
An even better measurement would be CPU time rather than elapsed time. Elapsed time is subject to environmental changes (like your virus checker or one of your services deciding to do something at the time you're testing). The large sample space will alleviate, but not necessarily solve, this.