MIPS If Statement Being Ignored - if-statement

I have a word array of 1's and 0's that is 7x5. My code is supposed to branch whenever there's a 1 to display a certain graphic, however, I noticed that the statement is being ignored. Here's the code:
bge s1, 5, _increment_outer_loop
la a0, arena
mul t0, s0, 28 #i
mul t1, s1, 4 #j
add t1, t1, t0
add t2, a0, t1
beq t2, 1, _draw_arena
j _increment_inner_loop
Can anyone shine light on the problem? Thank you

Related

Plotting a sympy function. Strange xlim behavior

I try to plot a sympy function from 0 to 120 by using this code :
def symbolicCalc():
A0, k1, k2, t = sp.symbols("A0 k1 k2 t",real=True)
fSymb=A0*(1-(k1+k2*A0)/(k2*A0+k1*sp.exp((k1+k2*A0)*t)))
sp.plotting.plot(fSymb.subs([(A0,70),(k1,8e-4),(k2,1.5e-3)]),xlim=[0,120],ylim=[0,100])
symbolicCalc()
And I obtain the following figure :
How can I have a plot from 0 to 120 ?
Thanks for answer
Ok, I solved my problem by adding (t,0,120)
def symbolicCalc():
A0, k1, k2, t = sp.symbols("A0 k1 k2 t",real=True)
fSymb=A0*(1-(k1+k2*A0)/(k2*A0+k1*sp.exp((k1+k2*A0)*t)))
sp.plotting.plot(fSymb.subs([(A0,70),(k1,8e-4),(k2,1.5e-3)]),(t,0,120),xlim=[0,120],ylim=[0,100])
symbolicCalc()

Improve non-horizontal assignment in AVX

So I've come across another problem when dealing with AVX code. I have a case where I have 4 ymm registers that need to be split vertically to 4 other ymm registers
(ie. ymm0(ABCD) -> ymm4(A...), ymm5(B...), ymm6(C...), ymm7(D...)).
Here is an example:
// a, b, c, d are __m256 structs with [] operators to access xyzw
__m256d A = _mm256_setr_pd(a[0], b[0], c[0], d[0]);
__m256d B = _mm256_setr_pd(a[1], b[1], c[1], d[1]);
__m256d C = _mm256_setr_pd(a[2], b[2], c[2], d[2]);
__m256d D = _mm256_setr_pd(a[3], b[3], c[3], d[3]);
Just putting Paul's comment into an answer:
My question is about how to a matrix transposition which is easily done in AVX as indicated with the link he provided.
Here's my implementation for those who come across here:
void Transpose(__m256d* A, __m256d* T)
{
__m256d t0 = _mm256_shuffle_pd(A[0], A[1], 0b0000);
__m256d t1 = _mm256_shuffle_pd(A[0], A[1], 0b1111);
__m256d t2 = _mm256_shuffle_pd(A[2], A[3], 0b0000);
__m256d t3 = _mm256_shuffle_pd(A[2], A[3], 0b1111);
T[0] = _mm256_permute2f128_pd(t0, t2, 0b0100000);
T[1] = _mm256_permute2f128_pd(t1, t3, 0b0100000);
T[2] = _mm256_permute2f128_pd(t0, t2, 0b0110001);
T[3] = _mm256_permute2f128_pd(t1, t3, 0b0110001);
}
This function cuts the number of instructions in about half on full optimization as compared to my previous attempt

Cubic spline / curve fitting

I need to determine parameters of Illumintaion change, which is defined by this continuous piece-wise polynomial C(t), where f(t) is is a cubic curve defined by the two boundary points (t1,c) and (t2,0), also f'(t1)=0 and f'(t2)=0.
Original Paper: Texture-Consistent Shadow Removal
Intensity curve is sampled from the normal on boundary of shadow and it looks like this:
Each row is sample, displaying illumintaion change.So X is number of column and Y is intensity of pixel.
I have my real data like this (one sample avaraged from all samples):
At all I have N samples and I need to determine parameters (c,t1,t2)
How can I do it?
I tried to do it by solving linear equation in Matlab:
avr_curve is average curve, obtained by averaging over all samples.
f(x)= x^3+a2*x^2+a1*x1+a0 is cubic function
%t1,t2 selected by hand
t1= 10;
t2= 15;
offset=10;
avr_curve= [41, 40, 40, 41, 41, 42, 42, 43, 43, 43, 51, 76, 98, 104, 104, 103, 104, 105, 105, 107, 105];
%gradx= convn(avr_curve,[-1 1],'same');
A= zeros(2*offset+1,3);
%b= zeros(2*offset+1,1);
b= avr_curve';
%for i= 1:2*offset+1
for i=t1:t2
i
x= i-offset-1
A(i,1)= x^2; %a2
A(i,2)= x; %a1
A(i,3)= 1; %a0
b(i,1)= b(i,1)-x^3;
end
u= A\b;
figure,plot(avr_curve(t1:t2))
%estimated cubic curve
for i= 1:2*offset+1
x= i-offset-1;
fx(i)=x^3+u(1)*x^2+u(2)*x+u(3);
end
figure,plot(fx(t1:t2))
part of avr_curve on [t1 t2]
cubic curve that I got (don't looks like avr_curve)
so what I'm doing wrong?
UPDATE:
Seems my error was due that I model cubic polynomial using 3 variables like this:
f(x)= x^3+a2*x^2+a1*x1+a0 - 3 variables
but then I use 4 variables everything seems ok:
f(x)= a3*x^3+a2*x^2+a1*x1+a0 - 4 variables
Here is the code in Matlab:
%defined by hand
t1= 10;
t2= 14;
avr_curve= [41, 40, 40, 41, 41, 42, 42, 43, 43, 43, 51, 76, 98, 104, 104, 103, 104, 105, 105, 107, 105];
x= [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21];
%x= [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]; %real x axis
%%%model 1
%%f(x)= x^3+a2*x^2+a1*x1+a0 - 3 variables
%A= zeros(4,3);
%b= [43 104]';
%%cubic equation at t1
%A(1,1)= t1^2; %a2
%A(1,2)= t1; %a1
%A(1,3)= 1; %a0
%b(1,1)= b(1,1)-t1^3;
%%cubic equation at t2
%A(2,1)= t2^2; %a2
%A(2,2)= t2; %a1
%A(2,3)= 1; %a0
%b(2,1)= b(2,1)-t1^3;
%%1st derivative at t1
%A(3,1)= 2*t1; %a2
%A(3,2)= 1; %a1
%A(3,3)= 0; %a0
%b(3,1)= -3*t1^2;
%%1st derivative at t2
%A(4,1)= 2*t2; %a2
%A(4,2)= 1; %a1
%A(4,3)= 0; %a0
%b(4,1)= -3*t2^2;
%model 2
%f(x)= a3*x^3+a2*x^2+a1*x1+a0 - 4 variables
A= zeros(4,4);
b= [43 104]';
%cubic equation at t1
A(1,1)= t1^3; %a3
A(1,2)= t1^2; %a2
A(1,3)= t1; %a1
A(1,4)= 1; %a0
b(1,1)= b(1,1);
%cubic equation at t2
A(2,1)= t2^3; %a3
A(2,2)= t2^2; %a2
A(2,3)= t2; %a1
A(2,4)= 1; %a0
b(2,1)= b(2,1);
%1st derivative at t1
A(3,1)= 3*t1^2; %a3
A(3,2)= 2*t1; %a2
A(3,3)= 1; %a1
A(3,4)= 0; %a0
b(3,1)= 0;
%1st derivative at t2
A(4,1)= 3*t2^2; %a3
A(4,2)= 2*t2; %a2
A(4,3)= 1; %a1
A(4,4)= 0; %a0
b(4,1)= 0;
size(A)
size(b)
u= A\b;
u
%estimated cubic curve
%dx=[1:21]; % global view
dx=t1-1:t2+1; % local view in [t1 t2]
for x= dx
%fx(x)=x^3+u(1)*x^2+u(2)*x+u(3); % model 1
fx(x)= u(1)*x^3+u(2)*x^2+u(3)*x+u(4); % model 2
end
err= 0;
for x= dx
err= err+(fx(x)-avr_curve(x))^2;
end
err
figure,plot(dx,avr_curve(dx),dx,fx(dx))
spline on interval [t1-1 t2+1]
and on full interval
Disclaimer
I cannot give any guarantees on the correctness of the code or methods given below, always use your critical sense before using any of that.
0. Define the problem
You have this piecewise defined function
Where f(t) is a cubic function, in order to uniquely identify it, the following additional conditions are given
You want to recover the best values of the parameters t1, t2 and sigma that minimize the error with a given set of points.
This is essentially a curve fitting in the least squares sense.
1 Parametrize the f(t) cubic function
In order to compute the error between a candidate Cl(t) function and the set of points we need to compute f(t), its general form (being a cubic) is
So it seems that we have four additional parameters to consider. Indeed this parameters are totally defined by the free three parameters t1, t2 and sigma.
It is important to not confuse the free parameters with the dependent ones.
Given the additional conditions on f(t) we can set up this linear system
Which has one solution (as expected) given by
This tell us how to compute the parameters of the cubic given the three free parameters.
This way Cl(t) is completely determined, now it's time to find the best candidate.
2 Minimize the error
I would normally go for the least squares now.
Since this is not a linear function, there is no closed form for computing sigma, t1 and t2.
There are however numerical methods, like the Gauss-Newton one.
However one way or another it is required to compute the partial derivatives with respect of the three parameters.
I don't know how to compute the derivative with respect of a separation parameter like t1.
I've searched MathSE and found this question that address the same problem, however nobody answered.
Without the partial derivatives the least squares methods are over.
So I take a more practical road and implemented a brute force function in C that try every possible triplet of parameter and return the best match.
3 The brute force function
Given the nature of the problem, this turned out to be O(n^2) in the number of sample.
The algorithm proceeds as follow: Divide the sample set in three parts, the first one is the part of point before t1, the second one of the points between t1 and t2 and the last one of the points after t2.
The first part only is used to compute sigma, sigma is simply the arithmetic average of the points in part 1.
t1 and t2 are computed through a cycle, t1 is set to every possible point in the original points set, starting from the second and going forward.
For every choice of t1, t2 is set to every possible point after t1.
At each iteration an error is computed and if it is the minimum ever seen, the parameters used are saved.
The error is computer as the absolute value of residuals since the absolute value should be fast (surely faster than square) and it fits the purpose of a metric.
4 The code
#include <stdio.h>
#include <math.h>
float point_on_curve(float sigma, float t1, float t2, float t)
{
float a,b,c,d, K;
if (t <= t1)
return sigma;
if (t >= t2)
return 0;
K = (t1-t2)*(t1-t2)*(t1-t2);
a = -2*sigma/K;
b = 3*sigma*(t1+t2)/K;
c = -6*sigma*t1*t2/K;
d = sigma*t2*t2*(3*t1-t2)/K;
return a*t*t*t + b*t*t + c*t + d;
}
float compute_error(float sigma, float t1, float t2, int s, int dx, int* data, int len)
{
float error=0;
unsigned int i;
for (i = 0; i < len; i++)
error += fabs(point_on_curve(sigma, t1, t2, s+i*dx)- data[i]);
return error;
}
/*
* s is the starting time of the samples set
* dx is the separation in time between two sample (a.k.a. sampling period)
* data is the array of samples
* len is the number of samples
* sigma, t1, t2 are pointers to output parameters computed
*
* return 1 if not enough (3) samples, 0 if everything went ok.
*/
int curve_fit(int s, int dx, int* data, unsigned int len, float* sigma, float* t1, float* t2)
{
float l_sigma = 0;
float l_t1, l_t2;
float sum = 0;
float min_error, cur_error;
char error_valid = 0;
unsigned int i, j;
if (len < 3)
return 1;
for (i = 0; i < len; i++)
{
/* Compute sigma as the average of points <= i */
sum += data[i];
l_sigma = sum/(i+1);
/* Set t1 as the point i+1 */
l_t1 = s+(i+1)*dx;
for (j = i+2; j < len; j++)
{
/* Set t2 as the points i+2, i+3, i+4, ... */
l_t2 = s+j*dx;
/* Compute the error */
cur_error = compute_error(l_sigma, l_t1, l_t2, s, dx, data, len);
if (cur_error < min_error || !error_valid)
{
error_valid = 1;
min_error = cur_error;
*sigma = l_sigma;
*t1 = l_t1;
*t2 = l_t2;
}
}
}
return 0;
}
int main()
{
float sigma, t1, t2;
int data[]={41, 40, 40, 41, 41, 42, 42, 43, 43, 43, 51, 76, 98, 104, 104, 103, 104, 105, 105, 107, 105};
unsigned int len = sizeof(data)/sizeof(int);
unsigned int i;
for (i = 0; i < len; i++)
data[i] -= 107; /* Subtract the max */
if (curve_fit(1,1,data, len, &sigma, &t1, &t2))
printf("Not enough data!\n");
else
printf("Parameters: sigma = %.3f, t1 = %.3f, t2 = %.3f\n", sigma, t1, t2);
return 0;
}
Please note that the Cl(t) was defined as having 0 as its right limit, so the code assume this is the case.
This is why the max value (107) is subtracted from every sample, I have worked with the definition of Cl(t) given at the beginning and only late noted that the sample were biased.
By now I'm not going to adapt the code, you can easily add another parameter in the problem and redo the steps from 1 if needed, or simply translate the samples using the maximum value.
The output of the code is
Parameters: sigma = -65.556, t1 = 10.000, t2 = 14.000
Which match the points set given, considering that it is vertically translated by -107.

How do I perform 8 x 8 matrix operation using SSE?

My initial attempt looked like this (supposed we want to multiply)
__m128 mat[n]; /* rows */
__m128 vec[n] = {1,1,1,1};
float outvector[n];
for (int row=0;row<n;row++) {
for(int k =3; k < 8; k = k+ 4)
{
__m128 mrow = mat[k];
__m128 v = vec[row];
__m128 sum = _mm_mul_ps(mrow,v);
sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */
}
_mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum));
}
But this clearly doesn't work. How do I approach this?
I should load 4 at a time....
The other question is: if my array is very big (say n = 1000), how can I make it 16-bytes aligned? Is that even possible?
OK... I'll use a row-major matrix convention. Each row of [m] requires (2) __m128 elements to yield 8 floats. The 8x1 vector v is a column vector. Since you're using the haddps instruction, I'll assume SSE3 is available. Finding r = [m] * v :
void mul (__m128 r[2], const __m128 m[8][2], const __m128 v[2])
{
__m128 t0, t1, t2, t3, r0, r1, r2, r3;
t0 = _mm_mul_ps(m[0][0], v[0]);
t1 = _mm_mul_ps(m[1][0], v[0]);
t2 = _mm_mul_ps(m[2][0], v[0]);
t3 = _mm_mul_ps(m[3][0], v[0]);
t0 = _mm_hadd_ps(t0, t1);
t2 = _mm_hadd_ps(t2, t3);
r0 = _mm_hadd_ps(t0, t2);
t0 = _mm_mul_ps(m[0][1], v[1]);
t1 = _mm_mul_ps(m[1][1], v[1]);
t2 = _mm_mul_ps(m[2][1], v[1]);
t3 = _mm_mul_ps(m[3][1], v[1]);
t0 = _mm_hadd_ps(t0, t1);
t2 = _mm_hadd_ps(t2, t3);
r1 = _mm_hadd_ps(t0, t2);
t0 = _mm_mul_ps(m[4][0], v[0]);
t1 = _mm_mul_ps(m[5][0], v[0]);
t2 = _mm_mul_ps(m[6][0], v[0]);
t3 = _mm_mul_ps(m[7][0], v[0]);
t0 = _mm_hadd_ps(t0, t1);
t2 = _mm_hadd_ps(t2, t3);
r2 = _mm_hadd_ps(t0, t2);
t0 = _mm_mul_ps(m[4][1], v[1]);
t1 = _mm_mul_ps(m[5][1], v[1]);
t2 = _mm_mul_ps(m[6][1], v[1]);
t3 = _mm_mul_ps(m[7][1], v[1]);
t0 = _mm_hadd_ps(t0, t1);
t2 = _mm_hadd_ps(t2, t3);
r3 = _mm_hadd_ps(t0, t2);
r[0] = _mm_add_ps(r0, r1);
r[1] = _mm_add_ps(r2, r3);
}
As for alignment, a variable of a type __m128 should be automatically aligned on the stack. With dynamic memory, this is not a safe assumption. Some malloc / new implementations may only return memory guaranteed to be 8-byte aligned.
The intrinsics header provides _mm_malloc and _mm_free. The align parameter should be (16) in this case.
Intel has developed a Small Matrix Library for matrices with sizes ranging from 1×1 to 6×6. Application Note AP-930 Streaming SIMD Extensions - Matrix Multiplication describes in detail the algorithm for multiplying two 6×6 matrices. This should be adaptable to other size matrices with some effort.

how to swap array-elements to transfer the array from a column-like into a row-like representation

For example: the array
a1, a2, a3, b1, b2, b3, c1, c2, c3, d1, d2, d3
represents following table
a1, b1, c1, d1
a2, b2, c2, d2
a3, b3, c3, d3
now i like to bring the array into following form
a1, b1, c1, d1, a2, b2, c2, d2, a3, b3, c3, d3
Does an algorithm exist, which takes the array (from the first form) and the dimensions of the table as input arguments and which transfers the array into the second form?
I thougt of an algorithm which doesn't need to allocate additional memory, instead i think it should be possible to do the job with element-swap operations.
The term you're looking for is in-place matrix transpose, and here's an implementation.
Wikipedia devotes an article to this process, which is called In-place Matrix Transposition.
http://en.wikipedia.org/wiki/In-place_matrix_transposition
This is nothing more than an in-place matrix transposition. Some pseudo-code:
for n = 0 to N - 2
for m = n + 1 to N - 1
swap A(n,m) with A(m,n)
As you can see, you'll need 2 indices to access an element. This can be done by transforming (n,m) to nP+m with P being the number of columns.
Why bother? If they are laid out in a 1-D array and you know how many elements there are in a logical row/span then you can get sequentially at any index with a little arithmetic.
int index(int row, int col, int elements)
{
return ((row * elements) + col);
}
int inverted_index(int row, int col, int elements)
{
return ((col * elements) + row);
}
then when you access the elements you can say something like...
array[index(row, col, elements)];
array[inverted_index(row, col, elements)];
I do most of my basic array manipulation like this for precisely the reason that I can transpose a matrix just by indexing it differently without any memory shuffling. It is also just about the fastest thing you can do with a computer.
You can follow the same principle and address your first array in terms that meet the needs of your final example with some of your own arithmetic.