Linear discriminant analysis using alglib

Linear discriminant analysis using alglib - c++

I've been asked to perform a linear discriminant analysis on a set of data for one of my projects. I'm using ALGLIB (C++ version) which has a fisherlda function but I need some help understanding how to use it.
The user answers a set of 6 questions (answers are a number from 1-7) which gives me an example data set of e.g. {1,2,3,4,5,6}. I then have 5 classes of 6 values each e.g. {0.765, 0.895, 1.345, 2.456, 0.789, 5.678}.
The fisher lda function takes a 2 dimensional array of values and returns another 1d array of values (that I have no idea what they mean).
As I understand it I need to see to which class the users answers best fit?
Any help understanding LDA and/or how I can use this function would be greatly appreciated.
EDIT:
Here's the definition of the function I'm trying to use:
/*************************************************************************
Multiclass Fisher LDA
Subroutine finds coefficients of linear combination which optimally separates
training set on classes.
INPUT PARAMETERS:
XY - training set, array[0..NPoints-1,0..NVars].
First NVars columns store values of independent
variables, next column stores number of class (from 0
to NClasses-1) which dataset element belongs to. Fractional
values are rounded to nearest integer.
NPoints - training set size, NPoints>=0
NVars - number of independent variables, NVars>=1
NClasses - number of classes, NClasses>=2
OUTPUT PARAMETERS:
Info - return code:
* -4, if internal EVD subroutine hasn't converged
* -2, if there is a point with class number
outside of [0..NClasses-1].
* -1, if incorrect parameters was passed (NPoints<0,
NVars<1, NClasses<2)
* 1, if task has been solved
* 2, if there was a multicollinearity in training set,
but task has been solved.
W - linear combination coefficients, array[0..NVars-1]
-- ALGLIB --
Copyright 31.05.2008 by Bochkanov Sergey
*************************************************************************/
void fisherlda(const real_2d_array &xy, const ae_int_t npoints, const ae_int_t nvars, const ae_int_t nclasses, ae_int_t &info, real_1d_array &w);

You are using fisherlda function which is an implementation of LDA algorithm.
LDA(Linear Discriminant Analysis) is aimed to find a linear combination of features that best characterizes or separates two or more classes of objects or events.
Assume the line y=wx(w,x both stand for a matrix here),so the given result of fisherlad is a 1d array of coefficients which is w.Then you can use this line to determine which class the answers belong.

Related

Univac Math pack subroutines in old-school FORTRAN (pre-77)

I have been looking at an engineering paper here which describes an old FORTRAN code for solving pipe flow equations (it's dated 1974, before FORTRAN was standardised as Fortran 77). On page 42 of this document the old code calls the following subroutine:
C SYSTEM SUBROUTINE FROM UNIVAC MATH-PACK TO
C SOLVE LINEAR SYSTEM OF EQ.
CALL GJR(A,51,50,NP,NPP,$98,JC,V)
It's a bit of a long shot, but do any veterans or ancient code buffs recall this system subroutine and it's input arguments? I'm having trouble finding any information about it.
If I can adapt the old code my current application I may rewrite this in C++ or VBA, and will be looking for an equivalent function in these languages.

I'll add to this answer if I find anything more detailed, but I have a place to start looking for the arguments to GJR.
This function is part of the Sperry UNIVAC MATH-PACK library - a full list of functions in the library can be found in http://www.dtic.mil/dtic/tr/fulltext/u2/a170611.pdf GJR is described as "determinant; inverse; solution of simultaneous equations". Marginally helpful.
A better description comes from http://nvlpubs.nist.gov/nistpubs/jres/74B/jresv74Bn4p251_A1b.pdf
A FORTRAN subroutine, one of the Univac 1108 Math Pack programs,
available on the library tapes at the University of Maryland computing
center. It solves simultaneous equations, computes a determinant, or
inverts a matrix or any combination of the three above by using a
Gauss-Jordan elimination technique with column pivoting.
This is slightly more useful, but what we really want is "MATH-PACK, Programmer Reference", UP-7542 Rev. 1 from Sperry-UNIVAC (Unisys) I find a lot of references to this document but no full-text PDF of the document itself.
I'd take a look at the arguments in the function call, how they are set up and how the results are used, then look for equivalent routines in LAPACK or BLAS. See http://www.netlib.org/lapack/
I have a few books on piping networks including "Analysis of Flow in Pipe Networks" by Jeppson (same author as in the original PDF hosted by USU) https://books.google.com/books/about/Analysis_of_flow_in_pipe_networks.html?id=peZSAAAAMAAJ - I'll see if I can dig that up. The book may have a more portable matrix solver than the proprietary Sperry-UNIVAC library.
Update:
From p. 41 of http://ngds.egi.utah.edu/files/GL04099/GL04099_1.pdf I found documentation for the CGJR function, the complex version of GJR from the same library. It is likely the only difference in the arguments is variable type (COMPLEX instead of REAL):
CGJR is a subroutine which solves simultaneous equations, computes a determinant, inverts a matrix, or does any combination of these three operations, by using a Gauss-Jordan elimination technique with column pivoting.
The procedure for using CGJR is as follows:
Calling statement: CALL CGJR(A,NC,NR,N,MC,$K,JC,V)
where
A is the matrix whose inverse or determinant is to be determined. If simultaneous equations are solved, the last MC-N columns of the matrix are the constant vectors of the equations to be solved. On output, if the inverse is computed, it is stored in the first N columns of A. If simultaneous equations are solved, the last MC-N columns contain the solution vectors. A is a complex array.
NC is an integer representing the maximum number of columns of the array A.
NR is an integer representing the maximum number of rows of the array A.
N is an integer representing the number of rows of the array A to be operated on.
MC is the number of columns of the array A, representing the coefficient matrix if simultaneous equations are being solved; otherwise it is a dummy variable.
K is a statement number in the calling program to which control is returned if an overflow or singularity is detected.
1) If an overflow is detected, JC(1) is set to the negative of the last correctly completed row of the reduction and control is then returned to statement number K in the calling program.
2) If a singularity is detected, JC(1)is set to the number of the last correctly completed row, and V is set to (0.,0.) if the determinant was to be computed. Control is then returned to statement number K in the calling program.
JC is a one dimensional permutation array of N elements which is used for permuting the rows and columns of A if an inverse is being computed .. If an inverse is not computed, this array must have at least one cell for the error return identification. On output, JC(1) is N if control is returned normally.
V is a complex variable. On input REAL(V) is the option indicator, set as follows:
invert matrix
compute determinant
do 1. and 2.
solve system of equations
do 1. and 4.
do 2. and 4.
do 1., 2. and 4.
Notes on usage of row dimension arguments N and NR:
The arguments N and NR refer to the row dimensions of the A matrix.
N gives the number of rows operated on by the subroutine, while NR
refers to the total number of rows in the matrix as dimensioned by the
calling program. NR is used only in the dimension statement of the
subroutine. Through proper use of these parameters, the user may specify that only a submatrix, instead of the entire matrix, be operated on by the subroutine.
In your application (pipe flow), look at how matrix A and vector V are populated before the call to GJR and how they are used after the call.
You may be able to replace the call to GJR with a call to LAPACK's SGESV or DGESV without much difficulty.
Aside: The Fortran community really needs a drop-in 'Rosetta library' that wraps LAPACK, etc. for replacing legacy/proprietary IBM, UNIVAC, and Numerical Recipes math functions. The perfect case would be that maintainers would replace legacy functions with de facto standard math functions but in the real world, many of these older programs are un(der)maintained and there simply isn't the will (or, as in this case, the ability) to update them.
Update 2:
I started work on a compatibility library for the Sperry MATH-PACK and STAT-PACK routines as well as a few other legacy libraries, posted at https://bitbucket.org/apthorpe/alfc
Further, I located my copy of Jeppson's Analysis of Flow in Pipe Networks which is a slightly more legible version of the PDF of Steady Flow Analysis of Pipe Networks: An Instructional Manual and modernized the codes listed in the text. I have posted those at https://bitbucket.org/apthorpe/jeppson_pipeflow
Note that I found a number of errors in both the code listings and in the example problems given for many of the codes. If you're trying to learn how to write a pipe flow solver based on Jeppson's paper or text, I'd strongly suggest reviewing my updated codes and test cases because they will save you hours of effort trying to understand why the code doesn't work and why you can't replicate the example cases. This took a fair amount of forensic computing to sort out.
Update 3:
The source to CGJR and DGJR can be found in http://www.dtic.mil/dtic/tr/fulltext/u2/a110089.pdf. DGJR is the closest to what you want, though it references more routines that aren't available (proprietary UNIVAC error-handling routines). It should be easy to convert `DGJR' to single precision and skip the proprietary calls. Otherwise, use the compatibility library mentioned above.

addWeighted in OpenCV

I came across the function addWeighted in OpenCV, where it was mentioned that it:
Calculates the weighted sum of two arrays.
Does that mean we multiply the pixels in the first array by some weight, and likewise to the second array, and then simply some the relevant pixel values together?
Thanks.

From the OpenCV documentation:
http://docs.opencv.org/modules/core/doc/operations_on_arrays.html
You answer is not completely correct (unless your gamma is 0) because you have to sum the gamma value.

Yes, as it says there in the docs:
The function addWeighted calculates the weighted sum of two arrays
as follows:
dst(I) = saturate(src1(I)*alpha + src2(I)*beta + gamma)
where I is a multi-dimensional index of array elements. In case of
multi-channel arrays, each channel is processed independently.
The function can be replaced with a matrix expression:
dst = src1*alpha + src2*beta + gamma;
where saturate is the saturate_cast<>() conversion function (which performs saturation as opposed to modular arithmetic that wraps around)
You can always check the source as well:
https://github.com/Itseez/opencv/blob/2.4/modules/core/src/arithm.cpp#L2114
The function has multiple execution paths depending on how you build it (what optimizations are available: SSE2, NEON, unrolled version, and then finally a fallback implementation) and the data types involved.

Transform a 1 by 1 matrix intro a variable or scalar

How do transform a 1 by 1 matrix intro a variable or scalar? At the moment I have two matrices which are both 1 by 1, so in principle they are scalars. I would like to divide one of the values (which is the 1 by 1 matrix) by the other value (which is the other 1 by 1 matrix).
I've read that one can do something like that
C[`i',`j']= A[`i',`j']/B[`i',`j']
to do element by element operations in Stata. In this example one would loop over i and j. Unfortunately, it did not work.

In Stata, variables and scalars are two different things. Variables are set up as columns in a Stata database; almost always the subject of some statistical analysis. A scalar is a storage type that holds some expression, be it numeric or string.
The code you show appears to be from this page: http://www.stata.com/support/faqs/data-management/element-by-element-operations-on-matrices/, but you only post one part. That part makes use of local macros, but nowhere do you seem to define them. Furthermore, if you have a matrix with only one element, then you need not loop over the indices of the matrix. Its only element is held in position [1,1].
Below is an example of two matrices with one element each, whose division is saved in a scalar.
clear all
set more off
matrix A = (1)
matrix B = (2)
scalar c = A[1,1]/B[1,1]
display "scalar c is: " c
Stata has its own matrix language, Mata, in case you need "advanced" matrix features.
See at least help macro, help scalar, help matrix, help forvalues and help mata.

Alglib: solving A * x = b in a least squares sense

I have a somewhat complicated algorithm that requires the fitting of a quadric to a set of points. This quadric is given by its parametrization (u, v, f(u,v)), where f(u,v) = au^2+bv^2+cuv+du+ev+f.
The coefficients of the f(u,v) function need to be found since I have a set of exactly 6 constraints this function should obey. The problem is that this set of constraints, although yielding a problem like A*x = b, is not completely well behaved to guarantee a unique solution.
Thus, to cut it short, I'd like to use alglib's facilities to somehow either determine A's pseudoinverse or directly find the best fit for the x vector.
Apart from computing the SVD, is there a more direct algorithm implemented in this library that can solve a system in a least squares sense (again, apart from the SVD or from using the naive inv(transpose(A)*A)*transpose(A)*b formula for general least squares problems where A is not a square matrix?

Found the answer through some careful documentation browsing:
rmatrixsolvels( A, noRows, noCols, b, singularValueThreshold, info, solverReport, x)
The documentation states the the singular value threshold is a clamping threshold that sets any singular value from the SVD decomposition S matrix to 0 if that value is below it. Thus it should be a scalar between 0 and 1.
Hopefully, it will help someone else too.

Removing unsolvable equations from an underdetermined system

My program tries to solve a system of linear equations. In order to do that, it assembles matrix coeff_matrix and vector value_vector, and uses Eigen to solve them like:
Eigen::VectorXd sol_vector = coeff_matrix
.colPivHouseholderQr().solve(value_vector);
The problem is that the system can be both over- and under-determined. In the former case, Eigen either gives a correct or uncorrect solution, and I check the solution using coeff_matrix * sol_vector - value_vector.
However, please consider the following system of equations:
a + b - c = 0
c - d = 0
c = 11
- c + d = 0
In this particular case, Eigen solves the three latter equations correctly but also gives solutions for a and b.
What I would like to achieve is that only the equations which have only one solution would be solved, and the remaining ones (the first equation here) would be retained in the system.
In other words, I'm looking for a method to find out which equations can be solved in a given system of equations at the time, and which cannot because there will be more than one solution.
Could you suggest any good way of achieving that?
Edit: please note that in most cases the matrix won't be square. I've added one more row here just to note that over-determination can happen too.

I think what you want to is the singular value decomposition (SVD), which will give you exact what you want. After SVD, "the equations which have only one solution will be solved", and the solution is pseudoinverse. It will also give you the null space (where infinite solutions come from) and left null space (where inconsistency comes from, i.e. no solution).

Based on the SVD comment, I was able to do something like this:
Eigen::FullPivLU<Eigen::MatrixXd> lu = coeff_matrix.fullPivLu();
Eigen::VectorXd sol_vector = lu.solve(value_vector);
Eigen::VectorXd null_vector = lu.kernel().rowwise().sum();
AFAICS, the null_vector rows corresponding to single solutions are 0s while the ones corresponding to non-determinate solutions are 1s. I can reproduce this throughout all my examples with the default treshold Eigen has.
However, I'm not sure if I'm doing something correct or just noticed a random pattern.

What you need is to calculate the determinant of your system. If the determinant is 0, then you have an infinite number of solutions. If the determinant is very small, the solution exists, but I wouldn't trust the solution found by a computer (it will lead to numerical instabilities).
Here is a link to what is the determinant and how to calculate it: http://en.wikipedia.org/wiki/Determinant
Note that Gaussian elimination should also work: http://en.wikipedia.org/wiki/Gaussian_elimination
With this method, you end up with lines of 0s if there are an infinite number of solutions.
Edit
In case the matrix is not square, you first need to extract a square matrix. There are two cases:
You have more variables than equations: then you have either no solution, or an infinite number of them.
You have more equations than variables: in this case, find a square sub-matrix of non-null determinant. Solve for this matrix and check the solution. If the solution doesn't fit, it means you have no solution. If the solution fits, it means the extra equations were linearly-dependant on the extract ones.
In both case, before checking the dimension of the matrix, remove rows and columns with only 0s.
As for the gaussian elimination, it should work directly with non-square matrices. However, this time, you should check that the number of non-empty row (i.e. a row with some non-0 values) is equal to the number of variable. If it's less you have an infinite number of solution, and if it's more, you don't have any solutions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js