Understanding Viterbi Algorithm

Understanding Viterbi Algorithm - c++

I'm trying to implement some code from here
And I have trained the HMM with my coefficients but do not understand how the Viterbi Decoder algorithm works, for example:
viterbi_decode(MFCC, M, model, q);
where MFCC = coefficents
M = size of MFCC
model = Model of HMM training using the MFCC coefficients
q = unknown (believed to be the outputted path).
But here is what I do not understand: I am attempting to compare two speech signals (training, sample) to find out the closest possible match. With the DTW algorithm for example, a single integer is returned where I can then find the closest, however, with this algorithm it returns a int* array and therefore differentiating is difficult.
Here is how the current program works:
vector<DIMENSIONS_2> MFCC = mfcc.transform(rawData, sample_rate);
int N = MFCC.size();
int M = 13;
double** mfcc_setup = setupHMM(MFCC, N, M);
model_t* model = hmm_init(mfcc_setup, N, M, 10);
hmm_train(mfcc_setup, N, model);
int* q = new int[N];
viterbi_decode(mfcc_setup, M, model, q);
Could anyone please tell me how the Viterbi Decoder works for the problem of identifying which is the best path to take from the training, to the input? I've tried both the Euclidean distance as well as the Hamming Distance on the decode path (q) but had no such luck.
Any help would be greatly appreciated

In this example it seems to me that (q) is the hidden state sequence, so a list of numbers from 0->9. If you have two audio samples say, test and train, and you generate two sequences q_test and q_train, then thinking about |q_test - q_train|, where the norm is componentwise distance, is not useful because it isn't representing a notion of distance correctly, since hidden state labels in HMM may be arbitrary.
A more natural way to think about distance may be the following, given q_train, you are interested in the probability that your test sample took that same path, which you can compute once you have the transition matrix and emission probabilites.
Please let me know if I am misunderstanding your question.

Related

How to do weighted sum as tensor operation in tensorflow?

I am trying to do a weighted sum of matrices in tensorflow.
Unfortunately, my dimensions are not small and I have a problem with memory. Another option is that I doing something completely wrong
I have two tensors U with shape (B,F,M) and A with shape (C,B). I would like to do weighted sum and stacking.
Weighted sum
For each index c from C, I have vector of weights a from A, with shape (B,).
I want to use it for the weighted sum of U to get matrix U_t with shape (F, M). This is pretty same with this, where I found small help.
Concatenation
Unfortunately, I want to do this for each vector a in A to get C matrices U_tc in list. U_tc have mentioned shape (F,M). After that I concatenate all matrices in list to get super matrix with shape (C*F,M)
My values are C=2500, M=500, F=80, B=300
In the beginning, I tried the very naive approach with many loop and element selection which generate very much operation.
Now with help from this, I have following:
U = tf.Variable(tf.truncated_normal([B, F, M],stddev=1.0 ,dtype=tf.float32) #just for example
A = tf.Variable(tf.truncated_normal([C, B],stddev=1.0) ,dtype=tf.float32) #just for example
U_t = []
for ccc in xrange(C):
a = A[ccc,:]
a_broadcasted = tf.tile(tf.reshape(a,[B,1,1]), tf.stack([1,F,M]))
T_p.append(tf.reduce_sum(tf.multiply(U,a_broadcasted), axis=0))
U_tcs = tf.concat(U_t,axis=0)
Unfortunately, this is failing at memory error. I am not sure if I did something wrong, or it is because computation has a too much mathematic operation? Because I think... variables aren't too large for memory, right? At least, I had larger variables before and it was ok. (I have 16 GB GPU memory)
Am I doing that weighted sum correctly?
Any idea how to do it more effective?
I will appreciate any help. Thanks.

1. Weighted sum and Concatenation
You can use vector operations directly without loops when memory is not limited.
import tensorflow as tf
C,M,F,B=2500,500,80,300
U = tf.Variable(tf.truncated_normal([B, F, M],stddev=1.0 ,dtype=tf.float32)) #just for example
A = tf.Variable(tf.truncated_normal([C, B],stddev=1.0) ,dtype=tf.float32) #just for example
# shape=(C,B,1,1)
A_new = tf.expand_dims(tf.expand_dims(A,-1),-1)
# shape=(B,F,M)
U_t = tf.reduce_sum(tf.multiply(A_new , U),axis=1)
# shape=(C*F,M)
U_tcs = tf.reshape(U_t,(C*F,M))
2. Memory error
In fact, I also had memory errors when I ran the above code.
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2500,300,80,500]...
With a little modification of the above code, it works properly on my 8GB GPU memory.
import tensorflow as tf
C,M,F,B=2500,500,80,300
U = tf.Variable(tf.truncated_normal([B, F, M],stddev=1.0 ,dtype=tf.float32)) #just for example
A = tf.Variable(tf.truncated_normal([C, B],stddev=1.0) ,dtype=tf.float32) #just for example
# shape=(C,B,1,1)
A_new = tf.expand_dims(tf.expand_dims(A,-1),-1)
U_t = []
for ccc in range(C):
a = A_new[ccc,:]
a_broadcasted = tf.reduce_sum(tf.multiply(a, U),axis=0)
U_t.append(a_broadcasted)
U_tcs = tf.concat(U_t,axis=0)

Camera pose estimation is inverted

I am working on a Structure from Motion framework for generating 3D models of moving objects with fixed cameras. To do this, I am following this pipeline:
Obtain the fundamental matrix F through keypoints. As I am still in development and I can not (yet) access the final objects I will model, I am doing this with a small object and manually annotated keypoint pairs between two images:
These are the correspondences used to calculate F. Computing the error of F by x'Fx where x' are the points from the right image and x the points of the left image (in pixel coordinates) gives an error of 0,1196.
Compute the essential matrix E using the intrinsic matrices and the fundamental: E = Kleft'*F*Kright where Kleft' is the inverse intrinsic matrix of the left camera. With the SVD decomposition I create a new E, which has only two singular values equal to 1.
Decompose E to get R and t. To do so, I have used a self made custom version of OpenCV recoverPose() function, that allows for two different camera matrices.
Obtain Pleft as a diagonal matrix and Pright as the construction of R and t.
Here are the significant parts of the code:
F = findFundamentalMat( alignedLeft.points, alignedRight.points, mask, RANSAC);
E = cameraMatrixLeftInv*F*cameraMatrixRight;
SVD::compute(E, w, u, vt);
Mat diag = Mat::zeros(Size(3,3),6);
diag.at<double>(0,0) = 1;
diag.at<double>(1,1) = 1;
E = u*diag*vt;
int good = recoverPoseFromTwoCameras(E,alignedLeft.points,alignedRight.points,intrinsicsLeft.K,intrinsicsRight.K,R,t,mask);
Pleft = Matx34f::eye();
Pright = Matx34f(R.at<double>(0,0), R.at<double>(0,1), R.at<double>(0,2), t.at<double>(0),
R.at<double>(1,0), R.at<double>(1,1), R.at<double>(1,2), t.at<double>(1),
R.at<double>(2,0), R.at<double>(2,1), R.at<double>(2,2), t.at<double>(2));
Then I use viz to visualize the camera poses:
viz::Viz3d myWindow("Results");
viz::WCameraPosition cameraLeft(imgLeft->getIntrinsics().K,imgLeft->getImage());
viz::WCameraPosition cameraRight(imgRight->getIntrinsics().K,imgRight->getImage());
Which I place in the viewer using Pleft and Pright:
myWindow.showWidget("cameraLeft",cameraLeft,Affine3d(Pleft(Range::all(),Range(0,3)),Pleft.col(3)));
myWindow.showWidget("cameraRight",cameraRight,Affine3d(Pright(Range::all(),Range(0,3)),Pright.col(3)));
However, if I do that, the result is inverted. I can't embed more than one link because of low reputation, but the camera1 is where the camera 2 should be and viceversa.
But if I apply the matrices like this:
myWindow.showWidget("cameraLeft",cameraLeft,Affine3d(Pright(Range::all(),Range(0,3)),Pleft.col(3)));
myWindow.showWidget("cameraRight",cameraRight,Affine3d(Pleft(Range::all(),Range(0,3)),Pleft.col(3)));
The result is correct.
What am I missing?

Hope you have worked out! If you have not, you'd better upload more data. For example, the features, the F/E you computed, the camera intrinsic. Then we can test your code and try to find your bug, otherwise, there are many possible for this.

Given camera matrices, how to find point correspondances using OpenCV?

I'm following this tutorial, which uses Features2D + Homography. If I have known camera matrix for each image, how can I optimize the result? I tried some images, but it didn't work well.
//Edit
After reading some materials, I think I should rectify two image first. But the rectification is not perfect, so a vertical line on image 1 correspond a vertical band on image 2 generally. Are there any good algorithms?

I'm not sure if I understand your problem. You want to find corresponding points between the images or you want to improve the correctness of your matches by use of the camera intrinsics?
In principle, in order to use camera geometry for finding matches, you would need the fundamental or essential matrix, depending on wether you know the camera intrinsics (i.e. calibrated camera). That means, you would need an estimate for the relative rotation and translation of the camera. Then, by computing the epipolar lines corresponding to the features found in one image, you would need to search along those lines in the second image to find the best match. However, I think it would be better to simply rely on automatic feature matching. Given the fundamental/essential matrix, you could try your luck with correctMatches, which will move the correspondences such that the reprojection error is minimised.
Tips for better matches
To increase the stability and saliency of automatic matches, it usually pays to
Adjust the parameters of the feature detector
Try different detection algorithms
Perform a ratio test to filter out those keypoints which have a very similar second-best match and are therefore unstable. This is done like this:
Mat descriptors_1, descriptors_2; // obtained from feature detector
BFMatcher matcher;
vector<DMatch> matches;
matcher = BFMatcher(NORM_L2, false); // norm depends on feature detector
vector<vector<DMatch>> match_candidates;
const float ratio = 0.8; // or something
matcher.knnMatch(descriptors_1, descriptors_2, match_candidates, 2);
for (int i = 0; i < match_candidates.size(); i++)
{
if (match_candidates[i][0].distance < ratio * match_candidates[i][1].distance)
matches.push_back(match_candidates[i][0]);
}
A more involved way of filtering would be to compute the reprojection error for each keypoint in the first frame. This means to compute the corresponding epipolar line in the second image and then checking how far its supposed matching point is away from that line. Throwing away those points whose distance exceeds some threshold would remove the matches which are incompatible with the epiploar geometry (which I assume would be known). Computing the error can be done like this (I honestly do not remember where I took this code from and I may have modified it a bit, also the SO editor is buggy when code is inside lists, sorry for the bad formatting):
double computeReprojectionError(vector& imgpts1, vector& imgpts2, Mat& inlier_mask, const Mat& F)
{
double err = 0;
vector lines[2];
int npt = sum(inlier_mask)[0];
// strip outliers so validation is constrained to the correspondences
// which were used to estimate F
vector imgpts1_copy(npt),
imgpts2_copy(npt);
int c = 0;
for (int k = 0; k < inlier_mask.size().height; k++)
{
if (inlier_mask.at(0,k) == 1)
{
imgpts1_copy[c] = imgpts1[k];
imgpts2_copy[c] = imgpts2[k];
c++;
}
}
Mat imgpt[2] = { Mat(imgpts1_copy), Mat(imgpts2_copy) };
computeCorrespondEpilines(imgpt[0], 1, F, lines[0]);
computeCorrespondEpilines(imgpt1, 2, F, lines1);
for(int j = 0; j < npt; j++ )
{
// error is computed as the distance between a point u_l = (x,y) and the epipolar line of its corresponding point u_r in the second image plus the reverse, so errij = d(u_l, F^T * u_r) + d(u_r, F*u_l)
Point2f u_l = imgpts1_copy[j], // for the purpose of this function, we imagine imgpts1 to be the "left" image and imgpts2 the "right" one. Doesn't make a difference
u_r = imgpts2_copy[j];
float a2 = lines1[j][0], // epipolar line
b2 = lines1[j]1,
c2 = lines1[j][2];
float norm_factor2 = sqrt(pow(a2, 2) + pow(b2, 2));
float a1 = lines[0][j][0],
b1 = lines[0][j]1,
c1 = lines[0][j][2];
float norm_factor1 = sqrt(pow(a1, 2) + pow(b1, 2));
double errij =
fabs(u_l.x * a2 + u_l.y * b2 + c2) / norm_factor2 +
fabs(u_r.x * a1 + u_r.y * b1 + c1) / norm_factor1; // distance of (x,y) to line (a,b,c) = ax + by + c / (a^2 + b^2)
err += errij; // at this point, apply threshold and mark bad matches
}
return err / npt;
}
The point is, grab the fundamental matrix, use it to compute epilines for all the points and then compute the distance (the lines are given in a parametric form so you need to do some algebra to get the distance). This is somewhat similar in outcome to what findFundamentalMat with the RANSAC method does. It returns a mask wherein for each match there is either a 1, meaning that it was used to estimate the matrix, or a 0 if it was thrown out. But estimating the fundamental Matrix like this will probably be less accurate than using chessboards.

EDIT: Looks like oarfish beat me to it, but I'll leave this here.
The fundamental matrix (F) defines a mapping from a point in the left image to a line in the right image on which the corresponding point must lie, assuming perfect calibration. This is the epipolar line, i.e. the line though the point in the left image and the two epipoles of the stereo camera pair. For references, see these lecture notes and this chapter of the HZ book.
Given a set of point correspondences in the left and right images: (p_L, p_R), from SURF (or any other feature matcher), and given F, the constraint from epipolar geometry of the stereo pair says that p_R should lie on the epipolar line projected by p_L onto the right image, i.e.
In practice, calibration errors from noise as well as erroneous feature matches lead to a non-zero value.
However, using this idea, you can then perform outlier removal by rejecting those feature matches for which this equation is greater than a certain threshold value, i.e. reject (p_L, p_R) if and only if:
When selecting this threshold, keep in mind that it is the distance in image space of a point from an epipolar line that you are willing to tolerate, which in some sense is your epipolar error tolerance.
Degenerate case: To visually imagine what this means, let us assume that the stereo pair differ only in a pure X-translation. Then the epipolar lines are horizontal. This means that you can connect the feature matched point pairs by a line and reject those pairs whose line slope is not close to zero. The equation above is a generalization of this idea to arbitrary stereo rotation and translation, which is accounted for by the matrix F.
Your specific images: It looks like your feature matches are sparse. I suggest instead to use a dense feature matching approach so that after outlier removal, you are still left with a sufficient number of good-quality matches. I'm not sure which dense feature matcher is already implemented in OpenCV, but I suggest starting here.

Giving your pictures, your are trying to do a stereo matching.
This page will be helpfull. The rectification you want can be done using stereoCalibrate then stereoRectify.
The result (from the doc):

In order to find the Fundamental Matrix, you need correct correspondances, but in order to get good correspondances, you need a good estimate of the fundamental matrix. This might sound like an impossible chicken-and-the-egg-problem, but there is well established methods to do this; RANSAC.
It randomly selects a small set of correspondances, uses those to calculate a fundamental matrix (using the 7 or 8 point algorithm) and then tests how many of the other correspondences that comply with this matrix (using the method described by scribbleink for measuring the distance between point and epipolar line). It keeps testing new combinations of correspondances for a certain number of iterations and selects the one with the most inliers.
This is already implemented in OpenCV as cv::findFundamentalMat (http://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html#findfundamentalmat). Select the method CV_FM_RANSAC to use ransac to remove bad correspondances. It will output a list of all the inlier correspondances.
The requirement for this is that all the points does not lie on the same plane.

Total Least Squares algorithm in C/C++

Given a set of points P I need to find a line L that best approximates these points. I have tried to use the function gsl_fit_linear from the GNU scientific library. However my data set often contains points that have a line of best fit with undefined slope (x=c), thus gsl_fit_linear returns NaN. It is my understanding that it is best to use total least squares for this sort of thing because it is fast, robust and it gives the equation in terms of r and theta (so x=c can still be represented). I can't seem to find any C/C++ code out there currently for this problem. Does anyone know of a library or something that I can use? I've read a few research papers on this but the topic is still a little fizzy so I don't feel confident implementing my own.
Update:
I made a first attempt at programming my own with armadillo using the given code on this wikipedia page. Alas I have so far been unsuccessful.
This is what I have so far:
void pointsToLine(vector<Point> P)
{
Row<double> x(P.size());
Row<double> y(P.size());
for (int i = 0; i < P.size(); i++)
{
x << P[i].x;
y << P[i].y;
}
int m = P.size();
int n = x.n_cols;
mat Z = join_rows(x, y);
mat U;
vec s;
mat V;
svd(U, s, V, Z);
mat VXY = V(span(0, (n-1)), span(n, (V.n_cols-1)));
mat VYY = V(span(n, (V.n_rows-1)) , span(n, (V.n_cols-1)));
mat B = (-1*VXY) / VYY;
cout << B << endl;
}
the output from B is always 0.5504, Even when my data set changes. As well I thought that the output should be two values, so I'm definitely doing something very wrong.
Thanks!

To find the line that minimises the sum of the squares of the (orthogonal) distances from the line, you can proceed as follows:
The line is the set of points p+r*t where p and t are vectors to be found, and r varies along the line. We restrict t to be unit length. While there is another, simpler, description in two dimensions, this one works with any dimension.
The steps are
1/ compute the mean p of the points
2/ accumulate the covariance matrix C
C = Sum{ i | (q[i]-p)*(q[i]-p)' } / N
(where you have N points and ' denotes transpose)
3/ diagonalise C and take as t the eigenvector corresponding to the largest eigenvalue.
All this can be justified, starting from the (orthogonal) distance squared of a point q from a line represented as above, which is
d2(q) = q'*q - ((q-p)'*t)^2

Curvature Scale Space corner detection algorithm. Arc Length Parameter?

I'm studying about the CSS algorithm and I don't get the hang of the concept of 'Arc Length Parameter'.
According to the literature, planar curve Gamma(u)=(x(u),y(u)) and they say this u is the arc length parameter and apparently, Gaussian Kernel g is also parameterized by this u here.
Stop me if I got something wrong but, aren't x and y location of the pixel? How is it represented by another parameter?
I had no idea when I first saw it on the literature so, I looked up the code. and apparently, I got puzzled even more.
here is the portion of the code
void getGaussianDerivs(double sigma, int M, vector<double>& gaussian,
vector<double>& dg, vector<double>& d2g) {
int L = (M - 1) / 2;
double sigma_sq = sigma * sigma;
double sigma_quad = sigma_sq*sigma_sq;
dg.resize(M); d2g.resize(M); gaussian.resize(M);
Mat_<double> g = getGaussianKernel(M, sigma, CV_64F);
for (double i = -L; i < L+1.0; i += 1.0) {
int idx = (int)(i+L);
gaussian[idx] = g(idx);
// from http://www.cedar.buffalo.edu/~srihari/CSE555/Normal2.pdf
dg[idx] = (-i/sigma_sq) * g(idx);
d2g[idx] = (-sigma_sq + i*i)/sigma_quad * g(idx);
}
}
so, it seems the code uses simple 1D Gaussian Kernel Aperture size of M and it is trying to compute its 1st and 2nd derivatives. As far as I know, 1D Gaussian kernel has parameter of x which is a horizontal coordinate and sigma which is scale. it seems like that 'arc length parameter u' is equivalent to the variable of x. That doesn't make any sense because later in the code, it directly convolutes the set of x and y on the contour.
what is this u?
PS. since I replied to the fellow who tried to answer my question, I think I should modify my question, so, here we go.
What I'm confusing is, how is this parameter 'u' implemented in codes? I think I understood the full code above -of course, I inserted only a portion of the code- but the problem is, I have no idea about what it would be for the 'improved' version of the algorithm. It says it's using 'affine length parameter' instead of this 'arc length parameter' and I'm not so sure how I implement the concept into the code.
According to the literature, the main difference between arc length parameter and affine length parameter is it's sampling interval and arc length parameter uses 1 for the vertical and horizontal direction and root of 2 for the diagonal direction. It makes sense since the portion of the code above is using for loop to compute 1st and 2nd derivatives of the 1d Gaussian and it directly inserts the value of interval 1 but, how is it gonna be with different interval with different variable? Is it possible that I'm not able to use 'for loop' for it?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js