Why are Cosine Similarity and TF-IDF used together? - data-mining

TF-IDF and Cosine Similarity is a commonly used combination for
text clustering. Each document is represented by vectors of TF-IDF
weights.
This is what my text book says.
With Cosine Similarity you can then compute the similarities between those documents.
But why are exactly those techniques used together?
What is the advantage?
Could for example Jaccard Similarity also be used?
I know, how it works, but I want to know, why exactly these techniques.

TF-IDF is the weighting used.
Cosine is the measure used.
You could use cosine without weighting, but results then usually are worse. Jaccard works on sets - it's not obvious how to use weights without turning it into something else without making it the same as Cosine.

Related

c++ - implementing a multivariate probability density function for a likelihood filter

I'm trying to construct a multivariate likelihood function in c++ code with the aim of comparing multiple temperature simulations for consistency with observations but taking into account autocorrelation between the time steps. I am inexperienced in c++ and so have been struggling to understand how to write the equation in c++ form. I have the covariance matrix, the simulations I wish to judge and the observations to compare to. The equation is as follows:
f(x,μ,Σ) = (1/√(∣Σ∣(2π)^d))*exp(−1/2(x-μ)Σ^(-1)(x-μ)')
So I need to find the determinant and the inverse of the covariance matrix. Does anyone know how to do that in c++ if x,μ and Σ are all specified?
I have found a few examples and resources to follow
https://github.com/dirkschumacher/rcppglm
https://www.youtube.com/watch?v=y8Kq0xfFF3U&t=953s
https://www.codeproject.com/Articles/25335/An-Algorithm-for-Weighted-Linear-Regression
https://www.geeksforgeeks.org/regression-analysis-and-the-best-fitting-line-using-c/
https://cppsecrets.com/users/489510710111510497118107979811497495464103109971051084699111109/C00-MLPACK-LinearRegression.php
https://stats.stackexchange.com/questions/146230/how-to-implement-glm-computationally-in-c-or-other-languages

What is the best comparison algorithm for comparing the similarity of arrays?

I am building a program that takes 2 arrays and returns some value showing to what degree they are similar. For example, images with few differences will have a good score, whereas images that are vastly different will have a worse score.
So far the only two algorithms I have come across for this problem are the sum of the squared differences and the normalized correlation.
Both of these will be fairly simple to implement, however I was wondering if there was another algorithm I haven't been able to find that I could use?
Furthermore, which previously mentioned method will be the best? Would be great to know both in terms of their accuracy and efficiency.
Thanks,
Comparing images usually depends on application you are dealing with. Normally distance functions used depends on image descriptor.
Take look at Distance functions
Euclidean Distance
Squared Euclidean Distance
Cosine Distance or Similarity [THIS SHOULD WORK FINE]
Sum of Absolute Differences
Sum of Squared Differences
Correlation Distance
Hellinger Distance
Grid Distance
Manhattan Distance
Chebyshev Distance
statistics distance function
Wasserstein Metric
Mahalanobis Distance
Bray Curtis Distance
Canberra Distance
Binary Distance functions
L0 Norm
Jacard similarity
Hamming Distance
As you are directly comparing images, taking cosine similarity should work for you.
Comparing images well is quite non-trivial.
Just for one example, to get meaningful results you'll need to account for misalignment between the images being compared. If you just compare (for example) the top-left pixel in one image with the top-left pixel in the other, a fairly minor misalignment between the two can make those pixels entirely different, even though a person looking at the images would have difficulty seeing any difference at all.
One way to deal with this would be to start with something similar to the motion compensation used by MPEG 4. That is, break each image into small blocks (e.g., MPEG using 16x16 pixel blocks) and compare blocks in one to blocks in the other. This can eliminate (or at least drastically reduce) the effects of misalignment between the images.

how to use tf-idf with Naive Bayes?

As per my search regarding the query, that I am posting here, I have got many links which propose solution but haven't mentioned exactly how this is to be done. I have explored, for example, the following links :
Link 1
Link 2
Link 3
Link 4
etc.
Therefore, I am presenting my understanding as to how the Naive Bayes formula with tf-idf can be used here and it is as follows:
Naive-Bayes formula :
P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_all_classes(basically vocabulary of words in the entire training set))
tf-idf weighting can be employed in the above formula as:
word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.
total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)
total_unique_words_in_all_classes : as is.
This question has been posted multiple times on stack overflow but nothing substantial has been answered so far. I want to know that the way I am thinking about the problem is correct or not i.e. implementation that I have shown above. I need to know this as I am implementing the Naive Bayes myself without taking help of any Python library which comes with the built-in functions for both Naive Bayes and tf-idf. What I actually want is to improve the accuracy(currently 30%) of the model which was using Naive Bayes trained classifier. So, if there are better ways to achieve good accuracy, suggestions are welcome.
Please suggest me. I am new to this domain.
It would be better if you actually gave us the exact features and class you would like to use, or at least give an example. Since none of those have been concretely given, I'll just assume the following is your problem:
You have a number of documents, each of which has a number of words.
You would like to classify documents into categories.
Your feature vector consists of all possible words in all documents, and has values of number of counts in each document.
Your Solution
The tf idf you gave is the following:
word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.
total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)
Your approach sounds reasonable. The sum of all probabilities would sum to 1 independent of the tf-idf function, and the features would reflect tf-idf values. I would say this looks like a solid way to incorporate tf-idf into NB.
Another potential Solution
It took me a while to wrap my head around this problem. The main reason for this was having to worry about maintaining probability normalization. Using a Gaussian Naive Bayes would help ignore this issue entirely.
If you wanted to use this method:
Compute mean, variation of tf-idf values for each class.
Compute the prior using a gaussian distribution generated by the above mean and variation.
Proceed as normal (multiply to prior) and predict values.
Hard coding this shouldn't be too hard since numpy inherently has a gaussian function. I just prefer this type of generic solution for these type of problems.
Additional methods to increase
Apart from the above, you could also use the following techniques to increase accuracy:
Preprocessing:
Feature reduction (usually NMF, PCA, or LDA)
Additional features
Algorithm:
Naive bayes is fast, but inherently performs worse than other algorithms. It may be better to perform feature reduction, and then switch to a discriminative model such as SVM or Logistic Regression
Misc.
Bootstrapping, boosting, etc. Be careful not to overfit though...
Hopefully this was helpful. Leave a comment if anything was unclear
P(word|class)=(word_count_in_class+1)/(total_words_in_class+total_unique_words_in_all_classes
(basically vocabulary of words in the entire training set))
How would this sum up to 1? If using the above conditional probabilities, I assume the SUM is
P(word1|class)+P(word2|class)+...+P(wordn|class) =
(total_words_in_class + total_unique_words_in_class)/(total_words_in_class+total_unique_words_in_all_classes)
To correct this, I think the P(word|class) should be like
(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_classes(vocabulary of words in class))
Please correct me if I am wrong.
I think there are two ways to do it:
Round down tf-idf as integers, then use the multinomial distribution for the conditional probabilities. See this paper https://www.cs.waikato.ac.nz/ml/publications/2004/kibriya_et_al_cr.pdf.
Use Dirichlet distribution which is a continuous version of the multinomial distribution for the conditional probabilities.
I am not sure if Gaussian mixture will be better.

Better understanding of cosine similarity

I am doing a little research on text mining and data mining. I need more help in understanding cosine similarity. I have read about it and notice that all of the given examples on the internet is using tf-idf before computing it through cosine-similarity.
My question
Is it possible to calculate cosine similarity just by using highest frequency distribution from a text file which will be the dataset. Most of the videos and tutorials that i go through has tf-idf ran before inputting it's data into cosine similarity, if no, what other types of equation/algorithm that can be input into cosine similarity?
2.Why is normalization used with tf-idf to compute cosine similarity? (can i do it without normalization?) Cosine similarity are computed from normalization of tf-idf output. Why is normalization needed?
3.What cosine similarity actually does to the weights of tf-idf?
I do not understand question 1.
TF-IDF weighting is a weighting scheme that worked well for lots of people on real data (think Lucene search). But the theoretical foundations of it are a bit weak. In particular, everybody seems to be using a slightly different version of it... and yes, it is weights + cosine similarity. In practise, you may want to try e.g. Okapi BM25 weighting instead, though.
I do not undestand this question either. Angular similarity is beneficial because the length of the text has less influence than with other distances. Furthermore, sparsity can be nicely exploitet. As for the weights, IDF is a heuristic with only loose statistical arguments: frequent words are more likely to occur at random, and thus should have less weight.
Maybe you can try to rephrase your questions so I can fully understand them. Also search for related questions such as these: Cosine similarity and tf-idf and
Better text documents clustering than tf/idf and cosine similarity?

Trajectory interpolation and derivative

I'm working on the analysis of a particle's trajectory in a 2D plane. This trajectory typically consists of 5 to 50 (in rare cases more) points (discrete integer coordinates). I have already matched the points of my dataset to form a trajectory (thus I have time resolution).
I'd like to perform some analysis on the curvature of this trajectory, unfortunately the analysis framework I'm using has no support for fitting a trajectory. From what I heard one can use splines/bezier curves for getting this done but I'd like your opinion and/or suggestions what to use.
As this is only an optional part of my work I can not invest a vast amount of time for implementing a solution on my own or understanding a complex framework. The solution has to be as simple as possible.
Let me specify the features I need from a possible library:
- create trajectory from varying number of points
- as the points are discrete it should interpolate their position; no need for exact matches for all points as long as the resulting distance between trajectory and point is less than a threshold
- it is essential that the library can yield the derivative of the trajectory for any given point
- it would be beneficial if the library could report a quality level (like chiSquare for fits) of the interpolation
EDIT: After reading the comments I'd like to add some more:
It is not necessary that the trajectory exactly matches the points. The points are created from values of a pixel matrix and thus they form a discrete matrix of coordinates with a space resolution limited by the number of pixel per given distance. Therefore the points (which are placed at the center of the firing pixel) do not (exactly) match the actual trajectory of the particle. Either interpolation or fit is fine for me as long as the solution can cope with a trajectory which may/most probably will be neither bijective nor injective.
Thus most traditional fit approaches (like fitting with polynomials or exponential functions using a least squares fit) can't fulfil my criterias.
Additionaly all traditional fit approaches I have tried yield a function which seems to describe the trajectory quite well but when looking at their first derivative (or at higher resolution) one can find numerous "micro-oscillations" which (from my interpretation) are a result of fitting non-straight functions to (nearly) straight parts of the trajectory.
Edit2: There has been some discussion in the comments, what those trajectories may look like. Essentially thay may have any shape, length and "curlyness", although I try to exclude trajectories which overlap or cross in the previous steps. I have included two examples below; ignore the colored boxes, they're just a representation of the values of the raw pixel matrix. The black, circular dots are the points which I'd like to match to a trajectory, as you can see they are always centered to the pixels and therefore may have only discrete (integer) values.
Thanks in advance for any help & contribution!
This MIGHT be the way to go
http://alglib.codeplex.com/
From your description I would say that a parametric spline interpolation may suit your requirements. I have not used the above library myself, but it does have support for spline interpolation. Using an interpolant means you will not have to worry about goodness of fit - the curve will pass through every point that you give it.
If you don't mind using matrix libraries, linear least squares is the easiest solution (look at the end of the General Problem section for the equation to use). You can also use linear/polynomial regression to solve something like this.
Linear least squares will always give the best solution, but it's not scalable, because matrix multiplication is moderately expensive. Regression is an iterative heuristic method, so you can just run it until you have a "sufficiently good" answer. I've seen guidelines for the cutoff at about 1000-10000 dimensions in your data. So, with your data set, I'd recommend linear least squares, unless you decide to make them highly dimensioned for some reason.