Better understanding of cosine similarity - data-mining

I am doing a little research on text mining and data mining. I need more help in understanding cosine similarity. I have read about it and notice that all of the given examples on the internet is using tf-idf before computing it through cosine-similarity.
My question
Is it possible to calculate cosine similarity just by using highest frequency distribution from a text file which will be the dataset. Most of the videos and tutorials that i go through has tf-idf ran before inputting it's data into cosine similarity, if no, what other types of equation/algorithm that can be input into cosine similarity?
2.Why is normalization used with tf-idf to compute cosine similarity? (can i do it without normalization?) Cosine similarity are computed from normalization of tf-idf output. Why is normalization needed?
3.What cosine similarity actually does to the weights of tf-idf?

I do not understand question 1.
TF-IDF weighting is a weighting scheme that worked well for lots of people on real data (think Lucene search). But the theoretical foundations of it are a bit weak. In particular, everybody seems to be using a slightly different version of it... and yes, it is weights + cosine similarity. In practise, you may want to try e.g. Okapi BM25 weighting instead, though.
I do not undestand this question either. Angular similarity is beneficial because the length of the text has less influence than with other distances. Furthermore, sparsity can be nicely exploitet. As for the weights, IDF is a heuristic with only loose statistical arguments: frequent words are more likely to occur at random, and thus should have less weight.
Maybe you can try to rephrase your questions so I can fully understand them. Also search for related questions such as these: Cosine similarity and tf-idf and
Better text documents clustering than tf/idf and cosine similarity?

Related

What is the best comparison algorithm for comparing the similarity of arrays?

I am building a program that takes 2 arrays and returns some value showing to what degree they are similar. For example, images with few differences will have a good score, whereas images that are vastly different will have a worse score.
So far the only two algorithms I have come across for this problem are the sum of the squared differences and the normalized correlation.
Both of these will be fairly simple to implement, however I was wondering if there was another algorithm I haven't been able to find that I could use?
Furthermore, which previously mentioned method will be the best? Would be great to know both in terms of their accuracy and efficiency.
Thanks,
Comparing images usually depends on application you are dealing with. Normally distance functions used depends on image descriptor.
Take look at Distance functions
Euclidean Distance
Squared Euclidean Distance
Cosine Distance or Similarity [THIS SHOULD WORK FINE]
Sum of Absolute Differences
Sum of Squared Differences
Correlation Distance
Hellinger Distance
Grid Distance
Manhattan Distance
Chebyshev Distance
statistics distance function
Wasserstein Metric
Mahalanobis Distance
Bray Curtis Distance
Canberra Distance
Binary Distance functions
L0 Norm
Jacard similarity
Hamming Distance
As you are directly comparing images, taking cosine similarity should work for you.
Comparing images well is quite non-trivial.
Just for one example, to get meaningful results you'll need to account for misalignment between the images being compared. If you just compare (for example) the top-left pixel in one image with the top-left pixel in the other, a fairly minor misalignment between the two can make those pixels entirely different, even though a person looking at the images would have difficulty seeing any difference at all.
One way to deal with this would be to start with something similar to the motion compensation used by MPEG 4. That is, break each image into small blocks (e.g., MPEG using 16x16 pixel blocks) and compare blocks in one to blocks in the other. This can eliminate (or at least drastically reduce) the effects of misalignment between the images.

how to use tf-idf with Naive Bayes?

As per my search regarding the query, that I am posting here, I have got many links which propose solution but haven't mentioned exactly how this is to be done. I have explored, for example, the following links :
Link 1
Link 2
Link 3
Link 4
etc.
Therefore, I am presenting my understanding as to how the Naive Bayes formula with tf-idf can be used here and it is as follows:
Naive-Bayes formula :
P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_all_classes(basically vocabulary of words in the entire training set))
tf-idf weighting can be employed in the above formula as:
word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.
total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)
total_unique_words_in_all_classes : as is.
This question has been posted multiple times on stack overflow but nothing substantial has been answered so far. I want to know that the way I am thinking about the problem is correct or not i.e. implementation that I have shown above. I need to know this as I am implementing the Naive Bayes myself without taking help of any Python library which comes with the built-in functions for both Naive Bayes and tf-idf. What I actually want is to improve the accuracy(currently 30%) of the model which was using Naive Bayes trained classifier. So, if there are better ways to achieve good accuracy, suggestions are welcome.
Please suggest me. I am new to this domain.
It would be better if you actually gave us the exact features and class you would like to use, or at least give an example. Since none of those have been concretely given, I'll just assume the following is your problem:
You have a number of documents, each of which has a number of words.
You would like to classify documents into categories.
Your feature vector consists of all possible words in all documents, and has values of number of counts in each document.
Your Solution
The tf idf you gave is the following:
word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.
total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)
Your approach sounds reasonable. The sum of all probabilities would sum to 1 independent of the tf-idf function, and the features would reflect tf-idf values. I would say this looks like a solid way to incorporate tf-idf into NB.
Another potential Solution
It took me a while to wrap my head around this problem. The main reason for this was having to worry about maintaining probability normalization. Using a Gaussian Naive Bayes would help ignore this issue entirely.
If you wanted to use this method:
Compute mean, variation of tf-idf values for each class.
Compute the prior using a gaussian distribution generated by the above mean and variation.
Proceed as normal (multiply to prior) and predict values.
Hard coding this shouldn't be too hard since numpy inherently has a gaussian function. I just prefer this type of generic solution for these type of problems.
Additional methods to increase
Apart from the above, you could also use the following techniques to increase accuracy:
Preprocessing:
Feature reduction (usually NMF, PCA, or LDA)
Additional features
Algorithm:
Naive bayes is fast, but inherently performs worse than other algorithms. It may be better to perform feature reduction, and then switch to a discriminative model such as SVM or Logistic Regression
Misc.
Bootstrapping, boosting, etc. Be careful not to overfit though...
Hopefully this was helpful. Leave a comment if anything was unclear
P(word|class)=(word_count_in_class+1)/(total_words_in_class+total_unique_words_in_all_classes
(basically vocabulary of words in the entire training set))
How would this sum up to 1? If using the above conditional probabilities, I assume the SUM is
P(word1|class)+P(word2|class)+...+P(wordn|class) =
(total_words_in_class + total_unique_words_in_class)/(total_words_in_class+total_unique_words_in_all_classes)
To correct this, I think the P(word|class) should be like
(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_classes(vocabulary of words in class))
Please correct me if I am wrong.
I think there are two ways to do it:
Round down tf-idf as integers, then use the multinomial distribution for the conditional probabilities. See this paper https://www.cs.waikato.ac.nz/ml/publications/2004/kibriya_et_al_cr.pdf.
Use Dirichlet distribution which is a continuous version of the multinomial distribution for the conditional probabilities.
I am not sure if Gaussian mixture will be better.

Why are Cosine Similarity and TF-IDF used together?

TF-IDF and Cosine Similarity is a commonly used combination for
text clustering. Each document is represented by vectors of TF-IDF
weights.
This is what my text book says.
With Cosine Similarity you can then compute the similarities between those documents.
But why are exactly those techniques used together?
What is the advantage?
Could for example Jaccard Similarity also be used?
I know, how it works, but I want to know, why exactly these techniques.
TF-IDF is the weighting used.
Cosine is the measure used.
You could use cosine without weighting, but results then usually are worse. Jaccard works on sets - it's not obvious how to use weights without turning it into something else without making it the same as Cosine.

c++ numerical analysis Accurate data structure?

Using double type I made Cubic Spline Interpolation Algorithm.
That work was success as it seems, but there was a relative error around 6% when very small values calculated.
Is double data type enough for accurate scientific numerical analysis?
Double has plenty of precision for most applications. Of course it is finite, but it's always possible to squander any amount of precision by using a bad algorithm. In fact, that should be your first suspect. Look hard at your code and see if you're doing something that lets rounding errors accumulate quicker than necessary, or risky things like subtracting values that are very close to each other.
Scientific numerical analysis is difficult to get right which is why I leave it the professionals. Have you considered using a numeric library instead of writing your own? Eigen is my current favorite here: http://eigen.tuxfamily.org/index.php?title=Main_Page
I always have close at hand the latest copy of Numerical Recipes (nr.com) which does have an excellent chapter on interpolation. NR has a restrictive license but the writers know what they are doing and provide a succinct writeup on each numerical technique. Other libraries to look at include: ATLAS and GNU Scientific Library.
To answer your question double should be more than enough for most scientific applications, I agree with the previous posters it should like an algorithm problem. Have you considered posting the code for the algorithm you are using?
If double is enough for your needs depends on the type of numbers you are working with. As Henning suggests, it is probably best to take a look at the algorithms you are using and make sure they are numerically stable.
For starters, here's a good algorithm for addition: Kahan summation algorithm.
Double precision will be mostly suitable for any problem but the cubic spline will not work well if the polynomial or function is quickly oscillating or repeating or of quite high dimension.
In this case it can be better to use Legendre Polynomials since they handle variants of exponentials.
By way of a simple example if you use, Euler, Trapezoidal or Simpson's rule for interpolating within a 3rd order polynomial you won't need a huge sample rate to get the interpolant (area under the curve). However, if you apply these to an exponential function the sample rate may need to greatly increase to avoid loosing a lot of precision. Legendre Polynomials can cater for this case much more readily.

Outlier detection in data mining [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a few sets of questions regarding outlier detection:
Can we find outliers using k-means and is this a good approach?
Is there any clustering algorithm which does not accept any input from the user?
Can we use support vector machine or any other supervised learning algorithm for outlier detection?
What are the pros and cons of each approach?
I will limit myself to what I think is essential to give some clues about all of your questions, because this is the topic of a lot of textbooks and they might probably be better addressed in separate questions.
I wouldn't use k-means for spotting outliers in a multivariate dataset, for the simple reason that the k-means algorithm is not built for that purpose: You will always end up with a solution that minimizes the total within-cluster sum of squares (and hence maximizes the between-cluster SS because the total variance is fixed), and the outlier(s) will not necessarily define their own cluster. Consider the following example in R:
set.seed(123)
sim.xy <- function(n, mean, sd) cbind(rnorm(n, mean[1], sd[1]),
rnorm(n, mean[2],sd[2]))
# generate three clouds of points, well separated in the 2D plane
xy <- rbind(sim.xy(100, c(0,0), c(.2,.2)),
sim.xy(100, c(2.5,0), c(.4,.2)),
sim.xy(100, c(1.25,.5), c(.3,.2)))
xy[1,] <- c(0,2) # convert 1st obs. to an outlying value
km3 <- kmeans(xy, 3) # ask for three clusters
km4 <- kmeans(xy, 4) # ask for four clusters
As can be seen in the next figure, the outlying value is never recovered as such: It will always belong to one of the other clusters.
One possibility, however, would be to use a two-stage approach where one's removing extremal points (here defined as vector far away from their cluster centroids) in an iterative manner, as described in the following paper: Improving K-Means by Outlier Removal (Hautamäki, et al.).
This bears some resemblance with what is done in genetic studies to detect and remove individuals which exhibit genotyping error, or individuals that are siblings/twins (or when we want to identify population substructure), while we only want to keep unrelated individuals; in this case, we use multidimensional scaling (which is equivalent to PCA, up to a constant for the first two axes) and remove observations above or below 6 SD on any one of say the top 10 or 20 axes (see for example, Population Structure and Eigenanalysis, Patterson et al., PLoS Genetics 2006 2(12)).
A common alternative is to use ordered robust mahalanobis distances that can be plotted (in a QQ plot) against the expected quantiles of a Chi-squared distribution, as discussed in the following paper:
R.G. Garrett (1989). The chi-square plot: a tools for multivariate outlier recognition. Journal of Geochemical Exploration 32(1/3): 319-341.
(It is available in the mvoutlier R package.)
It depends on what you call user input. I interpret your question as whether some algorithm can process automatically a distance matrix or raw data and stop on an optimal number of clusters. If this is the case, and for any distance-based partitioning algorithm, then you can use any of the available validity indices for cluster analysis; a good overview is given in
Handl, J., Knowles, J., and Kell, D.B.
(2005). Computational cluster validation in post-genomic data analysis.
Bioinformatics 21(15): 3201-3212.
that I discussed on Cross Validated. You can for instance run several instances of the algorithm on different random samples (using bootstrap) of the data, for a range of cluster numbers (say, k=1 to 20) and select k according to the optimized criteria taht was considered (average silhouette width, cophenetic correlation, etc.); it can be fully automated, no need for user input.
There exist other forms of clustering, based on density (clusters are seen as regions where objects are unusually common) or distribution (clusters are sets of objects that follow a given probability distribution). Model-based clustering, as it is implemented in Mclust, for example, allows to identify clusters in a multivariate dataset by spanning a range of shape for the variance-covariance matrix for a varying number of clusters and to choose the best model according to the BIC criterion.
This is a hot topic in classification, and some studies focused on SVM to detect outliers especially when they are misclassified. A simple Google query will return a lot of hits, e.g. Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction by Thongkam et al. (Lecture Notes in Computer Science 2008 4977/2008 99-109; this article includes comparison to ensemble methods). The very basic idea is to use a one-class SVM to capture the main structure of the data by fitting a multivariate (e.g., gaussian) distribution to it; objects that on or just outside the boundary might be regarded as potential outliers. (In a certain sense, density-based clustering would perform equally well as defining what an outlier really is is more straightforward given an expected distribution.)
Other approaches for unsupervised, semi-supervised, or supervised learning are readily found on Google, e.g.
Hodge, V.J. and Austin, J. A Survey of Outlier Detection Methodologies.
Vinueza, A. and Grudic, G.Z. Unsupervised Outlier Detection and Semi-Supervised Learning.
Escalante, H.J. A Comparison of Outlier Detection Algorithms for Machine Learning.
A related topic is anomaly detection, about which you will find a lot of papers.
That really deserves a new (and probably more focused) question :-)
1) Can we find outliers using k-means, is it a good approach?
Cluster-based approaches are optimal to find clusters, and can be used to detect outliers as
by-products. In the clustering processes, outliers can affect the locations of the cluster centers, even aggregating as a micro-cluster. These characteristics make the cluster-based approaches infeasible to complicated databases.
2) Is there any clustering algorithm which does not accept any input from the user?
Maybe you can achieve some valuable knowledge on this topic:
Dirichlet Process Clustering
Dirichlet-based clustering algorithm can adaptively determine the number of clusters according to the distribution of observation data.
3) Can we use support vector machine or any other supervised learning algorithm for outlier detection?
Any Supervised learning algorithm needs enough labeled training data to construct classifiers. However, a balanced training dataset is not always available for real world problem, such as intrusion detection, medical diagnostics. According to the definition of Hawkins Outlier("Identification of Outliers". Chapman and Hall, London, 1980), the number of normal data is much larger than that of outliers. Most supervised learning algorithms can't achieve an efficient classifier on the above unbalanced dataset.
4) What is the pros and cons of each approach?
Over the past several decades, the research on outlier detection varies from the global computation to the local analysis, and the descriptions of outliers vary from the binary interpretations to probabilistic representations. According to hypotheses of outlier detection models, outlier detection algorithms can be divided into four kinds: Statistic-based algorithms, Cluster-based algorithms, Nearest Neighborhood based algorithms, and Classifier-based algorithms. There are several valuable surveys on outlier detection:
Hodge, V. and Austin, J. "A survey of outlier detection methodologies", Journal of Artificial Intelligence Review, 2004.
Chandola, V. and Banerjee, A. and Kumar, V. "Outlier detection: A survey", ACM Computing Surveys, 2007.
k-means is rather sensitive to noise in the data set. It works best when you remove the outliers beforehand.
No. Any cluster analysis algorithm that claims to be parameter-free usually is heavily restricted, and often has hidden parameters - a common parameter is the distance function, for example. Any flexible cluster analysis algorithm will at least accept a custom distance function.
one-class classifiers are a popular machine-learning approach to outlier detection. However, supervised approaches aren't always appropriate for detecting _previously_unseen_ objects. Plus, they can overfit when the data already contains outliers.
Every approach has its pros and cons, that is why they exist. In a real setting, you will have to try most of them to see what works for your data and setting. It's why outlier detection is called knowledge discovery - you have to explore if you want to discover something new ...
You may want to have a look at the ELKI data mining framework. It is supposedly the largest collection of outlier detection data mining algorithms. It's open source software, implemented in Java, and includes some 20+ outlier detection algorithms. See the list of available algorithms.
Note that most of these algorithms are not based on clustering. Many clustering algorithms (in particular k-means) will try to cluster instances "no matter what". Only few clustering algorithms (e.g. DBSCAN) actually consider the case that maybe not all instance belong into clusters! So for some algorithms, outliers will actually prevent a good clustering!