tanimoto coefficient in the book of Programming Collective Intelligence - data-mining

I have read the book of Programming Collective Intelligence. For the after-class exercise 1 of chapter 2, could someone please tell me how to calculate the tanimoto coefficient? A specific mathematical formula will be really appreciated.

An extensive search on a related question has given me two formulas:
T(a,b) = N_intersection / (N_a + N_b - N_intersection) found here, which is the same as on Wikipedia in a slightly more readable fashion.
EDIT: As per your comment, this is the one the OP was looking for.
(n_11+n_00) / [n_11+2(n_10+n_01)+n_00], where
n_11: both have attribute,
n_00: both don't have attribute,
n_01 or n_10: only second/first object has the attribute.
For the source of the second equation have a look at http://reference.wolfram.com/language/ref/RogersTanimotoDissimilarity.html and calculate the similarity index from the dissimilarity index as (1-dissimilarity).
I believe that the second formula is commonly used in applied statistics and applied marketing.

Related

What are hp.Discrete and hp.Realinterval? Can I include more values in hp.realinterval instead of just 2?

I am using Hyperparameter using HParams Dashboard in Tensorflow 2.0-beta0 as suggested here https://www.tensorflow.org/tensorboard/r2/hyperparameter_tuning_with_hparams
I am confused in step 1, I could not find any better explanation. My questions are related to following lines:
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([16, 32]))
HP_DROPOUT = hp.HParam('dropout', hp.RealInterval(0.1, 0.2))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
My question:
I want to try more dropout values instead of just two (0.1 and 0.2). If I write more values in it then it throws an error- 'maximum 2 arguments can be given'. I tried to look for documentation but could not find anything like from where these hp.Discrete and hp.RealInterval functions came.
Any help would be appreciated. Thank you!
Good question. They notebook tutorial lacks in many aspects. At any rate, here is how you do it at a certain resolution res
for dropout_rate in tf.linspace(
HP_DROPOUT.domain.min_value,
HP_DROPOUT.domain.max_value,
res,):
By looking at the implementation to me it really doesn't seem to be GridSearch but MonteCarlo/Random search (note: this is not 100% correct, please see my edit below)
So on every iteration a random float of that real interval is chosen
If you want GridSearch behavior just use "Discrete". That way you can even mix and match GridSearch with Random search, pretty cool!
Edit: 27th of July '22: (based on the comment of #dpoiesz)
Just to make it a little more clear, as it is sampled from the intervals, concrete values are returned. Therefore, those are added to the grid dimension and grid search is performed using those
RealInterval is a min, max tuple in which the hparam will pick a number up.
Here a link to the implementation for better understanding.
The thing is that as it is currently implemented it does not seems to have any difference in between the two except if you call the sample_uniform method.
Note that tf.linspace breaks the mentioned sample code when saving current value.
See https://github.com/tensorflow/tensorboard/issues/2348
In particular OscarVanL's comment about his quick&dirty workaround.

Skip-Gram implementation in tensorflow/models - Subsampling of Frequent Words

I have some experiments in mind related to skipgram model. So I have started to study and modify the optimized implementation in tensorflow/models repository in tutorials/embedding/word2vec_kernels.cc. Suddenly I came above the part where corpus subsampling is done.
According to Tomáš Mikolov paper (https://arxiv.org/abs/1310.4546, eq.5), the word should be kept with probability
where t denotes threshold parameter (according to paper chosen as 10^-5), and f(w) frequency of the word w,
but the code in word2vec_kernels.cc is following:
float keep_prob = (std::sqrt(word_freq / (subsample_ * corpus_size_)) + 1) *
(subsample_ * corpus_size_) / word_freq;
which can be transformed into previously presented notation as
What is the motivation behind this change? Is it just to model 'some kind of relation' to corpus size into this formula? Or is it some transformation of the original formula? Was it chosen empirically?
Edit: link to the mentioned file on github
https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec_kernels.cc
Okay so I guess that without corpus_size, the graph looks somewhat the same as the original formula. Corpus size adds a relation to the corpus size to the formula and also "it works with the large numbers", so we can compute discard/keep probability without normalizing word frequency to proper distribution.

Weka improve model TP Rate

j48 weka
Hi,
I have problem with my model in weka (j48 cross-validation) that many instances are classified wrong when it comes to the second class. Is there any way to improve it or rather not? I'm not an expert in weka. Thank you in advance. My output is above.
In NaiveBayes it presents better but still TP Rate < 0.5 for the second class.
NaiveByes weka
It is hard to reproduce your example with the given information. However the solution is probably to turn your classifiert into a cost sensitive classifier
https://weka.wikispaces.com/CostSensitiveClassifier?responseToken=019a566fb2ce3b016b9c8c791c92e8e35
What it does it assigns a higher value to misclassifications of a certain class. In your case this would be the "True" class.
You can also simulate such an algorithm by oversampling your positive examples. This is, if you have n positive examples you sample k*n positive example, while you keep your negative examples as they are. You could also simply double positive examples.

How can i find the relevance of a page?

Say I have some features of a person and then i perform google search.
A list of results display I want to mine those results and check how the contents of those results are relevant to that particular person.
If you need to perform a Google search, the relevancy is best computed using
relevancy = cos(x) * (sqrt(x) + atan2(x,y)) * sin(0)
This is the universal formula for relevancy. You can apply it to people or to questions on SO.

Sentiment analysis with association rule mining

I am trying to come up with an algorithm to find top-3 most frequently used adjectives for the product in the same sentence. I want to use association rule mining(Apriori algorithm).
For that I am planning of using the twitter data. I can more or less decompose twits in to sentences and then with filtering I can find product names and adjectives with it.
For instance, after filtering I have data like;
ipad mini, great
ipad mini, horrible
samsung galaxy s2, best
...
etc.
Product names and adjectives are previously defined. So I have a set of product names and set of adjectives that I am looking for.
I have read couple of papers about sentimental analysis and rule mining and they all say Apriori algorithm is used. But they don't say how they used it and they don't give details.
Therefore how can I reduce my problem to association rule mining problem?
What values should I use for minsup and minconf?
How can I modify Apriori algorithm to solve this problem?
What I' m thinking is;
I should find frequent adjectives separately for each product. Then by sorting I can get top-3 adjectives. But I do not know if it is correct.
Finding the top-3 most used adjectives for each product is not association rule mining.
For Apriori to yield good results, you must be interested in itemsets of length 4 and more. Apriori pruning starts at length 3, and begins to yield major gains at length 4. At length 2, it is mostly enumerating all pairs. And if you are only interested in pairs (product, adjective), then apriori is doing much more work than necessary.
Instead, use counting. Use hash tables. If you really have Exabytes of data, use approximate counting and heavy hitter algorithms. (But most likely, you don't have exabytes of data after extracting those pairs...)
Don't bother to investigate association rule mining if you only need to solve this much simpler problem.
Association rule mining is really only for finding patterns such as
pasta, tomato, onion -> basil
and more complex rules. The contribution of Apriori is to reduce the number of candidates when going from length n-1 -> n for length n > 2. And it gets more effective when n > 3.
Reducing your problem to Association Rule Mining (ARM)
Create a feature vector having all the topics and adjectives. If a feed contains topic then place 1 for it else 0 in tuple. For eg. Let us assume Topics are Samsung and Apple. And Adjectives are good and horrible. And feed contains Samsung good. Then corresponding tuple for it is :
Samsung Apple good horrible
1 0 1 0
Modification to Apriori Algorithm required
generate Association Rules of type 'topic' --> 'adjective' using constrained apriori algorithm. 'topic' --> 'adjective' is a constraint.
How to set MinSup and MinConf :
Read a paper entitled "Minin top-k association rules". Implement that with k=3 for 3 top adjectives.