How to make document clusters using hierarchical clustering - data-mining

I am trying to cluster documents based on their similarity, the idea is to match the similar words in two documents and divide that number with the total number of words in both the documents. Each value is stored in a 2D array:
1.0000 0.1548 0.0968 0.0982 0.2750 0.1239 0.0891 0.1565
0.1548 1.0000 0.0960 0.0898 0.1631 0.0756 0.0874 0.2187
0.0957 0.2300 1.0000 0.4964 0.0980 0.2004 0.4582 0.2315
0.0971 0.2234 0.4946 1.0000 0.0995 0.2010 0.4533 0.2244
0.2793 0.1631 0.0986 0.1001 1.0000 0.1324 0.0904 0.1662
0.1726 0.0756 0.2149 0.2157 0.1795 1.0000 0.2019 0.0819
0.0880 0.2108 0.4582 0.4550 0.0899 0.1880 1.0000 0.2124
0.1556 0.2094 0.0950 0.0884 0.1662 0.0764 0.0867 1.0000
So if there are 8 documents the result of each document compared with other is stored as in the table above each index of the array shows one document. So document 0,0 1,1 2,2 ... will always have value one because they are same.
How do I cluster similar documents i.e. who's values are close to each other?

Have you tried converting similarity to a distance using e.g.
dist = 1 - sim
As your similarity is bounded by 1, this should work just fine.
Note that, however, hierarchical clustering does not scale well. The usual naive implementation scales with O(n^3), and very careful implementations can run in O(n^2) for some linkage types (single-link, complete-link, maybe UPGMA too). Nevertheless, your usual text corpus will be way too large to this to be feasible.

Related

Trajectory Analysis (SAS): Incorrect number of start values

I am attempting a trajectory analysis in SAS (proc traj).
Following instructions found online, I first begin by testing two quadratic models, then three, then four (i.e., order 2 2, order 2 2 2, order 2 2 2 2, order 2 2 2 2 2).
I determined that a three-group linear model is the best fit (order 1 1 1;)
I then wish to add time stable covariates with the risk command. As found online, I did this by adding the start parameters provided in the Log.
At this point, I receive a notice: "Incorrect number of start values. There should be 10 start values based on the model specifications.").
I understand that it's possible to delete some of the 12 parameter estimates provided - But how do I select which ones to remove?
Thank you.
Code:
proc traj data=followupyes outplot=op outstat=os out=of outest=oe itdetail;
id youthid;
title3 'linear 3-gp model ';
var pronoun_allpar1-pronoun_allpar3;
indep time1-time3;
model logit;
ngroups 3;
order 1 1 1;
weight wgt_00;
start 0.031547 0.499724 1.969017 0.859566 -1.236747 0.007471
0.771878 0.495458 0.000000 0.000000 0.000000 0.000000;
risk P00_45_1;
run;
%trajplot (OP, OS, "linear 3-gp model ", "Traj of Pronoun Support", "Pron Support", "Time");
Because you are estimating a model with 3 linear trajectories, you will need 2 start values for each of your 3 groups.
See here for more info: https://www.andrew.cmu.edu/user/bjones/example.htm

Data has same value for every dimension after PCA

I encountered a bug(?) after performing PCA on a big dataset. I have ca. 2000 measurements and ca. 50 features / dimensions. I perform PCA to reduce the number of dimensions. I want to have only 20-30 dimensions. But my data does look strange after I project it into new PCA feature space. Every dimension has the same values, except for the first. It doesnt matter how many dimensions I set for PCA, my data does always look like this: (three dimensions as example and four measurements)
10075.1;2.00177e-23;7.70922e-43
10114.6;2.00177e-23;7.70922e-43
10192.9;2.00177e-23;7.70922e-43
9843.2;2.00177e-23;7.70922e-43
What is the reason? Why do I have good data only for the first feature?
This is the original data:
0;24;54;167;19.3625;46;24;21;298.575;254.743;1.17207;1.73611;2.26757;18;15;14;12;9;8;4;15;13;12;9;8;4;33;28;26;21;17;15;8;0;0;1;92283.9;19441.8;16337;11731.8;6796.85;2215.39;1861.07;3516.91;4587.27;4130.99;7.38638;8;9.41167;10.5923;14;19.9733
0;24;54;167;19.3625;45;23;21;272.609;244.143;1.11659;1.89036;2.26757;17;15;14;11;9;7;4;16;13;12;9;8;4;33;28;26;20;17;14;8;0;1;1;92298.5;19414.8;16445.3;11871.4;6873.36;2071.48;1845.56;4483;4588.43;2854.95;7.06929;8;9.08176;10.0947;14;19.1412
0;24;54;167;19.3625;45;23;21;256.58;248.081;1.03426;1.89036;2.26757;17;15;14;11;9;7;4;15;13;12;9;8;4;32;28;26;20;17;14;8;0;1;1;92262.9;19449.6;16602.1;12066.9;6875.38;1762.22;1813.8;4461.31;4605.87;4540.53;6.72761;7;9.17784;10.0404;14;19.0638
0;24;54;167;19.3625;45;22;23;228.664;293.1;0.780157;2.06612;1.89036;16;14;13;10;8;7;4;17;14;13;10;8;3;33;28;26;20;16;13;7;1;0;0;92047.3;19594.2;16615.9;11855.3;6357.26;1412.1;1931.18;3292.93;4305.41;3125.78;7.14206;7;9.15515;10.0013;14;18.9998
Here are the eigenvalues and eigenvectors:
120544647.296627;
1055287.207309433;
788517.1814841435
4.445188101138883e-06, -1.582751359550716e-06, 0.0001194540407426801, 8.805619419232736e-05, 1.718812629108742e-05, -6.478627494871924e-06, 1.866065159173557e-06, -8.102268773738454e-06, 0.001575116366026065, 0.001368858662087531, 2.42338448583798e-06, 1.468791084230193e-07, 1.619495879919206e-08, 2.045676050284675e-06, 4.522426974955079e-06, 1.935642018365442e-06, 9.400348593348646e-07, 3.50785209102226e-06, -6.886458171608557e-07, -2.272864941126205e-06, -4.576437628645375e-06, -3.711985547436847e-06, -4.179746481364989e-06, -1.080958836802159e-06, 3.018347636693104e-06, -5.401065369031065e-08, -1.776343529071431e-06, -3.239711622030108e-06, 2.426893254220096e-06, 2.329701819532251e-06, -1.335049163771412e-06, -2.016447535744125e-06, -2.48848684914049e-06, 1.034821043317487e-06, 0.9509463574053698, 0.2040750414336948, 0.1698045366243798, 0.1221511665292666, 0.06648621927929886, 0.01787357780337607, 0.02181878649610538, 0.04094056949392437, 0.04589005034245261, 0.03602144595540402, 4.638015609510389e-05, -9.594011737623517e-07, 5.643329708389021e-05, 6.49999142971481e-05, 6.708699420903862e-07, 0.0001209291154324417;
-1.193874321738139e-05, -3.042062337012123e-05, -0.0001368023572559274, -0.0001093928140002418, -1.847065231448535e-05, 3.847106756849437e-05, -1.23803319528626e-05, 2.082402112096706e-06, -0.002107941678699949, -0.0007526438176676972, -1.304240623192574e-06, -4.358106348750469e-06, 4.189661461745327e-06, 3.972537960568455e-07, 5.415441896012467e-06, -3.487031299718403e-06, -3.082927770719131e-06, -6.180776247962886e-06, -3.293811231853141e-06, -3.069190535161948e-06, 9.242946297782889e-06, 1.849824602072292e-06, 8.007250998398399e-06, 9.597348504390614e-06, -7.976030386807306e-07, 1.465838819379542e-05, -1.637206697646072e-06, 4.924323227679534e-06, 3.416572256427778e-06, -4.091414270533951e-06, 3.950956777004832e-06, -1.425709512894606e-05, -1.612907157276045e-06, -1.656147283798045e-06, 0.01791626179130883, -0.03865588909604983, -0.02237813174629856, -0.011581970882016, 0.008401303497694863, 0.00598682750741207, -0.02647921936520565, -0.08745349044258101, -0.6199482703379527, 0.7776587660292456, -2.204501859699998e-05, 3.065799954216684e-06, -0.0001088757748474737, -9.070630703475932e-05, -1.507680849966721e-05, -0.000203298163659711;
2.141350692234778e-05, 3.763794188497906e-05, 0.0002682046623337108, 0.0002761646438217766, 2.250001958053043e-05, -4.493680340744517e-05, 1.71038513853044e-05, 4.793887034272248e-05, -0.002472775598056956, -0.002583273192861402, -2.360815196252781e-05, 8.57575614248591e-07, -2.277442903271404e-06, -9.431493206768549e-06, 2.836934896747011e-06, 1.836715455464421e-05, 2.384241283455247e-05, 4.963711569589484e-06, 1.390892651258379e-05, 2.354454084909798e-05, 2.358174073858803e-05, 3.953694936818999e-05, 3.859322887829735e-05, 4.383431246805508e-06, 9.501429817743515e-06, 2.641867563533516e-05, 5.790410392283418e-05, 6.243564171284964e-05, 9.347142816394926e-06, 2.341035633032736e-05, 3.140572721234472e-05, 2.567884918875704e-06, -2.488581283389154e-06, -1.083945623896245e-05, -0.02381539022135584, 0.1464545802416884, 0.09922198413600333, 0.009864006965697942, -0.07588888859083308, -0.1732512868035658, 0.2074803672415529, 0.5543971362454099, -0.6344797023718978, -0.4234201679790431, -0.0001368109107852992, 2.172633922404158e-07, -0.0001132510107743674, -7.90184051908068e-05, 1.89704719379068e-05, -0.0001862727476251848
I thought the reason why there is so much variance for the first feature was because the first eigenvalues is very big in comparison with the other two. When I normalize my data before PCA, I get very similar eigenvalues:
0.6660936495675316;
0.6449413383086006;
0.383110906838073
But the data still looks similar after projecting it in PCA space:
-0.816894;7.1333e-67;2.00113e-23
-0.822324;7.1333e-67;2.00113e-23
-0.831973;7.1333e-67;2.00113e-23
-0.822553;7.1333e-67;2.00113e-23
The problem is that all your data for features 2, 3, and 4 are very close or exactly the same as the first feature, which is why your results are not very great. The magnitude of the differences may also not be significant enough to capture the variance of the data.
PCA works by getting the covariances between the features. You might want to check out the covariance matrix produced by PCA. I suspect that all the values are very close to each other. Most of the variance is captured by the first eigenvector of the matrix.

How can I remove indices of non-max values that correspond to duplicate values of separate list from both lists?

I have two lists, the first of which represents times of observation and the second of which represents the observed values at those times. I am trying to find the maximum observed value and the corresponding time given a rolling window of various length. For example-sake, here are the two lists.
# observed values
linspeed = [280.0, 275.0, 300.0, 475.2, 360.1, 400.9, 215.3, 323.8, 289.7]
# times that correspond to observed values
time_count = [4.0, 6.0, 8.0, 8.0, 10.0, 10.0, 10.0, 14.0, 16.0]
# actual dataset is of size ~ 11,000
The missing times (ex: 3.0) correspond to an observed value of zero, whereas duplicate times correspond to multiple observations to the floored time. Since my window will be rolling over the time_count (ex: max value in first 2 hours, next 2 hours, 2 hours after that; max value in first 4 hours, next 4 hours, ...), I plan to use an array-reshaping routine. However, it's important to set up everything properly before, which entails finding the maximum value given duplicate times. To solve this problem, I tried the code just below.
def list_duplicates(data_list):
seen = set()
seen_add = seen.add
seen_twice = set(x for x in data_list if x in seen or seen_add(x))
return list(seen_twice)
# check for duplicate values
dups = list_duplicates(time_count)
print(dups)
>> [8.0, 10.0]
# get index of duplicates
for dup in dups:
print(time_count.index(dup))
>> 2
>> 4
When checking for the index of the duplicates, it appears that this code will only return the index of the first occurrence of the duplicate value. I also tried using OrderedDict via module collections for reasons concerning code efficiency/speed, but dictionaries have a similar problem. Given duplicate keys for non-duplicate observation values, the first instance of the duplicate key and corresponding observation value is kept while all others are dropped from the dict. Per this SO post, my second attempt is just below.
for dup in dups:
indexes = [i for i,x in enumerate(time_count) if x == dup]
print(indexes)
>> [4, 5, 6] # indices correspond to duplicate time 10s but not duplicate time 8s
I should be getting [2,3] for time in time_count = 8.0 and [4,5,6] for time in time_count = 10.0. From the duplicate time_counts, 475.2 is the max linspeed that corresponds to duplicate time_count 8.0 and 400.9 is the max linspeed that corresponds to duplicate time_count 10.0, meaning that the other linspeeds at leftover indices of duplicate time_counts would be removed.
I'm not sure what else I can try. How can I adapt this (or find a new approach) to find all of the indices that correspond to duplicate values in an efficient manner? Any advice would be appreciated. (PS - I made numpy a tag because I think there is a way to do this via numpy that I haven't figured out yet.)
Without going into the details of how to implement and efficient rolling-window-maximum filter; reducing the duplicate values can be seen as a grouping-problem, which the numpy_indexed package (disclaimer: I am its author) provides efficient and simple solutions to:
import numpy_indexed as npi
unique_time, unique_speed = npi.group_by(time_count).max(linspeed)
For large input datasets (ie, where it matters), this should be a lot faster than any non-vectorized solution. Memory consumption is linear and performance in general NlogN; but since time_count appears to be sorted already, performance should be linear too.
OK, if you want to do this with numpy, best is to turn both of your lists into arrays:
l = np.array(linspeed)
tc = np.array(time_count)
Now, finding unique times is just an np.unique call:
u, i, c = np.unique(tc, return_inverse = True, return_counts = True)
u
Out[]: array([ 4., 6., 8., 10., 14., 16.])
i
Out[]: array([0, 1, 2, 2, 3, 3, 3, 4, 5], dtype=int32)
c
Out[]: array([1, 1, 2, 3, 1, 1])
Now you can either build your maximums with a for loop
m = np.array([np.max(l[i==j]) if c[j] > 1 else l[j] for j in range(u.size)])
m
Out[]: array([ 280. , 275. , 475.2, 400.9, 360.1, 400.9])
Or try some 2d method. This could be faster, but it would need to be optimized. This is just the basic idea.
np.max(np.where(i[None, :] == np.arange(u.size)[:, None], linspeed, 0),axis = 1)
Out[]: array([ 280. , 275. , 475.2, 400.9, 323.8, 289.7])
Now your m and u vectors are the same length and include the output you want.

Generating random numbers given required distribution and empirical sampling

I have two sets of samplings, one distributes exponentially and the second- Bernoli (I used scipy.stats.expon and scipy.stats.bernoulli to fit my data).
Based on these sampling, I want to create two random generators that will enable me to sample numbers from the two distributions.
What alternatives are there for doing so?
How can I find the correct parameters for creating the random generators?
Use the rvs method to generate a sample using the estimated parameters. For example, suppose x holds my initial data.
In [56]: x
Out[56]:
array([ 0.366, 0.235, 0.286, 0.84 , 0.073, 0.108, 0.156, 0.029,
0.11 , 0.122, 0.227, 0.148, 0.095, 0.233, 0.317, 0.027])
Use scipy.stats.expon to fit the expononential distribution to this data. I assume we are interested in the usual case where the location parameter is 0, so I use floc=0 in the fit call.
In [57]: from scipy.stats import expon
In [58]: loc, scale = expon.fit(x, floc=0)
In [59]: scale
Out[59]: 0.21076203455218898
Now use those parameters to generate a random sample.
In [60]: sample = expon.rvs(loc=0, scale=scale, size=8)
In [61]: sample
Out[61]:
array([ 0.21576877, 0.23415911, 0.6547364 , 0.44424148, 0.07870868,
0.10415167, 0.12905163, 0.23428833])

Computation of Kullback-Leibler (KL) distance between text-documents using numpy

My goal is to compute the KL distance between the following text documents:
1)The boy is having a lad relationship
2)The boy is having a boy relationship
3)It is a lovely day in NY
I first of all vectorised the documents in order to easily apply numpy
1)[1,1,1,1,1,1,1]
2)[1,2,1,1,1,2,1]
3)[1,1,1,1,1,1,1]
I then applied the following code for computing KL distance between the texts:
import numpy as np
import math
from math import log
v=[[1,1,1,1,1,1,1],[1,2,1,1,1,2,1],[1,1,1,1,1,1,1]]
c=v[0]
def kl(p, q):
p = np.asarray(p, dtype=np.float)
q = np.asarray(q, dtype=np.float)
return np.sum(np.where(p != 0,(p-q) * np.log10(p / q), 0))
for x in v:
KL=kl(x,c)
print KL
Here is the result of the above code: [0.0, 0.602059991328, 0.0].
Texts 1 and 3 are completely different, but the distance between them is 0, while texts 1 and 2, which are highly related has a distance of 0.602059991328. This isn't accurate.
Does anyone has an idea of what I'm not doing right with regards to KL? Many thanks for your suggestions.
Though I hate to add another answer, there are two points here. First, as Jaime pointed out in the comments, KL divergence (or distance - they are, according to the following documentation, the same) is designed to measure the difference between probability distributions. This means basically that what you pass to the function should be two array-likes, the elements of each of which sum to 1.
Second, scipy apparently does implement this, with a naming scheme more related to the field of information theory. The function is "entropy":
scipy.stats.entropy(pk, qk=None, base=None)
http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html
From the docs:
If qk is not None, then compute a relative entropy (also known as
Kullback-Leibler divergence or Kullback-Leibler distance) S = sum(pk *
log(pk / qk), axis=0).
The bonus of this function as well is that it will normalize the vectors you pass it if they do not sum to 1 (though this means you have to be careful with the arrays you pass - ie, how they are constructed from data).
Hope this helps, and at least a library provides it so don't have to code your own.
After a bit of googling to undersand the KL concept, I think that your problem is due to the vectorization : you're comparing the number of appearance of different words. You should either link your column indice to one word, or use a dictionnary:
# The boy is having a lad relationship It lovely day in NY
1)[1 1 1 1 1 1 1 0 0 0 0 0]
2)[1 2 1 1 1 0 1 0 0 0 0 0]
3)[0 0 1 0 1 0 0 1 1 1 1 1]
Then you can use your kl function.
To automatically vectorize to a dictionnary, see How to count the frequency of the elements in a list? (collections.Counter is exactly what you need). Then you can loop over the union of the keys of the dictionaries to compute the KL distance.
A potential issue might be in your NP definition of KL. Read the wikipedia page for formula: http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
Note that you multiply (p-q) by the log result. In accordance with the KL formula, this should only be p:
return np.sum(np.where(p != 0,(p) * np.log10(p / q), 0))
That may help...