Getting significance level, alpha, from KS test results? - python-2.7
I am trying to find the significance level/alpha level (to eventually get the confidence level) of my Kolmogorov-Smirnov test results and I feel like I'm going crazy because this doesn't seem explained well enough anywhere (in a way that I understand.)
I have sample data that I want to see if it comes from one of four probability distribution functions: Cauchy, Gaussian, Students t, and Laplace. (I am not doing a two-sample test.)
Here is sample code for Cauchy:
### Cauchy Distribution Function
data = [-1.058, 1.326, -4.045, 1.466, -3.069, 0.1747, 0.6305, 5.194, 0.1024, 1.376, -5.989, 1.024, 2.252, -1.451, -5.041, 1.542, -3.224, 1.389, -2.339, 4.073, -1.336, 1.081, -2.573, 3.788, 2.26, -0.6905, 0.9064, -0.7214, -0.3471, -1.152, 1.904, 2.082, -2.471, 0.6434, -1.709, -1.125, -1.607, -1.059, -1.238, 6.042, 0.08664, 2.69, 1.013, -0.7654, 2.552, 0.7851, 0.5365, 4.351, 0.9444, -2.056, 0.9638, -2.64, 1.165, -1.103, -1.624, -1.082, 3.615, 1.709, 2.945, -5.029, -3.57, 0.6126, -2.88, 0.4868, 0.4222, -0.2062, -1.337, -0.326, -2.784, 6.724, -0.1316, 4.681, 6.839, -1.987, -5.372, 1.522, -2.347, 0.4531, -1.154, -3.631, 0.426, -4.271, 1.687, -1.612, -1.438, 0.8777, 0.06759, 0.6114, -1.296, 0.07865, -1.104, -1.454, -1.62, -1.755, 0.7868, -3.312, 1.054, -2.183, -7.066, -0.04661, 1.612, 1.441, -1.768, -0.2443, -0.7033, -1.16, 0.2529, 0.2441, -1.962, 0.568, 1.568, 8.385, 0.7192, -1.084, 0.9035, 3.376, -0.7172, -0.1221, 3.267, 0.4064, -0.4894, -2.001, 1.63, -2.891, 0.6244, 2.381, -1.037, -1.705, -0.5223, -0.2912, 1.77, -3.792, 0.1716, 4.121, -0.9119, -0.1166, 5.694, -5.904, 0.5485, -2.788, 2.582, -1.553, 1.95, 3.886, 1.066, -0.475, 0.5701, -0.9367, -2.728, 4.588, -5.544, 1.373, 1.807, 2.919, 0.8946, 0.6329, -1.34, -0.6154, 4.005, 0.204, -1.201, -4.912, -4.766, 0.0554, 3.484, -2.819, -5.131, 2.108, -1.037, 1.603, 2.027, 0.3066, -0.3446, -1.833, -2.54, 2.828, 4.763, 0.9926, 2.504, -1.258, 0.4298, 2.536, -1.214, -3.932, 1.536, 0.03379, -3.839, 4.788, 0.04021, -0.2701, -2.139, 0.1339, 1.795, -2.12, 5.558, 0.8838, 1.895, 0.1073, 2.011, -1.267, -1.08, -1.12, -1.916, 1.524, -1.883, 5.348, 0.115, -1.059, -0.4772, 1.02, -0.4057, 1.822, 4.011, -3.246, -7.868, 2.445, 2.271, 0.5377, 0.2612, 0.7397, -1.059, 1.177, 2.706, -4.805, -0.7552, -4.43, -0.4607, 1.536, -4.653, -0.5952, 0.8115, -0.4434, 1.042, 1.179, -0.1524, 0.2753, -1.986, -2.377, -1.21, 2.543, -2.632, -2.037, 4.011, 1.98, -2.589, -4.9, 1.671, -0.2153, -6.109, 2.497]
def C(data):
stuff = []
# vary gamma
for scale in xrange(1, 101, 1):
ks_statistic, pvalue = ss.kstest(data, "cauchy", args=(scale,))
stuff.append((ks_statistic, pvalue, scale))
bestks = min(c[0] for c in stuff)
bestrow = [row for row in stuff if row[0] == bestks]
return bestrow
I am trying to fit this function to my data, and to return the scale parameter (gamma) that corresponds to the highest probability of being fit with a Cauchy Distribution. The corresponding ks-statistic and p-value also get returned. I thought that this would be done by finding the minimum ks-statistic, which would be the curve that yields the smallest distance between any given data point and distribution-curve point. I realize that I need, though, to find "alpha" so that I can find my probability that the sample data is from a Cauchy Distribution, with the specified scale/gamma value I found.
I have referenced many sources trying to explain how to find "alpha", but I have no clue how to do this in my code.
Thank you for any help and insight!
I think this question is actually outside the range of SO because it involves statistics. You would probably be better to answer on, say, Cross Validation. However, let me offer one or two remarks.
The K-S is used for testing whether a given set of data has arisen from a given, fully specified distribution function. (Even for this purpose it might not be optimal.) It's not intended, as far as I know, as a measure of fit amongst alternatives.
To make inferences about probabilities one must have a viable probability model for the data in the first place. In this case, what is the space of alternatives and how are probabilities assigned to them under the null and alternative hypotheses?
Now, to get that unhelpful comment that I offered. Thanks for being so tactful about it! This is what I was trying to express.
You try scales from 1 to 100 in unit steps. I wanted to point out that scales less than one produce curious results. Now I see some close fits, which is especially true when p-values are considered; there's nothing to tell them apart from that for scale=2. Here's a plot.
Each triple gives (scale, K-S, p).
The main thing might be, what do you want from your data?
Related
determining whether two convex hulls overlap
I'm trying to find an efficient algorithm for determining whether two convex hulls intersect or not. The hulls consist of data points in N-dimensional space, where N is 3 up to 10 or so. One elegant algorithm was suggested here using linprog from scipy, but you have to loop over all points in one hull, and it turns out the algorithm is very slow for low dimensions (I tried it and so did one of the respondents). It seems to me the algorithm could be generalized to answer the question I am posting here, and I found what I think is a solution here. The authors say that the general linear programming problem takes the form Ax + tp >= 1, where the A matrix contains the points of both hulls, t is some constant >= 0, and p = [1,1,1,1...1] (it's equivalent to finding a solution to Ax > 0 for some x). As I am new to linprog() it isn't clear to me whether it can handle problems of this form. If A_ub is defined as on page 1 of the paper, then what is b_ub?
There is a nice explanation of how to do this problem, with an algorithm in R, on this website. My original post referred to the scipy.optimize.linprog library, but this proved to be insufficiently robust. I found that the SCS algorithm in the cvxpy library worked very nicely, and based on this I came up with the following python code: import numpy as np import cvxpy as cvxpy # Determine feasibility of Ax <= b # cloud1 and cloud2 should be numpy.ndarrays def clouds_overlap(cloud1, cloud2): # build the A matrix cloud12 = np.vstack((-cloud1, cloud2)) vec_ones = np.r_[np.ones((len(cloud1),1)), -np.ones((len(cloud2),1))] A = np.r_['1', cloud12, vec_ones] # make b vector ntot = len(cloud1) + len(cloud2) b = -np.ones(ntot) # define the x variable and the equation to be solved x = cvxpy.Variable(A.shape[1]) constraints = [A*x <= b] # since we're only determining feasibility there is no minimization # so just set the objective function to a constant obj = cvxpy.Minimize(0) # SCS was the most accurate/robust of the non-commercial solvers # for my application problem = cvxpy.Problem(obj, constraints) problem.solve(solver=cvxpy.SCS) # Any 'inaccurate' status indicates ambiguity, so you can # return True or False as you please if problem.status == 'infeasible' or problem.status.endswith('inaccurate'): return True else: return False cube = np.array([[1,1,1],[1,1,-1],[1,-1,1],[1,-1,-1],[-1,1,1],[-1,1,-1],[-1,-1,1],[-1,-1,-1]]) inside = np.array([[0.49,0.0,0.0]]) outside = np.array([[1.01,0,0]]) print("Clouds overlap?", clouds_overlap(cube, inside)) print("Clouds overlap?", clouds_overlap(cube, outside)) # Clouds overlap? True # Clouds overlap? False The area of numerical instability is when the two clouds just touch, or are arbitrarily close to touching such that it isn't possible to definitively say whether they overlap or not. That is one of the cases where you will see this algorithm report an 'inaccurate' status. In my code I chose to consider such cases overlapping, but since it is ambiguous you can decide for yourself what to do.
Why the output of model.wv.similarity() in Word2Vec results different with model.wv.similar()?
I have trained a Word2Vec model and I am trying to use it. When I input the most similar words of ‘动力', I got the output like this: 动力系统 0.6429724097251892 驱动力 0.5936785936355591 动能 0.5788494348526001 动力车 0.5579575300216675 引擎 0.5339343547821045 推动力 0.5152761936187744 扭力 0.501279354095459 新动力 0.5010953545570374 支撑力 0.48610919713974 精神力量 0.47970670461654663 But the problem is that if I input model.wv.similarity('动力','动力系统') I got the result 0.0, which is not equal with 0.6429724097251892 what confused me more was that when I got the next similarity of word '动力' and word '驱动力', it showed 3.689349e+19 So why ? Did I make misunderstanding with the similarity? I need someone to tell me!! And the code is: res = model.wv.most_similar('动力') for r in res: print(r[0],r[1]) print(model.wv.similarity('动力','动力系统')) print(model.wv.similarity('动力','驱动力')) print(model.wv.similarity('动力','动能')) output: 动力系统 0.6429724097251892 驱动力 0.5936785936355591 动能 0.5788494348526001 动力车 0.5579575300216675 引擎 0.5339343547821045 推动力 0.5152761936187744 扭力 0.501279354095459 新动力 0.5010953545570374 支撑力 0.48610919713974 精神力量 0.47970670461654663 0.0 3.689349e+19 2.0
I have written a function to replace the model.wv.similarity method. def Similarity(w1,w2,model): A = model[w1]; B = model[w2] return sum(A*B)/(pow(sum(pow(A,2)),0.5)*pow(sum(pow(B,2)),0.5) Where w1 and w2 are the words you input, model is the Word2Vec model you have trained.
Using the similarity method directly from the model is deprecated. It has a bit extra logic in it that performs vector normalization before evaluating the result. You should be using vw directly, because as stated in their documentation, for the word vectors it is of non importance how they were trained so they should be looked as independent structure, the model is just the means to obtain it. Here is short discussion which should give you starting points if you want to investigate further.
It may be an encoding issue, where you are not actually comparing the same tokens. Try the following, to see if it gives results closer to what you expect. res = model.wv.most_similar('动力') for r in res: print(r[0],r[1]) print(model.wv.similarity('动力', res[0][0])) print(model.wv.similarity('动力', res[1][0])) print(model.wv.similarity('动力', res[2][0])) If it does, you could look further into why the model might be reporting strings which print as 动力系统 (etc), but don't match your typed-in-code string literals like '动力系统' (etc). For example: print(res[0][0]=='动力系统') print(type(res[0][0])) print(type('动力系统'))
Adding values from multiple .rrd file
Problem =====> Basically there are three .rrd which are generated for three departments. From that we fetch three values (MIN, MAX, CURRENT) and print ins 3x3 format. There is a python script which does that. eg - Dept1: Min=10 Max=20 Cur=15 Dept2: Min=0 Max=10 Cur=5 Dept3: Min=10 Max=30 Cur=25 Now I want to add the values together (Min, Max, Cur) and print in one line. eg - Dept: Min=20 Max=60 Cur=45 Issue I am facing =====> No matter what CDEF i write, I am breaking the graph. :( This is the part I hate as i do not get any error message. As far as I understand(please correct me if i am wrong) I definitely cannot store the value anywhere in my program as a graph is returned. What would be a proper way to add the values in this condition. Please let me know if my describing the problem is lacking more detail.
You can do this with a VDEF over a CDEF'd sum. DEF:a=dept1.rrd:ds0:AVERAGE DEF:b=dept2.rrd:ds0:AVERAGE DEF:maxa=dept1.rrd:ds0:MAXIMUM DEF:maxb=dept2.rrd:ds0:MAXIMUM CDEF:maxall=maxa,maxb,+ CDEF:all=a,b,+ VDEF:maxalltime=maxall,MAXIMUM VDEF:alltimeavg=all,AVERAGE PRINT:maxalltime:Max=%f PRINT:alltimeavg:Avg=%f LINE:all#ff0000:AllDepartments However, you should note that, apart form at the highest granularity, the Min and Max totals will be wrong! This is because max(a+b) != max(a) + max(b). If you dont calculate the min/max aggregate at time of storage, the granularity will be gone at time of display. For example, if a = (1, 2, 3) and b = (3, 2, 1), then max(a) + max(b) = 6; however the maximum at any point in time is in fact 4. The same issue applies to using min(a) + min(b).
Multi-label text classification with scikit-learn
I'm new to machine learning and I'm having trouble adapting any examples that I've found to my specific problem. The official documentation for scikit is rather spartan and full of terminology I'm unfamiliar with, so I'm not really sure which algorithm I should be using, how to properly prepare my data for it, and how to get the predictions in the form I want. I already have my feature extraction function for the text in place, which returns a tuple of floats ranging from 0.0 to 100.0. These represent the prevalence of a certain characteristic in the text as a percentage. So my features for a certain piece of text would look something like (0.0, 17.31, 57.0, 93.2, ...). I'm unsure of which algorithm would be the most suitable for this type of data. As per the title, I also need the ability to classify a piece of text using more than one label. Reading some other SO questions clued me in that I need to use MultiLabelBinarizer and OneVsRestClassifier, but I'm still unsure how to apply them to my data and whichever algorithm I'll need to use. I also didn't find any examples that would return prediction results for the multiple labels in the form I want them. That is, instead of a binary "is or isn't this label", I'd like a percentage chance that the text is of a certain label. So when doing something like classifier.predict(testData) I'd like a return like {"spam":87.3, "code":27.9, "urlList":3.12} instead of something like ["spam", "code", "urlList"]. That way I can make more precise decisions about what to do with a certain text. I should probably also mention one characteristic of the dataset that I'm using, and that is that 85-90% of the text will be code, and therefore only have one tag, "code". I imagine there are some tweaks to the algorithm required to account for this? Some simplified and probably unsuitable code: possibleLabels = ["code", "spam", "urlList"] trainData, trainLabels = [ (0.0, 17.31, 57.0, 93.2), ... ], [ ["spam"], ["code"], ["code", "urlList"], ... ] testData, testLabels = [], [] # Separate batch of samples in the same format as above # Not sure if this is the proper way to prepare my labels, # nor how to later resolve the binarized versions to their string counterparts. mlb = preprocessing.MultiLabelBinarizer() fitTrainLabels = mlb.fit_transform(trainLabels) # Feels like I need more to make it suitable for my data classifier = OneVsRestClassifier() classifier.fit(trainData, fitTrainLabels) # Need return as a list of dicts containing probability of tags, ie. [ {"spam":87.3, "code":27.9, "urlList":3.12}, {...}, ... ] predicted = classifier.predict(testData)
Top n outliers in ResultWriter
I am dealing with high dimensional and large dataset, so i need to get just Top N outliers from output of ResultWriter. There is some option in elki to get just the top N outliers from this output?
The ResultWriter is some of the oldest code in ELKI, and needs to be rewritten. It's rather generic - it tries to figure out how to best serialize output as text. If you want some specific format, or a specific subset, the proper way is to write your own ResultHandler. There is a tutorial for writing a ResultHandler. If you want to find the input coordinates in the result, Database db = ResultUtil.findDatabase(baseResult); Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_VARIABLE_LENGTH); will return the first relation containing numeric vectors. To iterate over the objects sorted by their outlier score, use: OrderingResult order = outlierResult.getOrdering(); DBIDs ids = order.order(order.getDBIDs()); for (DBIDIter it = ids.iter(); it.valid(); it.advance()) { // Output as desired. }