Top n outliers in ResultWriter - data-mining

I am dealing with high dimensional and large dataset, so i need to get just Top N outliers from output of ResultWriter.
There is some option in elki to get just the top N outliers from this output?

The ResultWriter is some of the oldest code in ELKI, and needs to be rewritten. It's rather generic - it tries to figure out how to best serialize output as text.
If you want some specific format, or a specific subset, the proper way is to write your own ResultHandler. There is a tutorial for writing a ResultHandler.
If you want to find the input coordinates in the result,
Database db = ResultUtil.findDatabase(baseResult);
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_VARIABLE_LENGTH);
will return the first relation containing numeric vectors.
To iterate over the objects sorted by their outlier score, use:
OrderingResult order = outlierResult.getOrdering();
DBIDs ids = order.order(order.getDBIDs());
for (DBIDIter it = ids.iter(); it.valid(); it.advance()) {
// Output as desired.
}

Related

SAS Action to provide the class probability statistics

I have a vector of nominal values and I need to know the probability of occurring each of the nominal values. Basically, I need those to obtain the min, max, mean, std of the probability of observing the nominal values and to get the Class Entropy value.
For example, lets assume there is a data-set in which the target is predicting 0, 1, or 2. In the training data-set. We can count the number of records which their target is 1, and call it n_1 and similarly we can define n_0 and n_2. Then, the probability of observing class 1 in the training data-set is simply p_1=n_1/(n_0 + n_2). Once p_0, p_1, and p_2 are obtained, one can get min, max, mean, and std of the these probabilitis.
It is easy to get that in python by pandas, but want to avoid reading the data-set twice. I was wondering if there is any CAS-action in SAS that can provide it to me. Note that I use the Python API of SAS through swat and I need to have the API in python.
I found the following solution and it works fine. It uses s.dataPreprocess.highcardinality to get the number of classes and then uses s.dataPreprocess.binning to obtain the number of observations within each class. Then, there is just some straightforward calculation.
import swat
# create a CAS server
s = swat.CAS(server, port)
# load the table
tbl_name = 'hmeq'
s.upload("./data/hmeq.csv", casout=dict(name=tbl_name, replace=True))
# call to get the number of classes
cardinality_result = s.dataPreprocess.highcardinality(table=tbl_name, vars=[target_var])
cardinality_result_df = pd.DataFrame(cardinality_result["HighCardinalityDetails"])
number_of_classes = int(cardinality_result_df["CardinalityEstimate"])
# call dataPreprocess.binning action to get the probability of each class
s.loadactionset(actionset="dataPreprocess")
result_binning = s.dataPreprocess.binning(table=tbl_name, vars=[target_var], nBinsArray=[number_of_classes])
result_binning_df = pd.DataFrame(result_binning["BinDetails"])
probs = result_binning_df["NInBin"]/result_binning_df["NInBin"].sum()
prob_min = probs.min()
prob_max = probs.max()
prob_mean = probs.mean()
prob_std = probs.std()
entropy = -sum(probs*np.log2(probs))

Adding values from multiple .rrd file

Problem =====>
Basically there are three .rrd which are generated for three departments.
From that we fetch three values (MIN, MAX, CURRENT) and print ins 3x3 format. There is a python script which does that.
eg -
Dept1: Min=10 Max=20 Cur=15
Dept2: Min=0 Max=10 Cur=5
Dept3: Min=10 Max=30 Cur=25
Now I want to add the values together (Min, Max, Cur) and print in one line.
eg -
Dept: Min=20 Max=60 Cur=45
Issue I am facing =====>
No matter what CDEF i write, I am breaking the graph. :(
This is the part I hate as i do not get any error message.
As far as I understand(please correct me if i am wrong) I definitely cannot store the value anywhere in my program as a graph is returned.
What would be a proper way to add the values in this condition.
Please let me know if my describing the problem is lacking more detail.
You can do this with a VDEF over a CDEF'd sum.
DEF:a=dept1.rrd:ds0:AVERAGE
DEF:b=dept2.rrd:ds0:AVERAGE
DEF:maxa=dept1.rrd:ds0:MAXIMUM
DEF:maxb=dept2.rrd:ds0:MAXIMUM
CDEF:maxall=maxa,maxb,+
CDEF:all=a,b,+
VDEF:maxalltime=maxall,MAXIMUM
VDEF:alltimeavg=all,AVERAGE
PRINT:maxalltime:Max=%f
PRINT:alltimeavg:Avg=%f
LINE:all#ff0000:AllDepartments
However, you should note that, apart form at the highest granularity, the Min and Max totals will be wrong! This is because max(a+b) != max(a) + max(b). If you dont calculate the min/max aggregate at time of storage, the granularity will be gone at time of display.
For example, if a = (1, 2, 3) and b = (3, 2, 1), then max(a) + max(b) = 6; however the maximum at any point in time is in fact 4. The same issue applies to using min(a) + min(b).

How to apply a mask to a DataFrame in Python?

My dataset named ds_f is a 840x57 matrix which contains NaN values. I want to forecast a variable with a linear regression model but when I try to fit the model, I get this message "SVD did not converge":
X = ds_f[ds_f.columns[:-1]]
y = ds_f['target_o_tempm']
model = sm.OLS(y,X) #stackmodel
f = model.fit() #ERROR
So I've been searching for an answer to apply a mask to a DataFrame. Although I was thinking of creating a mask to "ignore" NaN values and then convert it into a DataFrame, I get the same DataFrame as ds_f, nothing changes:
m = ma.masked_array(ds_f, np.isnan(ds_f))
m_ds_f = pd.DataFrame(m,columns=ds_f.columns)
EDIT: I've solved the problem by writing model=sm.OLS(X,y,missing='drop') but a new problem appears when I display results, I get only NaN:
Are you using statsmodels? If so, you could specify sm.OLS(y, X, missing='drop'), to drop the NaN values prior to estimation.
Alternatively, you may want to consider interpolating the missing values, rather than dropping them.

How to smooth numbers from a file as many times as wanted in Python 2.7?

I'm trying to create a code that will open a file with a list of numbers in it and then take those numbers and smooth them as many times as the user wants. I have it opening and reading the file, but it will not transpose the numbers. In this format it gives this error: TypeError: unsupported operand type(s) for /: 'str' and 'float'. I also need to figure out how to make it transpose the numbers the amount of times the user asks it to. The list of numbers I used in my .txt file is [3, 8, 5, 7, 1].
Here is exactly what I am trying to get it to do:
Ask the user for a filename
Read all floating point data from file into a list
Ask the user how many smoothing passes to make
Display smoothed results with two decimal places
Use functions where appropriate
Algorithm:
Never change the first or last value
Compute new values for all other values by averaging the value with its two neighbors
Here is what I have so far:
filename = raw_input('What is the filename?: ')
inFile = open(filename)
data = inFile.read()
print data
data2 = data[:]
print data2
data2[1]=(data[0]+data[1]+data[2])/3.0
print data2
data2[2]=(data[1]+data[2]+data[3])/3.0
print data2
data2[3]=(data[2]+data[3]+data[4])/3.0
print data2
You almost certainly don't want to be manually indexing the list items. Instead, use a loop:
data2 = data[:]
for i in range(1, len(data)-1):
data2[i] = sum(data[i-1:i+2])/3.0
data = data2
You can then put that code inside another loop, so that you smooth repeatedly:
smooth_steps = int(raw_input("How many times do you want to smooth the data?"))
for _ in range(smooth_steps):
# code from above goes here
Note that my code above assumes that you have read numeric values into the data list. However, the code you've shown doesn't do this. You simply use data = inFile.read() which means data is a string. You need to actually parse your file in some way to get a list of numbers.
In your immediate example, where the file contains a Python formatted list literal, you could use eval (or ast.literal_eval if you wanted to be a bit safer). But if this data is going to be used by any other program, you'll probably want a more widely supported format, like CSV, JSON or YAML (all of which have parsers available in Python).

For loop using a t-stat function to create a list

I am using the following function to calculate the t-stat for data in data frame (x):
wilcox.test.all.genes<-function(x,s1,s2) {
x1<-x[s1]
x2<-x[s2]
x1<-as.numeric(x1)
x2<-as.numeric(x2)
wilcox.out<-wilcox.test(x1,x2,exact=F,alternative="two.sided",correct=T)
out<-as.numeric(wilcox.out$statistic)
return(out)
}
I need to write a for loop that will iterate a specific number of times. For each iteration, the columns need to be shuffled, the above function performed and the maximum t-stat value saved to a list.
I know that I can use the sample() function to shuffle the columns of the data frame, and the max() function to identify the maximum t-stat value, but I can't figure out how to put them together to achieve a workable code.
You are trying to generate empiric p-values, corrected for the multiple comparisons you are making because of the multiple columns in your data. First, let's simulate an example data set:
# Simulate data
n.row = 100
n.col = 10
set.seed(12345)
group = factor(sample(2, n.row, replace=T))
data = data.frame(matrix(rnorm(n.row*n.col), nrow=n.row))
Calculate the Wilcoxon test for each column, but we will replicate this many times while permuting the class membership of the observations. This gives us an empiric null distribution of this test statistic.
# Re-calculate columnwise test statisitics many times while permuting class labels
perms = replicate(500, apply(data[sample(nrow(data)), ], 2, function(x) wilcox.test(x[group==1], x[group==2], exact=F, alternative="two.sided", correct=T)$stat))
Calculate the null distribution of the maximum test statistic by collapsing across the multiple comparisons.
# For each permuted replication, calculate the max test statistic across the multiple comparisons
perms.max = apply(perms, 2, max)
By simply sorting the results, we can now determine the p=0.05 critical value.
# Identify critical value
crit = sort(perms.max)[round((1-0.05)*length(perms.max))]
We can also plot our distribution along with the critical value.
# Plot
dev.new(width=4, height=4)
hist(perms.max)
abline(v=crit, col='red')
Finally, comparing a real test statistic to this distribution will give you an empiric p-value, corrected for multiple comparisons by controlling the family-wise error to p<0.05. For example, let's pretend a real test stat was 1600. We could then calculate the p-value like:
> length(which(perms.max>1600))/length(perms.max)
[1] 0.074