SAS Action to provide the class probability statistics - sas

I have a vector of nominal values and I need to know the probability of occurring each of the nominal values. Basically, I need those to obtain the min, max, mean, std of the probability of observing the nominal values and to get the Class Entropy value.
For example, lets assume there is a data-set in which the target is predicting 0, 1, or 2. In the training data-set. We can count the number of records which their target is 1, and call it n_1 and similarly we can define n_0 and n_2. Then, the probability of observing class 1 in the training data-set is simply p_1=n_1/(n_0 + n_2). Once p_0, p_1, and p_2 are obtained, one can get min, max, mean, and std of the these probabilitis.
It is easy to get that in python by pandas, but want to avoid reading the data-set twice. I was wondering if there is any CAS-action in SAS that can provide it to me. Note that I use the Python API of SAS through swat and I need to have the API in python.

I found the following solution and it works fine. It uses s.dataPreprocess.highcardinality to get the number of classes and then uses s.dataPreprocess.binning to obtain the number of observations within each class. Then, there is just some straightforward calculation.
import swat
# create a CAS server
s = swat.CAS(server, port)
# load the table
tbl_name = 'hmeq'
s.upload("./data/hmeq.csv", casout=dict(name=tbl_name, replace=True))
# call to get the number of classes
cardinality_result = s.dataPreprocess.highcardinality(table=tbl_name, vars=[target_var])
cardinality_result_df = pd.DataFrame(cardinality_result["HighCardinalityDetails"])
number_of_classes = int(cardinality_result_df["CardinalityEstimate"])
# call dataPreprocess.binning action to get the probability of each class
s.loadactionset(actionset="dataPreprocess")
result_binning = s.dataPreprocess.binning(table=tbl_name, vars=[target_var], nBinsArray=[number_of_classes])
result_binning_df = pd.DataFrame(result_binning["BinDetails"])
probs = result_binning_df["NInBin"]/result_binning_df["NInBin"].sum()
prob_min = probs.min()
prob_max = probs.max()
prob_mean = probs.mean()
prob_std = probs.std()
entropy = -sum(probs*np.log2(probs))

Related

Sparse index use for optimization with Pyomo model

I have a Pyomo model connected to a Django-created website.
My decision variable has 4 indices and I have a huge amount of constraints running on it.
Since Pyomo takes a ton of time to read in the constraints with so many variables, I want to sparse out the index set to only contain variables that actually could be 1 (i have some conditions on that)
I saw this post
Create a variable with sparse index in pyomo
and tried a for loop for all my conditions. I created a set "AllowedVariables" to later put this inside my constraints.
But Django's server takes so long to create this set while performing the system check, it never comes out.
Currently i have this model:
model = AbstractModel()
model.x = Var(model.K, model.L, model.F, model.Z, domain=Boolean)
def ObjRule(model):
# some rule, sense maximize
model.Obj = pyomo.environ.Objective(rule=ObjRule, sense=maximize)
def ARule(model,l):
maxA = sum(model.x[k,l,f,z] * for k in model.K for f in model.F
for z in model.Z and (k,l,f,z) in model.AllowedVariables)
return maxA <= 1
model.maxA = Constraint(model.L, rule=ARule)
The constraint is exemplary, I have 15 more similar ones. I currently create "AllowedVariables" this way:
AllowedVariables = []
for k in model.K:
for l in model.L:
..... check all sorts of conditions, break if not valid
AllowedVaraibles.append((k,l,f,z))
model.AllowedVariables = Set(initialize=AllowedVariables)
Using this, the Django server starts checking....and never stops
performing system checks...
Sadly, I somehow need some restriction on the variables or else the reading for the solver will take way to long since the constraints contain so many unnecessary variables that have to be 0 anyways.
Any ideas on how I can sparse my variable set?

How to Use MCMC with a Custom Log-Probability and Solve for a Matrix

The code is in PyMC3, but this is a general problem. I want to find which matrix (combination of variables) gives me the highest probability. Taking the mean of the trace of each element is meaningless because they depend on each other.
Here is a simple case; the code uses a vector rather than a matrix for simplicity. The goal is to find a vector of length 2, where the each value is between 0 and 1, so that the sum is 1.
import numpy as np
import theano
import theano.tensor as tt
import pymc3 as mc
# define a theano Op for our likelihood function
class LogLike_Matrix(tt.Op):
itypes = [tt.dvector] # expects a vector of parameter values when called
otypes = [tt.dscalar] # outputs a single scalar value (the log likelihood)
def __init__(self, loglike):
self.likelihood = loglike # the log-p function
def perform(self, node, inputs, outputs):
# the method that is used when calling the Op
theta, = inputs # this will contain my variables
# call the log-likelihood function
logl = self.likelihood(theta)
outputs[0][0] = np.array(logl) # output the log-likelihood
def logLikelihood_Matrix(data):
"""
We want sum(data) = 1
"""
p = 1-np.abs(np.sum(data)-1)
return np.log(p)
logl_matrix = LogLike_Matrix(logLikelihood_Matrix)
# use PyMC3 to sampler from log-likelihood
with mc.Model():
"""
Data will be sampled randomly with uniform distribution
because the log-p doesn't work on it
"""
data_matrix = mc.Uniform('data_matrix', shape=(2), lower=0.0, upper=1.0)
# convert m and c to a tensor vector
theta = tt.as_tensor_variable(data_matrix)
# use a DensityDist (use a lamdba function to "call" the Op)
mc.DensityDist('likelihood_matrix', lambda v: logl_matrix(v), observed={'v': theta})
trace_matrix = mc.sample(5000, tune=100, discard_tuned_samples=True)
If you only want the highest likelihood parameter values, then you want the Maximum A Posteriori (MAP) estimate, which can be obtained using pymc3.find_MAP() (see starting.py for method details). If you expect a multimodal posterior, then you will likely need to run this repeatedly with different initializations and select the one that obtains the largest logp value, but that still only increases the chances of finding the global optimum, though cannot guarantee it.
It should be noted that at high parameter dimensions, the MAP estimate is usually not part of the typical set, i.e., it is not representative of typical parameter values that would lead to the observed data. Michael Betancourt discusses this in A Conceptual Introduction to Hamiltonian Monte Carlo. The fully Bayesian approach is to use posterior predictive distributions, which effectively averages over all the high-likelihood parameter configurations rather than using a single point estimate for parameters.

Difficulty Understanding TensorFlow Computations

I'm new to TensorFlow and have difficulty understanding how the computations works. I could not find the answer to my question on the web.
For the following piece of code, the last time I print "d" in the for loop of the "train_neural_net()" function, I'm expecting the values to be identical to when I print "test_distance.eval". But they are way different. Can anyone tell me why this is happening? Isn't TensorFlow supposed to cache the Variable results learned in the for loop and use them when I run "test_distance.eval"?
def neural_network_model1(data):
nn1_hidden_1_layer = {'weights': tf.Variable(tf.random_normal([5, n_nodes_hl1])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl1]))}
nn1_hidden_2_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl2]))}
nn1_output_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl2, vector_size])), 'biasses': tf.Variable(tf.random_normal([vector_size]))}
nn1_l1 = tf.add(tf.matmul(data, nn1_hidden_1_layer["weights"]), nn1_hidden_1_layer["biasses"])
nn1_l1 = tf.sigmoid(nn1_l1)
nn1_l2 = tf.add(tf.matmul(nn1_l1, nn1_hidden_2_layer["weights"]), nn1_hidden_2_layer["biasses"])
nn1_l2 = tf.sigmoid(nn1_l2)
nn1_output = tf.add(tf.matmul(nn1_l2, nn1_output_layer["weights"]), nn1_output_layer["biasses"])
return nn1_output
def neural_network_model2(data):
nn2_hidden_1_layer = {'weights': tf.Variable(tf.random_normal([5, n_nodes_hl1])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl1]))}
nn2_hidden_2_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl2]))}
nn2_output_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl2, vector_size])), 'biasses': tf.Variable(tf.random_normal([vector_size]))}
nn2_l1 = tf.add(tf.matmul(data, nn2_hidden_1_layer["weights"]), nn2_hidden_1_layer["biasses"])
nn2_l1 = tf.sigmoid(nn2_l1)
nn2_l2 = tf.add(tf.matmul(nn2_l1, nn2_hidden_2_layer["weights"]), nn2_hidden_2_layer["biasses"])
nn2_l2 = tf.sigmoid(nn2_l2)
nn2_output = tf.add(tf.matmul(nn2_l2, nn2_output_layer["weights"]), nn2_output_layer["biasses"])
return nn2_output
def train_neural_net():
prediction1 = neural_network_model1(x1)
prediction2 = neural_network_model2(x2)
distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(prediction1, prediction2)), reduction_indices=1))
cost = tf.reduce_mean(tf.multiply(y, distance))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epochs = 500
test_result1 = neural_network_model1(x3)
test_result2 = neural_network_model2(x4)
test_distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(test_result1, test_result2)), reduction_indices=1))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epochs):
_, d = sess.run([optimizer, distance], feed_dict = {x1: train_x1, x2: train_x2, y: train_y})
print("Epoch", epoch, "distance", d)
print("test distance", test_distance.eval({x3: train_x1, x4: train_x2}))
train_neural_net()
Each time you call the functions neural_network_model1() or neural_network_model2(), you create a new set of variables, so there are four sets of variables in total.
The call to sess.run(tf.global_variables_initializer()) initializes all four sets of variables.
When you train in the for loop, you only update the first two sets of variables, created with these lines:
prediction1 = neural_network_model1(x1)
prediction2 = neural_network_model2(x2)
When you evaluate with test_distance.eval(), the tensor test_distance depends only on the variables that were created in the last two sets of variables, which were created with these lines:
test_result1 = neural_network_model1(x3)
test_result2 = neural_network_model2(x4)
These variables were never updated in the training loop, so the evaluation results will be based on the random initial values.
TensorFlow does include some code for sharing weights between multiple calls to the same function, using with tf.variable_scope(...): blocks. For more information on how to use these, see the tutorial on variables and sharing on the TensorFlow website.
You don't need to define two function for generating models, you can use tf.name_scope, and pass a model name to the function to use it as a prefix for variable declaration. On the other hand, you defined two variables for distance, first is distance and second is test_distance . But your model will learn from train data to minimize cost which is only related to first distance variable. Therefore, test_distance is never used and the model which is related to it, will never learn anything! Again there is no need for two distance functions. You only need one. When you want to calculate train distance, you should feed it with train data and when you want to calculate test distance you should feed it with test data.
Anyway, if you want second distance to work, you should declare another optimizer for it and also you have to learn it as you have done for first one. Also you should consider the fact that models are learning base on their initial values and training data. Even if you feed both models with exactly same training batches, you can't expect to have exactly similar characteristics models since initial values for weights are different and this could cause falling into different local minimum of error surface. At the end notice that whenever you call neural_network_model1 or neural_network_model2 you will generate new weights and biases, because tf.Variable is generating new variables for you.

Processing influx db output of 'influxdb.resultset.ResultSet'

I am trying to integrate influxdb with my application and process the output. I am importing InfluxDBClient package to connect to influx instance running on my local machine. Using query() that returns data in 'influxdb.resultset.ResultSet' format.
However, I want to be able to pick each element specifically from the Resultset for my computations. I was using different functions like keys(), items() and values() from the influxdb-python manual here but of no use:
http://influxdb-python.readthedocs.io/en/latest/api-documentation.html
This is the sample output of the query():
Result: ResultSet({'(u'cpu', None)': [{u'usage_guest_nice': 0, u'usage_user': 0.90783871790308868, u'usage_nice': 0, u'usage_steal': 0, u'usage_iowait': 0.056348610076366427, u'host': u'xxx.xxx.hostname.com', u'usage_guest': 0, u'usage_idle': 98.184322579062794, u'usage_softirq': 0.0062609566755314457, u'time': u'2016-06-26T16:25:00Z', u'usage_irq': 0, u'cpu': u'cpu-total', u'usage_system': 0.84522915123660536}]})
I am also finding it hard to get the data in JSON format using Raw mentioned in the above link. Would be great to have any pointers to process the above output.
items() returns a tuple in below format, ((u'cpu', None), ), where the generator can be used to loop and get the actual data in Dictionary format. Took some time for me to figure out but it was fun!!
According to the docs you could use the get_points() function to retrieve results from an InfluxDB resultset. The function allows you to filter by either measurement, tag, both measurement AND tag, or simply get all the results without any filtering.
Getting all points
Using rs.get_points() will return a generator for all the points in the ResultSet.
Filtering by measurement
Using rs.get_points('cpu') will return a generator for all the points that are in a serie with measurement name cpu, no matter the tags.
rs = cli.query("SELECT * from cpu")
cpu_points = list(rs.get_points(measurement='cpu'))
Filtering by tags
Using rs.get_points(tags={'host_name': 'influxdb.com'}) will return a generator for all the points that are tagged with the specified tags, no matter the measurement name.
rs = cli.query("SELECT * from cpu")
cpu_influxdb_com_points = list(rs.get_points(tags={"host_name": "influxdb.com"}))
Filtering by measurement and tags
Using measurement name and tags will return a generator for all the points that are in a serie with the specified measurement name AND whose tags match the given tags.
rs = cli.query("SELECT * from cpu")
points = list(rs.get_points(measurement='cpu', tags={'host_name': 'influxdb.com'}))

For loop using a t-stat function to create a list

I am using the following function to calculate the t-stat for data in data frame (x):
wilcox.test.all.genes<-function(x,s1,s2) {
x1<-x[s1]
x2<-x[s2]
x1<-as.numeric(x1)
x2<-as.numeric(x2)
wilcox.out<-wilcox.test(x1,x2,exact=F,alternative="two.sided",correct=T)
out<-as.numeric(wilcox.out$statistic)
return(out)
}
I need to write a for loop that will iterate a specific number of times. For each iteration, the columns need to be shuffled, the above function performed and the maximum t-stat value saved to a list.
I know that I can use the sample() function to shuffle the columns of the data frame, and the max() function to identify the maximum t-stat value, but I can't figure out how to put them together to achieve a workable code.
You are trying to generate empiric p-values, corrected for the multiple comparisons you are making because of the multiple columns in your data. First, let's simulate an example data set:
# Simulate data
n.row = 100
n.col = 10
set.seed(12345)
group = factor(sample(2, n.row, replace=T))
data = data.frame(matrix(rnorm(n.row*n.col), nrow=n.row))
Calculate the Wilcoxon test for each column, but we will replicate this many times while permuting the class membership of the observations. This gives us an empiric null distribution of this test statistic.
# Re-calculate columnwise test statisitics many times while permuting class labels
perms = replicate(500, apply(data[sample(nrow(data)), ], 2, function(x) wilcox.test(x[group==1], x[group==2], exact=F, alternative="two.sided", correct=T)$stat))
Calculate the null distribution of the maximum test statistic by collapsing across the multiple comparisons.
# For each permuted replication, calculate the max test statistic across the multiple comparisons
perms.max = apply(perms, 2, max)
By simply sorting the results, we can now determine the p=0.05 critical value.
# Identify critical value
crit = sort(perms.max)[round((1-0.05)*length(perms.max))]
We can also plot our distribution along with the critical value.
# Plot
dev.new(width=4, height=4)
hist(perms.max)
abline(v=crit, col='red')
Finally, comparing a real test statistic to this distribution will give you an empiric p-value, corrected for multiple comparisons by controlling the family-wise error to p<0.05. For example, let's pretend a real test stat was 1600. We could then calculate the p-value like:
> length(which(perms.max>1600))/length(perms.max)
[1] 0.074