Get Weka Classifier Results - weka

I am new to Weka and trying to run classifiers on a csv data file. I am able to print the results by getting them in string from classifier.toString(), evaluation.toSummaryString(), evaluation.toMatrixString() method.
But my client's requirement is to return the variables values in json object instead of the output format given by Weka. Can we get all the variables values separately so that I can set them to my custom model and return that as json.
e.g. evaluationSummaryString=
Correctly Classified Instances 4 28.5714 %
Incorrectly Classified Instances 10 71.4286 %
Kappa statistic -0.0769
Mean absolute error 0.4722
Root mean squared error 0.6514
Relative absolute error 101.9709 %
Root relative squared error 132.1523 %
Coverage of cases (0.95 level) 50 %
Mean rel. region size (0.95 level) 45.2381 %
Total Number of Instances 14
Can I get above name value pairs read separately instead having them in string.

You can extract the majority of the values you are interested in directly from the evaluation object. I am unsure of "Coverage of cases", and "mean rel. region". The rest can be done as follows:
Instances train = // load train instances ...
Instances test = // load test instances ...
// build and evaluate a model
J48 classifier = new J48();
classifier.buildClassifier(train);
Evaluation eval = new Evaluation(train);
eval.evaluateModel(classifier, test);
// extract the values of interest
double numberCorrect = eval.correct();
double numberIncorrect = eval.incorrect();
double pctCorrect = eval.pctCorrect();
double pctIncorrect = eval.pctIncorrect();
double Kappa = eval.kappa();
double MeanAbsoluteError = eval.meanAbsoluteError();
double RootMeanSquaredError = eval.rootMeanSquaredError();
double RelativeAbsoluteError = eval.relativeAbsoluteError();
double RootRelativeSquaredError = eval.rootRelativeSquaredError();
double TotalNumberOfInstances = eval.numInstances();

Related

ValueError: Target size (torch.Size([10, 1])) must be the same as input size (torch.Size([10, 2]))

A binary classification problem with Batch Size = 10. Trying to use torch.nn.BCEWithLogitsLoss().
~\Anaconda3\envs\notebook\lib\site-packages\torch\nn\functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
2578
2579 if not (target.size() == input.size()):
-> 2580 raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
2581
2582 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
ValueError: Target size (torch.Size([1, 10])) must be the same as input size (torch.Size([10, 2]))
Here is my training code:
def train(epochs):
print('Starting training..')
for e in range(0, epochs):
exp_lr_scheduler.step()
print('='*20)
print(f'Starting epoch {e + 1}/{epochs}')
print('='*20)
train_loss = 0.
val_loss = 0.
resnet18.train() # set model to training phase
for train_step, (images, labels) in enumerate(dl_train):
optimizer.zero_grad()
outputs = resnet18(images)
outputs = outputs.float()
loss = loss_fn(outputs, labels.unsqueeze(0))
loss.backward()
optimizer.step()
train_loss += loss.item()
if train_step % 20 == 0:
print('Evaluating at step', train_step)
accuracy = 0
resnet18.eval() # set model to eval phase
for val_step, (images, labels) in enumerate(dl_val):
outputs = resnet18(images)
outputs = outputs.float()
loss = loss_fn(outputs, labels.unsqueeze(0))
val_loss += loss.item()
_, preds = torch.max(outputs, 1)
accuracy += sum((preds == labels).numpy())
val_loss /= (val_step + 1)
accuracy = accuracy/len(val_dataset)
print(f'Validation Loss: {val_loss:.4f}, Accuracy: {accuracy:.4f}')
show_preds()
resnet18.train() #set model to training phase
if accuracy >= 0.95:
print('Performance condition satisfied, stopping..')
return
train_loss /= (train_step + 1)
print(f'Training Loss: {train_loss:.4f}')
print('Training complete..')**
train(epochs=30)
Target size (torch.Size([1, 10])) must be the same as input size (torch.Size([10, 2]))
Seems to me you have two issues:
target size (a.k.a. ground truth tensor) should have the batch on the first axis: (1, 10).
From what you've described you are dealing with a binary classification task not a multi-label (2-class) classification task. Therefore input size (a.k.a. model's output) should have a shape of (10, 1).
In a binary classification task you should only have a single logit coming out of your model, i.e. your last nn.Linear layer should have a single neuron. The output will define which class has been predicted. Since you are using nn.BCEWithLogitsLoss, the loss input should be the raw output (since it includes a Sigmoid layer, cf. documentation) and should have a shape matching (batch_size=10, 1). Similarly, the target tensor should have the same shape. Its content would be 0s and 1s in shape (batch_size=10, 1).

Use Chi-Squared statistic in pymc3

I am trying to use PyMC3 to fit a model to some observed data. This model is based on external code (interfaced via theano.ops.as_op), and depends on multiple parameters that should be fit by the MCMC process. Since the gradient of the external code cannot be determined, I use the Metropolis-Hastings sampler.
I have established Uniform priors for my inputs, and generate a model using my custom code. However, I want to compare the simulated data to my observations (a 3D np.ndarray) using the chi-squared statistic (sum of the squares of data-model/sigma^2) to obtain a log-likelihood. When the MCMC samples are drawn, this should lead to the trace converging on the best values of each parameter.
My model is explained in the following semi-pseudocode (if that's even a word):
import pymc3 as pm
#Some stuff setting up the data, preparing some functions etc.
#theano.compile.ops.as_op(itypes=[input types],otypes = [output types])
def make_model(inputs):
#Wrapper to external code to generate simulated data
return simulated data
model = pm.model()
with model:
#priors for 13 input parameters
simData = make_model(inputs)
I now want to obtain the chi-squared logLikelihood for this model versus the data, which I think can be done using pm.ChiSquared, however I do not see how to combine the data, model and this distribution together to cause the sampler to perform correctly. I would guess it might look something like:
chiSq = pm.ChiSquared(nu=data.size, observed = (data-simData)**2/err**2)
trace = pm.sample(1000)
Is this correct? In running previous tests, I have found the samples appear to be simply drawn from the priors.
Thanks in advance.
Taking aloctavodia's advice, I was able to get parameter estimates for some toy exponential data using a pm.Normal likelihood. Using a pm.ChiSquared likelihood as the OP suggested, the model converged to correct values, but the posteriors on the parameters were roughly three times as broad. Here's the code for the model; I first generated data and then fit with PyMC3.
# Draw `nPoints` observed data points `y_obs` from the function
# 3. + 18. * numpy.exp(-.2 * x)
# with the points evaluated at `x_obs`
# x_obs = numpy.linspace(0, 100, nPoints)
# Add Normal(mu=0,sd=`cov`) noise to each point in `y_obs`
# Then instantiate PyMC3 model for fit:
def YModel(x, c, a, l):
# exponential model expected to describe the data
mu = c + a * pm.math.exp(-l * x)
return mu
def logp(y_mod, y_obs):
# Normal distribution likelihood
return pm.Normal.dist(mu = y_mod, sd = cov).logp(y_obs)
# Chi squared likelihood (to use, comment preceding line & uncomment next 2 lines)
#chi2 = chi2 = pm.math.sum( ((y_mod - y_obs)/cov)**2 )
#return pm.ChiSquared.dist(nu = nPoints).logp(chi2)
with pm.Model() as model:
c = pm.Uniform('constant', lower = 0., upper = 10., testval = 5.)
a = pm.Uniform('amplitude', lower = 0., upper = 50., testval = 25.)
l = pm.Uniform('lambda', lower = 0., upper = 10., testval = 5.)
y_mod = YModel(x_obs, c, a, l)
L = pm.DensityDist('L', logp, observed = {'y_mod': y_mod, 'y_obs': y_obs}, testval = {'y_mod': y_mod, 'y_obs': y_obs})
step = pm.Metropolis([c, a, l])
trace = pm.sample(draws = 10000, step = step)
The above model converged, but I found that success was sensitive to the bounds on the priors and the initial guesses on those parameters.
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
c 3.184397 0.111933 0.002563 2.958383 3.397741 1834.0 1.000260
a 18.276887 0.747706 0.019857 16.882025 19.762849 1343.0 1.000411
l 0.200201 0.013486 0.000361 0.174800 0.226480 1282.0 0.999991
(Edited: I had forgotten to sum the squares of the normalized residuals for chi2)

what does the 'find_MAP' output mean in pymc3?

What are the returned values of find_MAP in pymc3 ?
It seems that pymc3.Normal and pymc3.Uniform variables are not considered the same: for pymc3.Normal variables, find_MAP returns a value that looks like the maximum a posteriori probability. But for pymc3.Uniform variables, I get a '_interval' suffix added to the name of the variable and I don't find anywhere in the doc the meaning of the returned value (that may seem absurd, not even within the physical limits).
For example:
import numpy as np
import pymc3 as pm3
# create basic data such as obs = (x*0.95)**2+1.1+noise
x=np.arange(10)+1
obs=(x*0.95)**2+np.random.randn(10)+1.1
# fitting the model y=a(1*x)**2+a0 on data points
with pm3.Model() as model:
a0 = pm3.Uniform("a0",0,5)
a1 = pm3.Normal("a1",mu=1,sd=1)
a2 = pm3.Deterministic('a2',(x*a1)**2+a0)
hypothesis = pm3.Normal('hypothesis', mu=a2, sd=0.1, observed=obs)
start = pm3.find_MAP()
print('start: ',start)
returns:
Optimization terminated successfully.
Current function value: 570.382509
Iterations: 13
Function evaluations: 17
Gradient evaluations: 17
start: {'a1': array(0.9461006484031161), 'a0_interval_': array(-1.0812715249577414)}
By default, pymc3 transforms some variables with bounded support to the set of real numbers. The enables a variety of operations which would otherwise choke when given bounded distributions (e.g. some methods of optimizations and sampling). When this automatic transformation is applied, the random variable that you added to the model becomes a child of the transformed variable. This transformed variable is added to the model with [var]_[transform]_ as the name.
The default transformation for a uniform random variable is called the "interval transformation" and the new name of this variable is [name]_interval_. The MAP estimate is the found by optimizing all of the parameters to maximize the posterior probability. We only need to optimize the transformed variable, since this fully determines the value of the variable you originally added to the model. pm.find_MAP() returns only the variables being optimized, not the original variables. Notice that a2 is also not returned, since it is fully determined by a0 and a1.
The code that pymc3 uses for the interval transformation [^1] is
def forward(self, x):
a, b = self.a, self.b
r = T.log((x - a) / (b - x))
return r
Where a is the lower bound, b is the upper bound, and x is the variable being transformed. Using this map, values which are very close to the lower bound have transformed values approaching negative infinity, and values very close to the upper bound approach positive infinity.
Knowing the bounds, we can convert back from the real line to the bounded interval. The code that pymc3 uses for this is
def backward(self, x):
a, b = self.a, self.b
r = (b - a) * T.exp(x) / (1 + T.exp(x)) + a
return r
If you apply this backwards transformation yourself, you can recover a0 itself:
(5 - 0) * exp(-1.0812715249577414) / (1 + exp(-1.0812715249577414)) + 0 = 1.26632733897
Other transformations automatically applied include the log transform (for variables bounded on one side), and the stick-breaking transform (for variables which sum to 1).
[^1] As of commit 87cdd712c86321121c2ed3150764f3d847f5083c.

Tensorflow RNN training won't execute?

I am currently trying to train this RNN network, but seem to be running into weird errors, which I am not able to decode.
The input to my rnn network is digital sampled audio files. As the audio file can be of different length, will the vector of the sampled audio also have different lengths.
The output or the target of the neural network is to recreate a 14 dimensional vector, containing certain information of the audio files. I've already know the target, by manually calculating it, but need to make it work with a neural network.
I am currently using tensorflow as framework.
My network setup looks like this:
def last_relevant(output):
max_length = int(output.get_shape()[1])
relevant = tf.reduce_sum(tf.mul(output, tf.expand_dims(tf.one_hot(length, max_length), -1)), 1)
return relevant
def length(sequence): ##Zero padding to fit the max lenght... Question whether that is a good idea.
used = tf.sign(tf.reduce_max(tf.abs(sequence), reduction_indices=2))
length = tf.reduce_sum(used, reduction_indices=1)
length = tf.cast(length, tf.int32)
return length
def cost(output, target):
# Compute cross entropy for each frame.
cross_entropy = target * tf.log(output)
cross_entropy = -tf.reduce_sum(cross_entropy, reduction_indices=2)
mask = tf.sign(tf.reduce_max(tf.abs(target), reduction_indices=2))
cross_entropy *= mask
# Average over actual sequence lengths.
cross_entropy = tf.reduce_sum(cross_entropy, reduction_indices=1)
cross_entropy /= tf.reduce_sum(mask, reduction_indices=1)
return tf.reduce_mean(cross_entropy)
#----------------------------------------------------------------------#
#----------------------------Main--------------------------------------#
### Tensorflow neural network setup
batch_size = None
sequence_length_max = max_length
input_dimension=1
data = tf.placeholder(tf.float32,[batch_size,sequence_length_max,input_dimension])
target = tf.placeholder(tf.float32,[None,14])
num_hidden = 24 ## Hidden layer
cell = tf.nn.rnn_cell.LSTMCell(num_hidden,state_is_tuple=True) ## Long short term memory
output, state = tf.nn.dynamic_rnn(cell, data, dtype=tf.float32,sequence_length = length(data)) ## Creates the Rnn skeleton
last = last_relevant(output)#tf.gather(val, int(val.get_shape()[0]) - 1) ## Appedning as last
weight = tf.Variable(tf.truncated_normal([num_hidden, int(target.get_shape()[1])]))
bias = tf.Variable(tf.constant(0.1, shape=[target.get_shape()[1]]))
prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)
cross_entropy = cost(output,target)# How far am I from correct value?
optimizer = tf.train.AdamOptimizer() ## TensorflowOptimizer
minimize = optimizer.minimize(cross_entropy)
mistakes = tf.not_equal(tf.argmax(target, 1), tf.argmax(prediction, 1))
error = tf.reduce_mean(tf.cast(mistakes, tf.float32))
## Training ##
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
batch_size = 1000
no_of_batches = int(len(train_data)/batch_size)
epoch = 5000
for i in range(epoch):
ptr = 0
for j in range(no_of_batches):
inp, out = train_data[ptr:ptr+batch_size], train_output[ptr:ptr+batch_size]
ptr+=batch_size
sess.run(minimize,{data: inp, target: out})
print "Epoch - ",str(i)
incorrect = sess.run(error,{data: test_data, target: test_output})
print('Epoch {:2d} error {:3.1f}%'.format(i + 1, 100 * incorrect))
sess.close()
The error seem to be the usage of the function last_relevant, which should take the output, and feed it back.
This is the error message:
TypeError: Expected binary or unicode string, got <function length at 0x7f846594dde8>
Anyway to tell what could be wrong here?
I tried to build your code in my local.
There is a fundamental mistake in the code which is that you call tf.one_hot but what you pass don't really fit with what is expected:
Read documentation here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/api_docs/python/functions_and_classes/shard6/tf.one_hot.md
tf.one_hot(indices, depth, on_value=None, off_value=None, axis=None, dtype=None, name=None)
However, you are passing a function pointer ("length" is a function in your code, I recommend naming your function in a meaningful manner by refraining yourself from using common keywords) instead of the first parameter.
For a wild guide, you can put your indices as first param (instead of my placeholder empty list) and it will be fixed
relevant = tf.reduce_sum(
tf.mul(output, tf.expand_dims(tf.one_hot([], max_length), -1)), 1)

weka EuclideanDistance

I have following code to calculate the EuclideanDistance distance using weka.core.EuclideanDistance, where both two instances are all missing values, like below
Instance first are all missing values: ?,?,?,?
instance second are all missing values:?,?,?,?
EuclideanDistance distance = new EuclideanDistance();
distance.setInstances(test);
Instance first = test.get(0);
Instance second = test.get(1);
double d = distance.distance(first, second);
however, when i run the code, i got the result is 4.0, i have no idea where is this 4.0 from,can anyone tell me? Thanks in advance!
Missing values in k-Nearest Neighbours algorithm usually are handled according to the following criteria:
For nominal attributes:
if isMissingValue(a) or isMissingValue(b), then
distance = 1
For numeric attributes:
if isMissingValue(a) and isMissingValue(b), then
distance = 1
if isMissingValue(a) and !isMissingValue(b), then
distance = max(b, 1-b)
if !isMissingValue(a) and isMissingValue(b), then
distance = max(a, 1-a)
You can check the implementation in the source (link provided by Walter).