Weka displaying weird results for classification - question marks"?" - weka

I am trying to use the ZeroR algorithm in Weka in order to make baseline performance for my classification problem. However, Weka is displaying weird results for precision and F-measure, it is showing a question mark '?' instead of any number. Anyone knows how can I fix this ?
=== Classifier model (full training set) ===
ZeroR predicts class value: label 1
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 431 53.607 %
Incorrectly Classified Instances 373 46.393 %
Kappa statistic 0
Mean absolute error 0.4974
Root mean squared error 0.4987
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 804
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.000 ? 0.000 ? ? 0.488 0.457 label 0
1.000 1.000 0.536 1.000 0.698 ? 0.488 0.530 label 1
Weighted Avg. 0.536 0.536 ? 0.536 ? ? 0.488 0.496
=== Confusion Matrix ===
a b <-- classified as
0 373 | a = label 0
0 431 | b = label 1

It's not wrong. Note that you have no cases classified as "a", so the precision (etc.) are indeterminable for "a". Evidently Weka propagates incalculatable values (like Excel does), so the overall precision isn't calculated, either.
Your real problem here is that you have a model that is classifying everything as "b", which is unlikely to be useful. But that's ZeroR, so that's just your starting point.

Related

How to refer to an estimate value from the regression output in rmakrdown document using back quote?

I know how to refer to the numeric value from r stat analysis in my rmarkdown document, for example, putting r round(x.peaks[,2],3)[1] between the back quotes without having to update whenever the value changes. But I was wondering if there's a way to do the same for the value of estimated coefficient from regression output. So for example, I want to put the intercept -0.32958 (please see the table) in my rmarkdown document using back quotes without having to type or update every time it generates a different output depending on dataset.
lm(formula = log(p_share) ~ xxx, data = df2)
Residuals:
Min 1Q Median 3Q Max
-3.04165 -0.37272 -0.00279 0.48895 1.16493
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.32958 0.05525 -5.965 1.62e-08 ***
xxx 0.03313 0.11835 0.280 0.78
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6103 on 154 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.0005087, Adjusted R-squared: -0.005981
F-statistic: 0.07838 on 1 and 154 DF, p-value: 0.7799
You should use package broom.
An example:
title: "Untitled"
header-includes:
- \usepackage[utf8]{inputenc}
output:
pdf_document:
latex_engine: xelatex
---
```{r}
library(broom)
#our data
df = data.frame(x=rnorm(10),y=rnorm(10))
#our model
model <- lm(y ~ x, data=df)
t_tab <- tidy(model)
t_tab
```
Your coef is `r round(t_tab$estimate[1],3)`

How do you rank based on a measure referencing two tables in a star schema?

I have a star schema data model. DimDate, DimBranchName, BranchActual, BranchBudget.
I have measures to calculate the YTD variance to Budget by Branch called QVar. Qvar takes the counts from BranchActual and compares it BranchBudget between two dates. The visual is controlled by DimBranchName and DimDate.
Current Result:
BranchName YTDActual YTDBudget QVar
A 100 150 (33%)
B 200 200 0.0%
C 25 15 66%
I want a measure to be able to rank DimBranchName[BranchName] by QVar that will interact with the filters I have in place.
Desired result:
BranchName YTDActual YTDBudget QVar Rank
A 100 150 (33%) 3
B 200 200 0.0% 2
C 25 15 66% 1
What I've tried so far is
R Rank of Actual v Goal =
var V = [QVar]
RETURN
RANKX(ALLSELECTED('BranchActual'),CALCULATE(V),,ASC,Dense)
What I get is all 1's
BranchName YTDActual YTDBudget QVar Rank
A 100 150 (33%) 1
B 200 200 0.0% 1
C 25 15 66% 1
Thanks!
When you declare a variable it is computed once and treated as a constant through the rest of your DAX code, so CALCULATE(V) is simply whatever V was when you declared the variable.
This might be closer to what you want:
R Rank of Actual v Goal =
RANKX ( ALLSELECTED ( DimBranchName[BranchName] ), [QVar],, ASC, DENSE )
This way [QVar] is called within the filter context of the BranchName instead of being a constant. (Note that referencing a measure within another measure implicitly wraps it in CALCULATE so you don't need that again.)

WEKA: Print the Indexes of Test data instances w.r.t original data at the time of cross validation

I have a query about the indexes of test data instances chosen by weka at the time of cross validation. How to print the indexes of the test data instances which are being evaluated ?
==================================
I have chosen:
Dataset : iris.arff
Total instances : 150
Classifier : J48
cross validation: 10 fold
I have also made output prediction as "PlainText"
=============
In the output window I can see like this :-
inst# actual predicted error prediction
1 3:Iris-virginica 3:Iris-virginica 0.976
2 3:Iris-virginica 3:Iris-virginica 0.976
3 3:Iris-virginica 3:Iris-virginica 0.976
4 3:Iris-virginica 3:Iris-virginica 0.976
5 3:Iris-virginica 3:Iris-virginica 0.976
6 1:Iris-setosa 1:Iris-setosa 1
7 1:Iris-setosa 1:Iris-setosa 1
....
...
...
Total 10 test data set.(15 instances in each).
======================
As WEKA uses startified cross validation, instances in the test data sets are randomly choosen.
So, How to print the indexes of test data w.r.t the data in original file?
i.e
inst# actual predicted error prediction
1 3:Iris-virginica 3:Iris-virginica 0.976
This result is for which instance in main data (among total 50 Iris-virginica) ?
===============
After a lot of search, I have found that the below youtube video is helpful for the above problem.
Hope this will be helpful for any future visitor with same queries.
Weka Tutorial 34: Generating Stratified Folds (Data Preprocessing)

gruobi: used model.write but cannot find the file

I am using Gurobi with C++ and want to save the Lp as a file.lp. Therefore, I used
model.write("model.lp");
model.optimize();
This is my output and nor error occurs:
Optimize a model with 105 rows, 58 columns and 186 nonzeros
Coefficient statistics:
Matrix range [1e+00, 1e+00]
Objective range [1e+00, 1e+00]
Bounds range [0e+00, 0e+00]
RHS range [1e+00, 6e+00]
Presolve removed 105 rows and 58 columns
Presolve time: 0.00s
Presolve: All rows and columns removed
Iteration Objective Primal Inf. Dual Inf. Time
0 -0.0000000e+00 0.000000e+00 0.000000e+00 0s
Solved in 0 iterations and 0.00 seconds
Optimal objective -0.000000000e+00
obj: -0
Status2
Process finished with exit code 0
So there is probably a mistake in my LP, since the optimal solution should not be 0. This is why I want to have a look at the model.lp file. However, I cannot find it. I searched my whole computer. Am I missing anything?

to create highest & lowest quartiles of a variable in Stata

This is the Stata code I used to divide a Winsorised & centred variable (num_exp, denoting number of experienced managers) based on 4 quartiles & thereafter to generate the highest & lowest quartile dummies thereof:
egen quartile_num_exp = xtile(WC_num_exp), n(4)
gen high_quartile_numexp = 1 if quartile_num_exp==4
(1433 missing values generated);
gen low_quartile_num_exp = 1 if quartile_num_intlexp==1
(1062 missing values generated);
Thanks everybody - here's the link
https://dl.dropboxusercontent.com/u/64545449/No%20of%20expeienced%20managers.dta
I did try both Aspen Chen's & Roberto's suggestions - Chen's way of creating high quartile dummy gives the same results as I had earlier & Roberto's - both quartiles show 1 for the same rows - how's that possible?
I forgot to mention here that there are indeed many ties - the range of the original variable W_num_exp is from 0 to 7, the mean being 2.126618, i subtracted that from each observation of W_num_exp to get the WC_num_exp.
tab high_quartile_numexp shows the same problem I originally had
le_numexp | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,433 80.64 80.64
1 | 344 19.36 100.00
------------+-----------------------------------
Total | 1,777 100.00
Also, I checked egenmore is already installed in my Stata version 13.1
What I fail to understand is why the dummy variable based on the highest quartile doesn't have 75% of observations below it (I've got 1777 total observations): to my understanding this dummy variable should be the cut-off point above which exactly 25% of the total no. of observations should lie (as we can see it contains only 19.3% of observations).
Am I doing anything wrong in writing the correct Stata code for high_quartile low_quartile dummy variables?
Consider the following code:
clear
set more off
sysuse auto
keep make mpg
*-----
// your way (kind of)
egen mpg4 = xtile(mpg), nq(4)
gen lowq = mpg4 == 1
gen highq = mpg4 == 4
*-----
// what you want
summarize mpg, detail
gen lowq2 = mpg < r(p25)
gen highq2 = mpg < r(p75)
*-----
summarize high* low*
list
Now check the listing to see what's going on.
See help stored results.
The dataset provided answers the question. Consider the tabulation:
. tab W_num_exp
num_execs_i |
ntl_exp, |
Winsorized |
fraction |
.01 | Freq. Percent Cum.
------------+-----------------------------------
0 | 297 16.71 16.71
1 | 418 23.52 40.24
2 | 436 24.54 64.77
3 | 282 15.87 80.64
4 | 171 9.62 90.26
5 | 109 6.13 96.40
6 | 34 1.91 98.31
7 | 30 1.69 100.00
------------+-----------------------------------
Total | 1,777 100.00
Exactly equal numbers in each of 4 quartile-based bins can be provided if, and only if, there are values with cumulative percents 25, 50, 75. No such values exist. You have to make do with approximations. The approximations can be lousy, but the only alternative, of arbitrarily assigning observations with the same value to different bins to even up frequencies, is statistically indefensible.
(The number of observations needing to be a multiple of 4 for 4 bins, etc., for exactly equal frequencies is also a complication, which bites hard for small datasets, but that is not the major issue here.)