WEKA: Print the Indexes of Test data instances w.r.t original data at the time of cross validation - weka

I have a query about the indexes of test data instances chosen by weka at the time of cross validation. How to print the indexes of the test data instances which are being evaluated ?
==================================
I have chosen:
Dataset : iris.arff
Total instances : 150
Classifier : J48
cross validation: 10 fold
I have also made output prediction as "PlainText"
=============
In the output window I can see like this :-
inst# actual predicted error prediction
1 3:Iris-virginica 3:Iris-virginica 0.976
2 3:Iris-virginica 3:Iris-virginica 0.976
3 3:Iris-virginica 3:Iris-virginica 0.976
4 3:Iris-virginica 3:Iris-virginica 0.976
5 3:Iris-virginica 3:Iris-virginica 0.976
6 1:Iris-setosa 1:Iris-setosa 1
7 1:Iris-setosa 1:Iris-setosa 1
....
...
...
Total 10 test data set.(15 instances in each).
======================
As WEKA uses startified cross validation, instances in the test data sets are randomly choosen.
So, How to print the indexes of test data w.r.t the data in original file?
i.e
inst# actual predicted error prediction
1 3:Iris-virginica 3:Iris-virginica 0.976
This result is for which instance in main data (among total 50 Iris-virginica) ?
===============

After a lot of search, I have found that the below youtube video is helpful for the above problem.
Hope this will be helpful for any future visitor with same queries.
Weka Tutorial 34: Generating Stratified Folds (Data Preprocessing)

Related

Weka Question: Which cluster do this Iris attributes belongs to?

I'm totally new into Weka and data science, I got an assignment to detect which of the following Iris attributes (SW, SL, PW, PL) belongs to which cluster? can you assist me? Thanks!
enter image description here
The iris dataset that comes with Weka has three classes (Iris-setosa, Iris-versicolor, Iris-virginica).
If you want to see how well clusters determined by your cluster algorithm align with the class labels, you need to select Classes to clusters evaluation in the Weka Explorer or via the -c <class_att_index> option on the command-line.
The following command uses SimpleKMeans with three clusters on the iris dataset that comes with Weka (-c last uses the last attribute as class and performs clusters to classes evaluation):
java -cp weka.jar weka.clusterers.SimpleKMeans -N 3 -c last -t data/iris.arff
Which will result in this output:
=== Clustering stats for training data ===
kMeans
======
Number of iterations: 6
Within cluster sum of squared errors: 6.998114004826762
Initial starting points (random):
Cluster 0: 6.1,2.9,4.7,1.4
Cluster 1: 6.2,2.9,4.3,1.3
Cluster 2: 6.9,3.1,5.1,2.3
Missing values globally replaced with mean/mode
Final cluster centroids:
Cluster#
Attribute Full Data 0 1 2
(150.0) (61.0) (50.0) (39.0)
=========================================================
sepallength 5.8433 5.8885 5.006 6.8462
sepalwidth 3.054 2.7377 3.418 3.0821
petallength 3.7587 4.3967 1.464 5.7026
petalwidth 1.1987 1.418 0.244 2.0795
Clustered Instances
0 61 ( 41%)
1 50 ( 33%)
2 39 ( 26%)
Class attribute: class
Classes to Clusters:
0 1 2 <-- assigned to cluster
0 50 0 | Iris-setosa
47 0 3 | Iris-versicolor
14 0 36 | Iris-virginica
Cluster 0 <-- Iris-versicolor
Cluster 1 <-- Iris-setosa
Cluster 2 <-- Iris-virginica
Incorrectly clustered instances : 17.0 11.3333 %

Plotting categorical variables using a bar diagram/bar chart

data
I am trying to plot a bar graph for both sept and oct waves. As in the image you can see the id are the individuals who are surveyed across time. So on the one graph I need to plot sept in-house, oct in-house, sept out-house, oct out-house and just have to show the proportion of people who said yes in sept in-house, oct in-house, sept out-house, oct out-house. Not all the categories have to be taken into account.
Also I have to show whiskers for 95% confidence intervals for each of the respective categories.
* Example generated by -dataex-. For more info, type help dataex
clear
input float(id sept_outhouse sept_inhouse oct_outhouse oct_inhouse)
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 3 3 3
5 4 4 3 3
6 4 4 3 3
7 4 4 4 1
8 1 1 1 1
9 1 1 1 1
10 1 1 1 1
end
label values sept_outhouse codes
label values sept_inhouse codes
label values oct_outhouse codes
label values oct_inhouse codes
label def codes 1 "yes", modify
label def codes 2 "no", modify
label def codes 3 "don't know", modify
label def codes 4 "refused", modify
save tokenexample, replace
rename (*house) (house*)
reshape long house, i(id) j(which) string
replace which = subinstr(proper(which), "_", " ", .)
gen yes = house == 1
label def WHICH 1 "Sept Out" 2 "Sept In" 3 "Oct Out" 4 "Oct In"
encode which, gen(WHICH) label(WHICH)
statsby, by(WHICH) clear: ci proportion yes, jeffreys
set scheme s1color
twoway scatter mean WHICH ///
|| rspike ub lb WHICH, xla(1/4, noticks valuelabel) xsc(r(0.9 4.1)) ///
xtitle("") legend(off) subtitle(Proportion Yes with 95% confidence interval)
This has to be solved backwards.
The means and confidence intervals have to be plotted using twoway as graph bar is a dead-end here, because it does not allow whiskers too.
The confidence limits have to be put in variables before the graphics. Some graph commands, notably graph bar, will calculate means for you, but as said that is a dead end. So, we need to calculate the means too.
To do that you need an indicator variable for Yes.
The best way I know to get the results then is to reshape to a different structure and then apply ci proportion under statsby.
As a detail, the option jeffreys is explicit as a signal that there are different methods for the confidence interval calculation. You should choose one knowingly.

PowerBI running Total formula

I have a dataset OvertimeHours with EMPLID, checkdate and NumberOfHours (and other fields). I need a running total NumberOfHours for each employee by checkdate. I tried using the Quick Measure option but that only allows for a single column and I have two. I do not want the measure to recalculate when filters are applied. Ultimately what I am trying to do is identify the records for the first 6 hours of overtime worked on each check so that they can get a category of OCB and all overtime over the first 6 hours is OTP and it does not have to be exact (as demonstrated in the output below). I have only been working with Power BI for about a month and this is a pretty complex (for me) formula to figure out...
EMPLID CheckDate WkDate NumberOfHours RunningTotal Category
124 1/1/19 12/20/18 5 5 OCB
124 1/1/19 12/21/18 9 14 OTP
125 1/1/19 12/20/18 3 3 OCB
125 1/1/19 12/20/18 2 5 OCB
125 1/1/19 12/22/18 2 7 OTP
124 1/15/19 1/8/19 3 3 OCB
*Edited to add the WkDate.
Edit:
I have tweaked my query so that I have the running total and a sequential counter now:
Using the first 12 records, I am looking to get the following results:
I can either do it in a query if that is the easiest way or if there is a way to use DAX in PowerBI with this dataset now that I have the sequential piece, I can do that too.
I got it in the query:
select r.CheckDate,
r.EMPLID,
case
when PayrollRunningOTHours <= 6
then PayrollRunningOTHours
else 6
end as OCBHours,
case
when PayRollRunningOTHours > 6
then PayRollRunningOTHours - 6
end as OTPHours
from #rollingtotal r
inner
join lastone l
on r.CheckDate = l.CheckDate
and r.EMPLID = l.EMPLID
and r.OTCounter = l.lastRec
order by r.emplid,
r.CheckDate,
r.OTCounter

Two sample T-Test in SAS

I am trying to run a campaign analysis. I have two campaigns A and B and I have their sample size and response rates.
Data structure:
Campaign_Name Response_Flag
A 0
A 1
A 1
B 1
B 0
I have summarized to get a response rate and sample size
Campaign_name Sample_size Response Rate
A 6500 0.7%
B 3600 1.2%
I want to see if the two campaigns are statistically similar or different .
Please help !!
Thanks

Creating additional value to inform what line was not predicted correctly

I am trying to predict images from a bunch of pixel data on my arff file.
So far so good, it is working fine, but when i need to check what image was not predicted correctly i dont know.
Is there a way to output the line on the "=== Predictions under cross-validation ===" ?
Or there is a way to put an text on the line (like an attribute) with the image name and then outputs on the "=== Predictions under cross-validation ===" ?
Now my output is:
=== Predictions under cross-validation ===
inst# actual predicted error prediction ()
1 3:3 3:3 0.964
2 3:3 3:3 0.984
3 3:3 3:3 0.947
4 1:1 1:1 0.981
5 1:1 1:1 0.979
6 1:1 1:1 0.96
7 5:5 5:5 0.986
8 5:5 3:3 + 0.685
I needed to have the line or image file name
I created a new attribute as a string called ID.
Then i called the weka with:
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier -F weka.filters.unsupervised.attribute.RemoveType -t file -W "weka.classifiers.functions.MultilayerPerceptron"