weka confusion matrix and accuracy analysis - weka

How do I analyze the confusion matrix in Weka with regards to the accuracy obtained?
We know that accuracy is not accurate because of imbalanced data sets.
How does the confusion matrix "confirm" the accuracy?
Examples:
a) accuracy 96.1728 %
a b c d e f g <-- classified as
124 0 0 0 1 0 0 | a = brickface
0 110 0 0 0 0 0 | b = sky
1 0 119 0 2 0 0 | c = foliage
1 0 0 107 2 0 0 | d = cement
1 0 12 7 105 0 1 | e = window
0 0 0 0 0 94 0 | f = path
0 0 1 0 0 2 120 | g = grass
b) accuracy : 96.8 %
a b c d e f g <-- classified as
202 0 0 0 3 0 0 | a = brickface
0 220 0 0 0 0 0 | b = sky
0 0 198 0 10 0 0 | c = foliage
0 0 1 202 16 1 0 | d = cement
2 0 11 2 189 0 0 | e = window
0 0 0 2 0 234 0 | f = path
0 0 0 0 0 0 207 | g = grass
etc...

The accuracy is computed by summing up all instances in the main diagonal and dividing by the total number of instances (the contents of all the confusion matrix). For instance, in a), you get 124 + 110 + ... + 120 = 779, and the total number of instances (summing everything) is 810, so the accuracy is 0,9617 => 96,17%.
Your datasets are rather balanced (all the classes have approximately the same number of instances). You can see that the dataset is imbalanced when the sum of a row is much bigger than the sume of other rows, as rows represent actual classes. For instance:
a b <-- classified as
1000 20 | a = class1
10 10 | b = class2
In this case, class1 has 1020 instances, and class2 has only 20, so the problem is highly imbalanced. This will impact in classifier perfomance, as learning algorithm typically try to maximize the accuracy (or minimize the error), so a trivial classifier like e.g. the rule for any X, set class = class1 will have an accuracy of 1020/1040 = 0,9807.

a b c d e f g <-- classified as
124 0 0 0 1 0 0 | a = brickface
...
It means there is 125 examples a(brickface). and 124 examples are classified as a (correct) and 1 example classified as e(incorrect).
If you think your data is imbalanced, use AUC score. It's stub for unbalanced data set.

Accuracy is the proportion of the total number of correct predictions. It is calculated as
Accuracy = (124+110+119+107+105+94+120)/(124+0+0+0+1+0+0+0+110+0+0+0+0+0+1+0+119+0+2+0+0+1+0+0+107+2+0+0+1+0+12+7+105+0+1+0+0+0+0+0+94+0+0+0+1+0+0+2+120)
Accuracy = 779/810 = 0.961728
Similarly,
Accuracy = (202+220+198+202+189+234+207)/(202+0+0+0+3+0+0+0+220+0+0+0+0+0+0+0+198+0+10+0+0+0+0+1+202+16+1+0+2+0+11+2+189+0+0+0+0+0+2+0+234+0+0+0+0+0+0+0+207)
Accuracy = 1452/1500 = 0.968

Related

Retrieve multiple ArrayFire subarrays from min/max data points

I have an array with sections of touching values in it. For example:
0 0 1 0 0 0 0 0 0 0
0 1 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 2 2 2 0 0
0 0 0 0 0 0 0 2 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0
from this, I created a set of af::arrays: minX, maxX, minY, maxY. These define the box that encloses each group.
so for this example:
minX would be: [1,5,2] // 1 for label(1), 5 for label(2) and 2 for label(3)
maxX would be: [3,7,2] // 3 for label(1), 7 for label(2) and 2 for label(3)
minY would be: [0,3,7] // 0 for label(1), 3 for label(2) and 7 for label(3)
maxY would be: [1,4,9] // 1 for label(1), 4 for label(2) and 9 for label(3)
So if you take the i'th element from each of those arrays, you can get the upperleft/lowerright bounds of a box that encloses the corresponding label.
I would like use these values to pull out subarrays from this larger array. My goal is to put these values enclosed in the boxes into a flat list. In GPU memory, I also have calculated how many entries I would need for each box using the max/min X/Y values. So in this example - the result of the flat list should be:
result=[0 1 0 1 1 1 2 2 2 0 0 2 3 3 3]
where the first 6 entries are from the box
______
|0 1 0 |
|1 1 1 |
------
the second 6 entries are from the box
______
|2 2 2 |
|0 0 2 |
------
and the final three entries are from the box
___
| 3 |
| 3 |
| 3 |
---
I cannot figure out how to index into this af::array with min/max values in memory that resides on the GPU (and do not want to transfer them to the CPU). I was trying to see if gfor/seq would work for me, but it appears that af::seq cannot use array data, and everything I have tried with using af::index i could not get to work for me either.
I am able to change how I represent min/max (I could store indices for upper left/lower right) but my main goal is to do this efficiently on the GPU without moving data back and forth between the GPU and CPU.
How can this be achieved efficiently with ArrayFire?
Thank you for your help
How did you get there so far? which language are you using?
I guess you could be tiling the results to 3rd dimensions to handle each regions separately and end up with min/max vectors in GPU memory.

Categorize variable with zero values

Trying to categorize a variable X which has 82 values as 0, 118 values between 1 and 6, 0 values between 7 and 12, 0 values between 13 and 18, 0 values between 19 and 24.
Tried the following code:
gen X = .
replace X = 1 if Y >= 1 & Y <= 6
replace X = 2 if Y >= 7 & Y <= 12
replace X = 3 if Y >= 13 & Y <= 18
replace X = 4 if Y >= 19 & Y <= 24
I wish to see X categorized as 0, 1-6, 7-12, 13-18, 19-24. Instead of just 0 and 1.
Current Results:
tab X
X Freq. Percent Cum.
0 82 41.00 41.00
1 118 59.00 100.00
Total 200 100.00
* Example generated by -dataex-. To install: ssc install dataex
clear
input int FID byte Y float X
150 0 0
17 0 0
95 1 1
0 0 0
18 0 0
1 0 0
96 0 0
54 0 0
172 3 1
97 0 0
57 1 1
19 0 0
98 1 1
151 0 0
99 1 1
2 3 1
197 1 1
55 2 1
58 1 1
100 0 0
end
Your code does serve for your purpose, i.e. variable X is indeed the right set of categories for variable Y as you intended.
The fact that you only see X in the range 0,1 simply means that the data have no observations with Y falling in other categories. If the data had any Y belonging to other categories, then the correct corresponding values of X would show up.
A direct way to achieve this output is shown below. Just give it a try.
egen YCat = cut(Y), at(0,1,7,13,19,25)
Your code looks fine, except crucially that nothing in your code yields 0 as a result.
However, I disagree with #Romalpa Akzo on recommending egen, cut(). Even an experienced Stata user is unlikely to remember the exact rules used by that function of that command.
Are lower bounds >= or >, in particular? What happens above and below the extreme values mentioned? What if you don't want results 1 up?
I prefer explicit code.
Here's another way to do it. With a programmer's understanding that cond(A, B, C) yields B if A is true (non-zero) and C if A is false (zero), then we can go
clear
set obs 26
generate Y = _n - 1
generate X = cond(Y > 24, ., ///
cond(Y >= 19, 4, ///
cond(Y >= 13, 3, ///
cond(Y >= 7, 2, ///
cond(Y >= 1, 1, 0 )))))
tabulate Y X , missing
| X
Y | 0 1 2 3 4 . | Total
-----------+------------------------------------------------------------------+----------
0 | 1 0 0 0 0 0 | 1
1 | 0 1 0 0 0 0 | 1
2 | 0 1 0 0 0 0 | 1
3 | 0 1 0 0 0 0 | 1
4 | 0 1 0 0 0 0 | 1
5 | 0 1 0 0 0 0 | 1
6 | 0 1 0 0 0 0 | 1
7 | 0 0 1 0 0 0 | 1
8 | 0 0 1 0 0 0 | 1
9 | 0 0 1 0 0 0 | 1
10 | 0 0 1 0 0 0 | 1
11 | 0 0 1 0 0 0 | 1
12 | 0 0 1 0 0 0 | 1
13 | 0 0 0 1 0 0 | 1
14 | 0 0 0 1 0 0 | 1
15 | 0 0 0 1 0 0 | 1
16 | 0 0 0 1 0 0 | 1
17 | 0 0 0 1 0 0 | 1
18 | 0 0 0 1 0 0 | 1
19 | 0 0 0 0 1 0 | 1
20 | 0 0 0 0 1 0 | 1
21 | 0 0 0 0 1 0 | 1
22 | 0 0 0 0 1 0 | 1
23 | 0 0 0 0 1 0 | 1
24 | 0 0 0 0 1 0 | 1
25 | 0 0 0 0 0 1 | 1
-----------+------------------------------------------------------------------+----------
Total | 1 6 6 6 6 1 | 26
Naturally, you could write all the command on one line, but many will find the multiline layout easier to understand and to debug. With nested function calls, each new condition implies a promise to close all the parentheses at the end.
Multiple commands like those in the question are preferred by many Stata users too, so taste is behind many choices.

time series sliding window with occurrence counts

I am trying to get a count between two timestamped values:
for example:
time letter
1 A
4 B
5 C
9 C
18 B
30 A
30 B
I am dividing time to time windows: 1+ 30 / 30
then I want to know how many A B C in each time window of size 1
timeseries A B C
1 1 0 0
2 0 0 0
...
30 1 1 0
this shoud give me a table of 30 rows and 3 columns: A B C of ocurancess
The problem is the data is taking to long to be break down because it iterates through all master table every time to slice the data eventhough thd data is already sorted
master = mytable
minimum = master.timestamp.min()
maximum = master.timestamp.max()
window = (minimum + maximum) / maximum
wstart = minimum
wend = minimum + window
concurrent_tasks = []
while ( wstart <= maximum ):
As = 0
Bs = 0
Cs = 0
for d, row in master.iterrows():
ttime = row.timestamp
if ((ttime >= wstart) & (ttime < wend)):
#print (row.channel)
if (row.channel == 'A'):
As = As + 1
elif (row.channel == 'B'):
Bs = Bs + 1
elif (row.channel == 'C'):
Cs = Cs + 1
concurrent_tasks.append([m_id, As, Bs, Cs])
wstart = wstart + window
wend = wend + window
Could you help me in making this perform better ? i want to use map function and i want to prevent python from looping through all the loop every time.
This is part of big data and it taking days to finish ?
thank you
There is a faster approach - pd.get_dummies():
In [116]: pd.get_dummies(df.set_index('time')['letter'])
Out[116]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 0 0
30 0 1 0
If you want to "compress" (group) it by time:
In [146]: pd.get_dummies(df.set_index('time')['letter']).groupby(level=0).sum()
Out[146]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
or using sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(token_pattern=r"\b\w+\b", stop_words=None)
r = pd.SparseDataFrame(cv.fit_transform(df.groupby('time')['letter'].agg(' '.join)),
index=df['time'].unique(),
columns=df['letter'].unique(),
default_fill_value=0)
Result:
In [143]: r
Out[143]:
A B C
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
If we want to list all times from 1 to 30:
In [153]: r.reindex(np.arange(r.index.min(), r.index.max()+1)).fillna(0).astype(np.int8)
Out[153]:
A B C
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 1 0
19 0 0 0
20 0 0 0
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
or using Pandas approach:
In [159]: pd.get_dummies(df.set_index('time')['letter']) \
...: .groupby(level=0) \
...: .sum() \
...: .reindex(np.arange(r.index.min(), r.index.max()+1), fill_value=0)
...:
Out[159]:
A B C
time
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
... .. .. ..
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
[30 rows x 3 columns]
UPDATE:
Timing:
In [163]: df = pd.concat([df] * 10**4, ignore_index=True)
In [164]: %timeit pd.get_dummies(df.set_index('time')['letter'])
100 loops, best of 3: 10.9 ms per loop
In [165]: %timeit df.set_index('time').letter.str.get_dummies()
1 loop, best of 3: 914 ms per loop

Vectorized way to create a frequency vector from a set of observations in R?

Question
I have a vector of observations with their year of occurrence, and I want to create a vector of frequencies over a longer period for the purposes of curve fitting. I can do this easily with a function, but is there a simpler method or one that uses inherent vectorization? It may be I'm forgetting something simple.
Reproducible example
Data
Events <- data.frame(c(1991, 1991, 1995, 1999, 2007, 2007, 2010, 2010, 2010, 2014), seq(1100, 2000, 100))
names(Events) <- c("Year", "Loss")
Period <- seq(1990, 2014)
Function
FreqV <- function(Period, Observations){
n <- length(Period)
F <- double(n)
for(i in seq_len(n)) {
F[i] = sum(Observations == Period[i])
}
return(F)
}
Expected Results
FreqV(Period, Events$Year)
[1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
Post acceptance update
It bothered me why the C++ version of the algorithm (see comments under accepted answer) was so much slower, and I finally realized that the reason was that it is a naïve translation of FreqV above. If there are n periods and m events, it has to do n*m calculations. Even in C++ this is slow.
Tabulate probably is set to do a one-pass algorithm, and when I code a simple one-pass algorithm in C++, it's between 5–8 times faster than tabulate:
Naïve C++ Code
// [[Rcpp::export]]
std::vector<int> FV_C(std::vector<int> P, std::vector<int> O) {
int n = P.size();
std::vector<int> F(n);
for (int i = 0; i < n; ++i){
F[i] = std::count(O.begin(), O.end(), P[i]);
}
return(F);
}
One-pass C++ Code
// [[Rcpp::export]]
std::vector<int> FV_C2(std::vector<int> P, std::vector<int> O) {
int n = P.size();
int m = O.size();
int MinP = *std::min_element(P.begin(), P.end());
std::vector<int> F(n, 0);
for (int i = 0; i < m; ++i){
int offset = O[i] - MinP;
F[offset] += 1;
}
return(F);
}
Speed test
Tests done on an i7-2600K overclocked to 4.6Ghz with 16GB RAM using Windows 7 64bit, R-3.1.2 compiled with OpenBLAS 2.13.
set.seed(1)
vals <- sample(sample(10000, 100), 100000, TRUE)
period <- 1:10000
f1a <- function() tabulate(factor(vals, period), nbins = length(period))
f1b <- function() tabulate((vals-period[1])+1, nbins = length(period))
f2 <- function() unname(table(c(period, vals))-1)
library(microbenchmark)
all.equal(f1a(), f1b(), f2(), FV_C(period, vals), FV_C2(period, vals))
[1] TRUE
microbenchmark(f1a(), f1b(), f2(), FV_C(period, vals), FV_C2(period, vals), times = 100L)
Unit: microseconds
expr min lq mean median uq max neval
f1a() 26998.194 27812.6250 29515.375 28167.645 28703.4515 55456.079 100
f1b() 640.049 712.4235 1291.356 800.136 1522.0890 27814.561 100
f2() 34228.449 35746.6655 39686.660 36210.395 36768.3900 65295.374 100
FV_C(period, vals) 647577.794 647927.3040 648729.027 648221.417 648848.5090 659463.813 100
FV_C2(period, vals) 140.877 147.7270 169.085 158.449 170.3625 1095.738 100
I would recommend factor and table or tabulate.
Here's tabulate:
tabulate(factor(Events$Year, Period))
# [1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
It might even be faster to do something like:
tabulate((Events$Year-Period[1])+1)
For both of these, you should probably specify nbins, (nbins = length(Period)) in case the maximum value in "Events$Year" is less than the maximum value in "Period".
Here's a performance comparison:
set.seed(1)
vals <- sample(sample(10000, 100), 100000, TRUE)
period <- 1:10000
f1a <- function() tabulate(factor(vals, period), nbins = length(period))
f1b <- function() tabulate((vals-period[1])+1, nbins = length(period))
f2 <- function() unname(table(c(period, vals))-1)
library(microbenchmark)
microbenchmark(f1a(), f1b(), f2())
# Unit: microseconds
# expr min lq mean median uq max neval
# f1a() 41784.904 43665.394 46789.753 44278.093 45654.546 95032.59 100
# f1b() 884.465 1162.254 2261.118 1275.154 2756.922 46641.87 100
# f2() 54837.666 57615.562 71386.516 58863.272 100893.389 130235.33 100
You can solve this problem with table:
table(c(Period,Events$Year))-1
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
# 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0
# 2010 2011 2012 2013 2014
# 3 0 0 0 1
To get rid of the names, use:
unname(table(c(Period,Events$Year))-1)
# [1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
You could try
colSums(Vectorize(function(x) x==Events$Year)(Period))
#[1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
Or
colSums(outer(Events$Year, Period, FUN=function(x,y) x==y))
#[1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
Or using data.table
library(data.table)
CJ(Period, Events$Year)[, V3:=V1][, sum(V1==V2), V3]$V1
#[1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
Or if it is ordered
c(0,diff(findInterval(Period,Events$Year)))
#[1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
Or using a combination of tabulate with fmatch
library(fastmatch)
tabulate(fmatch(Events$Year, Period), nbins=length(Period))
#[1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1

How to find TP,TN, FP and FN values from 8x8 Confusion Matrix

I have confusion matrix as follow:
a b c d e f g h <-- classified as
1086 7 1 0 2 4 0 0 | a
7 1064 8 6 0 2 2 0 | b
0 2 1091 2 3 0 1 1 | c
0 8 0 1090 1 1 0 0 | d
1 1 1 1 597 2 2 0 | e
4 2 1 0 3 1089 0 1 | f
0 2 1 3 0 0 219 0 | g
0 0 1 0 1 4 1 443 | h
Now how to find the True positive, True Negative, False Positive and False Negative values from this confusion matrix.
The Weka gave me TP Rate is that same as True positive value ?
You have total 8 classes: a, b, c, d, e, f, g, h. You will thus get 8 different TP, FP, FN, and TN numbers. For instance, in the case of a class,
TP (instance belongs to a, classified as a) = 1086
FP (instance belongs to others, classified as a) = 7 + 0 + 0 + 1 + 4 + 0 + 0 = 12
FN (instance belongs to a, classified as others) = 7 + 1 + 0 + 2 + 4 + 0 + 0 = 14
TN (instance belongs to others, classified as others) = Total instance - (TP + FP + FN)
TP rate is not TP. It is the Recall or TP/TP+FN.
a b c d e f g h <-- classified as
1086 7 1 0 2 4 0 0 | a
7 1064 8 6 0 2 2 0 | b
0 2 1091 2 3 0 1 1 | c
0 8 0 1090 1 1 0 0 | d
1 1 1 1 597 2 2 0 | e
4 2 1 0 3 1089 0 1 | f
0 2 1 3 0 0 219 0 | g
0 0 1 0 1 4 1 0 | h