Understanding cost-sensitive evaluation in Weka (cost matrix) - weka

I am using Weka 3.7.1
I am attempting to analyze sport predictions for baseball using weka. I would like to use a cost matrix because the cost of different outcomes is not the same at a sportsbook where I gamble on the game. My data set is simple: it is a set of predictions with a nominal class {WIN,LOSS}. For this question, the attributes are not a concern.
In the WEKA Explorer, after loading my arff file I can setup a cost matrix from
Classify->More Options...->Cost-sensitive evaluation->Set...->There is
a 2x2 grid that appears in the weka cost-sensitive evaluation after I
set the classes == 2
Here are the values I would like to enter in to the cost matrix:
Correctly classified as loss, cost is 0 (I did not wager)
Incorrectly classified as loss, cost is 0 (I did not wager)
Correctly classified as win, cost is -.909 (I won .909 dollars)
Incorrectly classified as win, cost is 1.0 (I lost a dollar)
Observe that to stay true with it being a 'cost matrix' that I set my profit to a negative value (which is the opposite of cost, it is a profit); and that I set the loss to a positive number (because it cost me when I lost the wager).
After some reflection I decided to use the following grid, and I have not a clue if I did this correctly, please let me know if I did this correctly:
- a b <---- "classified as"
- 0 1.0 a=LOSS
- 0 -.909 b=WIN
And here is my probably faulty logic: (col, row)
(0,0) of grid=0: classified as LOSS, and was LOSS
(0,1) of grid=0: classified as LOSS, but was WIN
(1,0) of grid=1.0; classified as WIN, but was LOSS
(1,1) of grid=.909; classified as WIN, was WIN
and of course (0,0) and (0,1) represent the classifier predicting a LOSS and in these cases I do not wager, and therefore there is no cost.
on the other hand (1,0) and (1,1) represent the classifier predicting a WIN and in these cases I place a wager, and therefore there is a cost associated.
One other item that is of great confusion: after I setup the cost matrix and execute a classifier, the output report contains the following:
Evaluation cost matrix:
0 1
0 0.91 <--- notice that this is not a negative value!
And as you can see, in the report (1,1) is 0.91 when I had actually entered -.909. I did find another post about this topic, but it does not explain why the negative value became positive.
Thank you in advance. Please note that these are answerable questions; however, if you want to provide some guidance I would be very happy as I am a newbie still trying to build a framework of understanding.

Cost matrix is a way to change the threshold value for decision boundary.
It is explained in a following paper.
http://research.ijcaonline.org/volume44/number13/pxc3878677.pdf
By looking at your cost matrix it seems that there is a little correction required.
e.g.
0 cost
cost 0
just for explanation:
consider following cost matrix:
a b
c d
This is the general format of cost matrix which I have observed for two class problems.
now when you have classified something at a or d location then there is no need to incorporate the cost.
So the point here is, the cost comes in picture only when there is a misclassification. i.e. either at b or c location.
But as you have written negative value as a cost at place d it creates confusion. (kindly make it possible to explain the same, i.e. what do you mean by negative cost.)
an example cost matrix can be:
0 1
10 0
which says that cost of classifying examples as false positive is 10 times higher than the cost of misclassification of similar example as false negative. Moreover there is no cost when examples are classified correctly.

Related

Understanding the loss function in Yolo v1 research paper

I'm not able to understand the following piece of text from YOLO v1 research paper:
"We use sum-squared error because it is easy to optimize,
however it does not perfectly align with our goal of
maximizing average precision. It weights localization error
equally with classification error which may not be ideal.
Also, in every image many grid cells do not contain any
object. This pushes the “confidence” scores of those cells
towards zero, often overpowering the gradient from cells
that do contain objects. This can lead to model instability,
causing training to diverge early on.
To remedy this, we increase the loss from bounding box
coordinate predictions and decrease the loss from confidence
predictions for boxes that don’t contain objects. We
use two parameters, lambda(coord) and lambda(noobj) to accomplish this. We
set lambda(coord) = 5 and lambda(noobj) = .5"
What is the meaning of "overpowering" in the first paragraph and why would we decrease the loss from confidence prediction(must it not be already low especially for boxes that don't contain any object) and increase that from bounding box predictions ?
There are cells that contain objects and that do not. Model often very confident about the absence (confidence around zero) of the object in the grid cell, it make gradient from those cells be much greater than the gradient from cells that do contain objects but not with huge confidence, it overpowers them (i.e around 0.7-0.8).
So that we want to consider classification score less important because they are not very "fair", to implement this we make weight for coords prediction greater than for classification.

Using class weight to balance data set lowers accuracy in RBF SVM

I have been using sklearn to learn on some data. This is a binary classifcation task and I am using a RBF kernel. My data set is quite unbalanced (80:20) and I'm using only 120 samples, with 10ish features (I've been experimenting with a few less). Since I set class_weight="auto" the accuracy I've calculated from a cross validated (10 folds) gridsearch has dropped dramatically. Why??
I will include a couple of validation accuracy heatmaps to demonstrate the difference.
NOTE: top heatmap is before classweight was changed to auto.
Accuracy is not the best metrics to use when dealing with unbalanced dataset. Let's say you have 99 positive examples and 1 negative example, and if you predict all outputs to be positive, still you will get 99% accuracy, whereas you have mis-classified the only negative example. You might have gotten high accuracy in the first case because your predictions will be on the side which has high number of samples.
When you do class weight = auto, it takes the imbalance into consideration and hence, your predictions might have moved towards center, you can cross-check it using plotting the histograms of predictions.
My suggestion is, don't use accuracy as performance metric, use something like F1 Score or AUC.

What does eigen value of structure tensor matrix denote?

It is known that good feature point across two images can be determined properly, if
the two eigen value of above matrix, are greater than 0. Can someone explain, what does it mean to have both eigen value greater than 0 and why the feature point is not good if either of them is approx. equal to 0.
Note that this matrix always has nonnegative eigenvalues. Basically this rule says that one should favor rapid change in all directions, that is corners are better features than edges or flat surfaces.
The biggest eigenvalue corresponds to the eigenvector pointing towards the direction of the most significant change in the image at the point u.
If the two eigenvalues are small the image at point u does not change much.
If one of the eigenvectors is large and the other is small this point might lie on an edge in the image but it will be difficult to figure out where exactly on that edge.
If both are large, the point is like a corner.
There is a nice presentation with examples in the panoramic stitching slide deck from a course taught by Rajesh Rao at the University of Washington.
Here E(u,v) denotes the Eucledian distance between the two areas in the vicinities of pixels shifted by the vector (u,v) from each other. This distance tells how easy it is to distinguish the two pixels from one another.
Edit The matrix of image derivatives is denoted H in this illustration probably because of its relation to Harris corner detection algorithm.
That is related with the concept of Texturedness in the paper of Thomasi-Shi "Good features to track".
The idea of Textureness is to provide a rating of texture to make features (within a window) identifiable and unique. For instance, lines are not good features since are not unique (see Figure 3.9a)
To solve equation an optical flow equation, it must be possible to invert J (Hessian matrix). In practice next conditions must be satisfied:
Eigenvalues of J cannot differ by several orders of magnitude.
Eigenvalues of Hessian overcome image noise levels λnoise: implies that both eigenvalues of J must be large.
For the first condition we know that the greatest eigenvalue cannot be arbitrarily large because intensity variations in a window are bounded by the maximum allowable pixel value.
Regarding to second condition, being λ1 and λ2 two eigenvalues of J, following situations may rise (See Figure 3.10):
• Two small eigenvalues λ1 and λ2: means a roughly constant intensity profile within a window (Pink region). Problem of figure 3.9-b.
• A large and a small eigenvalue: means unidirectional texture patter (Violet or gray region). Problem of figure 3.9-a.
• λ1 and λ2 are both large: can represent a corner, salt and pepper textures or any other pattern that can be tracked reliably (Green region).
Some references:
1 - ORTIZ CAYON, R. J. (2013). Online video stabilization for UAV. Motion estimation and compensation for unnamed aerial vehicles.
2 - Shi, J., & Tomasi, C. (1994, June). Good features to track. In Computer Vision and Pattern Recognition, 1994. Proceedings CVPR'94., 1994 IEEE Computer Society Conference on (pp. 593-600). IEEE.
3 - Richard Szeliski. Image alignment and stitching: a tutorial. Found.
Trends. Comput. Graph. Vis., 2(1):1–104, January 2006.

Deciding about dimensionality reduction with PCA

I have 2D data (I have a zero mean normalized data). I know the covariance matrix, eigenvalues and eigenvectors of it. I want to decide whether to reduce the dimension to 1 or not (I use principal component analysis, PCA). How can I decide? Is there any methodology for it?
I am looking sth. like if you look at this ratio and if this ratio is high than it is logical to go on with dimensionality reduction.
PS 1: Does PoV (Proportion of variation) stands for it?
PS 2: Here is an answer: https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained does it a criteria to test it?
PoV (Proportion of variation) represents how much information of data will remain relatively to using all of them. It may be used for that purpose. If POV is high than less information will be lose.
You want to sort your eigenvalues by magnitude then pick the highest 1 or 2 values. Eigenvalues with a very small relative value can be considered for exclusion. You can then translate data values and using only the top 1 or 2 eigenvectors you'll get dimensions for plotting results. This will give a visual representation of the PCA split. Also check out scikit-learn for more on PCA. Precisions, recalls, F1-scores will tell you how well it works
from http://sebastianraschka.com/Articles/2014_pca_step_by_step.html...
Step 1: 3D Example
"For our simple example, where we are reducing a 3-dimensional feature space to a 2-dimensional feature subspace, we are combining the two eigenvectors with the highest eigenvalues to construct our d×kd×k-dimensional eigenvector matrix WW.
matrix_w = np.hstack((eig_pairs[0][1].reshape(3,1),
eig_pairs[1][1].reshape(3,1)))
print('Matrix W:\n', matrix_w)
>>>Matrix W:
[[-0.49210223 -0.64670286]
[-0.47927902 -0.35756937]
[-0.72672348 0.67373552]]"
Step 2: 3D Example
"
In the last step, we use the 2×32×3-dimensional matrix WW that we just computed to transform our samples onto the new subspace via the equation
y=W^T×x
transformed = matrix_w.T.dot(all_samples)
assert transformed.shape == (2,40), "The matrix is not 2x40 dimensional."

How to exploit periodicity to reduce noise of a signal?

100 periods have been collected from a 3 dimensional periodic signal. The wavelength slightly varies. The noise of the wavelength follows Gaussian distribution with zero mean. A good estimate of the wavelength is known, that is not an issue here. The noise of the amplitude may not be Gaussian and may be contaminated with outliers.
How can I compute a single period that approximates 'best' all of the collected 100 periods?
Time-series, ARMA, ARIMA, Kalman Filter, autoregression and autocorrelation seem to be keywords here.
UPDATE 1: I have no idea how time-series models work. Are they prepared for varying wavelengths? Can they handle non-smooth true signals? If a time-series model is fitted, can I compute a 'best estimate' for a single period? How?
UPDATE 2: A related question is this. Speed is not an issue in my case. Processing is done off-line, after all periods have been collected.
Origin of the problem: I am measuring acceleration during human steps at 200 Hz. After that I am trying to double integrate the data to get the vertical displacement of the center of gravity. Of course the noise introduces a HUGE error when you integrate twice. I would like to exploit periodicity to reduce this noise. Here is a crude graph of the actual data (y: acceleration in g, x: time in second) of 6 steps corresponding to 3 periods (1 left and 1 right step is a period):
My interest is now purely theoretical, as http://jap.physiology.org/content/39/1/174.abstract gives a pretty good recipe what to do.
We have used wavelets for noise suppression with similar signal measured from cows during walking.
I'm don't think the noise is so much of a problem here and the biggest peaks represent actual changes in the acceleration during walking.
I suppose that the angle of the leg and thus accelerometer changes during your experiment and you need to account for that in order to calculate the distance i.e you need to know what is the orientation of the accelerometer in each time step. See e.g this technical note for one to account for angle.
If you need get accurate measures of the position the best solution would be to get an accelerometer with a magnetometer, which also measures orientation. Something like this should work: http://www.sparkfun.com/products/10321.
EDIT: I have looked into this a bit more in the last few days because a similar project is in my to do list as well... We have not used gyros in the past, but we are doing so in the next project.
The inaccuracy in the positioning doesn't come from the white noise, but from the inaccuracy and drift of the gyro. And the error then accumulates very quickly due to the double integration. Intersense has a product called Navshoe, that addresses this problem by zeroing the error after each step (see this paper). And this is a good introduction to inertial navigation.
Periodic signal without noise has the following property:
f(a) = f(a+k), where k is the wavelength.
Next bit of information that is needed is that your signal is composed of separate samples. Every bit of information you've collected are based on samples, which are values of f() function. From 100 samples, you can get the mean value:
1/n * sum(s_i), where i is in range [0..n-1] and n = 100.
This needs to be done for every dimension of your data. If you use 3d data, it will be applied 3 times. Result would be (x,y,z) points. You can find value of s_i from the periodic signal equation simply by doing
s_i(a).x = f(a+k*i).x
s_i(a).y = f(a+k*i).y
s_i(a).z = f(a+k*i).z
If the wavelength is not accurate, this will give you additional source of error or you'll need to adjust it to match the real wavelength of each period. Since
k*i = k+k+...+k
if the wavelength varies, you'll need to use
k_1+k_2+k_3+...+k_i
instead of k*i.
Unfortunately with errors in wavelength, there will be big problems keeping this k_1..k_i chain in sync with the actual data. You'd actually need to know how to regognize the starting position of each period from your actual data. Possibly need to mark them by hand.
Now, all the mean values you calculated would be functions like this:
m(a) :: R->(x,y,z)
Now this is a curve in 3d space. More complex error models will be left as an excersize for the reader.
If you have a copy of Curve Fitting Toolbox, localized regression might be a good choice.
Curve Fitting Toolbox supports both lowess and loess localized regression models for curve and curve fitting.
There is an option for robust localized regression
The following blog post shows how to use cross validation to estimate an optimzal spaning parameter for a localized regression model, as well as techniques to estimate confidence intervals using a bootstrap.
http://blogs.mathworks.com/loren/2011/01/13/data-driven-fitting/