Trouble with classification using j48 algorithm in weka - weka

I'm trying to use the J48 classifier in Weka, but it classified everything as 0.
This is my DataSet:
#relation 'SimpleRules-weka.filters.unsupervised.attribute.Reorder-R1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,6-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.unsupervised.attribute.Remove-R3-weka.filters.unsupervised.attribute.Remove-R1-2'
#attribute R1 numeric
#attribute R2 numeric
#attribute R3 numeric
#attribute R4 numeric
#attribute R5 numeric
#attribute R6 numeric
#attribute R7 numeric
#attribute R8 numeric
#attribute R9 numeric
#attribute Rank numeric
#attribute R1R5R6 numeric
#attribute R1R6R7 numeric
#attribute CombinedRules numeric
#attribute Demoday {0,1}
#data
1,1,1,1,1,0,1,1,0,11,12,0,0,0
1,1,1,1,0,1,0,1,0,72,1,0,0,0
0,0,0,1,0,1,1,1,0,47,7,0,0,1
1,1,0,1,1,1,0,1,1,68,12,1,0,0
1,1,1,1,1,1,0,1,1,21,7,1,0,0
1,1,1,1,0,1,1,1,1,63,11,0,1,0
1,1,0,1,0,1,1,1,0,19,7,0,1,0
1,0,1,1,0,0,1,1,1,11,7,0,0,0
0,1,1,1,0,1,0,1,0,107,12,0,0,0
1,1,1,1,0,1,0,1,0,99,12,0,0,1
0,1,1,1,0,1,1,1,1,238,2,0,0,0
1,1,1,1,1,0,1,1,0,147,7,0,0,0
1,1,1,1,1,1,0,1,1,30,7,1,0,1
1,1,1,1,0,0,0,1,1,124,5,0,0,0
0,1,1,1,1,1,0,1,1,54,5,0,0,0
0,0,0,1,0,1,0,1,0,153,5,0,0,0
1,1,1,1,0,1,0,1,1,33,5,0,0,0
1,1,1,1,1,1,0,1,0,143,3,1,0,0
1,0,1,1,0,1,1,1,0,28,3,0,1,0
0,1,1,1,0,1,1,0,0,83,8,0,0,0
1,1,1,1,1,1,1,1,0,31,7,1,1,0
1,1,1,1,0,0,0,1,0,91,12,0,0,0
1,1,1,1,0,0,1,1,0,7,7,0,0,0
1,1,1,1,0,1,0,1,1,4,1,0,0,0
1,1,0,1,1,1,0,1,0,41,1,1,0,0
0,1,1,1,0,1,0,1,1,84,5,0,0,0
1,1,0,1,0,1,1,1,0,81,1,0,1,1
0,1,1,1,1,1,0,1,1,8,6,0,0,0
1,1,1,1,0,1,1,1,1,172,11,0,1,0
1,1,1,1,1,0,0,1,1,142,12,0,0,1
0,1,1,1,0,1,1,1,1,35,11,0,0,0
1,1,1,1,0,1,0,1,1,130,11,0,0,0
1,1,1,1,0,1,1,1,1,62,7,0,1,0
0,1,1,1,0,1,1,1,1,34,7,0,0,0
0,1,1,1,0,1,1,1,1,108,3,0,0,0
0,1,1,1,0,1,0,1,1,11,12,0,0,0
0,1,1,1,1,0,0,1,1,129,3,0,0,0
1,1,0,1,0,1,1,1,1,24,10,0,1,1
1,1,1,1,0,1,1,1,0,50,8,0,1,0
1,1,1,1,1,1,0,1,1,12,12,1,0,1
0,1,1,1,1,1,1,1,0,111,3,0,0,0
1,1,0,1,0,1,0,1,1,55,11,0,0,0
1,1,1,1,0,1,0,1,1,239,11,0,0,0
0,1,1,1,1,1,0,1,0,131,2,0,0,0
1,1,1,1,0,1,0,1,1,328,8,0,0,0
1,1,1,1,0,1,1,1,1,12,12,0,1,1
1,1,1,1,0,1,0,1,1,113,8,0,0,0
0,1,1,1,0,1,0,1,0,96,1,0,0,0
1,1,1,1,0,0,0,1,1,75,7,0,0,0
1,1,1,1,1,1,0,1,1,67,1,1,0,1
1,1,1,1,1,1,0,1,0,112,11,1,0,0
1,1,1,1,0,0,1,1,1,109,3,0,0,0
1,0,1,1,1,0,0,1,0,47,12,0,0,0
1,1,1,1,0,1,0,1,1,47,7,0,0,0
1,1,1,1,0,1,0,1,1,2,6,0,0,0
0,0,0,1,0,1,1,1,0,16,2,0,0,0
1,1,1,1,0,1,0,1,0,18,12,0,0,0
0,1,1,1,1,1,1,1,0,58,3,0,0,0
0,0,0,1,1,1,0,1,0,156,7,0,0,0
1,1,1,1,1,0,1,1,0,279,2,0,0,0
1,1,1,1,0,1,0,1,0,2,12,0,0,0
0,0,1,1,0,1,0,1,1,163,6,0,0,0
1,1,1,1,1,1,0,1,1,10,3,1,0,1
0,0,1,1,1,1,0,1,0,3,12,0,0,0
1,1,1,1,1,1,0,1,1,101,7,1,0,0
1,1,1,1,0,1,0,1,0,136,9,0,0,0
0,1,1,1,1,0,0,1,0,31,8,0,0,0
1,0,1,1,1,1,0,1,0,155,8,1,0,0
0,1,1,1,0,1,1,1,0,158,12,0,0,0
0,1,0,1,0,1,0,1,0,101,1,0,0,0
0,1,0,1,0,1,0,1,1,7,7,0,0,0
1,0,0,1,1,1,0,1,0,23,1,1,0,0
1,0,0,1,1,0,0,1,1,99,1,0,0,0
1,1,1,1,1,1,0,1,1,73,3,1,0,0
1,1,1,1,1,1,0,1,0,15,3,1,0,0
1,1,1,1,0,1,1,1,0,97,8,0,1,0
1,1,1,1,0,1,1,1,1,93,8,0,1,0
1,1,1,1,1,1,1,1,0,44,7,1,1,1
0,1,1,1,0,1,0,1,0,239,7,0,0,0
0,0,0,1,1,1,0,1,1,35,1,0,0,0
0,1,1,1,0,1,0,1,0,90,12,0,0,0
1,1,1,1,1,1,0,1,1,37,7,1,0,0
1,1,1,1,1,0,0,1,1,25,12,0,0,1
1,1,1,1,0,0,0,1,0,83,2,0,0,0
1,1,1,1,1,1,1,1,1,22,10,1,1,1
1,1,1,1,1,0,1,1,1,2,10,0,0,0
1,0,1,1,0,1,1,1,1,65,5,0,1,0
0,1,1,1,0,1,1,1,1,25,3,0,0,0
1,0,1,1,0,0,1,1,0,180,8,0,0,0
0,1,0,1,0,1,1,1,1,49,10,0,0,0
0,0,1,1,0,1,0,1,0,67,8,0,0,0
1,1,1,1,1,0,1,1,0,14,11,0,0,0
1,0,0,1,1,1,0,1,0,36,11,1,0,0
0,0,0,1,0,0,1,1,1,97,9,0,0,0
0,0,0,1,0,1,1,1,0,193,1,0,0,0
0,0,1,1,1,1,1,1,1,83,6,0,0,0
0,1,1,1,0,1,0,1,1,13,12,0,0,0
1,1,1,1,0,1,0,1,0,49,5,0,0,0
1,0,1,1,1,1,1,1,1,1,8,1,1,1
1,0,1,1,0,1,0,1,1,159,10,0,0,0
1,1,1,1,1,1,1,1,0,51,7,1,1,0
1,1,1,1,1,1,0,1,1,168,6,1,0,0
0,1,1,1,0,1,0,1,1,100,5,0,0,0
0,0,0,1,0,0,1,1,0,30,3,0,0,0
1,1,0,1,0,1,1,1,0,27,12,0,1,0
1,1,1,1,0,1,0,1,1,34,11,0,0,0
0,1,0,1,0,1,1,1,1,101,3,0,0,0
1,0,1,1,0,1,0,1,1,111,11,0,0,0
1,1,1,1,1,1,0,1,0,51,2,1,0,0
1,1,1,1,0,0,0,1,0,233,12,0,0,0
1,1,1,1,1,0,0,1,1,98,11,0,0,0
0,1,1,1,0,1,0,1,0,24,1,0,0,0
1,1,1,1,0,0,1,1,1,181,2,0,0,0
1,1,1,1,1,1,0,1,1,14,6,1,0,0
0,1,1,1,1,1,1,1,1,96,1,0,0,0
1,1,1,1,0,1,1,1,1,139,12,0,1,1
1,1,1,1,1,1,1,1,1,155,8,1,1,0
1,1,1,1,1,1,0,1,0,53,7,1,0,1
0,1,1,1,0,1,1,1,0,17,8,0,0,0
1,1,1,1,0,1,1,1,0,39,6,0,1,0
0,0,1,1,0,1,0,1,0,282,12,0,0,0
1,0,1,1,1,0,0,1,1,132,7,0,0,0
1,1,1,1,0,0,0,1,0,57,11,0,0,0
1,0,0,1,0,1,1,1,1,165,7,0,1,1
0,1,0,1,0,1,1,1,1,74,10,0,0,0
0,1,1,1,0,1,1,1,0,150,7,0,0,0
1,0,1,1,1,1,0,1,1,53,2,1,0,0
1,1,1,1,1,1,0,1,1,42,12,1,0,1
1,1,1,1,1,0,1,1,1,234,7,0,0,0
1,1,1,1,0,0,1,1,1,164,10,0,0,0
1,1,1,1,0,0,0,1,0,69,3,0,0,0
1,1,1,1,0,0,1,1,0,38,5,0,0,0
1,0,0,1,0,1,0,1,1,56,7,0,0,0
1,1,0,1,0,1,0,1,1,63,1,0,0,0
1,1,1,1,1,1,1,1,0,9,1,1,1,0
1,0,1,1,0,1,0,1,1,23,11,0,0,0
1,1,1,1,1,0,0,1,0,46,7,0,0,0
1,1,1,1,0,0,0,1,1,59,12,0,0,0
1,1,0,1,0,1,0,1,1,27,1,0,0,0
0,1,1,1,1,1,0,1,1,4,12,0,0,0
1,1,0,1,0,1,0,1,0,132,12,0,0,0
1,1,0,1,1,1,1,1,1,78,5,1,1,0
1,1,1,1,0,1,1,1,1,32,12,0,1,0
0,1,1,1,1,1,0,1,0,104,7,0,0,0
1,1,1,1,0,1,1,1,0,117,12,0,1,0
0,1,0,1,0,1,0,1,1,185,7,0,0,0
1,1,0,1,0,1,1,1,0,38,4,0,1,0
1,1,0,1,1,0,1,1,1,8,12,0,0,0
0,1,1,1,1,1,1,1,0,80,4,0,0,1
1,0,0,1,1,0,1,1,0,12,11,0,0,0
0,0,1,1,0,1,1,1,0,70,12,0,0,0
1,1,1,1,1,0,0,1,1,76,3,0,0,0
0,1,1,1,0,1,1,1,0,23,11,0,0,0
1,1,0,1,1,1,0,1,0,40,7,1,0,0
1,1,1,1,0,0,1,1,1,159,12,0,0,0
1,1,1,1,0,1,0,1,0,49,12,0,0,1
0,0,1,1,0,1,1,1,1,37,7,0,0,1
1,1,0,1,0,1,1,1,1,147,9,0,1,0
1,1,1,1,0,0,0,1,0,87,3,0,0,0
1,1,1,1,1,1,0,1,0,7,1,1,0,0
0,0,1,1,0,1,1,1,0,167,3,0,0,0
0,1,1,1,0,1,1,1,0,6,3,0,0,0
0,1,1,1,1,1,0,1,0,39,7,0,0,0
1,1,1,1,1,0,0,1,1,88,11,0,0,0
0,0,1,1,1,1,1,1,0,175,12,0,0,0
1,1,1,0,0,1,0,1,0,127,12,0,0,0
1,1,1,1,0,1,1,1,0,1,11,0,1,0
1,1,1,1,0,0,0,1,1,77,7,0,0,0
1,1,1,1,1,0,0,1,1,122,5,0,0,0
1,0,1,1,0,1,1,1,0,155,8,0,1,1
1,1,0,0,0,1,1,1,1,114,4,0,1,0
0,1,1,1,1,0,0,1,0,106,7,0,0,1
1,1,1,1,1,1,1,1,0,16,7,1,1,1
1,0,0,1,0,1,0,1,0,176,6,0,0,0
1,0,1,1,0,0,1,1,1,47,2,0,0,0
0,0,0,1,0,1,0,1,0,95,6,0,0,0
1,1,1,1,0,1,1,1,0,233,11,0,1,0
1,1,1,1,0,1,1,1,0,27,1,0,1,0
1,1,1,1,0,1,0,1,1,85,8,0,0,1
0,0,0,0,0,1,0,1,1,58,3,0,0,0
1,0,1,1,1,1,0,1,1,102,11,1,0,0
1,1,1,1,1,0,0,1,1,33,12,0,0,0
0,1,1,1,0,1,0,1,0,92,12,0,0,0
1,0,1,1,1,1,0,1,0,20,5,1,0,0
1,1,1,1,1,1,1,1,1,8,8,1,1,1
1,1,1,1,1,1,1,1,1,3,12,1,1,0
1,1,0,1,0,1,1,1,0,16,12,0,1,0
1,1,1,1,0,1,1,1,0,143,12,0,1,0
1,1,0,1,0,1,0,1,1,84,3,0,0,0
1,1,1,1,0,1,1,1,1,149,7,0,1,0
1,1,1,1,0,0,1,1,0,14,3,0,0,0
1,0,1,1,0,1,1,1,0,37,9,0,1,0
0,1,1,1,0,0,0,1,1,137,1,0,0,0
1,0,1,1,0,1,0,1,1,121,1,0,0,0
1,0,0,1,0,1,1,1,1,21,3,0,1,0
1,1,1,1,1,0,1,1,1,23,5,0,0,0
1,0,1,1,0,1,1,1,0,40,11,0,1,0
1,1,1,1,0,1,1,1,1,82,6,0,1,1
1,1,1,1,0,0,1,1,1,106,12,0,0,0
0,0,1,1,1,0,0,1,1,62,7,0,0,0
1,1,1,1,0,1,0,1,0,90,1,0,0,0
1,1,1,1,0,1,1,1,0,26,12,0,1,1
0,1,1,1,0,1,1,1,0,49,11,0,0,0
0,1,1,1,0,1,0,1,1,67,7,0,0,0
1,1,1,1,0,0,1,1,1,120,3,0,0,0
1,1,1,1,1,1,1,1,0,92,1,1,1,0
1,1,0,1,1,1,0,1,0,22,5,1,0,1
1,1,1,1,0,0,1,1,0,130,1,0,0,0
1,1,1,1,0,1,1,1,1,135,3,0,1,0
1,1,0,1,0,1,0,1,1,94,6,0,0,0
0,1,1,1,1,0,0,1,0,63,3,0,0,0
1,1,1,1,0,1,1,1,1,40,3,0,1,0
1,1,1,1,0,1,0,1,1,512,12,0,0,0
1,1,0,1,0,1,0,1,1,60,10,0,0,1
0,0,0,1,0,0,1,1,0,154,11,0,0,0
1,1,1,1,1,0,1,1,0,117,3,0,0,1
1,1,1,1,1,1,0,1,1,198,3,1,0,0
1,0,1,1,1,1,1,1,0,51,2,1,1,0
1,0,0,1,1,0,1,1,1,53,1,0,0,0
1,1,0,1,0,1,1,1,0,115,12,0,1,0
1,1,1,1,1,0,0,1,1,86,1,0,0,0
1,1,1,1,1,1,1,1,0,65,5,1,1,0
0,1,1,1,1,1,1,1,1,51,6,0,0,0
1,1,1,1,0,0,0,1,0,41,2,0,0,0
1,1,1,1,0,1,1,1,1,104,3,0,1,0
0,1,1,1,1,0,0,1,0,44,9,0,0,0
1,0,0,1,1,0,0,1,1,145,2,0,0,0
1,1,1,1,0,1,1,1,0,199,10,0,1,1
1,1,0,1,0,1,1,1,1,3,3,0,1,0
1,1,1,1,0,0,0,1,0,10,5,0,0,0
1,1,1,1,1,1,0,1,1,81,7,1,0,0
1,1,0,0,0,1,0,1,0,164,6,0,0,0
0,1,1,1,0,1,1,1,1,122,5,0,0,0
1,1,1,1,1,1,0,1,1,188,3,1,0,0
1,1,1,1,0,0,0,1,1,149,5,0,0,0
1,1,1,1,0,0,0,1,1,152,12,0,0,0
1,1,1,1,0,1,1,1,1,5,10,0,1,0
1,0,1,1,1,0,0,1,0,35,5,0,0,0
1,1,1,1,0,0,1,1,1,12,4,0,0,0
And here are the results after running the j48 algorithm with 10 folds cross validation
Correctly Classified Instances 205 85.7741 %
Incorrectly Classified Instances 34 14.2259 %
Kappa statistic 0.0266
Mean absolute error 0.2346
Root mean squared error 0.3465
Relative absolute error 100.0672 %
Root relative squared error 101.7226 %
Coverage of cases (0.95 level) 99.5816 %
Mean rel. region size (0.95 level) 99.3724 %
Total Number of Instances 239
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,986 0,969 0,868 0,986 0,923 0,044 0,492 0,865 0
0,031 0,014 0,250 0,031 0,056 0,044 0,492 0,135 1
Weighted Avg. 0,858 0,841 0,785 0,858 0,807 0,044 0,492 0,767
=== Confusion Matrix ===
a b <-- classified as
204 3 | a = 0
31 1 | b = 1
And the tree that it generates its this https://www.dropbox.com/s/qzjukr8klffwl90/Captura%20de%20pantalla%202014-04-15%2022.33.38.png
I hope you could help me to solve this issue,
Thanks a lot,

It is classical class imbalance problem. Have a look on your class distribution Demoday=1 32 records, Demoday = 0 207 records. Almost every machine learning algorithm is designed to achieve best overall accuracy. So, in your case, if it assigns 0 to every instance it yields 85,7% accuracy and this is exactly the one-leaf tree you get. The problem is of course, that the minority class is usually of greater interest. Google unbalanced class, or class imbalance for more info.
I have already posted some quick solutions as well as some model performance indicators suitable for this case as answer here: how to edit weka configurations to find "1"
Good luck

Related

Function for testing system stability, which receives predicted time series as input

I want to write a function that gets a time series and a standard deviation as parameters and returns an adjusted time series which looks like a forecast.
With this function I want to test a system for stability, which gets a forecasted time series list for weather as input parameter.
My approach for such a function, which is described below:
vector<tuple<datetime, double>> get_adjusted_timeseries(vector<tuple<datetime, double>>& timeseries_original, const double stddev, const double dist_mid)
{
auto timeseries_copy(timeseries_original);
int sign = randInRange(0, 1) == 0 ? 1 : -1;
auto left_limit = normal_cdf_inverse(0.5 - dist_mid, 0, stddev);
auto right_limit = normal_cdf_inverse(0.5 + dist_mid, 0, stddev);
for (auto& pair : timeseries_copy)
{
double number;
do
{
nd_value = normal_distribution_r(0, stddev);
}
while (sign == -1 && nd_value > 0.0 || sign == 1 && nd_value < 0.0);
pair = make_tuple(get<0>(pair), get<1>(pair) + (nd_value / 100) * get<1>(pair));
if (nd_value > 0.0 && nd_value < right_limit || nd_value < 0.0 && nd_value > left_limit)
{
sign = sign == -1 ? 1 : -1;
}
}
return timeseries_copy;
}
Make a copy from the original time series, which is also from type vector<tuple<datetime, double>>
Get a random number that is either 0 or 1 and use the number to set the sign.
Use the Inverse Cumulative distribution function to get the limits, which indicate when the sign is changed. The sign is changed when the value of the copied time series is close to the original value. The implementation of the inverse CDF is shown here
For-loop for each item in the time series:
get a normal distributed value, which should be lower zero when sign == -1 and greater zero when sign == 1
adjust old value of time series according to the normal distributed
value
change sign if the normal distributed value is close to the original value.
The result for a low standard deviation, for example, can be seen here in yellow:
If the mean absolute percentage error (MAPE) of the two time series is calculated, the following relationship results:
stddev: 5 -> MAPE: ~0.04
stddev: 10 -> MAPE: ~0.08
stddev: 15 -> MAPE: ~0.12
stddev: 20 -> MAPE: ~0.16
What do you think of this approach?
Can this function be used to test a system that has to deal with predicted time series?
You want to generate time series data that behave like some existing time series data that you have from real phenomena (weather and stock exchange). That generated time series data will be fed into some system to test its stability.
What you could do is: fit some model to your exiting data, and then use that model to generate data that follow the model, and hence your existing data. Fitting data to a model yields a set of model parameters and a set of deviations (differences not explained by the model). The deviations may follow some known density function but not necessarily. Given the model parameters and deviations, you can generate data that look like the original data. Note that if the model does not explain the data well, deviations will be large, and the data generated with the model will not look like the original data.
For example, if you know your data is linear, you fit a line through them, and your model would be:
y = M x + B + E
where E is a random variable that follows the distribution of the error around the line that fits your data, and where M and B are the model parameters. You can now use that model to generate (x, y) coordinates that are rougly linear. When sampling the random variable E, you can assume that it follows some known distribution like a normal distribution, or use an histogram, to generate deviations that follow arbitrary density functions.
There are several time series models that you could use to fit your weather and stock exchange data. You could look at exponential smoothing. It has several different models. I am sure you can find many other models on Wikipedia.
If a model does not fit well your data, you can also see its parameters as random variables. In our example above, suppose that we have observed data where it seems that the slope is changing. We would fit several lines and obtain a distribution for M. We would then sample that variable along with E when generating data.

How to make a condition based on the number of decimals a number has with the IFS and AND functions in Google Sheets?

I'm trying to do the following in Google Sheets:
If the number in cell A1 has 4 decimals (for example "1.0001"), and if cell A1>B1 (for example A1=1.0001 and B1=0.0001), and if the string in cell C1 = "Good", then return (A1-B1)*10000.
Additionally:
If the number in cell A2 has 2 decimals (for example "1.01"), and if cell A2>B2 (for example A2=1.01 and B2=0.01), and if the string in cell C2 = "Great", then return (A2-B2)*100.
So far, I've come up with this IFS function:
=IFS((AND(A1>B1, C1="Good")), (A1-B1)*10000,(AND(A2>B2, C2="Great")),(A2-B2)*100,TRUE,"ERROR")
Which treats the two arguments A1>B1, C1="Good" / A2>B2, C2="Great", within the AND formula.
How can I add the decimal argument to the AND statement?
I thought of setting for something like :
=IFS((AND(A1>B1, C1="Good", **A1=(a number with 4 decimals)))**, (A1-B1)*10000,(AND(A2>B2, C2="Great", **A2=(a number with 2 decimals)))**,(A2-B2)*100,TRUE,"ERROR")
Where the statements:
A1=(a number with 4 decimals)
and
A1=(a number with 2 decimals)
would do the trick.
How do you formulate those missing "decimal statements"?
It depends if you mean exactly 4 decimal places, like 1.0001, or at least 4 decimals, which would include 1.000123.
You can test for decimals by using round (or roundup or rounddown) so in the first case:
=ifs(and(round(A1,3)<>A1,A1>B1,C1="Good"),(A1-B1)*10000,and(round(A2,1)<>A2,A2>B2,C2="Great"),(A2-B2)*100)
But if you wanted it to be exactly 4 decimals:
=ifs(and(round(A1,3)<>A1,round(A1,4)=A1,A1>B1,C1="Good"),(A1-B1)*10000,and(round(A2,1)<>A2,round(A2,2)=A2,A2>B2,C2="Great"),(A2-B2)*100)
you can solve this with one blow:
=ARRAYFORMULA(
IF((LEN(IFERROR(REGEXEXTRACT(TO_TEXT(A1:A), "\.(.*)")))=4) *
(A1:A>B1:B) * (C1:C="Good"), (A1:A-B1:B)*10000,
IF((LEN(IFERROR(REGEXEXTRACT(TO_TEXT(A1:A), "\.(.*)")))=2) *
(A1:A>B1:B) * (C1:C="Great"), (A1:A-B1:B)*100, )))

How to check if a value within a range is multiple of a value from another range?

Let say I've a system that distribute 8820 values into 96 values, rounding using Banker's Round (call them pulse). The formula is:
pulse = BankerRound(8820 * i/96), with i[0,96[
Thus, this is the list of pulses:
0
92
184
276
368
459
551
643
735
827
919
1011
1102
1194
1286
1378
1470
1562
1654
1746
1838
1929
2021
2113
2205
2297
2389
2481
2572
2664
2756
2848
2940
3032
3124
3216
3308
3399
3491
3583
3675
3767
3859
3951
4042
4134
4226
4318
4410
4502
4594
4686
4778
4869
4961
5053
5145
5237
5329
5421
5512
5604
5696
5788
5880
5972
6064
6156
6248
6339
6431
6523
6615
6707
6799
6891
6982
7074
7166
7258
7350
7442
7534
7626
7718
7809
7901
7993
8085
8177
8269
8361
8452
8544
8636
8728
Now, suppose the system doesn't send to me these pulses directly. Instead, it send these pulse in 8820th (call them tick):
tick = value * 1/8820
The list of the ticks I get become:
0
0.010430839
0.020861678
0.031292517
0.041723356
0.052040816
0.062471655
0.072902494
0.083333333
0.093764172
0.104195011
0.11462585
0.124943311
0.13537415
0.145804989
0.156235828
0.166666667
0.177097506
0.187528345
0.197959184
0.208390023
0.218707483
0.229138322
0.239569161
0.25
0.260430839
0.270861678
0.281292517
0.291609977
0.302040816
0.312471655
0.322902494
0.333333333
0.343764172
0.354195011
0.36462585
0.375056689
0.38537415
0.395804989
0.406235828
0.416666667
0.427097506
0.437528345
0.447959184
0.458276644
0.468707483
0.479138322
0.489569161
0.5
0.510430839
0.520861678
0.531292517
0.541723356
0.552040816
0.562471655
0.572902494
0.583333333
0.593764172
0.604195011
0.61462585
0.624943311
0.63537415
0.645804989
0.656235828
0.666666667
0.677097506
0.687528345
0.697959184
0.708390023
0.718707483
0.729138322
0.739569161
0.75
0.760430839
0.770861678
0.781292517
0.791609977
0.802040816
0.812471655
0.822902494
0.833333333
0.843764172
0.854195011
0.86462585
0.875056689
0.88537415
0.895804989
0.906235828
0.916666667
0.927097506
0.937528345
0.947959184
0.958276644
0.968707483
0.979138322
0.989569161
Unfortunately, between these ticks it sends to me also fake ticks, that aren't multiply of original pulses. Such as 0,029024943, which is multiply of 256, which isn't in the pulse lists.
How can I find from this list which ticks are valid and which are fake?
I don't have the pulse list to compare with during the process, since 8820 will change during the time, so I don't have a list to compare step by step. I need to deduce it from ticks at each iteration.
What's the best math approch to this? Maybe reasoning only in tick and not pulse.
I've thought to find the closer error between nearest integer pulse and prev/next tick. Here in C++:
double pulse = tick * 96.;
double prevpulse = (tick - 1/8820.) * 96.;
double nextpulse = (tick + 1/8820.) * 96.;
int pulseRounded=round(pulse);
int buffer=lrint(tick * 8820.);
double pulseABS = abs(pulse - pulseRounded);
double prevpulseABS = abs(prevpulse - pulseRounded);
double nextpulseABS = abs(nextpulse - pulseRounded);
if (nextpulseABS > pulseABS && prevpulseABS > pulseABS) {
// is pulse
}
but for example tick 0.0417234 (pulse 368) fails since the prev tick error seems to be closer than it: prevpulseABS error (0.00543795) is smaller than pulseABS error (0.0054464).
That's because this comparison doesn't care about rounding I guess.
NEW POST:
Alright. Based on what I now understand, here's my revised answer.
You have the information you need to build a list of good values. Each time you switch to a new track:
vector<double> good_list;
good_list.reserve(96);
for(int i = 0; i < 96; i++)
good_list.push_back(BankerRound(8820.0 * i / 96.0) / 8820.0);
Then, each time you want to validate the input:
auto iter = find(good_list.begin(), good_list.end(), input);
if(iter != good_list.end()) //It's a match!
cout << "Happy days! It's a match!" << endl;
else
cout << "Oh bother. It's not a match." << endl;
The problem with mathematically determining the correct pulses is the BankerRound() function which will introduce an ever-growing error the higher values you input. You would then need a formula for a formula, and that's getting out of my wheelhouse. Or, you could keep track of the differences between successive values. Most of them would be the same. You'd only have to check between two possible errors. But that falls apart if you can jump tracks or jump around in one track.
OLD POST:
If I understand the question right, the only information you're getting should be coming in the form of (p/v = y) where you know 'y' (that's each element in your list of ticks you get from the device) and you know that 'p' is the Pulse and 'v' is the Values per Beat, but you don't know what either of them are. So, pulling one point of data from your post, you might have an equation like this:
p/v = 0.010430839
'v', in all the examples you've used thus far, is 8820, but from what I understand, that value is not a guaranteed constant. The next question then is: Do you have a way of determining what 'v' is before you start getting all these decimal values? If you do, you can work out mathematically what the smallest error can be (1/v) then take your decimal information, multiply it by 'v', round it to the nearest whole number and check to see if the difference between its rounded form and its non-rounded form falls in the bounds of your calculated error like so:
double input; //let input be elements in your list of doubles, such as 0.010430839
double allowed_error = 1.0 / values_per_beat;
double proposed = input * values_per_beat;
double rounded = std::round(proposed);
if(abs(rounded - proposed) < allowed_error){cout << "It's good!" << endl;}
If, however, you are not able to ascertain the values_per_beat ahead of time, then this becomes a statistical question. You must accumulate enough data samples, remove the outliers (the few that vary from the norm) and use that data. But that approach will not be realtime, which, given the terms you've been using (values per beat, bpm, the value 44100) it sounds like realtime might be what you're after.
Playing around with Excel, I think you want to multiply up to (what should be) whole numbers rather than looking for closest pulses.
Tick Pulse i Error OK
Tick*8820 Pulse*96/8820 ABS( i - INT( i+0.05 ) ) Error < 0.01
------------ ------------ ------------- ------------------------ ------------
0.029024943 255.9999973 2.786394528 0.786394528 FALSE
0.0417234 368.000388 4.0054464 0.0054464 TRUE
0 0 0 0 TRUE
0.010430839 91.99999998 1.001360544 0.001360544 TRUE
0.020861678 184 2.002721088 0.002721088 TRUE
0.031292517 275.9999999 3.004081632 0.004081632 TRUE
0.041723356 367.9999999 4.005442176 0.005442176 TRUE
0.052040816 458.9999971 4.995918336 0.004081664 TRUE
0.062471655 550.9999971 5.99727888 0.00272112 TRUE
0.072902494 642.9999971 6.998639424 0.001360576 TRUE
0.083333333 734.9999971 7.999999968 3.2E-08 TRUE
The table shows your two "problem" cases (the real wrong value, 256, and the one your code gets wrong, 368) followed by the first few "good" values.
If both 8820s vary at the same time, then obviously they will cancel out, and i will just be Tick*96.
The Error term is the difference between the calculated i and the nearest integer; if this less than 0.01, then it is a "good" value.
NOTE: the 0.05 and 0.01 values were chosen somewhat arbitrarily (aka inspired first time guess based on the numbers): adjust if needed. Although I've only shown the first few rows, all the 96 "good" values you gave show as TRUE.
The code (completely untested) would be something like:
double pulse = tick * 8820.0 ;
double i = pulse * 96.0 / 8820.0 ;
double error = abs( i - floor( i + 0.05 ) ) ;
if( error < 0.05 ) {
// is pulse
}
I assume your initializing your pulses in a for-loop, using int i as loop variable; then the problem is this line:
BankerRound(8820 * i/96);
8820 * i / 96 is an all integer operation and the result is integer again, cutting off the remainder (so in effect, always rounding towards zero already), and BankerRound actually has nothing to round any more. Try this instead:
BankerRound(8820 * i / 96.0);
Same problem applies if you are trying to calculate prev and next pulse, as you actually subtract and add 0 (again, 1/8820 is all integer and results in 0).
Edit:
From what I read from the commments, the 'system' is not – as I assumed previously – modifiable. Actually, it calculates ticks in the form of n / 96.0, n &#x220a [0, 96) in ℕ
however including some kind of internal rounding appearently independent from the sample frequency, so there is some difference to the true value of n/96.0 and the ticks multiplied by 96 do not deliver exactly the integral values in [0, 96) (thanks KarstenKoop). And some of the delivered samples are simply invalid...
So the task is to detect, if tick * 96 is close enough to an integral value to be accepted as valid.
So we need to check:
double value = tick * 96.0;
bool isValid
= value - floor(value) < threshold
|| ceil(value) - value < threshold;
with some appropriately defined threshold. Assuming the values really are calculated as
double tick = round(8820*i/96.0)/8820.0;
then the maximal deviation would be slightly greater than 0.00544 (see below for a more exact value), thresholds somewhere in the sizes of 0.006, 0.0055, 0.00545, ... might be a choice.
Rounding might be a matter of internally used number of bits for the sensor value (if we have 13 bits available, ticks might actually be calculated as floor(8192 * i / 96.0) / 8192.0 with 8192 being 1 << 13 &ndash and floor accounting to integer division; just a guess...).
The exact value of the maximal deviation, using 8820 as factor, as exact as representable by double, was:
0.00544217687075132516838493756949901580810546875
The multiplication by 96 is actually not necessary, you can compare directly with the threshold divided by 96, which would be:
0.0000566893424036596371706764330156147480010986328125

Stata: compare coefficients of factor variables using foreach (or forvalues)

I am using an ordinal independent variable in an OLS regression as a categorical variable using the factor variable technique in Stata (i.e, i.ordinal). The variable can take on values of the integers from 0 to 9, with 0 being the base category. I am interested in testing if the coefficient of each variable is greater (or less) than that which succeeds it (i.e. _b[1.ordinal] >= _b[2.ordinal], _b[2.ordinal] >= _b[3.ordinal], etc.). I've started with the following pseudocode based on FAQ: One-sided t-tests for coefficients:
foreach i in 1 2 3 5 6 7 8 {
test _b[`i'.ordinal] - _b[`i+'.ordinal] = 0
gen sign_`i'`i+' = sign(_b[`i'.ordinal] - _b[`i+'.ordinal])
display "Ho: i <= i+ p-value = " ttail(r(df_r), sign_`i'`i+'*sqrt(r(F)))
display "Ho: i >= i+ p-value = " 1-ttail(r(df_r), sign_`i'`i+'*sqrt(r(F)))
}
where I want the ```i+' to mean the next value of i in the sequence (so if i is 3 then ``i+' is 5). Is this even possible to do? Of course, if you have any cleaner suggestions to test the coefficients in this manner, please advise.
Note: The model only uses a sub-sample of my dataset for which there are no observations for 4.ordinal, which is why I use foreach instead of forvalues. If you have suggestions for developing a general code that can be used regardless of missing variables, please advise.
There are various ways to do this. Note that there is little obvious point to creating a new variable just to hold one constant. Code not tested.
forval i = 1/8 {
local j = `i' + 1
capture test _b[`i'.ordinal] - _b[`j'.ordinal] = 0
if _rc == 0 {
local sign = sign(_b[`i'.ordinal] - _b[`j'.ordinal])
display "Ho: `i' <= `j' p-value = " ttail(r(df_r), `sign' * sqrt(r(F)))
display "Ho: `i' >= `j' p-value = " 1-ttail(r(df_r), `sign' * sqrt(r(F)))
}
}
The capture should eat errors.

WEKA : Cost Matrix Interpretation

How do we interpret the cost matrix in WEKA? If I have 2 classes to predict (class 0 and class 1) and want to penalize classfication of class 0 as class 1 more (say double the penalty), what exactly is the matrix format?
Is it :
0 10
20 0
or is it
0 20
10 0
The source of confusion are the following two references:
1) The JavaDoc for Weka CostMatrix says:
The element at position i,j in the matrix is the penalty for classifying an instance of class j as class i.
2) However, the answer in this post seems to indicate otherwise.
http://weka.8497.n7.nabble.com/cost-matrix-td5821.html
Given the first cost matrix, the post says "Misclassifying an instance of class 0 incurs a cost of 10. Misclassifying an instance of class 1 is twice as costly.
Thanks.
I know my answer is coming very late, but it might help somebody so here it is:
To boost the cost of classifying an item of class 0 as class 1, the correct format is the second one.
The evidence:
Cost Matrix I used:
0 1.0
1000.0 0
Confusion matrix (from cross-validation):
a b <-- classified as
565 20 | a = ignored
54 204 | b = not_ignored
Cross-validation output:
...
Total Cost 54020
...
That's a cost of 54 * 10000 + 20 * 1, which matches the confusion matrix above.