Trouble with classification using j48 algorithm in weka - weka
I'm trying to use the J48 classifier in Weka, but it classified everything as 0.
This is my DataSet:
#relation 'SimpleRules-weka.filters.unsupervised.attribute.Reorder-R1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,6-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.unsupervised.attribute.Remove-R3-weka.filters.unsupervised.attribute.Remove-R1-2'
#attribute R1 numeric
#attribute R2 numeric
#attribute R3 numeric
#attribute R4 numeric
#attribute R5 numeric
#attribute R6 numeric
#attribute R7 numeric
#attribute R8 numeric
#attribute R9 numeric
#attribute Rank numeric
#attribute R1R5R6 numeric
#attribute R1R6R7 numeric
#attribute CombinedRules numeric
#attribute Demoday {0,1}
#data
1,1,1,1,1,0,1,1,0,11,12,0,0,0
1,1,1,1,0,1,0,1,0,72,1,0,0,0
0,0,0,1,0,1,1,1,0,47,7,0,0,1
1,1,0,1,1,1,0,1,1,68,12,1,0,0
1,1,1,1,1,1,0,1,1,21,7,1,0,0
1,1,1,1,0,1,1,1,1,63,11,0,1,0
1,1,0,1,0,1,1,1,0,19,7,0,1,0
1,0,1,1,0,0,1,1,1,11,7,0,0,0
0,1,1,1,0,1,0,1,0,107,12,0,0,0
1,1,1,1,0,1,0,1,0,99,12,0,0,1
0,1,1,1,0,1,1,1,1,238,2,0,0,0
1,1,1,1,1,0,1,1,0,147,7,0,0,0
1,1,1,1,1,1,0,1,1,30,7,1,0,1
1,1,1,1,0,0,0,1,1,124,5,0,0,0
0,1,1,1,1,1,0,1,1,54,5,0,0,0
0,0,0,1,0,1,0,1,0,153,5,0,0,0
1,1,1,1,0,1,0,1,1,33,5,0,0,0
1,1,1,1,1,1,0,1,0,143,3,1,0,0
1,0,1,1,0,1,1,1,0,28,3,0,1,0
0,1,1,1,0,1,1,0,0,83,8,0,0,0
1,1,1,1,1,1,1,1,0,31,7,1,1,0
1,1,1,1,0,0,0,1,0,91,12,0,0,0
1,1,1,1,0,0,1,1,0,7,7,0,0,0
1,1,1,1,0,1,0,1,1,4,1,0,0,0
1,1,0,1,1,1,0,1,0,41,1,1,0,0
0,1,1,1,0,1,0,1,1,84,5,0,0,0
1,1,0,1,0,1,1,1,0,81,1,0,1,1
0,1,1,1,1,1,0,1,1,8,6,0,0,0
1,1,1,1,0,1,1,1,1,172,11,0,1,0
1,1,1,1,1,0,0,1,1,142,12,0,0,1
0,1,1,1,0,1,1,1,1,35,11,0,0,0
1,1,1,1,0,1,0,1,1,130,11,0,0,0
1,1,1,1,0,1,1,1,1,62,7,0,1,0
0,1,1,1,0,1,1,1,1,34,7,0,0,0
0,1,1,1,0,1,1,1,1,108,3,0,0,0
0,1,1,1,0,1,0,1,1,11,12,0,0,0
0,1,1,1,1,0,0,1,1,129,3,0,0,0
1,1,0,1,0,1,1,1,1,24,10,0,1,1
1,1,1,1,0,1,1,1,0,50,8,0,1,0
1,1,1,1,1,1,0,1,1,12,12,1,0,1
0,1,1,1,1,1,1,1,0,111,3,0,0,0
1,1,0,1,0,1,0,1,1,55,11,0,0,0
1,1,1,1,0,1,0,1,1,239,11,0,0,0
0,1,1,1,1,1,0,1,0,131,2,0,0,0
1,1,1,1,0,1,0,1,1,328,8,0,0,0
1,1,1,1,0,1,1,1,1,12,12,0,1,1
1,1,1,1,0,1,0,1,1,113,8,0,0,0
0,1,1,1,0,1,0,1,0,96,1,0,0,0
1,1,1,1,0,0,0,1,1,75,7,0,0,0
1,1,1,1,1,1,0,1,1,67,1,1,0,1
1,1,1,1,1,1,0,1,0,112,11,1,0,0
1,1,1,1,0,0,1,1,1,109,3,0,0,0
1,0,1,1,1,0,0,1,0,47,12,0,0,0
1,1,1,1,0,1,0,1,1,47,7,0,0,0
1,1,1,1,0,1,0,1,1,2,6,0,0,0
0,0,0,1,0,1,1,1,0,16,2,0,0,0
1,1,1,1,0,1,0,1,0,18,12,0,0,0
0,1,1,1,1,1,1,1,0,58,3,0,0,0
0,0,0,1,1,1,0,1,0,156,7,0,0,0
1,1,1,1,1,0,1,1,0,279,2,0,0,0
1,1,1,1,0,1,0,1,0,2,12,0,0,0
0,0,1,1,0,1,0,1,1,163,6,0,0,0
1,1,1,1,1,1,0,1,1,10,3,1,0,1
0,0,1,1,1,1,0,1,0,3,12,0,0,0
1,1,1,1,1,1,0,1,1,101,7,1,0,0
1,1,1,1,0,1,0,1,0,136,9,0,0,0
0,1,1,1,1,0,0,1,0,31,8,0,0,0
1,0,1,1,1,1,0,1,0,155,8,1,0,0
0,1,1,1,0,1,1,1,0,158,12,0,0,0
0,1,0,1,0,1,0,1,0,101,1,0,0,0
0,1,0,1,0,1,0,1,1,7,7,0,0,0
1,0,0,1,1,1,0,1,0,23,1,1,0,0
1,0,0,1,1,0,0,1,1,99,1,0,0,0
1,1,1,1,1,1,0,1,1,73,3,1,0,0
1,1,1,1,1,1,0,1,0,15,3,1,0,0
1,1,1,1,0,1,1,1,0,97,8,0,1,0
1,1,1,1,0,1,1,1,1,93,8,0,1,0
1,1,1,1,1,1,1,1,0,44,7,1,1,1
0,1,1,1,0,1,0,1,0,239,7,0,0,0
0,0,0,1,1,1,0,1,1,35,1,0,0,0
0,1,1,1,0,1,0,1,0,90,12,0,0,0
1,1,1,1,1,1,0,1,1,37,7,1,0,0
1,1,1,1,1,0,0,1,1,25,12,0,0,1
1,1,1,1,0,0,0,1,0,83,2,0,0,0
1,1,1,1,1,1,1,1,1,22,10,1,1,1
1,1,1,1,1,0,1,1,1,2,10,0,0,0
1,0,1,1,0,1,1,1,1,65,5,0,1,0
0,1,1,1,0,1,1,1,1,25,3,0,0,0
1,0,1,1,0,0,1,1,0,180,8,0,0,0
0,1,0,1,0,1,1,1,1,49,10,0,0,0
0,0,1,1,0,1,0,1,0,67,8,0,0,0
1,1,1,1,1,0,1,1,0,14,11,0,0,0
1,0,0,1,1,1,0,1,0,36,11,1,0,0
0,0,0,1,0,0,1,1,1,97,9,0,0,0
0,0,0,1,0,1,1,1,0,193,1,0,0,0
0,0,1,1,1,1,1,1,1,83,6,0,0,0
0,1,1,1,0,1,0,1,1,13,12,0,0,0
1,1,1,1,0,1,0,1,0,49,5,0,0,0
1,0,1,1,1,1,1,1,1,1,8,1,1,1
1,0,1,1,0,1,0,1,1,159,10,0,0,0
1,1,1,1,1,1,1,1,0,51,7,1,1,0
1,1,1,1,1,1,0,1,1,168,6,1,0,0
0,1,1,1,0,1,0,1,1,100,5,0,0,0
0,0,0,1,0,0,1,1,0,30,3,0,0,0
1,1,0,1,0,1,1,1,0,27,12,0,1,0
1,1,1,1,0,1,0,1,1,34,11,0,0,0
0,1,0,1,0,1,1,1,1,101,3,0,0,0
1,0,1,1,0,1,0,1,1,111,11,0,0,0
1,1,1,1,1,1,0,1,0,51,2,1,0,0
1,1,1,1,0,0,0,1,0,233,12,0,0,0
1,1,1,1,1,0,0,1,1,98,11,0,0,0
0,1,1,1,0,1,0,1,0,24,1,0,0,0
1,1,1,1,0,0,1,1,1,181,2,0,0,0
1,1,1,1,1,1,0,1,1,14,6,1,0,0
0,1,1,1,1,1,1,1,1,96,1,0,0,0
1,1,1,1,0,1,1,1,1,139,12,0,1,1
1,1,1,1,1,1,1,1,1,155,8,1,1,0
1,1,1,1,1,1,0,1,0,53,7,1,0,1
0,1,1,1,0,1,1,1,0,17,8,0,0,0
1,1,1,1,0,1,1,1,0,39,6,0,1,0
0,0,1,1,0,1,0,1,0,282,12,0,0,0
1,0,1,1,1,0,0,1,1,132,7,0,0,0
1,1,1,1,0,0,0,1,0,57,11,0,0,0
1,0,0,1,0,1,1,1,1,165,7,0,1,1
0,1,0,1,0,1,1,1,1,74,10,0,0,0
0,1,1,1,0,1,1,1,0,150,7,0,0,0
1,0,1,1,1,1,0,1,1,53,2,1,0,0
1,1,1,1,1,1,0,1,1,42,12,1,0,1
1,1,1,1,1,0,1,1,1,234,7,0,0,0
1,1,1,1,0,0,1,1,1,164,10,0,0,0
1,1,1,1,0,0,0,1,0,69,3,0,0,0
1,1,1,1,0,0,1,1,0,38,5,0,0,0
1,0,0,1,0,1,0,1,1,56,7,0,0,0
1,1,0,1,0,1,0,1,1,63,1,0,0,0
1,1,1,1,1,1,1,1,0,9,1,1,1,0
1,0,1,1,0,1,0,1,1,23,11,0,0,0
1,1,1,1,1,0,0,1,0,46,7,0,0,0
1,1,1,1,0,0,0,1,1,59,12,0,0,0
1,1,0,1,0,1,0,1,1,27,1,0,0,0
0,1,1,1,1,1,0,1,1,4,12,0,0,0
1,1,0,1,0,1,0,1,0,132,12,0,0,0
1,1,0,1,1,1,1,1,1,78,5,1,1,0
1,1,1,1,0,1,1,1,1,32,12,0,1,0
0,1,1,1,1,1,0,1,0,104,7,0,0,0
1,1,1,1,0,1,1,1,0,117,12,0,1,0
0,1,0,1,0,1,0,1,1,185,7,0,0,0
1,1,0,1,0,1,1,1,0,38,4,0,1,0
1,1,0,1,1,0,1,1,1,8,12,0,0,0
0,1,1,1,1,1,1,1,0,80,4,0,0,1
1,0,0,1,1,0,1,1,0,12,11,0,0,0
0,0,1,1,0,1,1,1,0,70,12,0,0,0
1,1,1,1,1,0,0,1,1,76,3,0,0,0
0,1,1,1,0,1,1,1,0,23,11,0,0,0
1,1,0,1,1,1,0,1,0,40,7,1,0,0
1,1,1,1,0,0,1,1,1,159,12,0,0,0
1,1,1,1,0,1,0,1,0,49,12,0,0,1
0,0,1,1,0,1,1,1,1,37,7,0,0,1
1,1,0,1,0,1,1,1,1,147,9,0,1,0
1,1,1,1,0,0,0,1,0,87,3,0,0,0
1,1,1,1,1,1,0,1,0,7,1,1,0,0
0,0,1,1,0,1,1,1,0,167,3,0,0,0
0,1,1,1,0,1,1,1,0,6,3,0,0,0
0,1,1,1,1,1,0,1,0,39,7,0,0,0
1,1,1,1,1,0,0,1,1,88,11,0,0,0
0,0,1,1,1,1,1,1,0,175,12,0,0,0
1,1,1,0,0,1,0,1,0,127,12,0,0,0
1,1,1,1,0,1,1,1,0,1,11,0,1,0
1,1,1,1,0,0,0,1,1,77,7,0,0,0
1,1,1,1,1,0,0,1,1,122,5,0,0,0
1,0,1,1,0,1,1,1,0,155,8,0,1,1
1,1,0,0,0,1,1,1,1,114,4,0,1,0
0,1,1,1,1,0,0,1,0,106,7,0,0,1
1,1,1,1,1,1,1,1,0,16,7,1,1,1
1,0,0,1,0,1,0,1,0,176,6,0,0,0
1,0,1,1,0,0,1,1,1,47,2,0,0,0
0,0,0,1,0,1,0,1,0,95,6,0,0,0
1,1,1,1,0,1,1,1,0,233,11,0,1,0
1,1,1,1,0,1,1,1,0,27,1,0,1,0
1,1,1,1,0,1,0,1,1,85,8,0,0,1
0,0,0,0,0,1,0,1,1,58,3,0,0,0
1,0,1,1,1,1,0,1,1,102,11,1,0,0
1,1,1,1,1,0,0,1,1,33,12,0,0,0
0,1,1,1,0,1,0,1,0,92,12,0,0,0
1,0,1,1,1,1,0,1,0,20,5,1,0,0
1,1,1,1,1,1,1,1,1,8,8,1,1,1
1,1,1,1,1,1,1,1,1,3,12,1,1,0
1,1,0,1,0,1,1,1,0,16,12,0,1,0
1,1,1,1,0,1,1,1,0,143,12,0,1,0
1,1,0,1,0,1,0,1,1,84,3,0,0,0
1,1,1,1,0,1,1,1,1,149,7,0,1,0
1,1,1,1,0,0,1,1,0,14,3,0,0,0
1,0,1,1,0,1,1,1,0,37,9,0,1,0
0,1,1,1,0,0,0,1,1,137,1,0,0,0
1,0,1,1,0,1,0,1,1,121,1,0,0,0
1,0,0,1,0,1,1,1,1,21,3,0,1,0
1,1,1,1,1,0,1,1,1,23,5,0,0,0
1,0,1,1,0,1,1,1,0,40,11,0,1,0
1,1,1,1,0,1,1,1,1,82,6,0,1,1
1,1,1,1,0,0,1,1,1,106,12,0,0,0
0,0,1,1,1,0,0,1,1,62,7,0,0,0
1,1,1,1,0,1,0,1,0,90,1,0,0,0
1,1,1,1,0,1,1,1,0,26,12,0,1,1
0,1,1,1,0,1,1,1,0,49,11,0,0,0
0,1,1,1,0,1,0,1,1,67,7,0,0,0
1,1,1,1,0,0,1,1,1,120,3,0,0,0
1,1,1,1,1,1,1,1,0,92,1,1,1,0
1,1,0,1,1,1,0,1,0,22,5,1,0,1
1,1,1,1,0,0,1,1,0,130,1,0,0,0
1,1,1,1,0,1,1,1,1,135,3,0,1,0
1,1,0,1,0,1,0,1,1,94,6,0,0,0
0,1,1,1,1,0,0,1,0,63,3,0,0,0
1,1,1,1,0,1,1,1,1,40,3,0,1,0
1,1,1,1,0,1,0,1,1,512,12,0,0,0
1,1,0,1,0,1,0,1,1,60,10,0,0,1
0,0,0,1,0,0,1,1,0,154,11,0,0,0
1,1,1,1,1,0,1,1,0,117,3,0,0,1
1,1,1,1,1,1,0,1,1,198,3,1,0,0
1,0,1,1,1,1,1,1,0,51,2,1,1,0
1,0,0,1,1,0,1,1,1,53,1,0,0,0
1,1,0,1,0,1,1,1,0,115,12,0,1,0
1,1,1,1,1,0,0,1,1,86,1,0,0,0
1,1,1,1,1,1,1,1,0,65,5,1,1,0
0,1,1,1,1,1,1,1,1,51,6,0,0,0
1,1,1,1,0,0,0,1,0,41,2,0,0,0
1,1,1,1,0,1,1,1,1,104,3,0,1,0
0,1,1,1,1,0,0,1,0,44,9,0,0,0
1,0,0,1,1,0,0,1,1,145,2,0,0,0
1,1,1,1,0,1,1,1,0,199,10,0,1,1
1,1,0,1,0,1,1,1,1,3,3,0,1,0
1,1,1,1,0,0,0,1,0,10,5,0,0,0
1,1,1,1,1,1,0,1,1,81,7,1,0,0
1,1,0,0,0,1,0,1,0,164,6,0,0,0
0,1,1,1,0,1,1,1,1,122,5,0,0,0
1,1,1,1,1,1,0,1,1,188,3,1,0,0
1,1,1,1,0,0,0,1,1,149,5,0,0,0
1,1,1,1,0,0,0,1,1,152,12,0,0,0
1,1,1,1,0,1,1,1,1,5,10,0,1,0
1,0,1,1,1,0,0,1,0,35,5,0,0,0
1,1,1,1,0,0,1,1,1,12,4,0,0,0
And here are the results after running the j48 algorithm with 10 folds cross validation
Correctly Classified Instances 205 85.7741 %
Incorrectly Classified Instances 34 14.2259 %
Kappa statistic 0.0266
Mean absolute error 0.2346
Root mean squared error 0.3465
Relative absolute error 100.0672 %
Root relative squared error 101.7226 %
Coverage of cases (0.95 level) 99.5816 %
Mean rel. region size (0.95 level) 99.3724 %
Total Number of Instances 239
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,986 0,969 0,868 0,986 0,923 0,044 0,492 0,865 0
0,031 0,014 0,250 0,031 0,056 0,044 0,492 0,135 1
Weighted Avg. 0,858 0,841 0,785 0,858 0,807 0,044 0,492 0,767
=== Confusion Matrix ===
a b <-- classified as
204 3 | a = 0
31 1 | b = 1
And the tree that it generates its this https://www.dropbox.com/s/qzjukr8klffwl90/Captura%20de%20pantalla%202014-04-15%2022.33.38.png
I hope you could help me to solve this issue,
Thanks a lot,
It is classical class imbalance problem. Have a look on your class distribution Demoday=1 32 records, Demoday = 0 207 records. Almost every machine learning algorithm is designed to achieve best overall accuracy. So, in your case, if it assigns 0 to every instance it yields 85,7% accuracy and this is exactly the one-leaf tree you get. The problem is of course, that the minority class is usually of greater interest. Google unbalanced class, or class imbalance for more info.
I have already posted some quick solutions as well as some model performance indicators suitable for this case as answer here: how to edit weka configurations to find "1"
Good luck
Related
Function for testing system stability, which receives predicted time series as input
I want to write a function that gets a time series and a standard deviation as parameters and returns an adjusted time series which looks like a forecast. With this function I want to test a system for stability, which gets a forecasted time series list for weather as input parameter. My approach for such a function, which is described below: vector<tuple<datetime, double>> get_adjusted_timeseries(vector<tuple<datetime, double>>& timeseries_original, const double stddev, const double dist_mid) { auto timeseries_copy(timeseries_original); int sign = randInRange(0, 1) == 0 ? 1 : -1; auto left_limit = normal_cdf_inverse(0.5 - dist_mid, 0, stddev); auto right_limit = normal_cdf_inverse(0.5 + dist_mid, 0, stddev); for (auto& pair : timeseries_copy) { double number; do { nd_value = normal_distribution_r(0, stddev); } while (sign == -1 && nd_value > 0.0 || sign == 1 && nd_value < 0.0); pair = make_tuple(get<0>(pair), get<1>(pair) + (nd_value / 100) * get<1>(pair)); if (nd_value > 0.0 && nd_value < right_limit || nd_value < 0.0 && nd_value > left_limit) { sign = sign == -1 ? 1 : -1; } } return timeseries_copy; } Make a copy from the original time series, which is also from type vector<tuple<datetime, double>> Get a random number that is either 0 or 1 and use the number to set the sign. Use the Inverse Cumulative distribution function to get the limits, which indicate when the sign is changed. The sign is changed when the value of the copied time series is close to the original value. The implementation of the inverse CDF is shown here For-loop for each item in the time series: get a normal distributed value, which should be lower zero when sign == -1 and greater zero when sign == 1 adjust old value of time series according to the normal distributed value change sign if the normal distributed value is close to the original value. The result for a low standard deviation, for example, can be seen here in yellow: If the mean absolute percentage error (MAPE) of the two time series is calculated, the following relationship results: stddev: 5 -> MAPE: ~0.04 stddev: 10 -> MAPE: ~0.08 stddev: 15 -> MAPE: ~0.12 stddev: 20 -> MAPE: ~0.16 What do you think of this approach? Can this function be used to test a system that has to deal with predicted time series?
You want to generate time series data that behave like some existing time series data that you have from real phenomena (weather and stock exchange). That generated time series data will be fed into some system to test its stability. What you could do is: fit some model to your exiting data, and then use that model to generate data that follow the model, and hence your existing data. Fitting data to a model yields a set of model parameters and a set of deviations (differences not explained by the model). The deviations may follow some known density function but not necessarily. Given the model parameters and deviations, you can generate data that look like the original data. Note that if the model does not explain the data well, deviations will be large, and the data generated with the model will not look like the original data. For example, if you know your data is linear, you fit a line through them, and your model would be: y = M x + B + E where E is a random variable that follows the distribution of the error around the line that fits your data, and where M and B are the model parameters. You can now use that model to generate (x, y) coordinates that are rougly linear. When sampling the random variable E, you can assume that it follows some known distribution like a normal distribution, or use an histogram, to generate deviations that follow arbitrary density functions. There are several time series models that you could use to fit your weather and stock exchange data. You could look at exponential smoothing. It has several different models. I am sure you can find many other models on Wikipedia. If a model does not fit well your data, you can also see its parameters as random variables. In our example above, suppose that we have observed data where it seems that the slope is changing. We would fit several lines and obtain a distribution for M. We would then sample that variable along with E when generating data.
How to make a condition based on the number of decimals a number has with the IFS and AND functions in Google Sheets?
I'm trying to do the following in Google Sheets: If the number in cell A1 has 4 decimals (for example "1.0001"), and if cell A1>B1 (for example A1=1.0001 and B1=0.0001), and if the string in cell C1 = "Good", then return (A1-B1)*10000. Additionally: If the number in cell A2 has 2 decimals (for example "1.01"), and if cell A2>B2 (for example A2=1.01 and B2=0.01), and if the string in cell C2 = "Great", then return (A2-B2)*100. So far, I've come up with this IFS function: =IFS((AND(A1>B1, C1="Good")), (A1-B1)*10000,(AND(A2>B2, C2="Great")),(A2-B2)*100,TRUE,"ERROR") Which treats the two arguments A1>B1, C1="Good" / A2>B2, C2="Great", within the AND formula. How can I add the decimal argument to the AND statement? I thought of setting for something like : =IFS((AND(A1>B1, C1="Good", **A1=(a number with 4 decimals)))**, (A1-B1)*10000,(AND(A2>B2, C2="Great", **A2=(a number with 2 decimals)))**,(A2-B2)*100,TRUE,"ERROR") Where the statements: A1=(a number with 4 decimals) and A1=(a number with 2 decimals) would do the trick. How do you formulate those missing "decimal statements"?
It depends if you mean exactly 4 decimal places, like 1.0001, or at least 4 decimals, which would include 1.000123. You can test for decimals by using round (or roundup or rounddown) so in the first case: =ifs(and(round(A1,3)<>A1,A1>B1,C1="Good"),(A1-B1)*10000,and(round(A2,1)<>A2,A2>B2,C2="Great"),(A2-B2)*100) But if you wanted it to be exactly 4 decimals: =ifs(and(round(A1,3)<>A1,round(A1,4)=A1,A1>B1,C1="Good"),(A1-B1)*10000,and(round(A2,1)<>A2,round(A2,2)=A2,A2>B2,C2="Great"),(A2-B2)*100)
you can solve this with one blow: =ARRAYFORMULA( IF((LEN(IFERROR(REGEXEXTRACT(TO_TEXT(A1:A), "\.(.*)")))=4) * (A1:A>B1:B) * (C1:C="Good"), (A1:A-B1:B)*10000, IF((LEN(IFERROR(REGEXEXTRACT(TO_TEXT(A1:A), "\.(.*)")))=2) * (A1:A>B1:B) * (C1:C="Great"), (A1:A-B1:B)*100, )))
How to check if a value within a range is multiple of a value from another range?
Let say I've a system that distribute 8820 values into 96 values, rounding using Banker's Round (call them pulse). The formula is: pulse = BankerRound(8820 * i/96), with i[0,96[ Thus, this is the list of pulses: 0 92 184 276 368 459 551 643 735 827 919 1011 1102 1194 1286 1378 1470 1562 1654 1746 1838 1929 2021 2113 2205 2297 2389 2481 2572 2664 2756 2848 2940 3032 3124 3216 3308 3399 3491 3583 3675 3767 3859 3951 4042 4134 4226 4318 4410 4502 4594 4686 4778 4869 4961 5053 5145 5237 5329 5421 5512 5604 5696 5788 5880 5972 6064 6156 6248 6339 6431 6523 6615 6707 6799 6891 6982 7074 7166 7258 7350 7442 7534 7626 7718 7809 7901 7993 8085 8177 8269 8361 8452 8544 8636 8728 Now, suppose the system doesn't send to me these pulses directly. Instead, it send these pulse in 8820th (call them tick): tick = value * 1/8820 The list of the ticks I get become: 0 0.010430839 0.020861678 0.031292517 0.041723356 0.052040816 0.062471655 0.072902494 0.083333333 0.093764172 0.104195011 0.11462585 0.124943311 0.13537415 0.145804989 0.156235828 0.166666667 0.177097506 0.187528345 0.197959184 0.208390023 0.218707483 0.229138322 0.239569161 0.25 0.260430839 0.270861678 0.281292517 0.291609977 0.302040816 0.312471655 0.322902494 0.333333333 0.343764172 0.354195011 0.36462585 0.375056689 0.38537415 0.395804989 0.406235828 0.416666667 0.427097506 0.437528345 0.447959184 0.458276644 0.468707483 0.479138322 0.489569161 0.5 0.510430839 0.520861678 0.531292517 0.541723356 0.552040816 0.562471655 0.572902494 0.583333333 0.593764172 0.604195011 0.61462585 0.624943311 0.63537415 0.645804989 0.656235828 0.666666667 0.677097506 0.687528345 0.697959184 0.708390023 0.718707483 0.729138322 0.739569161 0.75 0.760430839 0.770861678 0.781292517 0.791609977 0.802040816 0.812471655 0.822902494 0.833333333 0.843764172 0.854195011 0.86462585 0.875056689 0.88537415 0.895804989 0.906235828 0.916666667 0.927097506 0.937528345 0.947959184 0.958276644 0.968707483 0.979138322 0.989569161 Unfortunately, between these ticks it sends to me also fake ticks, that aren't multiply of original pulses. Such as 0,029024943, which is multiply of 256, which isn't in the pulse lists. How can I find from this list which ticks are valid and which are fake? I don't have the pulse list to compare with during the process, since 8820 will change during the time, so I don't have a list to compare step by step. I need to deduce it from ticks at each iteration. What's the best math approch to this? Maybe reasoning only in tick and not pulse. I've thought to find the closer error between nearest integer pulse and prev/next tick. Here in C++: double pulse = tick * 96.; double prevpulse = (tick - 1/8820.) * 96.; double nextpulse = (tick + 1/8820.) * 96.; int pulseRounded=round(pulse); int buffer=lrint(tick * 8820.); double pulseABS = abs(pulse - pulseRounded); double prevpulseABS = abs(prevpulse - pulseRounded); double nextpulseABS = abs(nextpulse - pulseRounded); if (nextpulseABS > pulseABS && prevpulseABS > pulseABS) { // is pulse } but for example tick 0.0417234 (pulse 368) fails since the prev tick error seems to be closer than it: prevpulseABS error (0.00543795) is smaller than pulseABS error (0.0054464). That's because this comparison doesn't care about rounding I guess.
NEW POST: Alright. Based on what I now understand, here's my revised answer. You have the information you need to build a list of good values. Each time you switch to a new track: vector<double> good_list; good_list.reserve(96); for(int i = 0; i < 96; i++) good_list.push_back(BankerRound(8820.0 * i / 96.0) / 8820.0); Then, each time you want to validate the input: auto iter = find(good_list.begin(), good_list.end(), input); if(iter != good_list.end()) //It's a match! cout << "Happy days! It's a match!" << endl; else cout << "Oh bother. It's not a match." << endl; The problem with mathematically determining the correct pulses is the BankerRound() function which will introduce an ever-growing error the higher values you input. You would then need a formula for a formula, and that's getting out of my wheelhouse. Or, you could keep track of the differences between successive values. Most of them would be the same. You'd only have to check between two possible errors. But that falls apart if you can jump tracks or jump around in one track. OLD POST: If I understand the question right, the only information you're getting should be coming in the form of (p/v = y) where you know 'y' (that's each element in your list of ticks you get from the device) and you know that 'p' is the Pulse and 'v' is the Values per Beat, but you don't know what either of them are. So, pulling one point of data from your post, you might have an equation like this: p/v = 0.010430839 'v', in all the examples you've used thus far, is 8820, but from what I understand, that value is not a guaranteed constant. The next question then is: Do you have a way of determining what 'v' is before you start getting all these decimal values? If you do, you can work out mathematically what the smallest error can be (1/v) then take your decimal information, multiply it by 'v', round it to the nearest whole number and check to see if the difference between its rounded form and its non-rounded form falls in the bounds of your calculated error like so: double input; //let input be elements in your list of doubles, such as 0.010430839 double allowed_error = 1.0 / values_per_beat; double proposed = input * values_per_beat; double rounded = std::round(proposed); if(abs(rounded - proposed) < allowed_error){cout << "It's good!" << endl;} If, however, you are not able to ascertain the values_per_beat ahead of time, then this becomes a statistical question. You must accumulate enough data samples, remove the outliers (the few that vary from the norm) and use that data. But that approach will not be realtime, which, given the terms you've been using (values per beat, bpm, the value 44100) it sounds like realtime might be what you're after.
Playing around with Excel, I think you want to multiply up to (what should be) whole numbers rather than looking for closest pulses. Tick Pulse i Error OK Tick*8820 Pulse*96/8820 ABS( i - INT( i+0.05 ) ) Error < 0.01 ------------ ------------ ------------- ------------------------ ------------ 0.029024943 255.9999973 2.786394528 0.786394528 FALSE 0.0417234 368.000388 4.0054464 0.0054464 TRUE 0 0 0 0 TRUE 0.010430839 91.99999998 1.001360544 0.001360544 TRUE 0.020861678 184 2.002721088 0.002721088 TRUE 0.031292517 275.9999999 3.004081632 0.004081632 TRUE 0.041723356 367.9999999 4.005442176 0.005442176 TRUE 0.052040816 458.9999971 4.995918336 0.004081664 TRUE 0.062471655 550.9999971 5.99727888 0.00272112 TRUE 0.072902494 642.9999971 6.998639424 0.001360576 TRUE 0.083333333 734.9999971 7.999999968 3.2E-08 TRUE The table shows your two "problem" cases (the real wrong value, 256, and the one your code gets wrong, 368) followed by the first few "good" values. If both 8820s vary at the same time, then obviously they will cancel out, and i will just be Tick*96. The Error term is the difference between the calculated i and the nearest integer; if this less than 0.01, then it is a "good" value. NOTE: the 0.05 and 0.01 values were chosen somewhat arbitrarily (aka inspired first time guess based on the numbers): adjust if needed. Although I've only shown the first few rows, all the 96 "good" values you gave show as TRUE. The code (completely untested) would be something like: double pulse = tick * 8820.0 ; double i = pulse * 96.0 / 8820.0 ; double error = abs( i - floor( i + 0.05 ) ) ; if( error < 0.05 ) { // is pulse }
I assume your initializing your pulses in a for-loop, using int i as loop variable; then the problem is this line: BankerRound(8820 * i/96); 8820 * i / 96 is an all integer operation and the result is integer again, cutting off the remainder (so in effect, always rounding towards zero already), and BankerRound actually has nothing to round any more. Try this instead: BankerRound(8820 * i / 96.0); Same problem applies if you are trying to calculate prev and next pulse, as you actually subtract and add 0 (again, 1/8820 is all integer and results in 0). Edit: From what I read from the commments, the 'system' is not – as I assumed previously – modifiable. Actually, it calculates ticks in the form of n / 96.0, n ∊ [0, 96) in ℕ however including some kind of internal rounding appearently independent from the sample frequency, so there is some difference to the true value of n/96.0 and the ticks multiplied by 96 do not deliver exactly the integral values in [0, 96) (thanks KarstenKoop). And some of the delivered samples are simply invalid... So the task is to detect, if tick * 96 is close enough to an integral value to be accepted as valid. So we need to check: double value = tick * 96.0; bool isValid = value - floor(value) < threshold || ceil(value) - value < threshold; with some appropriately defined threshold. Assuming the values really are calculated as double tick = round(8820*i/96.0)/8820.0; then the maximal deviation would be slightly greater than 0.00544 (see below for a more exact value), thresholds somewhere in the sizes of 0.006, 0.0055, 0.00545, ... might be a choice. Rounding might be a matter of internally used number of bits for the sensor value (if we have 13 bits available, ticks might actually be calculated as floor(8192 * i / 96.0) / 8192.0 with 8192 being 1 << 13 &ndash and floor accounting to integer division; just a guess...). The exact value of the maximal deviation, using 8820 as factor, as exact as representable by double, was: 0.00544217687075132516838493756949901580810546875 The multiplication by 96 is actually not necessary, you can compare directly with the threshold divided by 96, which would be: 0.0000566893424036596371706764330156147480010986328125
Stata: compare coefficients of factor variables using foreach (or forvalues)
I am using an ordinal independent variable in an OLS regression as a categorical variable using the factor variable technique in Stata (i.e, i.ordinal). The variable can take on values of the integers from 0 to 9, with 0 being the base category. I am interested in testing if the coefficient of each variable is greater (or less) than that which succeeds it (i.e. _b[1.ordinal] >= _b[2.ordinal], _b[2.ordinal] >= _b[3.ordinal], etc.). I've started with the following pseudocode based on FAQ: One-sided t-tests for coefficients: foreach i in 1 2 3 5 6 7 8 { test _b[`i'.ordinal] - _b[`i+'.ordinal] = 0 gen sign_`i'`i+' = sign(_b[`i'.ordinal] - _b[`i+'.ordinal]) display "Ho: i <= i+ p-value = " ttail(r(df_r), sign_`i'`i+'*sqrt(r(F))) display "Ho: i >= i+ p-value = " 1-ttail(r(df_r), sign_`i'`i+'*sqrt(r(F))) } where I want the ```i+' to mean the next value of i in the sequence (so if i is 3 then ``i+' is 5). Is this even possible to do? Of course, if you have any cleaner suggestions to test the coefficients in this manner, please advise. Note: The model only uses a sub-sample of my dataset for which there are no observations for 4.ordinal, which is why I use foreach instead of forvalues. If you have suggestions for developing a general code that can be used regardless of missing variables, please advise.
There are various ways to do this. Note that there is little obvious point to creating a new variable just to hold one constant. Code not tested. forval i = 1/8 { local j = `i' + 1 capture test _b[`i'.ordinal] - _b[`j'.ordinal] = 0 if _rc == 0 { local sign = sign(_b[`i'.ordinal] - _b[`j'.ordinal]) display "Ho: `i' <= `j' p-value = " ttail(r(df_r), `sign' * sqrt(r(F))) display "Ho: `i' >= `j' p-value = " 1-ttail(r(df_r), `sign' * sqrt(r(F))) } } The capture should eat errors.
WEKA : Cost Matrix Interpretation
How do we interpret the cost matrix in WEKA? If I have 2 classes to predict (class 0 and class 1) and want to penalize classfication of class 0 as class 1 more (say double the penalty), what exactly is the matrix format? Is it : 0 10 20 0 or is it 0 20 10 0 The source of confusion are the following two references: 1) The JavaDoc for Weka CostMatrix says: The element at position i,j in the matrix is the penalty for classifying an instance of class j as class i. 2) However, the answer in this post seems to indicate otherwise. http://weka.8497.n7.nabble.com/cost-matrix-td5821.html Given the first cost matrix, the post says "Misclassifying an instance of class 0 incurs a cost of 10. Misclassifying an instance of class 1 is twice as costly. Thanks.
I know my answer is coming very late, but it might help somebody so here it is: To boost the cost of classifying an item of class 0 as class 1, the correct format is the second one. The evidence: Cost Matrix I used: 0 1.0 1000.0 0 Confusion matrix (from cross-validation): a b <-- classified as 565 20 | a = ignored 54 204 | b = not_ignored Cross-validation output: ... Total Cost 54020 ... That's a cost of 54 * 10000 + 20 * 1, which matches the confusion matrix above.