I've run JRip and i got a set of rules in this form:
|
V
(something >= 0.076) and (someotherthing >= 0.013) => class=1 (944.0/42.0)
I'm tring to understand what the numbers 944.0 and 42.0 mean.
Almost sure 944.0 is the number of instances covered by the rule, but I can't really undestand the second one.
the 2nd number is number of instances misclassified by rule
The result (A/B) indicates:
A: the weight of all instances supported by the rule
B: the weight of all instances misclassified
(A - B): the weight of all instances correctly classified, thus A > B
Note that, if you are not using weights, each instance has weight 1, and the value represents the number of instances.
Related
Can someone help me interpret the AWS Personalize solution version metrics in layman’s terms or, at the very least, tell me what these metrics should ideally look like?
I have no knowledge of Machine Learning and wanted to take advantage of Personalize as it is marketed as a 'no-previous-knowledge-required' ML SaaS. However, the “Solution version metrics” in my solution results seem to require a fairly high level of math knowledge.
My Solution version metrics are as follows:
Normalized discounted cumulative
At 5: 0.9881, At 10: 0.9890, At 25: 0.9898
Precision
At 5: 0.1981, At 10: 0.0993, At 25: 0.0399
Mean reciprocal rank
At 25: 0.9833
Research
I have looked through the Personalize Developer's Guide which includes a short definition of each metric on page 72. I also attempted to skim through the Wikipedia articles on discounted cumulative gain and mean reciprocal rank. From reading, this is my interpretation of each metric:
NDG = Consistency of relevance of recommendations; Is the first recommendation as relevant as the last?
Precision = Relevance of recommendations to user; How relevant are your recommendations to users across the board?
MRR = Relevance of first recommendation in the list versus the others in the list; How relevant is your first recommendation to each user?
If these interpretations are right, then my solution metrics indicate that I am highly consistent about recommending irrelevant content. Is that a valid conclusion?
Alright, my company has Developer Tier Support so I was able to get an answer to this question from AWS.
Answer Summary
The metrics are better the closer they are to '1'. My interpretation of my metrics was pretty much correct but my conclusion was not.
Apparently, these metrics (and Personalize in general) do not take into account how much a user likes an item. Personalize only cares how soon a relevant recommendation gets to the user. This makes sense because if you get the 25th item in a queue and don't like anything you've seen, you are not likely to continue looking.
Given this, what's happening in my solution is that the first-ish recommendation is relevant but none of the others are.
Detailed Answer from AWS
I will start with relatively easier question first: What are the ideal values for these metrics, so that a solution version can be preferred over another solution version?
The answer to the above question is that for each metric, higher numbers are better. [1] If you have more than one solution version, please prefer the solution version with higher values for these metrics. Please note that you can create number of solution versions by Overriding Default Recipe Parameters [2]. And by using Hyperparameters [3].
The second question: How to understand and interpret the metrics for AWS Personalize Solution version?
I can confirm from my research that the definitions and interpretation provided for these metrics in the case by you are valid.
Before I explain each metric, here is a primer for one of the main concept in Machine Learning. How these metrics are calculated?
The Model training step during the creation of solution version splits the input dataset into two parts, a training dataset (~70%) and test dataset (~30%). The training dataset is used during the Model training. Once the model is trained, it is used to predict the values for test dataset. Once the prediction is made it is validated against the known (and correct) value in the test dataset. [4]
I researched further to find more resources to understand the concept behind these metrics and also elaborate further an example provided in the AWS documentation. [1]
"mean_reciprocal_rank_at_25"
Let’s first understand Reciprocal Rank:
For example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e A, B, C, D, E.
Once these 5 recommended movies are compared against the actual movies liked by that user (in the test dataset) we find out that only movie B and E are actually liked by the user.
The Reciprocal Rank will only consider the first relevant (correct according to test dataset) recommendation which is movie B located at rank 2 and it will ignore the movie E located at rank 5. Thus the Reciprocal Rank will be 1/2 = 0.5
Now let’s expand the above example to understand Mean Reciprocal Rank: [5] Let’s assume that we ran predictions for three users and below movies were recommended.
User 1: A, B, C, D, E (user liked B and E, thus the Reciprocal Rank is 1/2)
User 2: F, G, H, I, J (user liked H and I, thus the Reciprocal Rank is 1/3)
User 3: K, L, M, N, O (user liked K, M and N, thus the Reciprocal Rank is 1)
The Mean Reciprocal Rank will be sum of all the individual Reciprocal Ranks divided by the total number of queries ran for predictions, which is 3.
(1/2 + 1/3 + 1)/3 = (0.5+0.33+1)/3 = (1.83)/3 = 0.61
In case of AWS Personalize Solution version metrics, the mean of the reciprocal ranks of the first relevant recommendation out of the top 25 recommendations over all queries is called “mean_reciprocal_rank_at_25”.
"precision_at_K"
It can be stated as the capability of a model for delivering the relevant elements with the least amount of recommendations.
The concept of precision is described in the following free video available at Coursera. [6] A very good article on the same topic can be found here. [7]
Let’s consider the same example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e; A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (correct values in the test dataset) we find out that only movie B and E are actually liked by the user.
The precision_at_5 will be 2 correctly predicted movies out of total 5 movies and can be stated as 2/5=0.4
"normalized_discounted_cumulative_gain_at_K"
This metric use the concept of Logarithm and Logarithmic Scale to assign weighting factor to relevant items (correct values in the test dataset). The full description of Logarithm and Logarithmic Scale is beyond the scope of this document. The main objective of using Logarithmic scale is to reduce wide-ranging quantities to tiny scopes.
discounted_cumulative_gain_at_K
Let’s consider the same example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e; A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (correct values in the test dataset) we find out that only movie B and E are actually liked by the user.
To produce the cumulative discounted gain (DCG) at 5, each relevant item is assigned a weighting factor (using Logarithmic Scale) based on its position in the top 5 recommendations. The value produced by this formula is called as “discounted value”.
The formula is 1/log(1 + position)
As B is at position 2 so the discounted value is = 1/log(1 + 2)
As E is at position 5 so the discounted value is = 1/log(1 + 5)
The cumulative discounted gain (DCG) is calculated by adding discounted values for both relevant items DCG = ( 1/log(1 + 2) + 1/log(1 + 5) )
normalized_discounted_cumulative_gain_at_K
First of all, what is “ideal DCG”?
In the above example the ideal predictions should look like B, E, A, C, D. Thus the relevant items should be at number 1 and 2 in ideal case. To produce the “ideal DCG” at 5, each relevant item is assigned a weighting factor (using Logarithmic Scale) based on its position in the top 5 recommendations. The value produced by this formula is called as “discounted value”.
The formula is 1/log(1 + position).
As B is at position 1 so the discounted value is = 1/log(1 + 1)
As E is at position 2 so the discounted value is = 1/log(1 + 2)
The ideal DCG is calculated by adding discounted values for both relevant items DCG = ( 1/log(1 + 1) + 1/log(1 + 2) )
The normalized discounted cumulative gain (NDCG) is the DCG divided by the “ideal DCG”.
DCG / ideal DCG = (1/log(1 + 2) + 1/log(1 + 5)) / (1/log(1 + 1) + 1/log(1 + 2)) = 0.6241
I hope the information provided above is helpful in understanding the concept behind these metrics.
[1] https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html
[2] https://docs.aws.amazon.com/personalize/latest/dg/customizing-solution-config.html
[3] https://docs.aws.amazon.com/personalize/latest/dg/customizing-solution-config-hpo.html
[4] https://medium.com/#m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54
[5] https://www.blabladata.com/2014/10/26/evaluating-recommender-systems/
[6] https://www.coursera.org/lecture/ml-foundations/optimal-recommenders-4EQc2
[7] https://medium.com/#bond.kirill.alexandrovich/precision-and-recall-in-recommender-systems-and-some-metrics-stuff-ca2ad385c5f8
I have a group of treated firms in a country, and for each firm I would like to find the closest match in terms of industry, size and profitability in the rest of the country. I am working on Stata. All I need is to form a control group- could anybody guide me with the code? That'd be greatly appreciated! I currently have the following, which doesn't get me what I need:
psmatch2 (logpension) (treated sector logassets logebitda), logit ate
Here's how you might match on x1 and x2 using Mahalanobis distance as a metric, to get the effect on y from treatment t:
use http://ssc.wisc.edu/sscc/pubs/files/psm, clear
psmatch2 t, mahalanobis(x1 x2) outcome(y) ate
The variable _n1 stores the observation number of the matched control observation for every treatment observation.
The following is a full set of code you can run to find your average treatment effect on the treated (your most important indicator result) and to check if the data is balanced (whether your result is valid). Before you run it, you need to make sure your treated is labeled in the following manner: 0 should be labeled as the control group and 1 should be labeled as the experimental/treatment. "neighbor(1)" means I chose the option nearest-neighbor matching. It basically pairs each treated observation with a control observation whose propensity score is closest in absolute value.
psmatch2 treated sector logassets logebitda, outcome (logpension) neighbor(1) common
After running psmatch, you need to make sure your data is balanced. So you need to run this:
pstest sector logassets logebitda, treated(treated)
if your t-test shows any significance below 0.05, it means your data is not balanced. to check the balance of your data visually, you can also run
psgraph
right after your psmatch2 command.
Good luck!
In weka I have seen the F-measure of the 'yes' class and 'no' class seperately. But what is the advantage of using the weighted average F-measure to compare the performance of the models. Please help me to find the answer :)
Let's start with a smart example, classifying protein interactions in text using machine learning, where our classifier has attempted to classify sentences into two classes: (1) positive class (2) negative class. Positive class contains sentences that describe protein interactions and negative class comprises sentences that do not describe protein interactions. As a researcher, my focus will be the F-score of my classifiers for positive class. Why? Because I am interested to see my classifier's performance on classifying sentences that contain protein interactions and I do not care about its ability to classify negative sentences. Therefore, I will consider only the F-score of the positive class.
However, for another classical problem like spam classification, where our classifier classifies emails into two classes: (1) hams and (2) spams, the scenario is a bit different. As a researcher, I would like to know my classifier's ability to classify hams as well as spams. At that point, I can either check the F-scores of each class independently or in an aggregated fashion. The weighted average of F-scores of ham and spam class is a means to check the performance of our classifier for both (in this case both, for multi-class problems read all) classes. Because the weighted F-measure is just the sum of all F-measures, each weighted according to the number of instances with that particular class label and for two classes, it is calculated as follows:
Weighted F-Measure=((F-Measure for n class X number of instances from n class)+(F-Measure for y class X number of instances from y class))/total instances in dataset.
So, the bottom line is- if the classification is sensitive for all the classes, use the weighted average of F-scores of all classes.
As far as I remember, It can better handle "extreme" precision or recall (P, R) numbers, when one or both are close to either 0 or 1. (They are generally negatively correlated).
This might happen when you want to apply different algorithms on a dataset and you end up with some precision/recall numbers that you need to compare.
Turns out that the simple average (P+R)/2 is too simplistic.
If you have a dataset where either precision or recall are close to 1 or zero, F-measure still takes the other one into account, somewhat arbitrarily.
(The name itself does not mean anything).
Andrew Ng explains it well in his machine-learning course, week 6 "Handling skewed data"
While creating a train,test & cross validation sample in Python, I see the default method as -:
1. Reading the dataset , after skipping headers
2. Creating the train, test and Cross validation sample
import csv
with open('C:/Users/Train/Trainl.csv', 'r') as f1:
next(f1)
reader = csv.reader(f1, delimiter=',')
input_set = []
for row in reader:
input_set.append(row)
import numpy as np
from numpy import genfromtxt
from sklearn import cross_validation
train, intermediate_set = cross_validation.train_test_split(input_set, train_size=0.6, test_size=0.4)
cv, test = cross_validation.train_test_split(intermediate_set, train_size=0.5, test_size=0.5)
My problem though is that I have a field say "A" in the csv file that I read into the numpy array, and all sampling should respect this field. That is, all entries with similar values for "A" should go in one sample .
Line #|A | B | C | D
1 |1 |
2 |1 |
3 |1 |
4 |1 |
5 |2 |
6 |2 |
7 |2 |
Required : line 1,2,3,4 should go in "one" sample and 5,6,7 should go in the "one" sample.
Value of column A is a unique id, corresponding to one single entity(could be seen as a cross section data points on one SINGLE user, so it MUST go in one unique sample of train, test, or cv), and there are many such entities, so a grouping by entity id is required.
B, C,D columns may have any values, but a grouping preservation is not required on them. (Bonus: can I group the sampling for multiple fields?)
What I tried :
A. Finding all unique values of A's - denoting this as my sample I now distribute the sample among-st train, intermediate & cv & test -> then putting the rest of the rows for this value of "A" in each of these files.
that is if train had entry for "3" , test for"2" and cv for "1" then all rows with value of A as 3 go in train, all with 2 go in test and all with 1 go in cv.
Ofcourse this approach is not scalable.
And I doubt, it may have introduced bias into the datasets, since the number of 1's in column A , no of 2's etc. is not equal, meaning this approach will not work !
B. I also tried numpy.random.shuffle, or numpy.random.permutation as per the thread here - Numpy: How to split/partition a dataset (array) into training and test datasets for, e.g., cross validation? , but it did not meet my requirement.
C. A third option of-course is writing a custom function that does this grouping, and then balances the training, test and cv data-sets based on number of data points in each group. But just wondering, if there's already an efficient way to implement this ?
Note my data set is huge, so ideally I would like to have a deterministic way to partition my datasets, without having multiple eye-ball-scans to be sure that the partition is correct.
EDIT Part 2:
Since I did not find any that fit my sampling criteria - I actually wrote a module to sample with grouping constraints. This is the github code to it. The code was not written for very large data in mind, so it's not very efficient. Should you FORK this code - please point out how can I improve the run-time.
https://github.com/ekta1007/Sampling-techniques/blob/master/sample_expedia.py
By forcing such constraints you will introduce bias either way, to you procedure. So approach based on the partition of the "users" data and then collecting their respective "measurements" does not seem bad. And it will scale just fine, this is O(n) method, the only reason for not scaling up is bad implementation, not bad method.
The reason for no such functionality in existing methods (like sklearn library) is because it looks highly artificial, and counter machine learning models idea. If these are somehow one entities then they should not be treated as separate data points. If you need this separate representation then requiring such division, that the particular entity cannot be partially in test test and partially in training will for sure bias the whole model.
To sum up - you should really deeply analyze whether your approach is reasonable from the machine learning point of view. If you are sure about it, I think the only possibility is to write the segmentation by yourself, as even though using many ML libraries in the past, I've never seen such functionality.
In fact I am not sure, if the problem of creating segmentation of the set containing N numbers (sizes of entities) into K (=3) subsets of given sums proportions with uniform distribution when treated as a random process is not NP problem on itself. If you cannot guarantee uniform distribution, then your datasets cannot be used as a statistically correct method of training/testing/validating your model. Even if it has a reasonable polynomial solution, it can still scale up badly (much worse then linear methods). This doubt applies if your constraints are "strict", if they are "weak" you can always do "generate and reject" approach, which should have amortized linear complexity.
I was also facing similar kind of issue, though my coding is not too good I came up with the solution as given below:
Created a new data frame that only contains the Unique Id of the df and removed duplicates.
new = df[["Unique_Id "]].copy()
New_DF = new.drop_duplicates()
Created training and test set on the basis of New_DF
train, test = train_test_split(New_DF, test_size=0.2)
And then merged those training and test set with original df.
df_Test = pd.merge(df, test, how='inner', on = “Unique_Id”)
df_Train = pd.merge(df, train, how='inner', on = “Unique_Id”)
Similarly, we can create sample for the validation part too.
Cheers.
I am trying to make a simple artificial neural network work with the backpropagation algorithm. I have created an ANN and I believe I have implemented the BP algorithm correctly, but I may of course be wrong.
Right now, I am trying to train the network by giving it two random numbers (a, b) between 0 and 0.5, and having it add them. Then, of course, each time the output the network gives is compared to the theoretical answer of a + b (which will always be achievable by the sigmoid function).
Strangely, the output always converges to a number between 0 and 1 (as it must, because of the sigmoid function), but the random numbers I'm putting in seem to have no effect on it.
Edit: Sorry, it appears it doesn't converge. Here is an image of the output:
The weights are randomly distributed between -1 and 1, but I have also tried between 0 and 1.
I also tried giving it two constant numbers (0.35,0.9) and trying to train it to spit out 0.5. This works and converges very fast to 0.5. I have also trained it to spit out 0.5 if I give it any two random numbers between 0 and 1, and this also works.
If instead, my target is:
vector<double> target;
target.push_back(.5);
Then it converges very quickly, even with random inputs:
I have tried a couple different networks, since I made it very easy to add layers to my network. The standard one I am using is one with two inputs, one layer of 2 neurons, and a second layer of only one neuron (the output neuron). However, I have also tried adding a few layers, and adding neurons to them. It doesn't seem to change anything. My learning rate is equal to 1.0, though I tried it equal to 0.5 and it wasn't much different.
Does anyone have any idea of anything I could try?
Is this even something an ANN is capable of? I can't imagine it wouldn't be, since they can be trained to do such complicated things.
Any advice? Thanks!
Here is where I train it:
//Initialize it. This will be one with 2 layers, the first having 2 Neurons and the second (output layer) having 1.
vector<int> networkSize;
networkSize.push_back(2);
networkSize.push_back(1);
NeuralNetwork myNet(networkSize,2);
for(int i = 0; i<5000; i++){
double a = randSmallNum();
double b = randSmallNum();
cout << "\n\n\nInputs: " << a << ", " << b << " with expected target: " << a + b;
vector<double> myInput;
myInput.push_back(a);
myInput.push_back(b);
vector<double> target;
target.push_back(a + b);
cout << endl << "Iteration " << i;
vector<double> output = myNet.backPropagate(myInput,target);
cout << "Output gotten: " << output[0];
resultPlot << i << "\t" << abs(output[0] - target[0]) << endl;
}
Edit: I set up my network and have been following from this guide: A pdf. I implemented "Worked example 3.1" and got the same exact results they did, so I think my implementation is correct, at least as far as theirs is.
As #macs states, the maximum output of standard sigmoid is 1, so, if you try to add n numbers from [0, 1], then your target should be normalized, i.e. sum(A1, A2, ..., An) / n.
In a model such as this, the sigmoid function (both in the output and in the intermediate layers) is used mainly for producing something that resembles a 0/1 toggle while still being a continuous function, so using it to represent a range of numbers is not what this kind of network is designed to do. This is because it is designed mostly with classification problems in mind.
There are, of course, other NN models that can do that sort of thing (for example, dropping the sigmoid on the output and just keeping it as a sum of its children).
If you can redefine your model in terms of classifying the input, you'll probably get better results.
Some examples of similar tasks for which the network will be more suitable:
Test whether the output is bigger or smaller than a certain constant - this should be very easy.
Output: A series of outputs, each representing a different potential value (for example, one output each for the the values between 0 and 10, one for 'more than 10', and one for 'less than 0'). You will want your network to round the result to the nearest integer
A tricky one will be to try and create a boolean representation of the output by having multiple output nodes.
None of these will give you the precision that you may want, though, since by nature NNs are more 'fuzzy'