Weka - no class and no cluster assign - weka

I am working on a file to predict treatment ways for patients diagnosed with diabetes (level is from 1 to 10).
There are 8 different treatment recommendations (256 possible outcome) and I need to cluster them (I have 21 attributes from the original file.). So I used 19 k with SimpleKMean. The problem is I get "no class" assigned for some clusters;
Also when I classify it for evaluation, I have the same problem "no cluster" assigned to class and also I lose some of the data. For example there are 940 instances but I have 876 after classifying.
But the confusion matrix displays the exact numbers. I don't know if it is related but it might help to solve the question. I have used AddCluster method because all my attribute are numeric and I need an additional column from the original file in order to display "Treatment Cluster" (22th attribute). So I run SimpleKMean and Cross-Validation with this new additional attribute which is also my class.
Thanks a lot for your help!!!

It appears that a class can only be applied to zero or one clusters. As a result, for example, class 9 is being applied to cluster 7, but all of the class 9 values in cluster 8 is not being assigned as it was being allocated to another class. The SimpleKMeans model appears to assign the cluster that generates the minimum classification error on the supplied data.
This problem has been raised before here, where the solution appears to be overriding the evaluation model to allow for one-to-many allocations.

Related

Amazon Sagemaker Groundtruth: Cannot get active learning to work

I am trying to test Sagemaker Groundtruth's active learning capability, but cannot figure out how to get the auto-labeling part to work. I started a previous labeling job with an initial model that I had to create manually. This allowed me to retrieve the model's ARN as a starting point for the next job. I uploaded 1,758 dataset objects and labeled 40 of them. I assumed the auto-labeling would take it from here, but the job in Sagemaker just says "complete" and is only displaying the labels that I created. How do I make the auto-labeler work?
Do I have to manually label 1,000 dataset objects before it can start working? I saw this post: Information regarding Amazon Sagemaker groundtruth, where the representative said that some of the 1,000 objects can be auto-labeled, but how is that possible if it needs 1,000 objects to start auto-labeling?
Thanks in advance.
I'm an engineer at AWS. In order to understand the "active learning"/"automated data labeling" feature, it will be helpful to start with a broader recap of how SageMaker Ground Truth works.
First, let's consider the workflow without the active learning feature. Recall that Ground Truth annotates data in batches [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-batching.html]. This means that your dataset is submitted for annotation in "chunks." The size of these batches is controlled by the API parameter MaxConcurrentTaskCount [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HumanTaskConfig.html#sagemaker-Type-HumanTaskConfig-MaxConcurrentTaskCount]. This parameter has a default value of 1,000. You cannot control this value when you use the AWS console, so the default value will be used unless you alter it by submitting your job via the API instead of the console.
Now, let's consider how active learning fits into this workflow. Active learning runs in between your batches of manual annotation. Another important detail is that Ground Truth will partition your dataset into a validation set and an unlabeled set. For datasets smaller than 5,000 objects, the validation set will be 20% of your total dataset; for datasets largert than 5,000 objects, the validation set will be 10% of your total dataset. Once the validation set is collected, any data that is subsequently annotated manually consistutes the training set. The collection of the validation set and training set proceeds according to the batch-wise process described in the previous paragraph. A longer discussion of active learning is available in [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html].
That last paragraph was a bit of a mouthful, so I'll provide an example using the numbers you gave.
Example #1
Default MaxConcurrentTaskCount ("batch size") of 1,000
Total dataset size: 1,758 objects
Computed validation set size: 0.2 * 1758 = 351 objects
Batch #
Annotate 351 objects to populate the validation set (1407 remaining).
Annotate 1,000 objects to populate the first iteration of the training set (407 remaining).
Run active learning. This step may, depending on the accuracy of the model at this stage, result in the annotation of zero, some, or all of the remaining 407 objects.
(Assume no objects were automatically labeled in step #3) Annotate 407 objects. End labeling job.
Example #2
Non-default MaxConcurrentTaskCount ("batch size") of 250
Total dataset size: 1,758 objects
Computed validation set size: 0.2 * 1758 = 351 objects
Batch #
Annotate 250 objects to begin populating the validation set (1508 remaining).
Annotate 101 objects to finish populating the validation set (1407 remaining).
Annotate 250 objects to populate the first iteration of the training set (1157 remaining).
Run active learning. This step may, depending on the accuracy of the model at this stage, result in the annotation of zero, some, or all of the remaining 1157 objects. All else being equal, we would expect the model to be less accurate than the model in example #1 at this stage, because our training set is only 250 objects here.
Repeat alternating steps of annotating batches of 250 objects and running active learning.
Hopefully these examples illustrate the workflow and help you understand the process a little better. Since your dataset consists of 1,758 objects, the upper bound on the number of automated labels that can be supplied is 407 objects (assuming you use the default MaxConcurrentTaskCount).
Ultimately, 1,758 objects is still a relatively small dataset. We typically recommend at least 5,000 objects to see meaningful results [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html]. Without knowing any other details of your labeling job, it's difficult to gauge why your job didn't result in more automated annotations. A useful starting point might be to inspect the annotations you received, and to determine the quality of the model that was trained during the Ground Truth labeling job.
Best regards from AWS!

Preserve Order for Cross Validation in Weka

I am using the Weka GUI for classifying sensor data.
I have measures of 10 people, the data is sorted. So the first 10% correspond to participant 1, the second 10% to participant 2 etc.
I would like to use 10 fold cross validation to build a model on 9 participants and test it on the remaining participant. In my case I believe I could accomplish this by simply not randomizing the data splits.
How would I best go about doing this?
I don't know how to do this in the Explorer.
In the KnowledgeFlow GUI, there is a CrossValidationFoldMaker used to create cross-validation folds. This has an option to Preserve instances order, which says it preserves the order of instances rather than randomly shuffling.
There's a video describing the KnowledgeFlow interface here:
https://www.youtube.com/watch?v=sHSgoVX9z-8&t=7s

Imbalance between errors in data summary and tree visualization in Weka

I tried to run a simple classification on the iris.arff dataset in Weka, using the J48 algorithm. I used cross-validation with 10 folds and - if I'm not wrong - all the default settings for J48.
The result is a 96% accuracy with 6 incorrectly classified instances.
Here's my question: according to this the second number in the tree visualization is the number of the wrongly classified instances in each leaf, but then why their sum isn't 6 but 3?
EDIT: running the algorithm with different test options I obtain different results in terms of accuracy (and therefore number of errors), but when I visualize the tree I get always the same tree with the same 3 errors. I still can't explain why.
The second number in the tree visualization is not the number of the wrongly classified instances in each leaf - it's the total weight of those wrongly classified instances.
Did you, by any chance, weigh some of those instances with 0.5 instead of 1?
Another option is that you are actually executing two different models. One where you use the full training set to build the classifier (classifier.buildClassifier(instances)) and another one where you run Cross-validation (eval.crossValidateModel(...)) with 10 train/test folds. The first model will produce the visualised tree with less errors (larger trainingset) while the second model from CV produces the output statistics with more errors. This would explain why you get different stats when changing the test set but still the same tree that is built on the full set.
For the record: if you train (and visualise) the tree with the full dataset, you will appear to have less errors, but your model will actually be overfitted and the obtained performance measures will probably not be realistic. As such, your results from CV are much more useful and you should visualise the tree from that model.

AIMMS artificial variable constraint

I have difficulty formulating the constraints correctly. Dumbed down version of the problem:
There are 12 time units, 3 products, demand d_{i,t} for the product i at time $t$ is known in advance and the resources r_{i,t} (all 8, product i uses the non-i resources) necessary for product i at time t are known as well. We need to minimize holding costs h_i by deciding how much of product i we need to produce at time t, this is called x_{i,t}. Starting inventory for every product is 6. To help I introduce stock level s_{i,t}. This equates to the following formulation:
I got this working using the Excel solver but I need to do this in AIMMS. The Stock variable s is giving me issues, I can't get it to work using an if statement to condition on if t=1 nor would I know how to split it in two constraints, given the first iteration of the second constraint would need to reference the first constraint.
You can specify index domain in your constraint attributes as follows:
(t,i) | t > 1
if t > 1 statement should work if the set of time instances is a subset of integers. If not - you should use ord(t) > 1, i.e.
if ord(t) > 1
then
Your_Constraint
endif

How to do prediction with weka

i'm using weka to do some text mining, i'm a little bit confused so i'm here to ask how can i ( with a set of comments that are in a some way classified as: notes, status of work, not conformity, warning) predict if a new comment belong to a specific class, with all the comment (9551) i've done a preprocess obtaining with the filter "stringtowordvector" a vector of tokens, and then i've used the simple kmeans to obtain a number of cluster.
So the question is: if a user post a new comment can i predict with those data if it belong to a category of comment?
sorry if my question is a little bit confused but so am i.
thank you
Trivial Training-validation-test
Create two datasets from your labelled instances. One will be training set and the other will be validation set. The training set will contain about 60% of the labelled data and the validation will contain 40% of the labelled data. There is no hard and fast rule for this split, but a 60-40 split is a good choice.
Use K-means (or any other clustering algorithm) on your training data. Develop a model. Record the model's error on training set. If the error is low and acceptable, you are fine. Save the model.
For now, your validation set will be your test dataset. Apply the model you saved on your validation set. Record the error. What is the difference between training error and validation error? If they both are low, the model's generalization is "seemingly" good.
Prepare a test dataset where you have all the features of your training and test dataset but the class/cluster is unknown.
Apply the model on the test data.
10-fold cross validation
Use all of your labelled data instances for this task.
Apply K-means (or any other algorithm of your choice) with a 10-fold CV setup.
Record the training error and CV error. Are they low? Is the difference between the errors is low? If yes, then save the model and apply it on the test data whose class/cluster is unknown.
NB: The training/test/validation errors and their differences will give you an "very initial" idea of overfitting/underfitting of your model. They are sanity tests. You need to perform other tests like learning curves to see if your model overfits or underfits or perfect. If there appears to be an overfitting and underfitting problem, you need to try many different techniques to overcome them.