How are confidence scores calculated in AWS SageMaker GroundTruth? - amazon-web-services

AWS's SageMaker/GroundTruth Labelling jobs return a confidence score for each human-annotated label.
However, the score is not a direct function of the responses of the N workers who labeled the task.
For example, on tasks with all three workers assigning different labels the score varies (0.61, 0.55, 0.68). And where 2/3 agree, the score varies also (0.95, 0.91).
"Automated data labelling" is disabled, which indicates that all items are labeled by a human, rather than being fully/partially automatically classified.
How does AWS calculate these confidence scores?

I can't find the details, so leaving this question open hoping for a real answer. But this is what I can find out so far:
Each labelling job has a AnnotationConsolidationConfig param which lets you control how the confidence score is calculated using an AWS Lambda function.
The default for single-image classification is described as:
a variant of the Expectation Maximisation approach.
It estimates parameters for each worker and uses Bayesian inference to estimate the true class based on the class annotations from individual workers."
however it appears regular AWS users are not able to view the function itself due to lack of permissions.

Related

Amazon CloudWatch Metric Math - Rolling Average

So I am trying to visualize a AWS CloudWatch Metric with Tolerance via AWS Managed Grafana.
For example, this is my current graph (and I want to add tolerance lines to it):
I want to add some tolerance lines too see which spikes go outside of the expected range.
I could technically do this by enabling CloudWatch anomaly detection and using ANOMALY_DETECTION_BAND(a) as my metric math but I am trying to replicate a dashboard we currently have which uses a 6 week rolling average with a simple multiplier as the upper and lower thresholds.
My thought was that I can accomplish this using metric math by leveraging a combination of SLICE, RUNNING_SUM, and DATAPOINT_COUNT but no matter what combination I try I can't seem to find the right mix.
Does anyone know how I can using Metric Math to create a time series where each data point is either:
The average of the last x amount of time data points (Ex: the last 6 days worth of data points)
The average of the last x data points.
If I can figure out either of these solutions I can do the rest but I am having a hard time just getting "the last x data points" instead of referencing the entire query when doing any metric math operation.
I could maybe find a way to do this with built in Grafana functionality as well but I couldn't find a great way to do it. (Still new to Grafana).

Amazon Sagemaker Groundtruth: Cannot get active learning to work

I am trying to test Sagemaker Groundtruth's active learning capability, but cannot figure out how to get the auto-labeling part to work. I started a previous labeling job with an initial model that I had to create manually. This allowed me to retrieve the model's ARN as a starting point for the next job. I uploaded 1,758 dataset objects and labeled 40 of them. I assumed the auto-labeling would take it from here, but the job in Sagemaker just says "complete" and is only displaying the labels that I created. How do I make the auto-labeler work?
Do I have to manually label 1,000 dataset objects before it can start working? I saw this post: Information regarding Amazon Sagemaker groundtruth, where the representative said that some of the 1,000 objects can be auto-labeled, but how is that possible if it needs 1,000 objects to start auto-labeling?
Thanks in advance.
I'm an engineer at AWS. In order to understand the "active learning"/"automated data labeling" feature, it will be helpful to start with a broader recap of how SageMaker Ground Truth works.
First, let's consider the workflow without the active learning feature. Recall that Ground Truth annotates data in batches [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-batching.html]. This means that your dataset is submitted for annotation in "chunks." The size of these batches is controlled by the API parameter MaxConcurrentTaskCount [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HumanTaskConfig.html#sagemaker-Type-HumanTaskConfig-MaxConcurrentTaskCount]. This parameter has a default value of 1,000. You cannot control this value when you use the AWS console, so the default value will be used unless you alter it by submitting your job via the API instead of the console.
Now, let's consider how active learning fits into this workflow. Active learning runs in between your batches of manual annotation. Another important detail is that Ground Truth will partition your dataset into a validation set and an unlabeled set. For datasets smaller than 5,000 objects, the validation set will be 20% of your total dataset; for datasets largert than 5,000 objects, the validation set will be 10% of your total dataset. Once the validation set is collected, any data that is subsequently annotated manually consistutes the training set. The collection of the validation set and training set proceeds according to the batch-wise process described in the previous paragraph. A longer discussion of active learning is available in [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html].
That last paragraph was a bit of a mouthful, so I'll provide an example using the numbers you gave.
Example #1
Default MaxConcurrentTaskCount ("batch size") of 1,000
Total dataset size: 1,758 objects
Computed validation set size: 0.2 * 1758 = 351 objects
Batch #
Annotate 351 objects to populate the validation set (1407 remaining).
Annotate 1,000 objects to populate the first iteration of the training set (407 remaining).
Run active learning. This step may, depending on the accuracy of the model at this stage, result in the annotation of zero, some, or all of the remaining 407 objects.
(Assume no objects were automatically labeled in step #3) Annotate 407 objects. End labeling job.
Example #2
Non-default MaxConcurrentTaskCount ("batch size") of 250
Total dataset size: 1,758 objects
Computed validation set size: 0.2 * 1758 = 351 objects
Batch #
Annotate 250 objects to begin populating the validation set (1508 remaining).
Annotate 101 objects to finish populating the validation set (1407 remaining).
Annotate 250 objects to populate the first iteration of the training set (1157 remaining).
Run active learning. This step may, depending on the accuracy of the model at this stage, result in the annotation of zero, some, or all of the remaining 1157 objects. All else being equal, we would expect the model to be less accurate than the model in example #1 at this stage, because our training set is only 250 objects here.
Repeat alternating steps of annotating batches of 250 objects and running active learning.
Hopefully these examples illustrate the workflow and help you understand the process a little better. Since your dataset consists of 1,758 objects, the upper bound on the number of automated labels that can be supplied is 407 objects (assuming you use the default MaxConcurrentTaskCount).
Ultimately, 1,758 objects is still a relatively small dataset. We typically recommend at least 5,000 objects to see meaningful results [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html]. Without knowing any other details of your labeling job, it's difficult to gauge why your job didn't result in more automated annotations. A useful starting point might be to inspect the annotations you received, and to determine the quality of the model that was trained during the Ground Truth labeling job.
Best regards from AWS!

Understanding GCP Dataproc billing and how it is affected by labels

I'm trying to make sure I have a clear understanding of how my organisation gets billed for Google Cloud Platform Dataproc.
We have exported our billing history to BigQuery so that we can analyse it. This morning we had two dataproc clusters running and the screenshot below shows a subset of the billing history for those two clusters. I have filtered on labels.key = "goog-dataproc-cluster-uuid" or labels.key = "goog-dataproc-cluster-name" or labels.key = "goog-dataproc-location". Here is a subset of the results
I've drawn boxes around the costs for two kinds of sku. Lets's take a look at the Standard Intel N1 16 VCPU running in EMEA items.
I only have two clusters yet for each of those two clusters there are three lines. The reason is that there are three labels applied to each dataproc cluster, hence the costs 1.271852 & 3.815556 appear three times each.
My simple question then is...how do I get the total cost of my dataproc clusters? Do I add up all of these numbers (thus implying that the total cost is split equally over all of the labels) or do I take just one of the values (implying that the cost is repeated for each label)?
Here's another way of phrasing my question. Does this query give the total cost of running cluster data-dev-dataplatform-dataproc for one day:
SELECT sum(cost)
FROM [dh-billing-179310:billing.gcp_billing_export_XXXXXXXX]
WHERE labels.key = "goog-dataproc-cluster-name"
and labels.value = "data-dev-dataplatform-dataproc"
and usage_start_time >= "2018-07-05 00:00:00"
and usage_end_time <= "2018-07-06 00:00:00"
or do I need to include other labels in order to get the total cost?
In that flattened view of billing export data, the cost is repeated for each label; you should pick a single label value for any particular calculation. If you're trying to calculate the Dataproc total, it's probably most convenient to use one of the Dataproc-inserted "goog-dataproc-*" labels.
The idea here is that you can use different sets of labels to easily organize your total Dataproc-related costs attributed to any given subproject, so that you can then filter your billing queries along different dimensions.

How to classify a small and peculiar subset out of a large database?

I have to perform a data mining task on a database containing informations about insurance policies. Each tuple indicates data about a single policy, along with information regarding the agency that issued it, the customer it is referring to and other fields. It is like a product between hypotetical tables Policies, Customers and Agencies. The fields are the following:
Policy Type,ID Number,Policy Status ,Product Description,Product Combinations,Issue Date,Effective Date,Maturity Date,Policy Duration,Loan Duration ,Cancellation Date ,Reason for cancellation,Total Premium,Splitter Premium,ID Partners,ID Agency,Country Agency,ID Zone,Agency potential,Sex Contractor ,Birth Year Contractor,Job Contractor,Sex Insured,Job Insured,Birth Year Insured,Product Area,Legal Form,ID Claim,Year Claim,Status Claim,Provision Claim,Payments Claim
This is an academic task and our professor wants us to identify churn rates, cross-selling and up-selling. I am not quite into the field and therefore I sought those terms on wikipedia. I started with churn rate and it appears to me that in this case I have to characterize the properties of customers whose Policy Status is set to "canceled" and the Reason for cancellation is "customer cancellation".
With Rapid Miner, I tried to apply decision trees and rule mining, but the subset of interest is so small that the output model, despite having a good accuracy overall, has a very very very poor accuracy in predicting canceled policies. This happens because the subset of canceled policies is really small. I also tried to apply the MetaCost operator with a given cost matrix in which the cost of misclassifying canceled policies is outrageously high with respect to the others (like a million times higher), but this did not change the result at all.
My best option now is to use the sequential covering algorithm for rule mining, but rapid miner does not implement it and I would have to code it manually.
Do you have any suggestion on how to build a good model for that small subset of canceled policies, so that we could use it to identify customers that would potentially cancel their policy in the future?
N.B.: since it comes from a real source, albeit anonymized, I cannot disclose the database or any data contained within.
Did you try Navie Bayes? It works well with small set of data. You can as well try a variant of it like AODE. AODE is not available in Rapid Miner. You should install Weka extension to access AODE in Rapid Miner.
You need to balance your dataset, so that the classes (cancelled / not cancelled) are the same size. This means (temporarily) discarding lots of data.
You can use the Sample operator with the Balance Labels checkbox to do this.

Show CloudWatch metric with unit Seconds in Hours

I have a custom cloud watch metric with unit Seconds. (representing the age of a cache)
As usual values are around 125,000 I'd like to convert them into Hours - for better readability.
Is that possible?
This has changed with the addition of Metrics Math. You can do all sorts of transformations on your data, both manually (from the console) and from CloudFormation dashboard templates.
From the console: see the link above, which says:
To add a math expression to a graph
Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
Create or edit a graph or line widget.
Choose Graphed metrics.
Choose Add a math expression. A new line appears for the expression.
For the Details column, type the math expression. The tables in the following section list the functions you can use in the
expression.
To use a metric or the result of another expression as part of the formula for this expression, use the value shown in the Id column. For
example, m1+m2 or e1-MIN(e1).
From a CloudFormation Template
You can add new metrics which are Metrics Math expressions, transforming existing metrics. You can add, subtract, multiply, etc. metrics and scalars. In your case, you probably just want to use divide, like in this example:
Say you have the following bucket request latency metrics object in your template:
"metrics":[
["AWS/S3","TotalRequestLatency","BucketName","MyBucketName"]
]
The latency default is in milliseconds. Let's plot it in seconds, just for fun. 1s = 1,000ms so we'll add the following:
"metrics":[
["AWS/S3","TotalRequestLatency","BucketName","MyBucketName",{"id": "timeInMillis"}],
[{"expression":"timeInMillis / 1000", "label":"LatencyInSeconds","id":"timeInSeconds"}]
]
Note that the expression has access to the ID of the other metrics. Helpful naming can be useful when things get more complicated, but the key thing is just to match the variables you put in the expression to the ID you assign to the corresponding metric.
This leaves us with a graph with two metrics on it: one milliseconds, the other seconds. If we want to lose the milliseconds, we can, but we need to keep the metric values around to compute the math expression, so we use the following work-around:
"metrics":[
["AWS/S3","TotalRequestLatency","BucketName","MyBucketName",{"id": "timeInMillis","visible":false}],
[{"expression":"timeInMillis / 1000", "label":"LatencyInSeconds","id":"timeInSeconds"}]
]
Making the metric invisible takes it off the graph while still allowing us to compute our expression off of it.
Cloudwatch does not do any Unit conversion (i.e seconds into hours etc). So you cannot use the AWS console to display your 'Seconds' datapoint values converted to Hours.
You could either publish your metric values as 'Hours' (leaving the Unit field blank or set it to 'None').
Otherwise if you still want to provide the datapoints with units 'Seconds' you could retrieve the datapoints (using the GetMetricStatistics API) and graph the values using some other dashboard/graphing solution.