Understanding GCP Dataproc billing and how it is affected by labels - google-cloud-platform

I'm trying to make sure I have a clear understanding of how my organisation gets billed for Google Cloud Platform Dataproc.
We have exported our billing history to BigQuery so that we can analyse it. This morning we had two dataproc clusters running and the screenshot below shows a subset of the billing history for those two clusters. I have filtered on labels.key = "goog-dataproc-cluster-uuid" or labels.key = "goog-dataproc-cluster-name" or labels.key = "goog-dataproc-location". Here is a subset of the results
I've drawn boxes around the costs for two kinds of sku. Lets's take a look at the Standard Intel N1 16 VCPU running in EMEA items.
I only have two clusters yet for each of those two clusters there are three lines. The reason is that there are three labels applied to each dataproc cluster, hence the costs 1.271852 & 3.815556 appear three times each.
My simple question then is...how do I get the total cost of my dataproc clusters? Do I add up all of these numbers (thus implying that the total cost is split equally over all of the labels) or do I take just one of the values (implying that the cost is repeated for each label)?
Here's another way of phrasing my question. Does this query give the total cost of running cluster data-dev-dataplatform-dataproc for one day:
SELECT sum(cost)
FROM [dh-billing-179310:billing.gcp_billing_export_XXXXXXXX]
WHERE labels.key = "goog-dataproc-cluster-name"
and labels.value = "data-dev-dataplatform-dataproc"
and usage_start_time >= "2018-07-05 00:00:00"
and usage_end_time <= "2018-07-06 00:00:00"
or do I need to include other labels in order to get the total cost?

In that flattened view of billing export data, the cost is repeated for each label; you should pick a single label value for any particular calculation. If you're trying to calculate the Dataproc total, it's probably most convenient to use one of the Dataproc-inserted "goog-dataproc-*" labels.
The idea here is that you can use different sets of labels to easily organize your total Dataproc-related costs attributed to any given subproject, so that you can then filter your billing queries along different dimensions.

Related

AWS CloudWatch interpreting insights graph -- how many read/write IOs will be billed?

Introduction
We are trying to "measure" the cost of usage of a specific use case on one of our Aurora DBs that is not used very often (we use it for staging).
Yesterday at 18:18 hrs. UTC we issued some representative queries to it and today we were examining the resulting graphs via Amazon CloudWatch Insights.
Since we are being billed USD 0.22 per million read/write IOs, we need to know how many of those there were during our little experiment yesterday.
A complicating factor is that in the cost explorer it is not possible to group the final billed costs for read/write IOs per DB instance! Therefore, the only thing we can think of to estimate the cost is from the read/write volume IO graphs on CLoudwatch Insights.
So we went to the CloudWatch Insights and selected the graphs for read/write IOs. Then we selected the period of time in which we did our experiment. Finaly, we examined the graphs with different options: "Number" and "Lines".
Graph with "number"
This shows us the picture below suggesting a total billable IO count of 266+510=776. Since we have choosen the "Sum" metric, this we assume would indicate a cost of about USD 0.00017 in total.
Graph with "lines"
However, if we choose the "Lines" option, then we see another picture, with 5 points on the line. The first and last around 500 (for read IOs) and the last one at approx. 750. Suggesting a total of 5000 read/write IOs.
Our question
We are not really sure which interpretation to go with and the difference is significant.
So our question is now: How much did our little experiment cost us and, equivalently, how to interpret these graphs?
Edit:
Using 5 minute intervals (as suggested in the comments) we get (see below) a horizontal line with points at 255 (read IOs) for a whole hour around the time we did our experiment. But the experiment took less than 1 minute at 19:18 (UTC).
Wil the (read) billing be for 12 * 255 IOs or 255 ... (or something else altogether)?
Note: This question triggered another follow-up question created here: AWS CloudWatch insights graph — read volume IOs are up much longer than actual reading
From Aurora RDS documentation
VolumeReadIOPs
The number of billed read I/O operations from a cluster volume within
a 5-minute interval.
Billed read operations are calculated at the cluster volume level,
aggregated from all instances in the Aurora DB cluster, and then
reported at 5-minute intervals. The value is calculated by taking the
value of the Read operations metric over a 5-minute period. You can
determine the amount of billed read operations per second by taking
the value of the Billed read operations metric and dividing by 300
seconds. For example, if the Billed read operations returns 13,686,
then the billed read operations per second is 45 (13,686 / 300 =
45.62).
You accrue billed read operations for queries that request database
pages that aren't in the buffer cache and must be loaded from storage.
You might see spikes in billed read operations as query results are
read from storage and then loaded into the buffer cache.
Imagine AWS report these data each 5 minutes
[100,150,200,70,140,10]
And you used the Sum of 15 minutes statistic like what you had on the image
F̶i̶r̶s̶t̶,̶ ̶t̶h̶e̶ ̶"̶n̶u̶m̶b̶e̶r̶"̶ ̶v̶i̶s̶u̶a̶l̶i̶z̶a̶t̶i̶o̶n̶ ̶r̶e̶p̶r̶e̶s̶e̶n̶t̶ ̶o̶n̶l̶y̶ ̶t̶h̶e̶ ̶l̶a̶s̶t̶ ̶a̶g̶g̶r̶e̶g̶a̶t̶e̶d̶ ̶g̶r̶o̶u̶p̶.̶ ̶I̶n̶ ̶y̶o̶u̶r̶ ̶c̶a̶s̶e̶ ̶o̶f̶ ̶1̶5̶ ̶m̶i̶n̶u̶t̶e̶s̶ ̶a̶g̶g̶r̶e̶g̶a̶t̶i̶o̶n̶,̶ ̶i̶t̶ ̶w̶o̶u̶l̶d̶ ̶b̶e̶ ̶(̶7̶0̶+̶1̶4̶0̶+̶1̶0̶)̶
Edit: First, the "number" visualization represent the whole selected duration, aggregated with would be the total of (100+150+200+70+140+10)
The "line" visualization will represent all the aggregated groups. which would in this case be 2 points (100+150+200) and (70+140+10)
It can be a little bit hard to understand at first if you are not used to data points and aggregations. So I suggest that you set your "line" chart to Sum of 5 minutes you will need to get value of each points and devide by 300 as suggested by the doc then sum them all
Added images for easier visualization

How are confidence scores calculated in AWS SageMaker GroundTruth?

AWS's SageMaker/GroundTruth Labelling jobs return a confidence score for each human-annotated label.
However, the score is not a direct function of the responses of the N workers who labeled the task.
For example, on tasks with all three workers assigning different labels the score varies (0.61, 0.55, 0.68). And where 2/3 agree, the score varies also (0.95, 0.91).
"Automated data labelling" is disabled, which indicates that all items are labeled by a human, rather than being fully/partially automatically classified.
How does AWS calculate these confidence scores?
I can't find the details, so leaving this question open hoping for a real answer. But this is what I can find out so far:
Each labelling job has a AnnotationConsolidationConfig param which lets you control how the confidence score is calculated using an AWS Lambda function.
The default for single-image classification is described as:
a variant of the Expectation Maximisation approach.
It estimates parameters for each worker and uses Bayesian inference to estimate the true class based on the class annotations from individual workers."
however it appears regular AWS users are not able to view the function itself due to lack of permissions.

Google Cloud Datastore - Delete pattern

I using Google Datastore to store multiple objects. Millions. At some point, I no longer want to keep storing rows on the database. The criterion to delete - Delete all the rows that older from 10 days.
I saw that Google provide two options to make this job:
Send delete command in batch. Of cause that you should GET all the ids before. It sounds like a very slow idea when you have to remove millions of rows. It's also expensive.
Use Google Dataflow product and provide an option to Delete bulk from Datastore. The problem here is just the price - high price.
The problem of those two options above is the pricing. I calculated that the price of deleting 16M rows in a month will cost 480$ (datastore read operations + delete operations) - which is too much money for small tasks. Additional to this you have to add the dataflow operations costs.
It seems that there is no cheap option to delete data from Datastore - I'm wrong?
You don't have to read to delete. Deletes are based on keys. So, all you need is to identify keys. For this you can do keys only query which are much cheaper (just one operation for the entire projection, although there may be a limit on how many keys can be fetched at a time with a projection query).
Also, how did you compute $480? As per
https://cloud.google.com/datastore/pricing
for a Multi-region, it costs $0.06 for 100,000 reads and $0.02 for 100,000 deletes. Using these numbers, I get the following for 16M.
16*10^6 * ( (1/1000) * 0.06/10^5 + 0.02 / 10^5) = $3.2096
Here the 1/1000 factor is a single read operation for 1000 keys read using keysonly query.

Auto labeling for Text Data with Amazon Sagemaker ground truth

What is the minimum number of text rows needed for ground truth to do auto-labelling ? I have text file which contains 1000 rows, is this good enough to get started with auto-labelling by sagemaker ground truth ?
I'm a product manager on the Amazon SageMaker Ground Truth team, and I'm happy to help you with this question. The minimum system requirement is 1,000 objects. In practice with text classification, we typically see meaningful results (% of data auto-labeled) only once you have 2,000 to 3,000 text objects. Remember performance is variable and depends on your dataset and the complexity of your task.
From the documentation,
You should use automated data labeling only on large datasets. The neural networks used with active learning require a significant amount of data for every new dataset. With larger datasets there is more potential to automatically label the data and therefore reduce the total cost of labeling. We recommend that you use thousands of data objects when using automated data labeling. You must use at least 5,000 data objects
https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html

How to use Amazon MWS to indicate two different shipping times on items?

I have a bit of a unique problem here. I currently have two warehouses that I ship items out of for selling on Amazon, my primary warehouse and my secondary warehouse. Shipping out of the secondary warehouse takes significantly longer than shipping from the main warehouse, hence why it is referred to as the "secondary" warehouse.
Some of our inventory is split between the two warehouses. Usually this is not an issue, but we keep having a particular issue. Allow me to explain:
Let's say that I have 10 red cups in the main warehouse, and an additional 300 in the secondary warehouse. Let's also say it's Christmas time, so I have all 310 listed. However, from what I've seen, Amazon only allows one shipping time to be listed for the inventory, so the entire 310 get listed as under the primary warehouse's shipping time (2 days) and doesn't account for the secondary warehouse's ship time, rather than split the way that they should be, 10 at 2 days and 300 at 15 days.
The problem comes in when someone orders an amount that would have to be split across the two warehouses, such as if someone were to order 12 of said red cups. The first 10 would come out of the primary warehouse, and the remaining two would come out of the secondary warehouse. Due to the secondary warehouse's shipping time, the remaining two cups would have to be shipped out at a significantly different date, but Amazon marks the entire order as needing to be shipped within those two days.
For a variety of reasons, it is not practical to keep all of one product in one warehouse, nor is it practical to increase the secondary warehouse's shipping time. Changing the overall shipping date for the product to the longest ship time causes us to lose the buy box for the listing, which really defeats the purpose of us trying to sell it.
So my question is this: is there some way in MWS to indicate that the inventory is split up in terms of shipping times? If so, how?
Any assistance in this matter would be appreciated.
Short answer: No.
There is no way to specify two values for FulfillmentLatency, in the same way as there is no way to specify two values for Quantity in stock. You can only ever have one inventory with them (plus FBA stock)
Longer answer: You could.
Sign up twice with Amazon:
"MySellerName" has an inventory of 10 and a fulfillment latency of 2 days
"MySellerName Overseas Warehouse" has an inventory of 300 and a fulfillment latency of 30 days
I haven't tried by I believe Amazon will then automatically direct the customer to the best seller for them, which should be "MySellerName" for small orders and "MySellerName Overseas Warehouse" for larger quantities.