Dataflow resource usage - google-cloud-platform

After following the dataflow tutorial, I used the pub/sub topic to big query template to parse a JSON record into a table. The Job has been streaming for 21 days. During that time I have ingested about 5000 JSON records, containing 4 fields (around 250 bytes).
After the bill came this month I started to look into resource usage. I have used 2,017.52 vCPU hr, memory 7,565.825 GB hr, Total HDD 620,407.918 GB hr.
This seems absurdly high for the tiny amount of data I have been ingesting. Is there a minimum amount of data I should have before using dataflow? It seems over powered for small cases. Is there another preferred method for ingesting data from a pub sub topic? Is there a different configuration when setting up a Dataflow Job that uses less resources?

It seems that the numbers you mentioned, correspond to not customizing the job resources. By default streaming jobs use a n1-standar-4 machine:
3 Streaming worker defaults: 4 vCPU, 15 GB memory, 400 GB Persistent Disk.
4 vCPU x 24 hrs x 21 days = 2,016
15 GB x 24 hrs x 21 days = 7,560
If you really need streaming in Dataflow, you will need to pay for resources allocated even if there is nothing to process.
Options:
Optimizing Dataflow
Considering that the number and size of the JSON string you need to process are really small, you can reduce the cost to aprox 1/4 of current charge. You just need to set the job to use a n1-standard-1 machine, which has 1vCPU and 3.75GB memory. Just be careful with max nodes, unless you are planning increase the load, one node may be enough.
Your own way
If you don't really need streaming (not likely), you can just create a function that pulls using Synchronous Pull, and add the part that writes to BigQuery. You can schedule according to your needs.
Cloud functions (my recommendation)
You can create a serverless Event-Driven Cloud Function with a Cloud Pub/Sub trigger. This way, considering your low volume, you can take advantage of the Free Tier and keep the real time processing:
"Cloud Functions provides a perpetual free tier for compute-time resources, which includes an allocation of both GB-seconds and GHz-seconds. In addition to the 2 million invocations, the free tier provides 400,000 GB-seconds, 200,000 GHz-seconds of compute time and 5GB of Internet egress traffic per month."[1]
[1] https://cloud.google.com/functions/pricing

Related

AWS Dynamo DB Free Tier Limits

I am new to DynamoDB, and I have a small in house application, which will be used by my parents for their small business. I just have to keep records of 10 - 20 rows daily, and will have a few edits close to 5 - 10 at max.
WIll I be able to use the Free Tier of Dynamo DB for the same?
I am using Heroku to host my LWC OSS (Node JS) application, which is again a free version. If not then any heads up to any particular type of Database which can fulfil my need.
Will I be able to use the Free Tier of Dynamo DB for the same?
Yes, dependent on the size of the data you want to be inputting & the rate at which you want to input.
Amazon DynamoDB offers a free tier with the following provisions, which is enough to handle up to 200M requests per month:
25 GB of Storage
25 provisioned Write Capacity Units (WCU)
25 provisioned Read Capacity Units (RCU)
Just be aware of the fact that:
25 WCU is 25 writes per second for an item up to 1KB or 5 writes per second for an item up to 5KB etc.
25 RCU is 50 reads per second for an item up to 4KB or 10 reads per second for an item up to 20KB etc.
If your API calls fall within the above criteria, you'll be within the free tier.
The main costed aspects of DynamoDB are how much you read and write to the tables. AWS call them "Read capacity units" (RCU) and "Write capacity units" (WCU).
When you create a DynamoDB table there are many options to choose from, but it's roughly accurate to say that:
One RCU gives you one strongly consistent read request per second
One WCU gives you one standard write request per second
So if you create a standard class table with 1 RCU and 1 WCU (the lowest possible) that would already easily accomodate what you predict you will need.
According to the AWS DynamoDB pricing page you can get 25 WCUs and 25 RCUs in the free tier.
So I would say choose DynamoDB standard class table, with Provisioned Capacity, no Auto Scaling, and customized to 1 RCU and 1 WCU like below, and your usage will remain well within the free tier.

How to lower Data Transfer costs on an AWS course platform?

I am calculating the operation costs for a platform we want to develop for a client in AWS. The platform is an online course solution where users can subscribe and access different multimedia contents.
My initial thought was to store the videos in an S3 bucket and simply "feed" them to my back end solution so that the front end can access and show them.
My problem is that when doing the cost estimate I am getting huge cost estimates in Outbound Data Transfer. I don't really know how much traffic the client is expecting so I estimated traffic the following way (this platform is gonna be supported by the state so it will have some traffic):
20MB for every minute of video:
2 hours per week for every user
200 users per month
20MB/m * 60 m/h * 2h/week * 4weeks * 200 = 1.92 TB
This, at 0.09 USD / GB gives me 184.23 USD per month...
I don't know if I am not designing a well made solution, if my estimate is wrong... but I find this to be very expensive. Adding other costs this means I have to pay nearly 2 USDs per user. If someone finds a way to reduce costs please let me know!
Thank you

AWS Personalize in Cost Explorer

I am using for 4 dataset group for example:-
Movies
Mobile
Laptops
AC
And in each datasetGroup, we have 3 datasets with name Users, Item and Item_User_INTERACTIONS
And we also have one solution and Campaigns for each dataset group.
I am also sending the real-time event to AWS Personalize using API (putEvent)
The above things cost me about 100USD in two days and showing 498 TPS hours used and I am unable to find the real reason for this much cost.
Or does AWS Personalize simply cost this much?
As your billing tells you, you have used 498 TPS hours, let's calculate if it should be $100.
According to official Amazon Personalize pricing:
https://aws.amazon.com/personalize/pricing/
For first 20K TPS-hour per month you have to pay $0.20 per TPS-hour.
You have used 498 TPS hours in two days, it gives us:
$0.2 * 498 = $99.6 in total.
The answer is: yes, it's expensive.
Another question is:
How TPS usage is calculated?
They charge you for each TPS that is currently reserved. So if you have a campaign with 1 TPS and it's created for 24 hours, then you will be charged for 24[h] x 1[TPS] = 24 TPS hours = $4.8.
The problem is, that $0.2 doesn't look expensive, but if you multiply it by hours, it becomes very expensive.
For testing purposes you should always set TPS to 1, since you cannot set it to 0. 1 TPS allows you to get 3600 recommendations per hour, which is a lot anyways.
The reason for such high price is because of created Campaign which exists and therefore running (this part of AWS Personalize uses more resources than uploading data to s3/creating a model. It is based on TPS-hour per month metric)
E.g. suppose you uploaded a dataset with 100000 rows
Training will cost you about $0.24*2=0.5$ (assuming training took 2h)
Uploading to s3 and dataset - almost free
A created campaign which allows 1 request per second will cost $0.2*24*30=144$ per month
If in the production environment you will set a campaign to support 20 requests per second, it will be 2880$ per month.
So definitely, if these are your first steps with AWS Personalize, create campaigns only that support 1 request per second and verify that you delete unused resources on time.
In case of the SIMS recipe, there is also another way which might save you some money. Try to check how much it will cost for you just to retrain the model every 3d, for example, and to create batch recommendations for your items. Using this strategy we are spending now only 50$ per month per e-Shop instead of 1000$ per month.
Find more data in AWS docs

BigQuery Data Transfer Service benchmarks for Campaign Manager data

There's some good info here on general transfer times over the wire for data to/from various sources.
Besides the raw data transfer time, I am trying to estimate roughly how long it would take to import ~12TB/day into BigQuery using the BigQuery Data Transfer service for DoubleClick Campaign Manager.
Is this documented anywhere?
In the first link you've shared, there's an image that shows the transfer speed (estimated) depending on the network bandwidth.
So let's say you have a bandwith of 1Gbps, then the data will be available in your GCP project in ~30 hours as you are transfering 12TB which is close to 10TB. That makes it 1 day and a half to transfer.
If you really want to transfer 12TB/day because you need that data to be available each day, and increasing bandwidth is not a possibility, I would recommend you to batch data and create different transfer services for each batch. As an example:
Split 12TB into 12 batches of 1TB -> 12 transfer jobs of 1TB each
Each batch will take 3 hours to complete, therefore you will have available 8/12TB a day.
This can be applied to smaller batches of data if you want to have a more fine-grained solution.

How control parallel job runs count in AWS batch?

Aws batch supports up to 10000 job in one array. But what if each job writes to DynamoDb? It is needed to control rate in this situation. How to do that? Is there a setting to keep only N job in the running state and do not launch others?
Easiest way would be to send DyanmoDB jobs to an SQS queue, and have workers/lambdas poll this queue at a rate you specify. That is the classic approach to rate-limiting in AWS world. I would do some calculations as to what rate this should be in capacity units and configure your Tables' capacity accordingly with the queue polling rate.
Keep in mind that you may have other processes accessing your DynamoDB using up your Table's capacity as well as noting the retention time of the queue you setup. You may benefit immensely speed and cost wise with some caching implemented for read jobs, have a look at DAX for that.
Edit Just to address your comments. So as you say if you have 20 units for your table, you can only execute 10 jobs per second if each job is using 2 units in 1 second. Say you submit 10,000 jobs, at 10 jobs a second that will be 1,000 seconds to process all those jobs. If, however you submit more than 3,456,000 jobs, that will take more than 4 days to process at 10 jobs a second. The default retention time for SQS is 4 days, so you would start losing messages/jobs at this rate.
And as I mentioned you could have other processes accessing your table which could blow it's usage past 20 units, so you will need to be very careful when approaching your Table's limit.