How to set a GCP Cloud Monitoring (Stackdriver) alert policy period greater than 24 hours? - google-cloud-platform

Currently 24 hours is the limit of time a Cloud Monitoring (erstwhile Stackdriver) alert policy can be set.
However, if you have a daily activity, like a database backup, it might take a little more or less time each day (e.g. run in 1 hour 10min one day, 1 hour 12min the next day). In this case, you might not see your completion indicator until 24 hours and 2 minutes since the prior indicator. This will cause Cloud Monitoring to issue an alert (because you are +2min over the alerting window limit).
Is there a way to better handle the variance in these alerts, like a 25 hour look back period?

Currently, there is no way to increase the period time over 24 hours.
However, there is a Feature Request already opened for that.
You can follow it in this public link [1].
Cheers,
[1] https://issuetracker.google.com/175703606

I found a work around to this problem.
Create a metric for when your job starts (e.g. started_metric)
Create a metric for when your job finishes (e.g. completed_metric)
Now create a two part Alert Policy
Require that started_metric occurs once per 24 hours
Require that completed_metric occurs once per 24 hours
Trigger if (1) and (2) above are met (e.g. both > 24 hours)
This works around the 24 hour job jitter issue, as the job might take > 24 hours to complete, but it should always start (e.g. cron job) within 24 hours.

Related

Do I have to wait for 30 minutes when a on-demand Dynamodb table is throttled?

I am using on-demand Dynamodb table and I have read the doc https://aws.amazon.com/premiumsupport/knowledge-center/on-demand-table-throttling-dynamodb/. It says You might experience throttling if you exceed double your previous traffic peak within 30 minutes. It means Dynamodb adjust the RCU/WCU based on the last 30 minutes.
Let's say my table is throttled, do I have to wait for maximum 30 minutes until the table adjust its RCU/WCU? Or does the table update RCU immediately? or in a few minutes?
The reason I am asking is that I'd like to put a retry on my application code to retry the DB action whenever there is a throttle. How can I add sleep interval between the retry?
Capacity is always managed with an On Demand table to support double any previous peak throughput, but if you grow faster than that, the table will add physical capacity (physical partitions).
When DynamoDB adds partitions it can take between 5 minutes and 30 minutes for that capacity to be available for use.
It has nothing to do with RCUs/WCUs because On Demand tables don't have capacity units.
Note: You may stay throttled if you've designed a hot partition key in either the base table or a GSI.
During the throttle period requests are still getting handled (and handled at a good rate). Just like if you see a line at the grocery store check out, you get in line. Don't design the code to come back in 30 minutes hoping there's no line after adding checkers. The grocery store will be "adding checkers" when it notices the load is high, but it also keeps the existing work processing.

CloudWatch log insights query scans very slow

I'm looking for help on an issue I'm struggling with.
I have created a new log group on CloudWatch, created a few streams and up to 1500 log events in total via AWS putLogEvents method. When I run a simple query to return just the timestamp and message, the query takes about 20-25 seconds to scan all 1500 events which is slow compared to other log groups. The scan rate fluctuates starting from 300 records per second down to 60 records per second.
In other log groups with an even larger size (e.g. 10,000+), the query took about 5 seconds. I'm clueless as to why the query takes such a long time to scan.
Any assistance is highly appreciated, TIA.

AWS Personalize in Cost Explorer

I am using for 4 dataset group for example:-
Movies
Mobile
Laptops
AC
And in each datasetGroup, we have 3 datasets with name Users, Item and Item_User_INTERACTIONS
And we also have one solution and Campaigns for each dataset group.
I am also sending the real-time event to AWS Personalize using API (putEvent)
The above things cost me about 100USD in two days and showing 498 TPS hours used and I am unable to find the real reason for this much cost.
Or does AWS Personalize simply cost this much?
As your billing tells you, you have used 498 TPS hours, let's calculate if it should be $100.
According to official Amazon Personalize pricing:
https://aws.amazon.com/personalize/pricing/
For first 20K TPS-hour per month you have to pay $0.20 per TPS-hour.
You have used 498 TPS hours in two days, it gives us:
$0.2 * 498 = $99.6 in total.
The answer is: yes, it's expensive.
Another question is:
How TPS usage is calculated?
They charge you for each TPS that is currently reserved. So if you have a campaign with 1 TPS and it's created for 24 hours, then you will be charged for 24[h] x 1[TPS] = 24 TPS hours = $4.8.
The problem is, that $0.2 doesn't look expensive, but if you multiply it by hours, it becomes very expensive.
For testing purposes you should always set TPS to 1, since you cannot set it to 0. 1 TPS allows you to get 3600 recommendations per hour, which is a lot anyways.
The reason for such high price is because of created Campaign which exists and therefore running (this part of AWS Personalize uses more resources than uploading data to s3/creating a model. It is based on TPS-hour per month metric)
E.g. suppose you uploaded a dataset with 100000 rows
Training will cost you about $0.24*2=0.5$ (assuming training took 2h)
Uploading to s3 and dataset - almost free
A created campaign which allows 1 request per second will cost $0.2*24*30=144$ per month
If in the production environment you will set a campaign to support 20 requests per second, it will be 2880$ per month.
So definitely, if these are your first steps with AWS Personalize, create campaigns only that support 1 request per second and verify that you delete unused resources on time.
In case of the SIMS recipe, there is also another way which might save you some money. Try to check how much it will cost for you just to retrain the model every 3d, for example, and to create batch recommendations for your items. Using this strategy we are spending now only 50$ per month per e-Shop instead of 1000$ per month.
Find more data in AWS docs

How control parallel job runs count in AWS batch?

Aws batch supports up to 10000 job in one array. But what if each job writes to DynamoDb? It is needed to control rate in this situation. How to do that? Is there a setting to keep only N job in the running state and do not launch others?
Easiest way would be to send DyanmoDB jobs to an SQS queue, and have workers/lambdas poll this queue at a rate you specify. That is the classic approach to rate-limiting in AWS world. I would do some calculations as to what rate this should be in capacity units and configure your Tables' capacity accordingly with the queue polling rate.
Keep in mind that you may have other processes accessing your DynamoDB using up your Table's capacity as well as noting the retention time of the queue you setup. You may benefit immensely speed and cost wise with some caching implemented for read jobs, have a look at DAX for that.
Edit Just to address your comments. So as you say if you have 20 units for your table, you can only execute 10 jobs per second if each job is using 2 units in 1 second. Say you submit 10,000 jobs, at 10 jobs a second that will be 1,000 seconds to process all those jobs. If, however you submit more than 3,456,000 jobs, that will take more than 4 days to process at 10 jobs a second. The default retention time for SQS is 4 days, so you would start losing messages/jobs at this rate.
And as I mentioned you could have other processes accessing your table which could blow it's usage past 20 units, so you will need to be very careful when approaching your Table's limit.

AWS: Execute a task after 1 year has elapsed

Basically, I have a web service that receives a small json payload (an event) a few times per minute, say 60. This event must be sent to an SQS queue only after 1 year has elapsed (it's ok to have it happen a few hours sooner or later, but the day of month should be exactly the same).
This means I'll have to store more than 31 million events somewhere before the first one should be sent to the SQS queue.
I thought about using SQS message timers, but they have a limit of only 15 minutes, and as pointed out by #Charlie Fish, it's weird to have an element lurking around on a queue for such a long time.
A better possibility could be to schedule a lambda function using a Cron expression for each event (I could end up with millions or billions of scheduled lambda functions in a year, if I don't hit an AWS limit well before that).
Or I could store these events on DynamoDB or RDS.
What would be the recommended / most cost-effective way to handle this using AWS services? Scheduled lambda functions? DynamoDB? PostgreSQL on RDS? Or something entirely different?
And what if I have 31 billion events per year instead of 31 million?
I cannot afford to loose ANY of those events.
DynamoDB is a reasonable option, as is RDS - SQS for long term storage is not a good choice. However - if you want to keep your costs down, I may suggest another: accumulate the events for a single 24 hour period (or a smaller interval if that is desirable), and write that set of data out as an S3 object instead of keeping it in DynamoDB. You could employ dynamodb or rds (or just about anything else) as a place to accumulate events for the day (or hour) before it then writes out that data to S3 as a single set of data for the interval.
Each S3 object could be named appropriately, either indicating the date/time it was created, or the data/time it needs to be used, i.e. 20190317-1400 to indicate that on March 17th, 2019 at 2PM this file needs to be used.
I would imagine a lambda function, called by a cloudwatch event that is triggered every 60 minutes, scans your s3 bucket looking for files that are due to be used, and it then reads in the json data and puts them into an SQS queue for further processing and moves the processed s3 object to another 'already processed' bucket
Your storage costs would be minimal (especially if you batch them up by day or hour), S3 has 11 9's of durability, and you can archive older events off to Glacier if you want to keep them around even after the are processed.
DynamoDB is a great product, it provides redundant storage, and super high performance - but I see nothing in your requirements to that would warrant incurring that cost or requiring the performance of DynamoDB; and why keep millions of records of data in a 'always on' database when you know in advance that you don't need to use or see the records until a year from now.
I mean you could store some form of data in DynamoDB, and run some daily Lambda task to query for all the items that are greater than a year old, remove those from DynamoDB and import it into SQS.
As you mentioned SQS doesn't have this functionality built in. So you need to store the data using some other technology. DynamoDB seems like a responsible choice based on what you have mentioned above.
Of course you also have to think about if doing a cron task once per day is sufficient for your task. Do you need it to be exactly after 1 year? Is it acceptable to have it be one year and a few days? Or one year and a few weeks? What is the window that is acceptable for importing into SQS?
Finally, the other question you have to think about is if SQS is even reasonable for your application. Having a queue that has a 1 year delay seems kinda strange. I could be wrong, but you might want to consider something besides SQS because SQS is meant for much more instantaneous tasks. See the examples on this page (Decouple live user requests from intensive background work: let users upload media while resizing or encoding it, Allocate tasks to multiple worker nodes: process a high number of credit card validation requests, etc.). None of those examples are really meant for a year of wait time before executing. At the end of the day it depends on your use case, but off the top of my head I can't think of a situation that makes sense for delaying entry into an SQS queue for a year. There seem to be much better ways to handle this, but again I don't know your specific use case.
EDIT another question is if your data is consistent? Is the amount of data you need to store consistent? How about the format? What about the number of events per second? You mention that you don’t want to lose any data. For sure build in error handling and backup systems. But for DynamoDB it doesn’t scale the best if one moment you store 5 items then the next moment you want to store 5 million items. If you set your capacity to account for 5 million then it is fine. But the question is will the amount of data and frequency be consistent or not?