How control parallel job runs count in AWS batch? - amazon-web-services

Aws batch supports up to 10000 job in one array. But what if each job writes to DynamoDb? It is needed to control rate in this situation. How to do that? Is there a setting to keep only N job in the running state and do not launch others?

Easiest way would be to send DyanmoDB jobs to an SQS queue, and have workers/lambdas poll this queue at a rate you specify. That is the classic approach to rate-limiting in AWS world. I would do some calculations as to what rate this should be in capacity units and configure your Tables' capacity accordingly with the queue polling rate.
Keep in mind that you may have other processes accessing your DynamoDB using up your Table's capacity as well as noting the retention time of the queue you setup. You may benefit immensely speed and cost wise with some caching implemented for read jobs, have a look at DAX for that.
Edit Just to address your comments. So as you say if you have 20 units for your table, you can only execute 10 jobs per second if each job is using 2 units in 1 second. Say you submit 10,000 jobs, at 10 jobs a second that will be 1,000 seconds to process all those jobs. If, however you submit more than 3,456,000 jobs, that will take more than 4 days to process at 10 jobs a second. The default retention time for SQS is 4 days, so you would start losing messages/jobs at this rate.
And as I mentioned you could have other processes accessing your table which could blow it's usage past 20 units, so you will need to be very careful when approaching your Table's limit.

Related

Dynamodb streams: small number of items per batch

I have a very large dynamodb table, and I want to use lambda function triggered by a stream. I would like to work in big batches, of at least 1000 items. But when I connect the lambda, I see it is invoked with tiny batches of 1 or 2 items. I increased the window to 15 seconds, and it doesn't help.
I assume it's because the table has a lot of shards, and every batch gathers items from only one shard. Is this correct?
What can be done in order to increase the batch size?
I wrote a deep-dive blog post about the integration of DynamoDB Streams an Lambda (disclaimer, written by me on the company blog - very relevant to the question) - the images are taken from there.
DynamoDB Streams consist of shards that store a record of changes sequentially. Each storage partition in the table maps to at least one shard of a DynamoDB stream. The shards get split if a shard is full or the throughput is too high.
Conceptually, this is how the Lambda Service polls the stream shards:
Crucially, polling the shards happens in parallel, but batching is always per shard in order to maintain the order of changes and have consistent scale-out behavior.
This diagram shows how the configuration options in the event source mapping influence how processing happens.
Let's focus on your situation. If you have a large number of items, and relatively high throughput, chances are that DynamoDB allocates many storage partitions to handle that throughput. That automatically leads to a large number of stream shards (#shards >= #storage_partitions).
If your changes are well distributed over the table (which is what you want to distribute the load evenly), this means there aren't many changes written to any single shard at any point in time. So for a batch window of a few seconds (15 in your case), the actual batch size may be low. If the changes are focused on some partitions, you should see a relatively high variance in the batch size (unfortunately, there's no metric for it afaik).
The only thing you can control directly here (without larger architectural changes) is the batch window. If you increase that, you should see larger batch sizes at the expense of higher processing latency.
You could consider having a lambda function write these changes to a kinesis firehose delivery stream, configure it to write records in batches to S3, and have another Lambda respond to objects written to S3. This would increase your latency again, but allows for much larger batch sizes.
(I also considered writing to SQS, but the max batch size you can request from there is 10.)

Do I have to wait for 30 minutes when a on-demand Dynamodb table is throttled?

I am using on-demand Dynamodb table and I have read the doc https://aws.amazon.com/premiumsupport/knowledge-center/on-demand-table-throttling-dynamodb/. It says You might experience throttling if you exceed double your previous traffic peak within 30 minutes. It means Dynamodb adjust the RCU/WCU based on the last 30 minutes.
Let's say my table is throttled, do I have to wait for maximum 30 minutes until the table adjust its RCU/WCU? Or does the table update RCU immediately? or in a few minutes?
The reason I am asking is that I'd like to put a retry on my application code to retry the DB action whenever there is a throttle. How can I add sleep interval between the retry?
Capacity is always managed with an On Demand table to support double any previous peak throughput, but if you grow faster than that, the table will add physical capacity (physical partitions).
When DynamoDB adds partitions it can take between 5 minutes and 30 minutes for that capacity to be available for use.
It has nothing to do with RCUs/WCUs because On Demand tables don't have capacity units.
Note: You may stay throttled if you've designed a hot partition key in either the base table or a GSI.
During the throttle period requests are still getting handled (and handled at a good rate). Just like if you see a line at the grocery store check out, you get in line. Don't design the code to come back in 30 minutes hoping there's no line after adding checkers. The grocery store will be "adding checkers" when it notices the load is high, but it also keeps the existing work processing.

Why is it possible to go beyond DynamoDB burst capacity?

I have created a DynamoDB table with 1 RCU (manual provisioned capacity).
I have inserted some items to read in that table.
I can launch a scan on my table (which consumes 82 RCUs according to the response).
I understand this is possible because of the burst capacity.
What I don't understand though, is why am I able to keep consuming huge numbers of RCUs for long periods of time.
As you can see on this screenshot, despite the RCU being 1, I have been
consuming around 150 or 200 RCU per minute for more than 1 hour (we can barely see the 1 RCU red line at the bottom).
Why is that? (some of the requests are of course throttled but why so little ?)
How much data do you have in that table?
When you try scan operation from console, it will read items from the table that will consume RCUs.
There are options to configure baseline read/write capacity units and enable autoscaling if you expect variable reads/write requests. If the load starts to increase, dynamo db service will gradually scale to fulfil those requests instead of throttling.
https://aws.amazon.com/blogs/database/amazon-dynamodb-auto-scaling-performance-and-cost-optimization-at-any-scale/

DynamoDB write is too slow

I use Lambda to read from a JSON Api and write in DynamoDB via http request. The JSON Api is very big (has 200k objects) and my function is extremely slow with writing to DynamoDB. I used the regular write function and after 10 min I could only populate 5k rows in my DynamoDB table. I was thinking about using BatchWriteItem but since it can only do 25 puts in one batch, it would still take too much time to write all 200k rows. Is there any better solution?
This will be because you're being throttled.
For Lambda
There are a maximum number of concurrent invocations of Lambdas that can be running at a time, the default limit is 1000 concurrent requests.
If you have more than 1000 concurrent requests at the same time you will need to reach out to AWS Support to increase this, you will also need to provide a business use case for why it needs to support this.
For DynamoDB
Whether you use batch or single PutItem your DynamoDB table is configured with a number of WCU (Write Credit Units) and RCU (Read Credit Units).
A single write credit unit covers 1 write of an item 1Kb or less (every extra kb is another unit). If you exceed this you will start to be throttled for write requests, if you're using the SDK it may use exponential backoff as well to keep attempting to write.
As a solution for this you should do one of the following:
If this is a one time process you can adjust the WCU as a fixed number, then wait 5 minutes for it to increase and then scale down.
If this is a natural flow on your app then enable DynamoDB autoscaling to naturally increase and decrease throughout the day
In addition look at your data modelling as this can lead to throttling too.
In extreme cases, throttling can occur if a single partition receives more than 3,000 RCUs or 1,000 WCUs

AWS: Execute a task after 1 year has elapsed

Basically, I have a web service that receives a small json payload (an event) a few times per minute, say 60. This event must be sent to an SQS queue only after 1 year has elapsed (it's ok to have it happen a few hours sooner or later, but the day of month should be exactly the same).
This means I'll have to store more than 31 million events somewhere before the first one should be sent to the SQS queue.
I thought about using SQS message timers, but they have a limit of only 15 minutes, and as pointed out by #Charlie Fish, it's weird to have an element lurking around on a queue for such a long time.
A better possibility could be to schedule a lambda function using a Cron expression for each event (I could end up with millions or billions of scheduled lambda functions in a year, if I don't hit an AWS limit well before that).
Or I could store these events on DynamoDB or RDS.
What would be the recommended / most cost-effective way to handle this using AWS services? Scheduled lambda functions? DynamoDB? PostgreSQL on RDS? Or something entirely different?
And what if I have 31 billion events per year instead of 31 million?
I cannot afford to loose ANY of those events.
DynamoDB is a reasonable option, as is RDS - SQS for long term storage is not a good choice. However - if you want to keep your costs down, I may suggest another: accumulate the events for a single 24 hour period (or a smaller interval if that is desirable), and write that set of data out as an S3 object instead of keeping it in DynamoDB. You could employ dynamodb or rds (or just about anything else) as a place to accumulate events for the day (or hour) before it then writes out that data to S3 as a single set of data for the interval.
Each S3 object could be named appropriately, either indicating the date/time it was created, or the data/time it needs to be used, i.e. 20190317-1400 to indicate that on March 17th, 2019 at 2PM this file needs to be used.
I would imagine a lambda function, called by a cloudwatch event that is triggered every 60 minutes, scans your s3 bucket looking for files that are due to be used, and it then reads in the json data and puts them into an SQS queue for further processing and moves the processed s3 object to another 'already processed' bucket
Your storage costs would be minimal (especially if you batch them up by day or hour), S3 has 11 9's of durability, and you can archive older events off to Glacier if you want to keep them around even after the are processed.
DynamoDB is a great product, it provides redundant storage, and super high performance - but I see nothing in your requirements to that would warrant incurring that cost or requiring the performance of DynamoDB; and why keep millions of records of data in a 'always on' database when you know in advance that you don't need to use or see the records until a year from now.
I mean you could store some form of data in DynamoDB, and run some daily Lambda task to query for all the items that are greater than a year old, remove those from DynamoDB and import it into SQS.
As you mentioned SQS doesn't have this functionality built in. So you need to store the data using some other technology. DynamoDB seems like a responsible choice based on what you have mentioned above.
Of course you also have to think about if doing a cron task once per day is sufficient for your task. Do you need it to be exactly after 1 year? Is it acceptable to have it be one year and a few days? Or one year and a few weeks? What is the window that is acceptable for importing into SQS?
Finally, the other question you have to think about is if SQS is even reasonable for your application. Having a queue that has a 1 year delay seems kinda strange. I could be wrong, but you might want to consider something besides SQS because SQS is meant for much more instantaneous tasks. See the examples on this page (Decouple live user requests from intensive background work: let users upload media while resizing or encoding it, Allocate tasks to multiple worker nodes: process a high number of credit card validation requests, etc.). None of those examples are really meant for a year of wait time before executing. At the end of the day it depends on your use case, but off the top of my head I can't think of a situation that makes sense for delaying entry into an SQS queue for a year. There seem to be much better ways to handle this, but again I don't know your specific use case.
EDIT another question is if your data is consistent? Is the amount of data you need to store consistent? How about the format? What about the number of events per second? You mention that you don’t want to lose any data. For sure build in error handling and backup systems. But for DynamoDB it doesn’t scale the best if one moment you store 5 items then the next moment you want to store 5 million items. If you set your capacity to account for 5 million then it is fine. But the question is will the amount of data and frequency be consistent or not?