Improving DynamoDB Write Operation - amazon-web-services

I am trying to call dynamodb write operation to write around 60k records.
I have tried to put 1000 write capacity unites for Provisioned Write capacity. But my write operation is still taking lot of time. Also when I check the metrics I can still see the consumed Write capacity units as around 10 per seconds.
My record size is definitely less than 1KB.
Is there a way we can speed up the write operation for dynamodb?

So here is what I figured out.
I changed my call to use batchWrite and my consumed Write capacity units has increased significantly upto 286 write capacity units.
Also the complete write operation finished within couple of minutes.
As mentioned in all above answers using putItem to load large number of data has the latency issues and it affects your consumed capacities. It is always better to batchWrite.

DynamoDB performance, like most databases is highly dependent on how it is used.
From your question, it is likely that you are using only a single DynamoDB partition. Each partition can support up to 1000 write capacity units and up to 10GB of data.
However, you also mention that your metrics show only 10 write units consumed per second. This is very low. Check all the metrics visible for the table in the AWS console. This is a tab per table under the DynamoDB pages. Check for throttling and any errors. Check the consumed capacity is below the provisioned capacity on the charts.
It is possible that there is some other bottleneck in your process.

It looks like you can send more requests per second. You can perform more request, but if you send requests in a loop like this:
for item in items:
table.putItem(item)
You need to mind the roundtrip latency for each request.
You can use two tricks:
First, upload data from multiple threads/machines.
Second, you can use BatchWriteItem method that allow you to write up to 25 items in one request:
The BatchWriteItem operation puts or deletes multiple items in one or
more tables. A single call to BatchWriteItem can write up to 16 MB of
data, which can comprise as many as 25 put or delete requests.
Individual items to be written can be as large as 400 KB.

Related

Dynamodb streams: small number of items per batch

I have a very large dynamodb table, and I want to use lambda function triggered by a stream. I would like to work in big batches, of at least 1000 items. But when I connect the lambda, I see it is invoked with tiny batches of 1 or 2 items. I increased the window to 15 seconds, and it doesn't help.
I assume it's because the table has a lot of shards, and every batch gathers items from only one shard. Is this correct?
What can be done in order to increase the batch size?
I wrote a deep-dive blog post about the integration of DynamoDB Streams an Lambda (disclaimer, written by me on the company blog - very relevant to the question) - the images are taken from there.
DynamoDB Streams consist of shards that store a record of changes sequentially. Each storage partition in the table maps to at least one shard of a DynamoDB stream. The shards get split if a shard is full or the throughput is too high.
Conceptually, this is how the Lambda Service polls the stream shards:
Crucially, polling the shards happens in parallel, but batching is always per shard in order to maintain the order of changes and have consistent scale-out behavior.
This diagram shows how the configuration options in the event source mapping influence how processing happens.
Let's focus on your situation. If you have a large number of items, and relatively high throughput, chances are that DynamoDB allocates many storage partitions to handle that throughput. That automatically leads to a large number of stream shards (#shards >= #storage_partitions).
If your changes are well distributed over the table (which is what you want to distribute the load evenly), this means there aren't many changes written to any single shard at any point in time. So for a batch window of a few seconds (15 in your case), the actual batch size may be low. If the changes are focused on some partitions, you should see a relatively high variance in the batch size (unfortunately, there's no metric for it afaik).
The only thing you can control directly here (without larger architectural changes) is the batch window. If you increase that, you should see larger batch sizes at the expense of higher processing latency.
You could consider having a lambda function write these changes to a kinesis firehose delivery stream, configure it to write records in batches to S3, and have another Lambda respond to objects written to S3. This would increase your latency again, but allows for much larger batch sizes.
(I also considered writing to SQS, but the max batch size you can request from there is 10.)

DynamoDB write is too slow

I use Lambda to read from a JSON Api and write in DynamoDB via http request. The JSON Api is very big (has 200k objects) and my function is extremely slow with writing to DynamoDB. I used the regular write function and after 10 min I could only populate 5k rows in my DynamoDB table. I was thinking about using BatchWriteItem but since it can only do 25 puts in one batch, it would still take too much time to write all 200k rows. Is there any better solution?
This will be because you're being throttled.
For Lambda
There are a maximum number of concurrent invocations of Lambdas that can be running at a time, the default limit is 1000 concurrent requests.
If you have more than 1000 concurrent requests at the same time you will need to reach out to AWS Support to increase this, you will also need to provide a business use case for why it needs to support this.
For DynamoDB
Whether you use batch or single PutItem your DynamoDB table is configured with a number of WCU (Write Credit Units) and RCU (Read Credit Units).
A single write credit unit covers 1 write of an item 1Kb or less (every extra kb is another unit). If you exceed this you will start to be throttled for write requests, if you're using the SDK it may use exponential backoff as well to keep attempting to write.
As a solution for this you should do one of the following:
If this is a one time process you can adjust the WCU as a fixed number, then wait 5 minutes for it to increase and then scale down.
If this is a natural flow on your app then enable DynamoDB autoscaling to naturally increase and decrease throughout the day
In addition look at your data modelling as this can lead to throttling too.
In extreme cases, throttling can occur if a single partition receives more than 3,000 RCUs or 1,000 WCUs

DynamoDb throttling write requests even though DB is under provisioned capacity

In the dynamodb table, we are getting lot of throttled write requests at the rate of ~2500/min. I am having confusions in some of the things here:
The consumed write capacity is much lower than the provisioned write capacity. Eg. Consumed Write capacity = 500 and Provisioned Write Capacity = 800. Then also, why the throttling is happening?
There is another metric called Throttle write events which is like 100/min. What does this mean and how its different from throttle write requests?
The parameters in capacity I see to change are Target utilization, Minimum provisioned capacity and Maximum provisioned capacity for Write. They all look good to me. I am using Auto scaling. So, I am not sure, what is to increase which can fix this issue?
What is happening to the throttled write events? Will they result in exception in code?
How much do the partition keys vary when you write your data?
I previously ran into a similar issue, when inserting a lot of data for the same partition key in many consecutive BatchWriteItem calls. The problem there was that there were too many write requests for the same partition key - overwhelming the partitions - which caused the throttling.
The solution is to structure the data in a BatchWriteItem call in such a way, that the partition keys are not repeating as much.

Aws Dynamo db performance is slow

For my application I am using free tier aws account I have given 5 read capacity and 5Write capacity(i can’t increase the capacity because they will charge if I increase) to the dynamo db here I am using scan operation. The api is loading in between 10 seconds to 20 seconds.
I have used parallel scan too but the api is loading same time. Is there any alternate service in aws.
click here to see the image
It is not a good idea to use a Scan on a NoSQL database.
DynamoDB is optimize for Query requests. The data will come back very quickly, guaranteed (within the allocated Capacity).
However, when using a Scan, the database must read each item from the database and each item consumes a Read Capacity unit. So, if you have a table with 1000 items, a Query on one item would consume one Unit, whereas a Scan would consume 1000 Units.
So, either increase the Capacity Units (and cost) or, best of all, use a Query rather than a Scan. Indexes can also help.
You might need to re-think how you store your data if you always need to do a Scan.

aws dynamo db throughput

There's something which I cant understand about AWS DynamoDb throughput.
Lets consider strongly consistent reads.
Now, I understand that in this case, 1 unit of capacity would mean I can read up to 4KB of per second.
It's the "per second" bit that slightly confuses me. If you know exactly how quickly you want to read data then you can set the units appropriately. But what if you're not too fussy about the read time?
Say I do have only 1 read unit assigned to my table and I try to read an item which is more than 4KB. Now surely that just means that my read is going to take more than 1 second? That would be fine but the documentation talks about Requests failing. How can AWS determine that I used too many units when I didn't request that the data be read within a particular time?
Maybe I am missing something obvious. Can you someone help clear this up?
DynamoDB can consume up to 300 seconds of unused throughput in burst capacity.
The maximum item size in DynamoDB is 400KB and 1 RCU gives you a read of up to 4KB.
Lets say you want to read an item that is 400KB in size and you have 1 RCU on your table. You could retrieve that item once every 100 seconds.
Because of burst capacity there will always be a time you can read that item, because in fact you can use up to 300 RCUs in one go, not just 1.
Imagine starting the table with that 400KB item. You need to wait 100 seconds without spending any RCUs so that you've earned enough burst capacity to get the item. After 101 seconds you make the request, spend 100 RCUs and get the item. After another 5 seconds you make the request again, but get denied with a Throttling Exception.
So no, DynamoDB will not increase request latency to meet your RCU provision. It either returns your results as fast as possible, or throws an exception.
EDIT: By the way, I should mention that all AWS DynamoDB SDKs handle Throttling Exceptions for you. If you try and read an item, but get denied because you don't have enough throughput available, the SDK backs off and try again. So unless your table really is under provisioned, you shouldn't have to worry about handling Throttling Exceptions.