While doing an import of data (first ingest after creating the table), throughput maxes out during import.
Is there a way to seed your DynamoDB database with database where it is not subject to the regular write throughput settings?
Or, are we expected set a very high provisioned write throughput capacity for a few minutes during the data import process?
I'm not sure what the convention is here.
Are you using the BatchWriteItem API to do the initial load? That can be enough sometimes.
Otherwise, the unfortunate answer is that you need to temporarily increase the write throughout. The SDKs also have built in retry logic, so you could tune that as well to ensure everything is written.
There is no such option - for seeding just temporarily increase your write throughput if you need it to go faster (does speed really matter, or can you live with this being slower?). Also, I'd recommend increasing the maximum number of retries on the retry strategy in the DDB ClientConfiguration.
Be careful with the "very high" throughput option, as this can cause repartitioning on the AWS side, and can cause throughput dilution on your table when you reduce it afterwards.
All read and write requests are subject to the provisioned write throughput. Increase the provisioned write throughput while importing your data and decrease it afterwards.
Related
I'm currently using ~500 concurrent executions and this tends to reach up to 5000 easily, is this a long term problem or is it relatively easy to make a quota increase request to AWS?
Getting quota increases is not difficult, but it’s also not instantaneous. In some cases the support person will ask for more information on why you need the increase (often to be sure you aren’t going too far afoul of best practices), which can slow things down. Different support levels have different response times too. So if you are concerned about it you should get ahead of it and get the increase before you think you’ll need it.
To request an increase:
In the AWS management console, select Service Quotes
Click AWS Lambda
Select Concurrent executions
Click Request quota increase
I've been using AWS DMS to perform ongoing replication from MySql Aurora to Redshift. However, the ongoing replication is causing constant 25-30% CPU load on the target. This is because it produces many small files on S3 and loads/processes them non-stop. Redshift is not really designed for handling large number of small tasks.
In order to optimize, i've made it so that the process starts at the beginning of each hour, waits till the target is in-sync, and then stops. So, instead of working continually, it works for 5-8 minutes at the beginning of each hour. Even so, it is still very slow and unoptimized because it still has to process hundreds of small s3 files, only in shorter timespan.
Can this be optimized further? Is there a way to tell DMS to buffer these changes for larger period of time, and not produce fewer larger instead of many small s3 files? We really don't mind having higher target latency.
The amount of data transferred between Aurora and Redshift is rather small. There are around ~20K changes per hour, and we're using 4-node dc1.large redshift cluster. It should be able to handle those 20K changes in matter of seconds, not minutes
maybe, you can try BatchApplyTimeoutMin and BatchApplyTimeoutMax.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.ChangeProcessingTuning.html
BatchApplyTimeoutMin sets the minimum amount of time in seconds that AWS DMS waits between each application of batch changes. The default value is 1.
You can change the value to 1200, even 3600.
Bump up maxFileSize in the target settings - https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Redshift.html
SideInput is sort of like broadcast in Spark, meaning you are caching data to a local worker machines for fast lookup to reduce network/shuffle overhead. It is logical to think limit to how much memory you can have should fit in heap. In Dataflow documentation, it says limit is 20K shard. What does this mean? How big is a shard?
To answer your original question, you can configure the amount of in-memory caching done by a Dataflow worker via the --workerCacheSizeMb option on the command line, which is setWorkerCacheSizeMb if you are invoking a pipeline programmatically. The default is 100Mb.
We've been experiencing problems where batch update requests take in excess of 60 seconds. We're updating a few kbs of data, quite some way short of the 5MB limit.
What is surprising me is not so much the time taken to index the data, but the time taken for the update request itself. Just uploading ~65kb of data can take over a minute.
We're making frequent updates with small quantities of data. Could it be that we're being throttled?
Not sure if this applies to your problem, but Amazon's documentation recommends sending large batches if possible.
Important
Whenever possible, you should group add and delete operations in batches that are close to the maximum batch size. Submitting a large volume of single-document batches to the document service can increase the time it takes for your changes to become visible in search results.
It talks about search result availability, but I don't think it's a stretch to say that this also would affect SDF processing performance.
From what i gather, with Amazon DynamoDB you pay for provisioned throughput.
Application1 does more or less consistent rate of writes, data is ideal for key/value store, and doesn't ever read the data back. At the moment its 3 - 5k writes/hour but its bound to increase once we launch.
Application2 reads (from the data written by application1) one hour worth of records every hour, and reads one days worth of records every day. Eventual consistency is acceptable.
So am i right to assume dynamodb isn't well suited for me? As in i would have to provision a high read rate, even if I hit that rate only for few seconds every hour? Is there a way to dump records?
At the moment, i'm using master/slave on mongodb. I use the slave for my batch reads, so that it doesn't effect the master... but id much rather let someone else handle the db infrastructure.
So am i right to assume dynamodb isn't well suited for me?
Good question - I wouldn't necessarily come to your conclusion, though you'll need to account for the specific cost/performance characteristics indeed, which may or may not outweigh the benefits you are looking for.
As in i would have to provision a high read rate, even if I hit that
rate only for few seconds every hour?
That's correct, You pay a flat, hourly rate based on the capacity you reserve (see Pricing) and must provision capacity to the maximum throughput requirements encountered for reading one hour worth of records accordingly in order to avoid being throttled.
In addition you'll need to adjust the provisioned capacity for the daily spike of one days worth of records every day. As usual for AWS there is an API available to do this, but be aware of the related FAQ items, e.g.:
Is there any limit on how much I can change my provisioned throughput with a single request?.
How often can I change my provisioned throughput?
The latter is particularly tough, insofar You can increase your provisioned throughput as often as you want, however You can decrease it once per day only!
Obviously You should review the other available FAQ items related to Provisioned Throughput as well, as there might be more subtleties still.
Given the involved complexities it's probably unavoidable to fully grasp the concept of Provisioned Throughput in Amazon DynamoDB, insofar one must account for it architecture wise in order to achieve the desired results. Calculating the cost and performance details for a particular use case is apparently going to be a non trivial exercise for DynamoDB ;)