I am trying to export data from dynamoDb to S3 by using data pipeline. My table is on-demand provisioned and contains data of 10gb. How much rcu it would consume? Is there a way I could reduce the scaling of the rcu and would eventually increase the time for transfer?
You can change on-demand <-> provisioned capacity any time. You should do it since provisioned is cheaper, also you hardset rcu for provisioned, which you want
As another option, DynamoDB recently released a new feature to export your data to an S3 bucket. It supports DynamoDB JSON and Amazon Ion - see the documentation on how to use it at:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html
This process does not consume your table's read capacity, which removes this concern entirely.
Related
I have an Amazon DynamoDB used for a project that will be on hold for at least 6 months. I want to archive the database to remove as much of the day-to-day costs as is reasonably possible. I'm hoping I can somehow export the data to some backup format and store it in Amazon S3 storage. I would then want to be able to restore the data to resurrect the DynamoDB at some future date.
You have several options, but first of all it's important to understand DynamoDB free tier:
You get 25GB of storage free per region
You get 25RCU/WCU free per region
So if it's a single table and less than 25GB you can keep that table in DynamoDB for free.
Backup options:
AWS Backup : Takes a snapshot of your table and handles the storage medium. You can also choose to keep in cold storage for a lower cost. No S3 costs for putting or getting the data later.
Export to S3 : This built in feature to DynamoDB allows you to export the table to S3 in DynamoDB JSON or ION format. You will incur export costs as well as puts to S3. Later you can use the import from S3 feature to restore the table.
I have a bunch of tables in DynamoDB (80 right now but can grow in future) and looking to sync data from these tables to either Redshift or S3 (using Glue on top of it to query using Athena) for running analytics query.
There are also very frequent updates (to existing entries) and deletes happen in the DynamoDB tables which I want to sync along with addition of newer entries.
Checked the Write capacity units (WCU) consumed for all tables and the rate is coming out to be around 35-40 WCU per second at peak times.
Solutions I considered:
Use Kinesis Firehose along with lambda (which reads updates from DDB streams) to push data to Redshift in small batches (Issue: it cannot support updates and deletes and is only good for adding new entries because it uses Redshift COPY command under the hood to upload data to Redshift)
Use Lambda (which reads updates from DDB streams) to copy data to S3 directly as json with each entry being a separate file. This can support updates and deletes if we have S3 filepath same as primary key of DynamoDB tables (Issue: It will result in tons of small files in S3 which might not scale for querying using AWS Glue)
Use Lambda (which reads updates from DDB streams) to update data to Redshift directly as soon as a new update happens. (Issue: Too many small writes to Redshift can cause scaling issues for Redshift as it is more suited for batch writes/updates)
We're building Lambda architecture on AWS stack. A lack of devops knowledge forces us to prefer AWS managed solution over custom deployments.
Our workflow:
[Batch layer]
Kinesys Firehouse -> S3 -Glue-> EMR (Spark) -Glue-> S3 views -----+
|===> Serving layer (ECS) => Users
Kinesys -> EMR (Spark Streaming) -> DynamoDB/ElasticCache views --+
[Speed layer]
We have already using 3 datastores: ElasticCache, DynamoDB and S3 (queried with Athena). Bach layer produce from 500,000 up to 6,000,000 row each hour. Only last hour results should be queried by serving layer with low latency random reads.
Neither of our databases fits batch-insert & random-read requirements. DynamoDB not fit batch-insert - it's too expensive because of throughput required for batch inserts. Athena is MPP and moreover has limitation of 20 concurrent queries. ElasticCache is used by streaming layer, not sure if it's good idea to perform batch inserts there.
Should we introduce the fourth storage solution or stay with existing?
Considered options:
Persist batch output to DynamoDB and ElasticCache (part of data that is updated rarely and can be compressed/aggregated goes to DynamoDB; frequently updated data ~8GB/day goes to elasticCache).
Introduce another database (HBase on EMR over S3/ Amazon redshift?) as a solution
Use S3 Select over parquet to overcome Athena concurrent query limits. That will also reduce query latency. But have S3 Select any concurrent query limits? I can't find any related info.
The first option is bad because of batch insert to ElasticCache used by streaming. Also does it follow Lambda architecture - keeping batch and speed layer views in the same data stores?
The second solution is bad because of the fourth database storage, isn't it?
In this case you might want to use something like HBase or Druid; not only can they handle batch inserts and very low latency random reads, they could even replace the DynamoDB/ElastiCache component from your solution, since you can write directly to them from the incoming stream (to a different table).
Druid is probably superior for this, but as per your requirements, you'll want HBase, as it is available on EMR with the Amazon Hadoop distribution, whereas Druid doesn't come in a managed offering.
I am asking this in context of loading data from DynamoDb into Redshift. Per the Redshift Docs:
To avoid consuming excessive amounts of provisioned read throughput, we recommend that you not load data from Amazon DynamoDB tables that are in production environments.
My data is in Production, so how do I get it out of there?
Alternately is DynamoDB Streams a better overall choice to move data from DynamoDB into Redshift? (I understand this does not add to my RCU cost.)
The warning is due to the fact that the export could consume much of your read capacity for a period of time, which would impact your production environment.
Some options:
Do it at night when you don't need as much capacity
Set READRATIO to a low value so that it consumes less of the capacity
Temporarily increase the Read Capacity Units of the table when performing the export (you can decrease capacity four times a day)
DynamoDB Streams provides a stream of data representing changes to a DynamoDB table. You would need to process these streams using AWS Lambda to send the data somewhere for loading into Redshift. For example, you could populate another DynamoDB table and use it for importing into Redshift. Or, you could write the data to Amazon S3 and import from there into Redshift. However, this involves lots of moving parts.
Using AWS Data pipeline, you can do a bulk copy data from DynamoDB to a new or existing Redshift table.
We are looking for a solution which uses minimum read/write units of DynamoDB table for performing full backup, incremental backup and restore operations. Backup should store in AWS S3 (open to other alternatives). We have thought of few options such as:
1) Using python multiprocessing and boto modules we were able to perform Full backup and Restore operations, it is performing well, but is taking more DynamoDB read/write Units.
2) Using AWS Data Pipeline service, we were able to perform Full backup and Restore operations.
3) Using Dynamo Streams and kinesis Adapter/ Dynamo Streams and Lambda function, we were able to perform Incremental backup.
Are there other alternatives for Full backup, Incremental backup and Restore operations. The main limitation/need is to have a scalable solution by utilizing minimal read/write units of DynamoDb table.
Option #1 and #2 are almost the same- both do a Scan operation on the DynamoDB table, thereby consuming maximum no. of RCUs.
Option #3 will save RCUs, but restoring becomes a challenge. If a record is updated more than once, you'll have multiple copies of it in the S3 backup because the record update will appear twice in the DynamoDB stream. So, while restoring you need to pick the latest record. You also need to handle deleted records correctly.
You should choose option #3 if the frequency of restoring is less, in which case you can run an EMR job over the incremental backups when needed. Otherwise, you should choose #1 or #2.
On-Demand backups are a feature built into the DynamoDB service (Accessible via the API, AWS Management Console and CLI as usual), which allows you to take a full backup of a table at a point in time.
This task has no impact on performance or availability to your tables. All backups are automatically encrypted, cataloged, easily discoverable, and retained until you explicitly delete them.
Additionally, you can restore these backups to a new table at any point.
Along with data, the following is included in the backups:
Global secondary indexes (GSIs)
Local secondary indexes (LSIs)
Streams
Provisioned read and write capacity
The following is NOT included in the backups:
Auto scaling policies
AWS Identity and Access Management (IAM) policies
Amazon CloudWatch metrics and alarms
Tags
Stream settings
Time To Live (TTL) settings
I've blogged more information and a walkthrough here: https://www.abhayachauhan.com/2017/12/dynamodb-scheduling-on-demand-backups/