I am asking this in context of loading data from DynamoDb into Redshift. Per the Redshift Docs:
To avoid consuming excessive amounts of provisioned read throughput, we recommend that you not load data from Amazon DynamoDB tables that are in production environments.
My data is in Production, so how do I get it out of there?
Alternately is DynamoDB Streams a better overall choice to move data from DynamoDB into Redshift? (I understand this does not add to my RCU cost.)
The warning is due to the fact that the export could consume much of your read capacity for a period of time, which would impact your production environment.
Some options:
Do it at night when you don't need as much capacity
Set READRATIO to a low value so that it consumes less of the capacity
Temporarily increase the Read Capacity Units of the table when performing the export (you can decrease capacity four times a day)
DynamoDB Streams provides a stream of data representing changes to a DynamoDB table. You would need to process these streams using AWS Lambda to send the data somewhere for loading into Redshift. For example, you could populate another DynamoDB table and use it for importing into Redshift. Or, you could write the data to Amazon S3 and import from there into Redshift. However, this involves lots of moving parts.
Using AWS Data pipeline, you can do a bulk copy data from DynamoDB to a new or existing Redshift table.
Related
I have a bunch of tables in DynamoDB (80 right now but can grow in future) and looking to sync data from these tables to either Redshift or S3 (using Glue on top of it to query using Athena) for running analytics query.
There are also very frequent updates (to existing entries) and deletes happen in the DynamoDB tables which I want to sync along with addition of newer entries.
Checked the Write capacity units (WCU) consumed for all tables and the rate is coming out to be around 35-40 WCU per second at peak times.
Solutions I considered:
Use Kinesis Firehose along with lambda (which reads updates from DDB streams) to push data to Redshift in small batches (Issue: it cannot support updates and deletes and is only good for adding new entries because it uses Redshift COPY command under the hood to upload data to Redshift)
Use Lambda (which reads updates from DDB streams) to copy data to S3 directly as json with each entry being a separate file. This can support updates and deletes if we have S3 filepath same as primary key of DynamoDB tables (Issue: It will result in tons of small files in S3 which might not scale for querying using AWS Glue)
Use Lambda (which reads updates from DDB streams) to update data to Redshift directly as soon as a new update happens. (Issue: Too many small writes to Redshift can cause scaling issues for Redshift as it is more suited for batch writes/updates)
I am trying to export data from dynamoDb to S3 by using data pipeline. My table is on-demand provisioned and contains data of 10gb. How much rcu it would consume? Is there a way I could reduce the scaling of the rcu and would eventually increase the time for transfer?
You can change on-demand <-> provisioned capacity any time. You should do it since provisioned is cheaper, also you hardset rcu for provisioned, which you want
As another option, DynamoDB recently released a new feature to export your data to an S3 bucket. It supports DynamoDB JSON and Amazon Ion - see the documentation on how to use it at:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html
This process does not consume your table's read capacity, which removes this concern entirely.
I'm storing some events into DynamoDB. I have to sync (i.e. copy incrementally) the data with Redshift. Ultimately, I want to be able to analyze the data through AWS Quicksight.
I've come across multiple solutions but those are either one-time (using the one-time COPY command) or real-time (a streaming data pipeline using Kinesis Firehose).
The real-time solution seems superior to hourly sync, but I'm worried about performance and complexity. I was wondering if there's an easier way to batch the updates on an hourly basis.
What you are looking for are DynamoDB Streams (official docs). This can seamlessly flow into the Kinesis firehose as you have correctly pointed out.
This is the most optimal way and provides the best balance between the cost, operational overhead and the functionality itself. Allow me explain how:
DynamoDB streams: Streams are triggered when any activity happens on the database. This means that unlike a process that will scan the data on a periodic basis and consume the read capacity even if there is no update, you will be notified of the new data.
Kinesis Firehose: You can configure the Firehose to batch data either by the size of the data or time. This means that if you have a good inflow, you can set the stream to batch the records received in each 2 minutes interval and then issue just one COPY command to redshift. Same goes for the size of the data in the stream buffer. Read more about it here.
The ideal way to load data into Redshift is via COPY command and Kinesis Firehose does just that. You can also configure it to automatically create backup of the data into S3.
Remember that a reactive or push based system is almost always more performant and less costly than a reactive or push based system. You save on the compute capacity needed to run a cron process and also continuously scan for updates.
I have a data warehouse maintained in AWS Redshift. The data volume and velocity both have increased lately. One option is to keep scaling the cluster horizontally at the expanse of a higher cost of course. I was wondering if there are any archiving options available so that I can query the entire data as usual (maybe with a compromise in the querying time) but with a low or no additional cost?
One option would be to use external tables and query data directly from S3 but the tools used for achieving this, like Athena and Glue have their own cost, that too on a per query basis.
Easy options:
Ensure all tables have compression SELECT * FROM svv_table_info;.
Maximize compression by changing large tables to use ENCODE zstd.
Switch small tables < ~50k rows (depends) to DISTSTYLE ALL (yes this saves space!).
Switch from SSD based nodes (dc2) to HDD nodes (ds2) which have more 8x storage space.
Less easy options:
UNLOAD older data from Redshift to S3 and query using Redshift Spectrum.
Convert unloaded data to Parquet or ORC format using AWS Glue or AWS EMR and then query using Redshift Spectrum.
Please experiment with Redshift Spectrum. Query performance is typically very good and gets even better if your data is in a columnar format (Parquet/ORC).
Data stored within Amazon Redshift will provide the highest performance.
However, if you have data that is less-frequently accessed, you could export (UNLOAD) it into Amazon S3, preferably as compressed, partitioned data and storing it as Parquet or ORC would be even better!
You could then use Amazon Redshift Spectrum to Query External Data in Amazon S3. You can even join external data with Redshift data, so you could query historical information and current information in the one query.
Alternatively, you could use Amazon Athena to query the data directly from Amazon S3. This is similar to Redshift Spectrum, but does not require Redshift. Amazon Athena is based on Presto, so it is super-fast, especially if data is stored as compressed, partitioned, Parquet/ORC.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Please note that Redshift Spectrum and Amazon Athena charge based upon the amount of data read from disk. Therefore, compressed, partitioned Parquet/ORC is both cheaper and faster.
I am considering using AWS DynamoDB for an application we are building. I understand that setting a backup job that exports data from DynamoDB to S3 involves a data pipeline with EMR. But my question is do I need to worry about having a backup job set up on day 1? What are the chances that a data loss would happen?
There are multiple use-cases for DynamoDB table data copy elsewhere:
(1) Create a backup in S3 on a daily basis, in order to restore in case of accidental deletion of data or worse yet drop table (code bugs?)
(2) Create a backup in S3 to become the starting point of your analytics workflows. Once this data is backed up in S3, you can combine it with, say, your RDBMS system (RDS or on-premise) or other S3 data from log files. Data Integration workflows could involve EMR jobs to be ultimately loaded into Redshift (ETL) for BI queries. Or directly load these into Redshift to do more ELT style - so transforms happen within Redshift
(3) Copy (the whole set or a subset of) data from one table to another (either within the same region or another region) - so the old table can be garbage collected for controlled growth and cost containment. This table-to-table copy could also be used as a readily consumable backup table in case of, say region-specific availability issues. Or, use this mechanism to copy data from one region to another to serve it from an endpoint closer to the DynamoDB client application that is using it.
(4) Periodic restore of data from S3. Possibly as a way to load back post-analytics data back into DynamoDB for serving it in online applications with high-concurrency, low-latency requirements.
AWS Data Pipeline helps schedule all these scenarios with flexible data transfer solutions (using EMR underneath).
One caveat when using these solutions is to note that this is not a point-in-time backup: so any changes to the underlying table happening during the backup might be inconsistent.
This is really subjective. IMO you shouldn't worry about them 'now'.
You can also use simpler solutions other than pipleline. Perhaps that will be a good place to start.
After running DynamoDB as our main production database for more than a year I can say it is a great experience. No data loss and no downtime. The only thing that we care about is sometimes SDK misbehaves and tweaking provisioned throughput.
data pipeline has limit regions.
https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region
I would recommend setting up a Data pipeline to backup on daily basis to an S3 bucket - If you want to be really safe.
Dynamo DB itself might be very reliable, but nobody can protect you from your own accidental deletions (what if by mistake you or your colleague ended up deleting a table from the console). So I would suggest setup a backup on daily basis - It doesn't any case cost so much.
You can tell the Pipeline to only consume say may 25% of the capacity while backup is going on so that your real users don't see any delay. Every backup is "full" (not incremental), so in some periodic interval, you can delete some old backups if you are concerned about storage.