Incremental exports from Amazon DynamoDB to Amazon S3 - amazon-web-services

We need to run an analysis of the data in Amazon DynamoDB. Since doing it in the DDB isn't an option due to DDB's limitations with analysis, based on the recommendations I am leaning towards DDB -?> S3 -> Athena.
It is a data-heavy application with data streaming from AWS IoT devices and is also a multi-tenant application. Now, to sync data from DDB to Amazon S3, it will be probably a couple of times a day. How do we set up incremental exports for this purpose?

There is an Athena connector to be able to query your data in DynamoDB table directly using SQL query.
https://docs.aws.amazon.com/athena/latest/ug/athena-prebuilt-data-connectors-dynamodb.html
https://dev.to/jdonboch/finally-dynamodb-support-in-aws-quicksight-sort-of-2lbl

Another solution for this use case is you can write an AWS Step Functions workflow that when invoked, can read data from an Amazon DynamoDB table and then format the data to the way you want it and place the data into an Amazon S3 bucket (an example that shows a similar use case will be available soon):
This is the reverse (here the source is an Amazon S3 bucket and the target is an Amazon DynamoDB table) but you can build the Workflow so the target is an Amazon S3 bucket. Because it's a workflow, you can use a Lambda function that is scheduled to fire a few times a day based on a CRON expression. The job of this Lambda function is to invoke the workflow using the Step Functions API.

Related

Data Pipeline (DynamoDB to S3) - How to format S3 file?

I have a Data Pipeline that exports my DynamoDB table to an S3 bucket so I can use the S3 file for services like QuickSight, Athena and Forecast.
However, for my S3 file to work with these services, I need the file to be formatted in a csv like so:
date, journal, id
1589529457410, PLoS Genetics, 10.1371/journal.pgen.0030110
1589529457410, PLoS Genetics, 10.1371/journal.pgen.1000047
But instead, my exported file looks like this:
{"date":{"s":"1589529457410"},"journal":{"s":"PLoS Genetics"},"id":{"s":"10.1371/journal.pgen.0030110"}}
{"date":{"s":"1589833552714"},"journal":{"s":"PLoS Genetics"},"id":{"s":"10.1371/journal.pgen.1000047"}}
How can I specify the format for my exported file in S3 so I can operate with services like QuickSight, Athena and Forecast? I'd preferably do the data transformation using Data Pipeline as well.
Athena can read JSON data.
You can also use DynamoDB streams to stream the data to S3. Here is a link to a blog post with best practice and design patterns for streaming data from DynamoDB to S3 to be used with Athena.
You can use DynamoDB streams to trigger an AWS Lambda function, which can transform the data and store it in Amazon S3, Amazon Redshift etc. With AWS Lambda you could also trigger Amazon Forecast to retrain, or pass the data to Amazon Forecast for a prediction.
Alternatively you could use Amazon Data Pipeline to write the data to an S3 bucket as you currently have it. Then use a cloud watch event scheduled to run a lambda function, or an S3 event notification to run a lambda function. The lambda function can transform the file and store it in another S3 bucket for further processing.

AWS Glue ETL : transfer data to S3 Bucket

I wish to transfer data in a database like MySQL[RDS] to S3 using AWS Glue ETL.
I am having difficulty trying to do this the documentation is really not good.
I found this link here on stackoverflow:
Could we use AWS Glue just copy a file from one S3 folder to another S3 folder?
SO based on this link, it seems that Glue does not have an S3 bucket as a data Destination, it may have it as a data Source.
SO, i hope i am wrong on this.
BUT if one makes an ETL tool, one of the first basics on AWS is for it to tranfer data to and from an S3 bucket, the major form of storage on AWS.
So hope someone can help on this.
You can add a Glue connection to your RDS instance and then use the Spark ETL script to write the data to S3.
You'll have to first crawl the database table using Glue Crawler. This will create a table in the Data Catalog which can be used in the job to transfer the data to S3. If you do not wish to perform any transformation, you may directly use the UI steps for autogenerated ETL scripts.
I have also written a blog on how to Migrate Relational Databases to Amazon S3 using AWS Glue. Let me know if it addresses your query.
https://ujjwalbhardwaj.me/post/migrate-relational-databases-to-amazon-s3-using-aws-glue
Have you tried https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-copyrdstos3.html?
You can use AWS Data Pipeline - it has standard templates for full as well incrementation copy to s3 from RDS.

What is the most optimal way to automate data (csv file) transfer from s3 to Redshift without AWS Pipeline?

I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog

Backup only new records from DynamoDB to S3 and load them into RedShift

I saw that similar questions already exist:
Backup AWS Dynamodb to S3
Copying only new records from AWS DynamoDB to AWS Redshift
Loading data from Amazon dynamoDB to redshift
Unfortunately most of them are outdated (since amazon introduced new services) and/or have different answers.
In my case I have two databases (RedShift and DynamoDB) and I have to:
Keep RedShift database up-to-date
Store database backup on S3
To do that I want to use that approach:
Backup only new/modified records
from DynamoDB to S3 at the end of the day. (1 file per day)
Update RedShift database using file from S3
So my question is what is the most efficient way to do that?
I read this tutorial but I am not sure that AWS Data Pipeline could be configured to "catch" only new records from DynamoDB. If that is not possible, scanning entire database every time is not an option.
Thank you in advance!
you can use Amazon Lambda with dynamodb stream (documentation)
you can configure your lambda function to get updated records (from dynamodb stream) and then update redshift db

Periodically moving query results from Redshift to S3 bucket

I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.
You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.
I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.
I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.
You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case