I've found a tutorial for how to get item level changes into s3 from dynambo via kinsis firehose here
but how do I get these into a redshift table? If an item is updated, it will create a new record for it and post to s3, so is there a tutorial or guidance on how to take these item level changes and read them into a table?
Kinesis Firehose has multiple destinations that you can choose from. S3 is only one of them, and Redshift is another.
You can use the following configuration to set up Redshift as the destination.
Related
I'm looking at the diagram here, and from my understanding, a DynamoDB stream into a redshift table via kinesis firehose will send the updates as redshift commands to the table (i.e. update, insert etc). So this will keep a redshift version of a dynamodb table in sync.
But how do you deal with the historical data? Is there a good process for filling the redshift table with data to date, that can then be kept in sync via a dynamodb stream? It's not trivial, because some updates may be lost if I manually copy the data into a redshift table and then switch on a dynamodb stream depending on the timing.
So regarding the diagram, it shows kinesis firehose delivering data to s3, queryable by athena. I feel like I'm missing sometthing because if the data going to s3 are only updates and new records, it doesn't seem like something that works well for athea (a partitioned snapshot of the entire table makes more sense).
So if I have a dynamodb table that is currently receiving data, and I want to create a new redshift table that contains all the same data up to a given time, and then gets all the updates via a dynamodb stream after, how do I go about doing that?
I have a requirement of reading a csv batch file that was uploaded to s3 bucket, encrypt data in some columns and persist this data in a Dynamo DB table. While persisting each row in the DynamoDB table, depending on the data in each row, I need to generate an ID and store that in the DynamoDB table too. It seems AWS Data pipeline allows to create a job to import S3 bucket files into DynanoDB, but I can't find a way to add a custom logic there to encrypt some of the column values in the file and add custom logic to generate the id mentioned above.
Is there any way that I can achieve this requirement using AWS Data Pipeline? If not what would the best approach that I can follow using AWS services?
We also have a situation where we need fetch data from S3 and populate it to DynamoDb after performing some transformations (business logic).
We also use AWS DataPipeline for this process.
We first trigger a EMR cluster from Data Pipeline where we fetch the data from S3 and then transform it and populate the DynamoDB(DDB). You can include all the logic you require in the EMR cluster.
We have a timer set in the pipeline which triggers the EMR cluster every day once to perform the task.
This can be having additional costs too.
In AWS Glue jobs, in order to retrieve data from DB or S3, we can get using 2 approaches. 1) Using Crawler 2) Using direct connection to DB or S3.
So my question is: How does crawler much better than direct connecting to a database and retrieve data?
AWS Glue Crawlers will not retrieve the actual data. Crawlers accesses your data stores and progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers can be scheduled to run periodically that will detect the availability of the new data along with the change to the existing data, including the table definition changes made by the data crawler. Crawlers automatically adds new table, new partitions to the existing table and the new versions of table definitions.
AWS Glue Data Catalog becomes a common metadata repository between
Amazon Athena, Amazon Redshift Spectrum, Amazon S3. AWS Glue Crawlers
helps in building this metadata repository.
I need to upgrade my application in order to bear streaming data. My application has a different kind of data that is stored in different MySQL tables.
So, I want to create an AWS Kinesis Firehose and AWS Lambda function to receive, transform and load my data to S3 in CSV file.
All the information I have found googling explains very well how to implement this but only storing the data in one unique CSV. I assume that only having one unique CSV, it will be interpreted by Athena as one table.
I have not found any information to create and store multiple CSV files using Kinesis Firehose and AWS Lambda function (which will represent tables in Athena).
Should I create a new Kinesis Firehose instance for each table I have in my MySQL database? or exists some way to store these data in different CSV files?
I saw that similar questions already exist:
Backup AWS Dynamodb to S3
Copying only new records from AWS DynamoDB to AWS Redshift
Loading data from Amazon dynamoDB to redshift
Unfortunately most of them are outdated (since amazon introduced new services) and/or have different answers.
In my case I have two databases (RedShift and DynamoDB) and I have to:
Keep RedShift database up-to-date
Store database backup on S3
To do that I want to use that approach:
Backup only new/modified records
from DynamoDB to S3 at the end of the day. (1 file per day)
Update RedShift database using file from S3
So my question is what is the most efficient way to do that?
I read this tutorial but I am not sure that AWS Data Pipeline could be configured to "catch" only new records from DynamoDB. If that is not possible, scanning entire database every time is not an option.
Thank you in advance!
you can use Amazon Lambda with dynamodb stream (documentation)
you can configure your lambda function to get updated records (from dynamodb stream) and then update redshift db