I want Amazon Redshift to push any new rows that get inserted into an Amazon Kinesis Firehose that will transform the data with a Lambda function.
Can this be done? If so, can you point me to an example and documentation?
No. There is no trigger mechanism within Amazon Redshift to cause other things to happen (either within Redshift or external to it).
Related
I have gone through couple of stackoverflow questions regarding hourly backups from DDB to S3 where the best solution turned out to be to enable DDB Stream, subscribe lambda function and push to S3.
I am trying to understand if directly pushing from Lambda to S3 is fine or from Lambda to Kinesis Firehose and then to S3. Can someone share what is the advantage if we introduce Firehose in between. We anyways trigger lambda only after specific batch window that implies we are already buffering there.
Thanks in advance.
Firehose gives you the possibility to convert and compress your data. In addition you can directly attach a Glue Metadata table, so you can query your data with Athena.
You can write a Lambda function that reads a DynamoDB table, gets a result set, encodes the data to some format (ie, JSON), then place that JSON into an Amazon S3 bucket. You can use scheduled events to fire off the Lambda function on a regular schedule.
Here in AWS tutorial that shows you how to use scheduled events to invoke a Lambda function:
Creating scheduled events to invoke Lambda functions
This AWS tutorial also shows you how to read data from an Amazon DynamoDB table from a Lambda function.
Firehose is fully managed whereas Streams is manually managed.
If other people are aware of other major differences, please add them. I'm just learning.
Thanks..
Amazon Kinesis Data Firehose can send data to:
Amazon S3
Amazon Redshift
Amazon Elasticsearch Service
Splunk
To do the same thing with Amazon Kinesis Data Streams, you would need to write an application that consumes data from the stream and then connects to the destination to store data.
So, think of Firehose as a pre-configured streaming application with a few specific options. Anything outside of those options would require you to write your own code.
I am trying to setup a sync between AWS Aurora and Redshift. What is the best way to achieve this sync?
Possible ways to sync can be: -
Query table to find changes in a table(since I am only doing inserts, updates don't matter), export these changes to a flat file in S3 bucket and use Redshift copy command to insert into Redshift.
Use python publisher and Boto3 to publish changes into a Kinesis stream and then consume this stream in Firehose from where I can copy directly into Redshift.
Use Kinesis Agent to detect changes in binlog (Is it possible to detect changes int binlog using Kinesis Agent) and publish it to Firehose and from there copy into Firehose.
I haven't explored AWS Datapipeline yet.
As pointed out by #Mark B, the AWS Database Migration Service can migrate data between databases. This can be done as a one-off exercise, or it can run continuously, keeping two databases in sync.
The documentation shows that Amazon Aurora can be a source and Amazon Redshift can be a target.
AWS has just announced this new feature: Amazon Aurora zero-ETL integration with Amazon Redshift
This natively provides near real-time (second) synchronization from Aurora to Redshift.
You can also use federated queries: https://docs.aws.amazon.com/redshift/latest/dg/federated-overview.html
I'm investigating redshift for our Data Warehouse, and I'm trying to think of how to architect a solution.
I have an instance of Amazon Kinesis Firehose as a delivery stream which writes to my Redshift database, and all that works fine.
Now my issue is how do I automate the creation of dimensions and fact tables.
Can I use a Lambda function in the delivery stream to write to the fact table and update the dimensions?
The Data Transformation capability of AWS Lambda on an Amazon Kinesis Firehose is purely to modify or exclude streaming data. It cannot be used to create other tables.
If you wish to create dimension and fact tables, or otherwise perform ETL, you'll need to trigger it externally, such as having a scheduled task run SQL commands on your Amazon Redshift instance. This task would connect via JDBC/ODBC to run the commands.
When i read about AWS data pipeline the idea immediately struck - produce statistics to kinesis and create a job in pipeline that will consume data from kinesis and COPY it to redshift every hour. All in one go.
But it seems there is no node in pipeline that can consume kinesis. So now i have two possible plans of action:
Create instance where Kinesis's data will be consumed and sent to S3 split by hours. Pipeline will copy from there to Redshift.
Consume from Kinesis and produce COPY directly to Redshift on the spot.
What should I do? Is there no way to connect Kinesis to redshift using AWS services only, without custom code?
It is now possible to do so without user-code via a new managed service called Kinesis Firehose. It manages the desired buffering intervals, temp uploads to s3, upload to Redshift, error handling and auto throughput management.
That is already done for you!
If you use the Kinesis Connector Library, there is a built-in connector to Redshift
https://github.com/awslabs/amazon-kinesis-connectors
Depending on the logic you have to process connector can be really easy to implement.
You can create and orchestrate complete pipeline with InstantStack to read data from Kinesis, transform it and push it into any Redshift or S3.