AWS Redshift ETL Process - amazon-web-services

I'm investigating redshift for our Data Warehouse, and I'm trying to think of how to architect a solution.
I have an instance of Amazon Kinesis Firehose as a delivery stream which writes to my Redshift database, and all that works fine.
Now my issue is how do I automate the creation of dimensions and fact tables.
Can I use a Lambda function in the delivery stream to write to the fact table and update the dimensions?

The Data Transformation capability of AWS Lambda on an Amazon Kinesis Firehose is purely to modify or exclude streaming data. It cannot be used to create other tables.
If you wish to create dimension and fact tables, or otherwise perform ETL, you'll need to trigger it externally, such as having a scheduled task run SQL commands on your Amazon Redshift instance. This task would connect via JDBC/ODBC to run the commands.

Related

Transfer Data from AWS Time Stream to DynamoDB

i am working in the IoT Space with 2 Databases. AWS Time Stream & AWS DynamoDB.
My sensor data is coming into Time Stream via AWS IoT Core and MQTT. I set up a rule, that gives permission to transfer the incoming data directly into Time Stream.
What i need to do now is to run some operations on the data and save the result of these operations into DynamoDB.
I know with DynamoDB there is function called DynamoDB Streams. Is there a solution like Streams in Time Stream as well? Or does anybody has an idea, how i can automatically transfer the results of the operations from Time Stream to DynamoDB?
Timestream does not have Change Data Capture capabilities.
Best thing to do is to write the data into DynamoDB from wherever you are doing your operations on Timestream. For example, if you are using AWS Glue to analyze your Timestream data, you can sink the results directly from Glue using the DynamoDB sink.
Timestream has the concept of Schedule Query. When a query has ran, you can be notified via a SNS topic. You could connect a lambda on that SNS topic to retrieve the query result and store it in DynamoDB.

AWS Glue - Tracking Processed Data on DocumentDB

I have a DocumentDB as the data source.
I am running an AWS Glue job that pulls all the data from a certain table, and then inserts it to a RedShift cluster.
Is it possible to avoid adding duplicate data?
I have seen that AWS glue supports bookmarks,
This does not seem to work for DocumentDB as the data source
Thanks.

Periodically moving query results from Redshift to S3 bucket

I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.
You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.
I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.
I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.
You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case

Stream data from EC2 web server to Redshift

We would like to stream data directly from EC2 web server to RedShift. Do I need to use Kinesis? What is the best practice? I do not plan to do any special analysis before the storage on this data. I would like a cost effective solution (it might be costly to use DynamoDB as a temporary storage before loading).
If cost is your primary concern than the exact number of records/second combined with the record sizes can be important.
If you are talking very low volume of messages a custom app running on a t2.micro instance to aggregate the data is about as cheap as you can go, but it won't scale. The bigger downside is that you are responsible for monitoring, maintaining, and managing that EC2 instance.
The modern approach would be to use a combination of Kinesis + Lambda + S3 + Redshift to have the data stream in requiring no EC2 instances to mange!
The approach is described in this blog post: A Zero-Administration Amazon Redshift Database Loader
What that blog post doesn't mention is now with API Gateway if you do need to do any type of custom authentication or data transformation you can do that without needing an EC2 instance by using Lambda to broker the data into Kinesis.
This would look like:
API Gateway -> Lambda -> Kinesis -> Lambda -> S3 -> Redshift
Redshift is best suited for batch loading using the COPY command. A typical pattern is to load data to either DynamoDB, S3, or Kinesis, then aggregate the events before using COPY to Redshift.
See also this useful SO Q&A.
I implemented a such system last year inside my company using Kinesis and Kinesis connector. Kinesis connector is just a standalone app released by AWS we are running in a bunch of ElasticBeanStalk servers as Kinesis consumers, then the connector will aggregate messages to S3 every a while or every amount of messages, then it will trigger the COPY command from Redshift to load data into Redshift periodically. Since it's running on EBS, you can tune the auto-scaling conditions to make sure the cluster grows and shrinks with the volume of data from Kinesis stream.
BTW, AWS just announced Kinesis Firehose yesterday. I haven't played it but it definitely looks like a managed version of the Kinesis connector.

How to copy data in bulk from Kinesis -> Redshift

When i read about AWS data pipeline the idea immediately struck - produce statistics to kinesis and create a job in pipeline that will consume data from kinesis and COPY it to redshift every hour. All in one go.
But it seems there is no node in pipeline that can consume kinesis. So now i have two possible plans of action:
Create instance where Kinesis's data will be consumed and sent to S3 split by hours. Pipeline will copy from there to Redshift.
Consume from Kinesis and produce COPY directly to Redshift on the spot.
What should I do? Is there no way to connect Kinesis to redshift using AWS services only, without custom code?
It is now possible to do so without user-code via a new managed service called Kinesis Firehose. It manages the desired buffering intervals, temp uploads to s3, upload to Redshift, error handling and auto throughput management.
That is already done for you!
If you use the Kinesis Connector Library, there is a built-in connector to Redshift
https://github.com/awslabs/amazon-kinesis-connectors
Depending on the logic you have to process connector can be really easy to implement.
You can create and orchestrate complete pipeline with InstantStack to read data from Kinesis, transform it and push it into any Redshift or S3.