Best practice for reading data from Kafka to AWS Redshift - amazon-web-services

What is the best practice for moving data from a Kafka cluster to a Redshift table?
We have continuous data arriving on Kafka and I want to write it to tables in Redshift (it doesn't have to be in real time).
Should I use Lambda function?
Should I write a Redshift connector (consumer) that will run on a dedicated EC2 instance? (downside is that I need to handle redundancy)
Is there some AWS pipeline service for that?

Kafka Connect is commonly used for streaming data from Kafka to (and from) data stores. It does useful things like automagically managing scaleout, fail over, schemas, serialisation, and so on.
This blog shows how to use the open-source JDBC Kafka Connect connector to stream to Redshift. There is also a community Redshift connector, but I've not tried this.
This blog shows another approach, not using Kafka Connect.
Disclaimer: I work for Confluent, who created the JDBC connector.

Related

Writing data to AWS Aurora using StreamSets

I have got one requirement where we have to write real-time data to AWS Aurora (PostgreSQL) using StreamSets Data Collector. I have never worked on StreamSets but I have learn that it's a data connector. I tried to search to get something on this topic but no luck. Any idea how StreamSets can be used to write data to Aurora?
You can use StreamSets Data Collector's JDBC Producer destination to write data to Aurora. Data Collector includes the JDBC driver required for PostgreSQL.

Migrate from Oracle RDBMS to AWS S3 with Kinesis

Any suggested architecture ?
For the first full load, using Kinesis, how do I automate it so that it creates different streams for different tables. (Is this the way to do it?)
Incase if there is a new additional table, how do I create a new stream automatically.
3.How do I load to Kinesis incrementally (whenever the data is populated )
Any resources/ architectures will be definitely helpful. Using Kinesis because multiple other down stream consumers might access this data in future.
Recommend looking into AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS). DMS does not necessarily use Kinesis but it is specifically design for this use case.
Start with the walk through in this blog post: "How to Migrate Your Oracle Data Warehouse to Amazon Redshift Using AWS SCT and AWS DMS"

What are the differences between Amazon Redshift and the new AWS Glue datawarehousing services?

I am confused about these two services. It looks that they are offering the same service. Probably the only difference is that the Glue catalog can contain a wider range of data sources. Does it mean that AWS Glue can replace Redshift?
The Comment is right , These two services are not same AWS Glue is ETL Service while AWS Redshift is Data Warehousing service.
According to AWS Documentation :
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It allows you to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
According to AWS Documentation :
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores
You can Refer the Documentation Provided by AWS for Details but essentially these are totally different services.

AWS Redshift ETL Process

I'm investigating redshift for our Data Warehouse, and I'm trying to think of how to architect a solution.
I have an instance of Amazon Kinesis Firehose as a delivery stream which writes to my Redshift database, and all that works fine.
Now my issue is how do I automate the creation of dimensions and fact tables.
Can I use a Lambda function in the delivery stream to write to the fact table and update the dimensions?
The Data Transformation capability of AWS Lambda on an Amazon Kinesis Firehose is purely to modify or exclude streaming data. It cannot be used to create other tables.
If you wish to create dimension and fact tables, or otherwise perform ETL, you'll need to trigger it externally, such as having a scheduled task run SQL commands on your Amazon Redshift instance. This task would connect via JDBC/ODBC to run the commands.

Stream data from EC2 web server to Redshift

We would like to stream data directly from EC2 web server to RedShift. Do I need to use Kinesis? What is the best practice? I do not plan to do any special analysis before the storage on this data. I would like a cost effective solution (it might be costly to use DynamoDB as a temporary storage before loading).
If cost is your primary concern than the exact number of records/second combined with the record sizes can be important.
If you are talking very low volume of messages a custom app running on a t2.micro instance to aggregate the data is about as cheap as you can go, but it won't scale. The bigger downside is that you are responsible for monitoring, maintaining, and managing that EC2 instance.
The modern approach would be to use a combination of Kinesis + Lambda + S3 + Redshift to have the data stream in requiring no EC2 instances to mange!
The approach is described in this blog post: A Zero-Administration Amazon Redshift Database Loader
What that blog post doesn't mention is now with API Gateway if you do need to do any type of custom authentication or data transformation you can do that without needing an EC2 instance by using Lambda to broker the data into Kinesis.
This would look like:
API Gateway -> Lambda -> Kinesis -> Lambda -> S3 -> Redshift
Redshift is best suited for batch loading using the COPY command. A typical pattern is to load data to either DynamoDB, S3, or Kinesis, then aggregate the events before using COPY to Redshift.
See also this useful SO Q&A.
I implemented a such system last year inside my company using Kinesis and Kinesis connector. Kinesis connector is just a standalone app released by AWS we are running in a bunch of ElasticBeanStalk servers as Kinesis consumers, then the connector will aggregate messages to S3 every a while or every amount of messages, then it will trigger the COPY command from Redshift to load data into Redshift periodically. Since it's running on EBS, you can tune the auto-scaling conditions to make sure the cluster grows and shrinks with the volume of data from Kinesis stream.
BTW, AWS just announced Kinesis Firehose yesterday. I haven't played it but it definitely looks like a managed version of the Kinesis connector.