How to copy data in bulk from Kinesis -> Redshift

How to copy data in bulk from Kinesis -> Redshift - amazon-web-services

When i read about AWS data pipeline the idea immediately struck - produce statistics to kinesis and create a job in pipeline that will consume data from kinesis and COPY it to redshift every hour. All in one go.
But it seems there is no node in pipeline that can consume kinesis. So now i have two possible plans of action:
Create instance where Kinesis's data will be consumed and sent to S3 split by hours. Pipeline will copy from there to Redshift.
Consume from Kinesis and produce COPY directly to Redshift on the spot.
What should I do? Is there no way to connect Kinesis to redshift using AWS services only, without custom code?

It is now possible to do so without user-code via a new managed service called Kinesis Firehose. It manages the desired buffering intervals, temp uploads to s3, upload to Redshift, error handling and auto throughput management.

That is already done for you!
If you use the Kinesis Connector Library, there is a built-in connector to Redshift
https://github.com/awslabs/amazon-kinesis-connectors
Depending on the logic you have to process connector can be really easy to implement.

You can create and orchestrate complete pipeline with InstantStack to read data from Kinesis, transform it and push it into any Redshift or S3.

Related

Streaming Data From different Sources to AWS S3

I have different data sources and I need to publish them to S3 in real-time. I also need to process and validate data before delivering them to S3 buckets. I know that AWS Kinesis Data Stream offers Real-time data streaming and I can process data using AWS lambda before sending them to S3. However, it is not clear for me that can we use AWS Glue Streaming instead of AWS Kinesis Data Stream and AWS Lambda? I have seen some documentations about using AWS Glue Streaming for processing real-time data on the fly and send them to S3. So, what is the real differences here? Is AWS Glue Streaming ETL a good choice for streaming and processing data in real-time and store them into S3?

Kinesis data stream with lambda consumer will fit as long as the lambda execution environment limits is sufficient
15 mins execution time
Memory config
Concurrency limits
When going with glue consumer, your glue jobs can run longer and also supports Apache spark for massive parallel processing
You can also use Kinesis firehose which has native integration to deliver data to S3, ElasticSearch etc..., which doesn't require any changes to data. You can also have a lambda to do minimal processing intercepting the data before delivering using firehose.

Looking for guidance/alternatives on ETL process on AWS

I'm working on an ETL pipeline using Apache NiFi, the flow runs hourly and is something like this:
data provider API->Apache Nifi->S3 landing
->Athena Query to transform the data->S3 stage
->Athena Query to change field types and join with another data so it be ready for analysis->S3 trusted
->Glue->Redshift
I found GLUE to be expensive to send data to redshift, will code something ad-hoc to use the COPY command.
The question I would like to ask is if you can guide me if there is a better tool/way to do something better/cheaper/scalable, specially on steps 2 and 3.
I'm looking for ways, to optimize this process and make it ready to recieve millions of registries per hour.
Thank you!

Interesting workflow.
You can actually use some neat combinations to automatically get data from s3 into redshift.
You can do S3 (Raw Data) -> Lambda (Off PUT notification) -> Kinesis Firehose -> S3 (batched & transformed with firehose transformer) -> Kinesis Redshift Copy
This flow will completely automate updates based on your data. You can read more about it here. Hope this helps.

You can save your data in partitioned fashion in s3.
Then use glue spark jobs to transform the data and implementing joins and aggregations as that will be fast if written in optimized way.
This will also save you cost as glue will process the data faster then expected and then to move data to redshift copy command is the best approach.

Read AWS GLUE https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console

Kinesis to S3 custom partitioning

Question
I've read this and this and this articles. But they provide contradictory answers to the question: how to customize partitioning on ingesting data to S3 from Kinesis Stream?
More details
Currently, I'm using Firehose to deliver data from Kinesis Streams to Athena. Afterward, data will be processed with EMR Spark.
From time to time I have to handle historical bulk ingest into Kinesis Streams. The issue is that my Spark logic hardly depends on data partitioning and order of event handling. But Firehouse supports partitioning only by ingestion_time (into Kinesis Stream), not by any other custom field (I need by event_time).
For example, under Firehouse's partition 2018/12/05/12/some-file.gz I can get data for the last few years.
Workarounds
Could you please help me to choose between the following options?
Copy/partition data from Kinesis Steam with help of custom lambda. But this looks more complex and error-prone for me. Maybe because I'm not very familiar with AWS lambdas. Moreover, I'm not sure how well it will perform on bulk load. At this article it was said that Lambda option is much more expensive than Firehouse delivery.
Load data with Firehouse, then launch Spark EMR job to copy the data to another bucket with right partitioning. At least it sounds simpler for me (biased, I just starting with AWS Lambas). But it has the drawback of double-copy and additional spark Job.
At one hour I could have up to 1M rows that take up to 40 MB of memory (at compressed state). From Using AWS Lambda with Amazon Kinesis I know that Kinesis to Lambda event sourcing has a limitation of 10,000 records per batch. Would it be effective to process such volume of data with Lambda?

While Kinesis does not allow you to define custom partitions, Athena does!
The Kinesis stream will stream into a table, say data_by_ingestion_time, and you can define another table data_by_event_time that has the same schema, but is partitioned by event_time.
Now, you can make use of Athena's INSERT INTO capabilities to let you repartition data without needing to write Hadoop or a Spark job and you get the serverless scale-up of Athena for your data volume. You can use SNS, cron, or a workflow engine like Airflow to run this at whatever interval you need.
We dealt with this at my company and go in-to more depth details of the trade-offs of using EMR or a streaming solution, but now you don't need to introduce anymore systems like Lambda or EMR.
https://radar.io/blog/custom-partitions-with-kinesis-and-athena

you may use the kinesis stream, and create the partitions like you wants.
you create a producer, and in your consumer, create the partitions.
https://aws.amazon.com/pt/kinesis/data-streams/getting-started/

AWS Redshift ETL Process

I'm investigating redshift for our Data Warehouse, and I'm trying to think of how to architect a solution.
I have an instance of Amazon Kinesis Firehose as a delivery stream which writes to my Redshift database, and all that works fine.
Now my issue is how do I automate the creation of dimensions and fact tables.
Can I use a Lambda function in the delivery stream to write to the fact table and update the dimensions?

The Data Transformation capability of AWS Lambda on an Amazon Kinesis Firehose is purely to modify or exclude streaming data. It cannot be used to create other tables.
If you wish to create dimension and fact tables, or otherwise perform ETL, you'll need to trigger it externally, such as having a scheduled task run SQL commands on your Amazon Redshift instance. This task would connect via JDBC/ODBC to run the commands.

Stream data from EC2 web server to Redshift

We would like to stream data directly from EC2 web server to RedShift. Do I need to use Kinesis? What is the best practice? I do not plan to do any special analysis before the storage on this data. I would like a cost effective solution (it might be costly to use DynamoDB as a temporary storage before loading).

If cost is your primary concern than the exact number of records/second combined with the record sizes can be important.
If you are talking very low volume of messages a custom app running on a t2.micro instance to aggregate the data is about as cheap as you can go, but it won't scale. The bigger downside is that you are responsible for monitoring, maintaining, and managing that EC2 instance.
The modern approach would be to use a combination of Kinesis + Lambda + S3 + Redshift to have the data stream in requiring no EC2 instances to mange!
The approach is described in this blog post: A Zero-Administration Amazon Redshift Database Loader
What that blog post doesn't mention is now with API Gateway if you do need to do any type of custom authentication or data transformation you can do that without needing an EC2 instance by using Lambda to broker the data into Kinesis.
This would look like:
API Gateway -> Lambda -> Kinesis -> Lambda -> S3 -> Redshift

Redshift is best suited for batch loading using the COPY command. A typical pattern is to load data to either DynamoDB, S3, or Kinesis, then aggregate the events before using COPY to Redshift.
See also this useful SO Q&A.

I implemented a such system last year inside my company using Kinesis and Kinesis connector. Kinesis connector is just a standalone app released by AWS we are running in a bunch of ElasticBeanStalk servers as Kinesis consumers, then the connector will aggregate messages to S3 every a while or every amount of messages, then it will trigger the COPY command from Redshift to load data into Redshift periodically. Since it's running on EBS, you can tune the auto-scaling conditions to make sure the cluster grows and shrinks with the volume of data from Kinesis stream.
BTW, AWS just announced Kinesis Firehose yesterday. I haven't played it but it definitely looks like a managed version of the Kinesis connector.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js