AWS Kinesis Connector Library - amazon-web-services

I am developing a real-time streaming application which needs to send information to AWS Kinesis streams and from there to AWS Redshift. Based on my reading and understanding of documentation, following are the options to push information from Kinesis Streams to Redshift:
Kinesis Streams -> Lambda Function -> Redshift
Kinesis Streams -> Lambda Function -> Kinesis Firehose -> Redshift
Kinesis Streams -> Kinesis Connector Library -> Redshift (https://github.com/awslabs/amazon-kinesis-connectors)
I found the Kinesis Connector option to be the best option for moving information from Streams to Redshift. However, I am not able to understand where do we deploy this library and how does this run? Does this need to run as a lambda function or as a java function on an EC2 instance. Based on the readme I am not able to get that information. In case anyone has worked with connectors successfully, I will appreciate the insight very much.

If you're using the Kinesis Connector Library then you want to deploy it on an EC2 instance, but using a Lambda function without the Connector Library is a lot easier and better in my opinion. It handles batching, scaling up your instance invocation and retries. Dead Letter Queues are probably coming soon too for Lambda + Kinesis.
Basically it's a lot easier to scale and deal with failures in Lambda.

Related

buffer s3 object inputs

Does anyone know other than kinesis firehose, is there any other service from AWS can catch the S3 inject event? I am trying to do some analysis on VPC flow logs, currently setup is cloud-watch-logs -> Kinesis Firehose -> S3 -> Athena.
The problem is kinesis firehose can only buffer up to 128MB which is to small for me.
Events from Amazon S3 can go to:
AWS Lambda functions
Amazon SNS topic
Amazon SQS queue
So, you could send the messages to an SQS queue and then have a regular process (every hour?) that retrieves many messages and writes them to a single file.
Alternatively, you could use your current setup but use Amazon Athena on a regular basis to join multiple files by using CREATE TABLE AS. This would select from the existing files and store the output in a new table (with a new location). You could even use it to transform the files into a format that is easier to query in Athena (eg Snappy-compressed Parquet). The hard part is to only include each input file once into this concatenation process (possibly using SymlinkTextInputFormat).

what is difference between Kinesis Streams and Kinesis Firehose?

Firehose is fully managed whereas Streams is manually managed.
If other people are aware of other major differences, please add them. I'm just learning.
Thanks..
Amazon Kinesis Data Firehose can send data to:
Amazon S3
Amazon Redshift
Amazon Elasticsearch Service
Splunk
To do the same thing with Amazon Kinesis Data Streams, you would need to write an application that consumes data from the stream and then connects to the destination to store data.
So, think of Firehose as a pre-configured streaming application with a few specific options. Anything outside of those options would require you to write your own code.

Use AWS lambda to read Kinesis and save to S3

I am very new to AWS. Upto now I am able to send csv data to kinesis streams using aws .net sdk. Now I have to save this data in S3 using lambda using S3 Emitter(this is the most common way which I found on many websites) . When I create a Lambda function for it. It asks for Node.js or java8 code.
I don't understand from here , what code needs to be uploaded, how to use S3 Emitter code.
I cannot use Kinesis Firehose because the streaming data is going to EMR for processing.
Please help me here.
If there is any alternate way please suggest.
You need to write code that will get the events from the kinesis stream and write it to S3 (or even easier to Kinesis Firehose). This code should be in one of the programing languages that are supported in Lambda currently (JavaScript, Java, Python).
Here is a tutorial for reading from Kinesis: http://docs.aws.amazon.com/lambda/latest/dg/with-kinesis-example.html
It is relatively easy to read the events and batch them to S3 or even easier to write them to Firehose to get a more optimized batches in S3 (larger, compressed, encrypted...).

Stream data from EC2 web server to Redshift

We would like to stream data directly from EC2 web server to RedShift. Do I need to use Kinesis? What is the best practice? I do not plan to do any special analysis before the storage on this data. I would like a cost effective solution (it might be costly to use DynamoDB as a temporary storage before loading).
If cost is your primary concern than the exact number of records/second combined with the record sizes can be important.
If you are talking very low volume of messages a custom app running on a t2.micro instance to aggregate the data is about as cheap as you can go, but it won't scale. The bigger downside is that you are responsible for monitoring, maintaining, and managing that EC2 instance.
The modern approach would be to use a combination of Kinesis + Lambda + S3 + Redshift to have the data stream in requiring no EC2 instances to mange!
The approach is described in this blog post: A Zero-Administration Amazon Redshift Database Loader
What that blog post doesn't mention is now with API Gateway if you do need to do any type of custom authentication or data transformation you can do that without needing an EC2 instance by using Lambda to broker the data into Kinesis.
This would look like:
API Gateway -> Lambda -> Kinesis -> Lambda -> S3 -> Redshift
Redshift is best suited for batch loading using the COPY command. A typical pattern is to load data to either DynamoDB, S3, or Kinesis, then aggregate the events before using COPY to Redshift.
See also this useful SO Q&A.
I implemented a such system last year inside my company using Kinesis and Kinesis connector. Kinesis connector is just a standalone app released by AWS we are running in a bunch of ElasticBeanStalk servers as Kinesis consumers, then the connector will aggregate messages to S3 every a while or every amount of messages, then it will trigger the COPY command from Redshift to load data into Redshift periodically. Since it's running on EBS, you can tune the auto-scaling conditions to make sure the cluster grows and shrinks with the volume of data from Kinesis stream.
BTW, AWS just announced Kinesis Firehose yesterday. I haven't played it but it definitely looks like a managed version of the Kinesis connector.

How to copy data in bulk from Kinesis -> Redshift

When i read about AWS data pipeline the idea immediately struck - produce statistics to kinesis and create a job in pipeline that will consume data from kinesis and COPY it to redshift every hour. All in one go.
But it seems there is no node in pipeline that can consume kinesis. So now i have two possible plans of action:
Create instance where Kinesis's data will be consumed and sent to S3 split by hours. Pipeline will copy from there to Redshift.
Consume from Kinesis and produce COPY directly to Redshift on the spot.
What should I do? Is there no way to connect Kinesis to redshift using AWS services only, without custom code?
It is now possible to do so without user-code via a new managed service called Kinesis Firehose. It manages the desired buffering intervals, temp uploads to s3, upload to Redshift, error handling and auto throughput management.
That is already done for you!
If you use the Kinesis Connector Library, there is a built-in connector to Redshift
https://github.com/awslabs/amazon-kinesis-connectors
Depending on the logic you have to process connector can be really easy to implement.
You can create and orchestrate complete pipeline with InstantStack to read data from Kinesis, transform it and push it into any Redshift or S3.