Use AWS lambda to read Kinesis and save to S3 - amazon-web-services

I am very new to AWS. Upto now I am able to send csv data to kinesis streams using aws .net sdk. Now I have to save this data in S3 using lambda using S3 Emitter(this is the most common way which I found on many websites) . When I create a Lambda function for it. It asks for Node.js or java8 code.
I don't understand from here , what code needs to be uploaded, how to use S3 Emitter code.
I cannot use Kinesis Firehose because the streaming data is going to EMR for processing.
Please help me here.
If there is any alternate way please suggest.

You need to write code that will get the events from the kinesis stream and write it to S3 (or even easier to Kinesis Firehose). This code should be in one of the programing languages that are supported in Lambda currently (JavaScript, Java, Python).
Here is a tutorial for reading from Kinesis: http://docs.aws.amazon.com/lambda/latest/dg/with-kinesis-example.html
It is relatively easy to read the events and batch them to S3 or even easier to write them to Firehose to get a more optimized batches in S3 (larger, compressed, encrypted...).

Related

How to store data into s3 bucket using an AWS SQS Queue

I have live streaming data coming my way different sources. We have created an AWS SQS Queue to get those data. I was directly pushing this incoming data to DynamoDB using lambda function but I have been asked to store the raw files into s3 bucket first. So that we do not have any data loss and later use that data for further transformations.
I have an AWS SQS Queue and I want to store the messages of this queue into an S3 bucket in JSON or Parquet form. Or is there any other alternative other than s3 to store the raw data in AWS?
I tried searching over net but couldn't find anything concrete. I am new to this please help me solve this.
I tried searching over net but couldn't find anything concrete.
That's correct, because there is no such AWS provided feature or tool. You have to implement custom solution for that yourself.
Since you are using lambda which reads from the queue, you can add code to the lambda function to write that data to S3 before writing to DynamoDB.

Streaming Data From different Sources to AWS S3

I have different data sources and I need to publish them to S3 in real-time. I also need to process and validate data before delivering them to S3 buckets. I know that AWS Kinesis Data Stream offers Real-time data streaming and I can process data using AWS lambda before sending them to S3. However, it is not clear for me that can we use AWS Glue Streaming instead of AWS Kinesis Data Stream and AWS Lambda? I have seen some documentations about using AWS Glue Streaming for processing real-time data on the fly and send them to S3. So, what is the real differences here? Is AWS Glue Streaming ETL a good choice for streaming and processing data in real-time and store them into S3?
Kinesis data stream with lambda consumer will fit as long as the lambda execution environment limits is sufficient
15 mins execution time
Memory config
Concurrency limits
When going with glue consumer, your glue jobs can run longer and also supports Apache spark for massive parallel processing
You can also use Kinesis firehose which has native integration to deliver data to S3, ElasticSearch etc..., which doesn't require any changes to data. You can also have a lambda to do minimal processing intercepting the data before delivering using firehose.

Writing to S3 via Kinesis Stream or Firehose

I have events that keep coming which I need to put to S3. I am trying to evaluate if I muse use Kinesis Stream or Firehose. I also want to wait for few minutes before writing to S3 so that the object is fairly full.
Based on my reading of Kinesis Data stream, I have to create an analytics app which will then be used to invoke a lambda. I will then have to use the lambda to write to S3. Or Kinesis Data Streams can directly write to lambda somehow? I could not find anything indicating the same.
Firehose is not charged by hour(while stream is). So is firehose a better option for me?
Or Kinesis Data Streams can directly write to lambda somehow?
Data Streams can't write directly to S3. Instead Firehose can do this:
delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers, including Datadog, MongoDB, and New Relic.
What's more Firehose allows you to buffer the records before writing them to S3. The writing can happen based on buffer size or time. In addition to that you can process the records using lambda function before writing to S3.
Thus, colectively it seems that Firehose is more suited to your use-case then Data Streams.

When do I need to Kinesis Data Streams together with Kinesis Firehose?

I want to build a use case where I want to do real time analytics. I am not sure when it is necessary to use Kinesis Data Streams before Kinesis Firehose. In the documentation it says that Kinesis Firehose can get the data from Kinesis Data Streams but the use cases are not clear.
https://aws.amazon.com/kinesis/data-firehose/faqs/?nc=sn&loc=5
So the benefit of using Kinesis Firehose to have data passed from Kinesis Data Streams is that it integrates directly with the following services: S3, Redshift, ElasticSearch Service, Splunk.
If you want your streamed data to be delivered to any of those endpoints by passing to Firehose you can have it do the work for you.
Traditionally you'd write your own consumer which would be another piece of code to develop and maintain if it breaks. But using Firehose you can rely on AWS to do this part for you.

Stream new files in S3 bucket into Kinesis

I am trying to evaluate using Kinesis for stream processing log files. There is a separate process that uploads new logs into a S3 bucket - I can't touch that process. I want to know if there's a good way to stream new files that show up in the S3 log bucket into a Kinesis stream for processing. All documentation I've found so far covers using S3 as an output for the stream.
My current solution is to have a machine that constantly polls S3 for new files, downloads the new file to the local machine and streams it in using the Log4j appender. This seems inefficient. Is there a better way?
I realize this is a really old question, but have a look at AWS Lambda. It's perfect for your use case, as illustrated here.
In your case, you would setup the s3 event such that each new object added to the bucket invokes your lambda function. In the lambda function you then write a few lines of code that read in the file and send the contents to the PutRecord (or PutRecords for batch) method for the Kinesis stream.
Not only will this work for your use case, but it's also awesome since it checks off a few buzzwords: "serverless" and "realtime"!