Is there a way to recover Kinesis Data if Lambda function fails? - amazon-web-services

Context: I'm scraping data from a third party source using a lambda function (this lambda function is invoked my a cloudwatch event bridge so its asynchronous) then writing that data to kinesis firehose which writes it to an S3 bucket. This allows for data buffering and ensures that the data is written to S3 regardless of S3 connection failures (since kinesis will hold on to the data and re try the writes). I'm scraping data from the third party source in chunks (meaning that I make multiple http calls) and simultaneously writing them to the firehose.
Question: If my lambda fails mid way while getting data from the third party source, is there a way that I can re invoke the lambda and poll kinesis to see what data exists there to ensure that I'm not re writing the same data to kinesis? Essentially, I want the lambda to pick up fetching data from the same point that it failed.

If my lambda fails mid way while getting data from the third party source, is there a way that I can re invoke the lambda and poll kinesis to see what data exists there to ensure that I'm not re writing the same data to kinesis?
No. Kinesis Firehose is not Kinesis Data Streams, and you can't read from Firehose as you do with Data Streams. I think the easiest way for you, would be to setup DynamoDB (or any equivalent) which would store some kind of "bookmark" allowing you to see what you have recently processed.

Related

Writing to S3 via Kinesis Stream or Firehose

I have events that keep coming which I need to put to S3. I am trying to evaluate if I muse use Kinesis Stream or Firehose. I also want to wait for few minutes before writing to S3 so that the object is fairly full.
Based on my reading of Kinesis Data stream, I have to create an analytics app which will then be used to invoke a lambda. I will then have to use the lambda to write to S3. Or Kinesis Data Streams can directly write to lambda somehow? I could not find anything indicating the same.
Firehose is not charged by hour(while stream is). So is firehose a better option for me?
Or Kinesis Data Streams can directly write to lambda somehow?
Data Streams can't write directly to S3. Instead Firehose can do this:
delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers, including Datadog, MongoDB, and New Relic.
What's more Firehose allows you to buffer the records before writing them to S3. The writing can happen based on buffer size or time. In addition to that you can process the records using lambda function before writing to S3.
Thus, colectively it seems that Firehose is more suited to your use-case then Data Streams.

When do I need to Kinesis Data Streams together with Kinesis Firehose?

I want to build a use case where I want to do real time analytics. I am not sure when it is necessary to use Kinesis Data Streams before Kinesis Firehose. In the documentation it says that Kinesis Firehose can get the data from Kinesis Data Streams but the use cases are not clear.
https://aws.amazon.com/kinesis/data-firehose/faqs/?nc=sn&loc=5
So the benefit of using Kinesis Firehose to have data passed from Kinesis Data Streams is that it integrates directly with the following services: S3, Redshift, ElasticSearch Service, Splunk.
If you want your streamed data to be delivered to any of those endpoints by passing to Firehose you can have it do the work for you.
Traditionally you'd write your own consumer which would be another piece of code to develop and maintain if it breaks. But using Firehose you can rely on AWS to do this part for you.

Configure Firehose so it writes only one record per S3 object?

I'm using a Firehose delivery stream to write JSONs to S3. These JSONs represent calls. The stream will often receive a new version of a JSON, that bring new info about the represented call.
I would want my Firehose to write each JSON record to a separate S3 object, so not grouping them together as it seems to do by default. Each JSON would be written at an S3 key that identifies the call, so that when a new version of a JSON shows up, Firehose replaces its previous version in S3. Is this possible?
I see that I can set up the buffer size that triggers writing to S3, but can I explicitly configure my Firehose stream so it writes exactly one S3 object per record?
There's no Redshift involved.
This is not possible with Amazon Kinesis Data Firehose. It is a simplified service that only has a few configuration options.
Instead, you could use Amazon Kinesis Data Streams:
Send data to the stream
Create an AWS Lambda function that will be triggered whenever data is received by the stream
Code the Lambda function to write the data to the appropriate Amazon S3 object
See: Using AWS Lambda with Amazon Kinesis - AWS Lambda

How to replay in a stream data pushed to S3 from AWS Firehose?

There is a plenty of examples how data is stores by AWS Firehose to S3 bucket and parallelly passed to some processing app (like on the picture above).
But I can't find anything about good practice of replaying this data from s3 bucket in case if processing app was crushed. And we need to supply it with historical data, which we have in s3, but which is already not in the Firehose.
I can think of replaying it with Firehose or Lambda, but:
Kinesis Firehose could not consume from bucket
Lambda will need to deserialize .parquet file to send it to Firehose or Kinesis Data Stream. And I'm confused with this implicit deserializing, because Firehose was serializing it explicitly.
Or maybe there is some other way to put data back from s3 to stream which I completely miss?
EDIT: More over if we will run lambda for pushing records to stream probably it will have to rum more that 15 min. So another option is to run a script doing it which runs on separate EC2 instance. But this methods of extracting data from s3 looks so much more complicated than storing it there with Firehose, that is makes me think there should be some easier approach
The problem which stuck me was actually that I expect some more advanced serialization than just converting to JSON (as Kafka support AVRO for example).
Regarding replaying records from s3 bucket: this part of solution seems to be really significantly more complicated, than the one needed for archiving records. So if we can archive stream with out of the box functions of Firehose, for replaying it we will need two lambda functions and two streams.
Lambda 1 (pushes filenames to stream)
Lambda 2 (activated for every filename in the first stream, pushes records from files to second stream)
First lambda is triggered manually, scans through all s3 bucket files and write their names to first stream. Second lambda function is triggered by every event is stream with file names, reads all the records in the file and sends them to final stream. From which there could be consumed but Kinesis Data Analytics or another Lambda.
This solution expects that there are multiple files generated per day, and there are multiple records in every file.
Similar to this solution, but destination is Kinesis in my case instead of Dynamo in the article.

Write to a specific folder in S3 bucket using AWS Kinesis Firehose

I would like to be able to send data sent to kinesis firehose based on the content inside the data. For example if I sent this JSON data:
{
"name": "John",
"id": 345
}
I would like to filter the data based on id and send it to a subfolder of my s3 bucket like: S3://myS3Bucket/345_2018_03_05. Is this at all possible with Kinesis Firehose or AWS Lambda?
The only way I can think of right now is to resort to creating a kinesis stream for every single one of my possible IDs and point them to the same bucket and then send my events to those streams in my application, but I would like to avoid that since there are many possible IDs.
You probably want to use an S3 event notification that gets fired each time Firehose places a new file in your S3 bucket (a PUT); the S3 event notification should call a custom lambda function that you write that reads the contents of the S3 file and splits it up and writes it out to the separate buckets, keeping in mind that each S3 file is likely going to contain many records, not just one.
https://aws.amazon.com/blogs/aws/s3-event-notification/
This is not possible out-of-the box, but here's some ideas...
You can write a Data Transformation in Lambda that is triggered by Amazon Kinesis Firehose for every record. You could code Lambda to save to save the data to a specific file in S3, rather than having Firehose do it. However, you'd miss-out on the record aggregation features of Firehose.
You could use Amazon Kinesis Analytics to look at the record and send the data to a different output stream based on the content. For example, you could have a separate Firehose stream per delivery channel, with Kinesis Analytics queries choosing the destination.
If you use a lambda to save the data you would end up with duplicate data onto s3. One stored by lambda and the other stored by firehose since transformation lambda will add the data back to firehose. Unless there is a way to avoid transformed data from lambda being re-added to the stream. I am not aware of a way to avoid that