API Gateway and IoT Core to Kinesis Data Stream - amazon-web-services

I want to build a Kinesis data stream that will trigger my lambda function.
The function needs inputs that will come from IoT Core (sensor data) and an API Gateway (weather data), also, they have to be on the same time stamp, meaning the temperature sensor reading at 12:00:00 has to be paired with the weather at 12:00:00.
I know that I can use an IoT rule to push data to a Kinesis data stream, and I can use a REST API to push records to a stream as well.
My questions are
Can I put both on the same stream, so the lambda function will have all the inputs it needs?
If yes, how can I pair both data based on its time stamp value?

Related

Transfer Data from AWS Time Stream to DynamoDB

i am working in the IoT Space with 2 Databases. AWS Time Stream & AWS DynamoDB.
My sensor data is coming into Time Stream via AWS IoT Core and MQTT. I set up a rule, that gives permission to transfer the incoming data directly into Time Stream.
What i need to do now is to run some operations on the data and save the result of these operations into DynamoDB.
I know with DynamoDB there is function called DynamoDB Streams. Is there a solution like Streams in Time Stream as well? Or does anybody has an idea, how i can automatically transfer the results of the operations from Time Stream to DynamoDB?
Timestream does not have Change Data Capture capabilities.
Best thing to do is to write the data into DynamoDB from wherever you are doing your operations on Timestream. For example, if you are using AWS Glue to analyze your Timestream data, you can sink the results directly from Glue using the DynamoDB sink.
Timestream has the concept of Schedule Query. When a query has ran, you can be notified via a SNS topic. You could connect a lambda on that SNS topic to retrieve the query result and store it in DynamoDB.

Is there a way to recover Kinesis Data if Lambda function fails?

Context: I'm scraping data from a third party source using a lambda function (this lambda function is invoked my a cloudwatch event bridge so its asynchronous) then writing that data to kinesis firehose which writes it to an S3 bucket. This allows for data buffering and ensures that the data is written to S3 regardless of S3 connection failures (since kinesis will hold on to the data and re try the writes). I'm scraping data from the third party source in chunks (meaning that I make multiple http calls) and simultaneously writing them to the firehose.
Question: If my lambda fails mid way while getting data from the third party source, is there a way that I can re invoke the lambda and poll kinesis to see what data exists there to ensure that I'm not re writing the same data to kinesis? Essentially, I want the lambda to pick up fetching data from the same point that it failed.
If my lambda fails mid way while getting data from the third party source, is there a way that I can re invoke the lambda and poll kinesis to see what data exists there to ensure that I'm not re writing the same data to kinesis?
No. Kinesis Firehose is not Kinesis Data Streams, and you can't read from Firehose as you do with Data Streams. I think the easiest way for you, would be to setup DynamoDB (or any equivalent) which would store some kind of "bookmark" allowing you to see what you have recently processed.

How do I setup AWS Kinesis Data Stream which gets data from an existing API?

I currently have a GET API endpoint which gives me realtime data of an object. I want to set up AWS Kinesis Data Stream such that it requests data from the API every 5 seconds and directs the output to AWS RDS. How do I get kinesis to query the API every 5 seconds?
How do I get kinesis to query the API every 5 seconds?
You can't as kinesis does not have such functionality. You have to implement it yourself. For example, a lambda function that queries your endpoint every 5 seconds and injects records to the stream.

Kinesis Analytics Destination Guidance: Lambda vs Kinesis Stream to Lambda

After Kinesis Analytics does it's job, the next step is to send that information off to a destination. AWS currently offers 3 destination choices:
Kinesis stream
Kinesis Firehose delivery stream
AWS Lambda function
For my use case, Kinesis Firehose delivery stream is not what I want so I am left with:
Kinesis stream
AWS Lambda function
If I set the destination to a Kinesis Stream, I would then attach a Lambda to that stream to process the records.
AWS also offers the ability to set the destination to a Lambda, bypassing the Kinesis Stream step of this process. In doing some digging for docs I found this:
Using a Lambda Function as Output
Specifically in those docs under Lambda Output Invocation Frequency it says:
If records are emitted to the destination in-application stream within the data analytics application as a continuous query or a sliding window, the AWS Lambda destination function is invoked approximately once per second.
My Kinesis Analytics output qualifies under this scenario. So I can assume that my Lambda will be invoked, "approximately once per second".
I'm trying to understand the difference between using these 2 destinations as it pertains to using a Lambda.
Using AWS Lambda with Kinesis states that:
You can subscribe Lambda functions to automatically read batches of records off your Kinesis stream and process them if records are detected on the stream. AWS Lambda then polls the stream periodically (once per second) for new records.
So it sounds like the the invocation interval is the same in either case; approximately 1 second.
So I think the guidence is:
If the next stage in the pipeline only needs one consumer, then use the AWS Lambda function destination. If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
Is this a correct assumption on how to choose a destination? Again, for my use case I am excluding the Kinesis Firehose delivery stream.
If the next stage in the pipeline only needs one consumer, then use the AWS Lambda function destination. If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
• I would always use Kinesis Stream with one shard and batch size = 1 (for example) if I wanted the items to be consumed one by one with no concurrency.
If there are multiple consumers, increase the number of shards, one lambda is launched in parallel for each shard when there are items to process. If it makes sense, also increase the batch size.
But read again at the highlighted phrase below:
If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
If you have one or more producers and many consumers of the exactly same item, I guess you need to use SNS. The producer writes the item on one topic, then all the lambdas listening to the topic will process that item.
If this does not answer your question, please clarify it. There is a little ambiguity.

How to fanout an AWS kinesis stream?

I'd like to fanout/chain/replicate an Input AWS Kinesis stream To N new Kinesis streams, So that each record written to the input Kinesis will appear in each of the N streams.
Is there an AWS service or an open source solution?
I prefer not to write code to do that if there's a ready-made solution. AWS Kinesis firehose is a no solution because it can't output to kinesis. Perhaps a AWS Lambda solution if that won't be too expensive to run?
There are two ways you could accomplish fan-out of an Amazon Kinesis stream:
Use Amazon Kinesis Analytics to copy records to additional streams
Trigger an AWS Lambda function to copy records to another stream
Option 1: Using Amazon Kinesis Analytics to fan-out
You can use Amazon Kinesis Analytics to generate a new stream from an existing stream.
From the Amazon Kinesis Analytics documentation:
Amazon Kinesis Analytics applications continuously read and process streaming data in real-time. You write application code using SQL to process the incoming streaming data and produce output. Then, Amazon Kinesis Analytics writes the output to a configured destination.
Fan-out is mentioned in the Application Code section:
You can also write SQL queries that run independent of each other. For example, you can write two SQL statements that query the same in-application stream, but send output into different in-applications streams.
I managed to implement this as follows:
Created three streams: input, output1, output2
Created two Amazon Kinesis Analytics applications: copy1, copy2
The Amazon Kinesis Analytics SQL application looks like this:
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM"
(log VARCHAR(16));
CREATE OR REPLACE PUMP "COPY_PUMP1" AS
INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM "log" FROM "SOURCE_SQL_STREAM_001";
This code creates a pump (think of it as a continual select statement) that selects from the input stream and outputs to the output1 stream. I created another identical application that outputs to the output2 stream.
To test, I sent data to the input stream:
#!/usr/bin/env python
import json, time
from boto import kinesis
kinesis = kinesis.connect_to_region("us-west-2")
i = 0
while True:
data={}
data['log'] = 'Record ' + str(i)
i += 1
print data
kinesis.put_record("input", json.dumps(data), "key")
time.sleep(2)
I let it run for a while, then displayed the output using this code:
from boto import kinesis
kinesis = kinesis.connect_to_region("us-west-2")
iterator = kinesis.get_shard_iterator('output1', 'shardId-000000000000', 'TRIM_HORIZON')['ShardIterator']
records = kinesis.get_records(iterator, 5)
print [r['Data'] for r in records['Records']]
The output was:
[u'{"LOG":"Record 0"}', u'{"LOG":"Record 1"}', u'{"LOG":"Record 2"}', u'{"LOG":"Record 3"}', u'{"LOG":"Record 4"}']
I ran it again for output2 and the identical output was shown.
Option 2: Using AWS Lambda
If you are fanning-out to many streams, a more efficient method might be to create an AWS Lambda function:
Triggered by Amazon Kinesis stream records
That writes records to multiple Amazon Kinesis 'output' streams
You could even have the Lambda function self-discover the output streams based on a naming convention (eg any stream named app-output-*).
There is a github repo from Amazon lab providing the fanout using lambda. https://github.com/awslabs/aws-lambda-fanout . Also read "Transforming a synchronous Lambda invocation into an asynchronous one" on https://medium.com/retailmenot-engineering/building-a-high-throughput-data-pipeline-with-kinesis-lambda-and-dynamodb-7d78e992a02d , which is critical to build a truly asynchronous processing.
There are two AWS native solutions to fanning out Kinesis streams that don't require AWS Firehose or AWS Lambda.
Similar to Kafka consumer groups, Kinesis has the application name. Every consumer to the stream can provide a unique application name. If two consumer has the same application name, then messages are distributed between them. To fan out the stream, provide a different application name for those consumers that you want to receive the same messages from the stream. Kinesis will, under the hood, create new DynamoDB tables to keep track of each consumer for each new application so that they can consume messages at a different rate, etc.
Use Kinesis Enhanced Fan-Out for higher throughput (up to 2MiB per second) and this does not count towards your global read limit. At the time of writing, there is a limit of 20 "enhanced fan-out" consumers per stream.
One caveat as far I am aware with these two options is that you need to use the Kinesis Client Library (KCL) (and not the raw AWS SDK).