AWS DynamoDB Triggers enable us to capture all changes in our DynamoDB table. Lambda assigns each Lambda function to a shard of DynamoDB Streams.
I want to retrieve a shard ID in Lambda function to keep consistency in some data processing tasks, but I can't find the way to get it. Does any one manage to do it?
AWS Lambda abstracts the details of shard processing and serialization away from you. The only way to process a stream and be aware of the shard ID at the same time is to consume the stream yourself using the KCL and the DynamoDB Streams adapter.
Related
i am working in the IoT Space with 2 Databases. AWS Time Stream & AWS DynamoDB.
My sensor data is coming into Time Stream via AWS IoT Core and MQTT. I set up a rule, that gives permission to transfer the incoming data directly into Time Stream.
What i need to do now is to run some operations on the data and save the result of these operations into DynamoDB.
I know with DynamoDB there is function called DynamoDB Streams. Is there a solution like Streams in Time Stream as well? Or does anybody has an idea, how i can automatically transfer the results of the operations from Time Stream to DynamoDB?
Timestream does not have Change Data Capture capabilities.
Best thing to do is to write the data into DynamoDB from wherever you are doing your operations on Timestream. For example, if you are using AWS Glue to analyze your Timestream data, you can sink the results directly from Glue using the DynamoDB sink.
Timestream has the concept of Schedule Query. When a query has ran, you can be notified via a SNS topic. You could connect a lambda on that SNS topic to retrieve the query result and store it in DynamoDB.
I was assuming I
create a table and enable stream and I now have an ARN
create a kinesis stream
configure somewhere to tell the dynamoDb stream to write to kinesis stream
I was looking at working with https://github.com/harlow/kinesis-consumer but this reads from kinesis or can I use the ARN and use it to read right from the dynamoDB stream?
The more I look, the more I seem to think, I have to write a lambda to read dynamoDB and write to kinesis. Is that correct?
thanks
Hey can you provide a bit more of information about your target setup? do you plan to have some sort of ETL process for your dynamoDB table? AFAIK when you bound a kinesis stream to a dynamodb table, everytime you add, remove or update rows on the dynamodb a new event will be publish in the associated kinesis stream which you can consume from and use the event in whatever way you want.
maybe worth checking this one:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html
DynamoDB now support Kinesis Data Streams natively:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/kds.html
You can choose either DynamoDB Streams or Kinesis Data Streams for your Change Data Capture (CDC).
Properties
Kinesis Data Streams for DynamoDB
DynamoDB Streams
Data retention
Up to 1 year.
24 hours.
Kinesis Client Library (KCL) support
Supports KCL versions 1.X and 2.X.
Supports KCL version 1.X.
Number of consumers
Up to 5 simultaneous consumers per shard, or up to 20 simultaneous consumers per shard with enhanced fan-out.
Up to 2 simultaneous consumers per shard.
Throughput quotas
Unlimited.
Subject to throughput quotas by DynamoDB table and AWS Region.
Record delivery model
Pull model over HTTP using GetRecords and with enhanced fan-out, Kinesis Data Streams pushes the records over HTTP/2 by using SubscribeToShard.
Pull model over HTTP using GetRecords.
Ordering of records
The timestamp attribute on each stream record can be used to identify the actual order in which changes occurred in the DynamoDB table.
For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.
Duplicate records
Duplicate records might occasionally appear in the stream.
No duplicate records appear in the stream.
Stream processing options
Process stream records using AWS Lambda, Kinesis Data Analytics, Kinesis data firehose , or AWS Glue streaming ETL.
Process stream records using AWS Lambda or DynamoDB Streams Kinesis adapter.
Durability level
Availability zones to provide automatic failover without interruption.
Availability zones to provide automatic failover without interruption.
You can use Amazon Kinesis Data Streams to capture changes to Amazon DynamoDB. According to the AWS documentation:
Kinesis Data Streams captures item-level modifications in any DynamoDB table and replicates them to a Kinesis data stream. Your applications can access this stream and view item-level changes in near-real time. You can continuously capture and store terabytes of data per hour. You can take advantage of longer data retention time—and with enhanced fan-out capability, you can simultaneously reach two or more downstream applications. Other benefits include additional audit and security transparency.
Also You can enable streaming to Kinesis from your DynamoDB table.
After Kinesis Analytics does it's job, the next step is to send that information off to a destination. AWS currently offers 3 destination choices:
Kinesis stream
Kinesis Firehose delivery stream
AWS Lambda function
For my use case, Kinesis Firehose delivery stream is not what I want so I am left with:
Kinesis stream
AWS Lambda function
If I set the destination to a Kinesis Stream, I would then attach a Lambda to that stream to process the records.
AWS also offers the ability to set the destination to a Lambda, bypassing the Kinesis Stream step of this process. In doing some digging for docs I found this:
Using a Lambda Function as Output
Specifically in those docs under Lambda Output Invocation Frequency it says:
If records are emitted to the destination in-application stream within the data analytics application as a continuous query or a sliding window, the AWS Lambda destination function is invoked approximately once per second.
My Kinesis Analytics output qualifies under this scenario. So I can assume that my Lambda will be invoked, "approximately once per second".
I'm trying to understand the difference between using these 2 destinations as it pertains to using a Lambda.
Using AWS Lambda with Kinesis states that:
You can subscribe Lambda functions to automatically read batches of records off your Kinesis stream and process them if records are detected on the stream. AWS Lambda then polls the stream periodically (once per second) for new records.
So it sounds like the the invocation interval is the same in either case; approximately 1 second.
So I think the guidence is:
If the next stage in the pipeline only needs one consumer, then use the AWS Lambda function destination. If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
Is this a correct assumption on how to choose a destination? Again, for my use case I am excluding the Kinesis Firehose delivery stream.
If the next stage in the pipeline only needs one consumer, then use the AWS Lambda function destination. If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
• I would always use Kinesis Stream with one shard and batch size = 1 (for example) if I wanted the items to be consumed one by one with no concurrency.
If there are multiple consumers, increase the number of shards, one lambda is launched in parallel for each shard when there are items to process. If it makes sense, also increase the batch size.
But read again at the highlighted phrase below:
If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
If you have one or more producers and many consumers of the exactly same item, I guess you need to use SNS. The producer writes the item on one topic, then all the lambdas listening to the topic will process that item.
If this does not answer your question, please clarify it. There is a little ambiguity.
I can't seem to find the documentation on what kinds of events DynamoDB is able to trigger a lambda function based on. All I can find is mentions of when a new record is added to a table or a record is updated. Are those the two "only" actions/events available? Or could I also trigger a lambda function when I request a records that does not exists (which is what I need in my case, where I will be using DynamoDB as a cache)?
Triggering AWS Lambda through events happening in DynamoDB is done by utilizing DynamoDB Streams.
As stated in the documentation:
DynamoDB Streams captures a time-ordered sequence of item-level modifications in any DynamoDB table, and stores this information in a log for up to 24 hours.
So they only capture operations which modify data, which isn't the case for read operations.
Triggering a Lambda function automatically because somebody queried for a key that doesn't exist is not supported by DynamoDB. You would have to handle that in your querying code.
Firehose->S3 uses the current date as a prefix for creating keys in S3. So this partitions the data by the time the record is written. My firehose stream contains events which have a specific event time.
Is there a way to create S3 keys containing this event time instead? Processing tools downstream depend on each event being in an "hour-folder" related to when it actually happened. Or would that have to be an additional processing step after Firehose is done?
The event time could be in the partition key or I could use a Lambda function to parse it from the record.
Kinesis Firehose doesn't (yet) allow clients to control how the date suffix of the final S3 objects is generated.
The only option with you is to add a post-processing layer after Kinesis Firehose. For e.g., you could schedule an hourly EMR job, using Data Pipeline, that reads all files written in last hour and publishes them to correct S3 destinations.
It's not an answer for the question, however I would like to explain a little bit the idea behind storing records in accordance with event arrival time.
First a few words about streams. Kinesis is just a stream of data. And it has a concept of consuming. One can reliable consume a stream only by reading it sequentially. And there is also an idea of checkpoints as a mechanism for pausing and resuming the consuming process. A checkpoint is just a sequence number which identifies a position in the stream. Via specifying this number, one can start reading the stream from the certain event.
And now go back to default s3 firehose setup... Since the capacity of kinesis stream is quite limited, most probably one needs to store somewhere the data from kinesis to analyze it later. And the firehose to s3 setup does this right out of the box. It just stores raw data from the stream to s3 buckets. But logically this data is the still the same stream of records. And to be able to reliable consume (read) this stream one needs these sequential numbers for checkpoints. And these numbers are records arrival times.
What if I want to read records by creation time? Looks like the proper way to accomplish this task is to read the s3 stream sequentially, dump it to some [time series] database or data warehouse and do creation-time-based readings against this storage. Otherwise there will be always a non-zero chance to miss some bunches of events while reading the s3 (stream). So I would not suggest the reordering of s3 buckets at all.
You'll need to do some post-processing or write a custom streaming consumer (such as Lambda) to do this.
We dealt with a huge event volume at my company, so writing a Lambda function didn't seem like a good use of money. Instead, we found batch-processing with Athena to be a really simple solution.
First, you stream into an Athena table, events, which can optionally be partitioned by an arrival-time.
Then, you define another Athena table, say, events_by_event_time which is partitioned by the event_time attribute on your event, or however it's been defined in the schema.
Finally, you schedule a process to run an Athena INSERT INTO query that takes events from events and automatically repartitions them to events_by_event_time and now your events are partitioned by event_time without requiring EMR, data pipelines, or any other infrastructure.
You can do this with any attribute on your events. It's also worth noting you can create a view that does a UNION of the two tables to query real-time and historic events.
I actually wrote more about this in a blog post here.
For future readers - Firehose supports Custom Prefixes for Amazon S3 Objects
https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
AWS started offering "Dynamic Partitioning" in Aug 2021:
Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html
Look at https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html. You can implement a lambda function which takes your records, processes them, changes the partition key and then sends them back to firehose to be added. You would also have the change the firehose to enable this partitioning and also define your custom partition key/prefix/suffix.