I'm currently using a Kinesis data stream as a command queue for a high number of instances. The original stream is using the instance-id as the partition key.
The goal is to get the ordered events to a worker pool. Each worker pool is capable of handling a specific "instance origin", e.g. Azure, GCP, DataCenter001 - DataCenter100. Since each worker pool can only process a subset, I cannot directly read from this single stream since one parition read could return multiple instance commands from different origins (I could of course simply skip these events, but this would result in each pool ignoring 90% of the commands). To keep this whole process near real-time, my idea was to split the original "uber stream" based on the "instance origin" to one kinesis stream per origin. Then each worker pool can consume it's own kinesis stream.
{
"instanceID": "9f858f6c-f924-490a-93c7-0b9c037cccc1",
"origin": "DataCenter007"
"command": "cd /etc"
}
Original stream partitions on instanceID. Kineses Analytics filters on origin and forwards to a new stream with partitions on instanceID.
However I couldn't find if the KinesisStreamsSink retains the parition key order.
DataStream<InstanceEvent> streamA = originalStream.filter(value -> value.getOrigin().equalsIgnoreCase("AZURE"));
DataStream<InstanceEvent> streamB = originalStream.filter(value -> value.getOrigin().equalsIgnoreCase("GCP"));
streamA.sinkTo(KinesisStreamsSink.<InstanceEvent>builder()
.setStreamName("DataCenter001")
.setSerializationSchema(new InstanceEventSerializationSchema())
.setKinesisClientProperties(producerProperties)
.setPartitionKeyGenerator(InstanceEvent::getInstanceId)
.build());
streamB.sinkTo(KinesisStreamsSink.<InstanceEvent>builder()
.setStreamName("DataCenter002")
.setSerializationSchema(new InstanceEventSerializationSchema())
.setKinesisClientProperties(producerProperties)
.setPartitionKeyGenerator(InstanceEvent::getInstanceId)
.build());
...
Will stream DataCenter002 and DataCenter002 have the order guarantee from originalStream if they use the same parition key? Or will individual batch fails lead to a wrong order?
Related
I have a DynamoDB Table and it is linked with one Stream, and that stream is linked with one lambda function which processed it.
Question - With above set up if an event comes to the stream and is ingested in Lambda, does that event still resides in that stream or it gets POPPED out as soon as it got ingested in Lambda just like a Queue?
Question 2 Can someone kindly tell me about the inner working of DDB Stream and how it passes the data to Lambda? Like are there any states for the stream events?
P.S: AWS Documentation says that events stay in stream for 24 hour window.
There are two concepts to understand here
Streams
Triggers
Whenever there is a change in the table like an addition, update or deletion, the Kinesis Stream feature of AWS stores that change for a period of 24 hrs. It does this through 4 methods:-
Keys only:- only the keys are stored after the change
New image:- The entire item on which the change is performed is stored
Old image:- When a change is performed on an item, the old item is stored instead of the new one
New and old:- self-explanatory
To associate a lambda function with your streams, a feature called Triggers are used. The changes invoke the Trigger which in-turn performs the lambda function associated with the change.
Part 1 of your question:-
By default, Lambda invokes your function as soon as records are available in the stream. If the batch it reads from the stream only has one record in it, Lambda only sends one record to the function. To avoid invoking the function with a small number of records, you can tell the event source to buffer records for up to 5 minutes by configuring a batch window. Before invoking the function, Lambda continues to read records from the stream until it has gathered a full batch, or until the batch window expires. If the Lambda fails it will try and process that message indefinitely (or until it expires), keeping other messages from being processed as a result. To avoid stalled shards(I'll talk about this later), you can configure the event source mapping to retry with smaller batch size, limit the number of retries, or discard records that are too old(you can set the age of the record that lambda can read).
Part 2 of your question:-
The streams which we are talking about are Kinesis Streams It is a feature to be used by multiple producers and consumers. Here the producer is DynamoDb and the consumer is lambda. Consumers have dedicated read throughput so they don't have to compete with other consumers of the same data. With consumers, Kinesis pushes records to Lambda over an HTTP/2 connection, which can also reduce latency between adding a record and function invocation.
The capacity of the streams is determined by the number of shards it contains. Shards are small units of capacity in the Stream. Hence higher the shard value, higher the capacity.
I guess I have explained the working in the part1 of this answer. Feel free to ask follow up questions.
How are checkpoints and trimming related in AWS KCL library?
The documentation page Handling Startup, Shutdown, and Throttling says:
By default, the KCL begins reading records from the tip of the
stream;, which is the most recently added record. In this
configuration, if a data-producing application adds records to the
stream before any receiving record processors are running, the records
are not read by the record processors after they start up.
To change the behavior of the record processors so that it always
reads data from the beginning of the stream, set the following value
in the properties file for your Amazon Kinesis Streams application:
initialPositionInStream = TRIM_HORIZON
The documentation page Developing an Amazon Kinesis Client Library Consumer in Java says:
Streams requires the record processor to keep track of the records
that have already been processed in a shard. The KCL takes care of
this tracking for you by passing a checkpointer
(IRecordProcessorCheckpointer) to processRecords. The record processor
calls the checkpoint method on this interface to inform the KCL of how
far it has progressed in processing the records in the shard. In the
event that the worker fails, the KCL uses this information to restart
the processing of the shard at the last known processed record.
The first page seems to say that the KCL resumes at the tip of the stream, the second page at the last known processed record (that was marked as processed by the RecordProcessor using the checkpointer). In my case, I definitely need to restart at the last known processed record. Do I need to set the initialPositionInStream to TRIM_HORIZON?
With kinesis stream you have two options, you can read the newest records, or start from the oldest (TRIM_HORIZON).
But, once you started your application it just reads from the position it stopped using its checkpoints.
You can see those checkpoints in dynamodb (Usually the table name is as the app name).
So if you restart your app it will usually continue from where it stopped.
The answer is no, you don't need to set the initialPositionInStream to TRIM_HORIZON.
When you are reading events from a kinesis stream, you have 4 options:
TRIM_HORIZON - the oldest events that are still in the stream shards before they are automatically trimmed (default 1 day, but can be extended up to 7 days). You will use this option if you want to start a new application that will process all the records that are available in the stream, but it will take a while until it is able to catch up and start processing the events in real-time.
LATEST - the newest events in the stream, and ignore all the past events. You will use this option if you start a new application that you want to process in teal time immediately.
AT/AFTER_SEQUENCE_NUMBER - the sequence number is usually the checkpoint that you are keeping while you are processing the events. These checkpoints are allowing you to reliably process the events, even in cases of reader failure or when you want to update its version and continue processing all the events and not lose any of them. The difference between AT/AFTER is based on the time of your checkpoint, before or after you processed the events successfully.
Please note that this is the only shard specific option, as all the other options are global to the stream. When you are using the KCL it is managing a DynamoDB table for that application with a record for each shard with the "current" sequence number for that shard.
AT_TIMESTAMP - the estimate time of the event put into the stream. You will use this option if you want to find specific events to process based on their timestamp. For example, when you know that you had a real life event in your service at a specific time, you can develop an application that will process these specific events, even if you don't have the sequence number.
See more details in Kinesis documentation here: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html
You should use the "TRIM_HORIZON". It will only have effect on the first time your application starts to read records from the stream.
After that, it will continue from last known position.
I have checked all of the AWS documentation on Kinesis. All I have found is how the Producer streams data onto kinesis streams and consumer consumes the streams when initialized (Kind of FIFO model). If the data which is sent to the stream stays in the shard for 24 hours, I would like to access a particular value multiple times. However, I did not find a suitable mechanism to do that. Is there a way to scan the kinesis stream rather than processing the streams like FIFO model.
No, unfortunately you can't do that.
If you know the position of your data (i.e. checkpoint value) you can start reading your shard starting from that place. But otherwise, there is no search mechanism.
If you really need to catch a specific value and process it multiple times; you might want to use some in-memory database-like cache structure on your consumer application. Redis, Memcache or maybe VoltDB can be helpful if you have such large data moving at high speed.
When you are putting a record into Kinesis, the producer is getting a sequence ID and Shard ID (see the API for PutRecord here: http://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html).
Response Syntax:
{
"SequenceNumber": "string",
"ShardId": "string"
}
You can use this sequence ID and shard ID to fetch the record from the kinesis stream on the consumers side (see the API for GetShardIterator here: http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html).
Request Syntax:
{
"ShardId": "string",
"ShardIteratorType": "string",
"StartingSequenceNumber": "string",
"StreamName": "string"
}
Please note that if you are looking for more of pub-sub model, you should use SNS and not Kinesis, which is more optimized for event streaming processing (mainly in FIFO order) in near real time.
I am trying to make a Kinesis Consumer Client. To work on it I went through the Developer Guide of Kinesis and AWS Document http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-record-processor-implementation-app-java.html.
I was wondering is It possible to get Data from two different Streams and process it accordingly.
Say I have two different Streams stream1 and stream2 .
Is it possible to get Data from both stream and process individually ?
Why not? Do get_records from both streams.
If your streams only have a single shard each, you will also see all the events, as it is recommended to process each shard with a single worker, but if your logic is somehow to join events from different sources/streams, you can implement it with a single worker reading from both streams.
Note that if you have streams with multiple shards, each one of your workers will see only a part of the events. You can have the following options:
Both streams have a single shard each - in this case you can read with a single worker from bout streams and see all events from both streams. You can add timestamps or other keys to allow you to "join" these events in the worker.
One stream (stream1) with one shard and the second streams (stream2) with multiple shards - in this case you can read from stream1 from all your workers, that will also process single shard from the stream2 each. Each one of your workers will see all the events of stream1 and its share of events of stream2. Note that you have a limit of the speed that you can read the events from stream1 with the single shard (2MB/second or 5 reads/second), and if you have many shards in stream2, this can be a real limit.
Both streams can have multiple shards - in this case it will be more complex for you to ensure that you are able to "join" these events, as you need to sync both the writes and the reads to these streams. You can also read from all shards of both streams with a single worker, but this is not a good practice as it is limiting your ability to scale since you don't have a distributed system anymore. Another option is to use the same partition_key in both streams, and have the same number of shards and partition definition for both streams, and verify that you are reading from the "right" shard from each stream in each of your workers, and that you are doing it correctly every time one of your workers is failing and restarting, which might be a bit complex.
Another option that you can consider is to write both types of events in a single stream, again using the same partition_key, and then filter them on the reader side if you need to process them differently (for example, to write them to different log files in S3).
How do I tell what percentage of the data in a Kinesis stream a reader has already processed? I know each reader has a per-shard checkpoint sequence number, and I can also get the StartingSequenceNumber of each shard from describe-stream, however, I don't know how far along in my data the reader currently is (I don't know the latest sequence number of the shard).
I was thinking of getting a LATEST iterator for each shard and getting the last record's sequence number, however that doesn't seem to work if there's no new data since I got the LATEST iterator.
Any ideas or tools for doing this out there?
Thanks!
I suggest you implement a custom metric or metrics in your applications to track this.
For example, you could append a message send time within your Kinesis message, and on processing the message, record the time difference as an AWS CloudWatch custom metric. This would indicate how close your consumer is to the front of the stream.
You could also record the number of messages pushed (at the pushing application) and messages received at the Kinesis consumer. If you compare these in a chart on CloudWatch, you could see that the curves roughly follow each other indicating that the consumer is doing a good job at keeping up with the workload.
You could also try monitoring your Kinesis consumer, to see how often it idly waits for records (i.e, no results are returned by Kinesis, suggesting it is at the front of the stream and all records are processed)
Also note there is not a way to track a "percent" processed in the stream, since Kinesis messages expire after 24 hours (so the total number of messages is constantly rolling). There is also not a direct (API) function to count the number of messages inside your stream (unless you have recorded this as above).
If you use KCL you can do that by comparing IncomingRecords from the cloudwatch built-in metrics of Kinesis with RecordsProcessed which is a custom metric published by the KCL.
Then you select a time range and interval of say 1 day.
You would then get the following type of graphs:
As you can see there were much more records added than processed. By looking at the values in each point you will know exactly if your processor is behind or not.