AWS Kinesis read from past - amazon-web-services

How do we read from AWS Kinesis stream going back in time?
Using AWS Kinesis stream, one can send stream of events and the consumer application can read the events. Kinesis Stream worker fetches the records and passes them to IRecordProcessor#processRecords from the last check point.
However If I have a need to read the records going back in time, such as start processing records from 2 hours ago, how do I configure my kinesis worker to fetch me such records?

You can start your kinesis consumer again (or a different one) with different settings regarding the Shard iterator.
see here GetShardIterator
The usual setting is LATEST or TRIM_HORIZON (oldest):
{
"ShardId": "ShardId",
"ShardIteratorType": "LATEST",
"StreamName": "StreamName",
}
But you can change it to a specific time (from the last 24 hours)
{
"ShardId": "ShardId",
"ShardIteratorType": "AT_TIMESTAMP",
"StreamName": "StreamName",
"Timestamp": 2016-06-29T19:58:46.480-00:00
}
Keep in mind that usually the kinesis consumer saves its checkpoints in a dynamodb table, so if you are using the same kinesis application you need to delete those checkpoints first.

Related

Reprocessing messages AWS kinesis data stream message using KCL consumer

A KCL consumer is running using Auto Scaling Group(ASG) configured according to the number of provisioned shards of the kinesis data stream on EC2 machines which means if the kinesis data stream has n number of provisioned shards then the maximum n number of EC2 machines can be configured to consume messages from each shard as per this document link
Now, Messages will be processed in real-time as soon as messages arrive in the kinesis data stream as the shard type iterator is set as LATEST for the KCL consumer. For more info check here.
A dynamo DB table is configured for a KCL consumer having entries of checkpoints for each provisioned shard to keep track of the shards in a kinesis data stream that is being leased and processed by the workers of the KCL consumer application.
If we want to process every message present in the kinesis data stream as per the data retention period of it (which is by default 7 days). Is there any simple and easy mechanism to do it?
Possible theoretical solution (can be incorrect or improved):
First Approach
Stop KCL consumer workers.
Delete the DynamoDB table associated with each provisioned shard so that workers start picking up the messages from the kinesis data stream.
Restart the KCL consumer service.
Second Approach
Stop the KCL consumer
Editing/Updating the checkpoint values for each shard related to previous/old timestamp. Any conversion formula? I don't know. Can we have any other dump value instead which will be overwritten by the KCL consumer?
Restart KCL consumer service
Any other approach?
Kindly feel free to suggest/comment on how can we reprocess kinesis data stream messages again effectively without any problem.
To reprocess all the stream data with your First approach you would need to change the type iterator from LATEST to TRIM_HORIZON before deleting the tables and restarting the KCL consumer, otherwise you would only process new arrivals to the stream.
The second approach is possible, you will need to get the shard-iterator for all the shards, using also the TRIM_HORIZON shard iterator type. There is also the possibility to indicate a timestamp in case you would need to reprocess less data than the retention of your stream. This aws reference documentation can be useful .

How to wire a DynamoDB stream to a kinesis stream?

I was assuming I
create a table and enable stream and I now have an ARN
create a kinesis stream
configure somewhere to tell the dynamoDb stream to write to kinesis stream
I was looking at working with https://github.com/harlow/kinesis-consumer but this reads from kinesis or can I use the ARN and use it to read right from the dynamoDB stream?
The more I look, the more I seem to think, I have to write a lambda to read dynamoDB and write to kinesis. Is that correct?
thanks
Hey can you provide a bit more of information about your target setup? do you plan to have some sort of ETL process for your dynamoDB table? AFAIK when you bound a kinesis stream to a dynamodb table, everytime you add, remove or update rows on the dynamodb a new event will be publish in the associated kinesis stream which you can consume from and use the event in whatever way you want.
maybe worth checking this one:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html
DynamoDB now support Kinesis Data Streams natively:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/kds.html
You can choose either DynamoDB Streams or Kinesis Data Streams for your Change Data Capture (CDC).
Properties
Kinesis Data Streams for DynamoDB
DynamoDB Streams
Data retention
Up to 1 year.
24 hours.
Kinesis Client Library (KCL) support
Supports KCL versions 1.X and 2.X.
Supports KCL version 1.X.
Number of consumers
Up to 5 simultaneous consumers per shard, or up to 20 simultaneous consumers per shard with enhanced fan-out.
Up to 2 simultaneous consumers per shard.
Throughput quotas
Unlimited.
Subject to throughput quotas by DynamoDB table and AWS Region.
Record delivery model
Pull model over HTTP using GetRecords and with enhanced fan-out, Kinesis Data Streams pushes the records over HTTP/2 by using SubscribeToShard.
Pull model over HTTP using GetRecords.
Ordering of records
The timestamp attribute on each stream record can be used to identify the actual order in which changes occurred in the DynamoDB table.
For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.
Duplicate records
Duplicate records might occasionally appear in the stream.
No duplicate records appear in the stream.
Stream processing options
Process stream records using AWS Lambda, Kinesis Data Analytics, Kinesis data firehose , or AWS Glue streaming ETL.
Process stream records using AWS Lambda or DynamoDB Streams Kinesis adapter.
Durability level
Availability zones to provide automatic failover without interruption.
Availability zones to provide automatic failover without interruption.
You can use Amazon Kinesis Data Streams to capture changes to Amazon DynamoDB. According to the AWS documentation:
Kinesis Data Streams captures item-level modifications in any DynamoDB table and replicates them to a Kinesis data stream. Your applications can access this stream and view item-level changes in near-real time. You can continuously capture and store terabytes of data per hour. You can take advantage of longer data retention time—and with enhanced fan-out capability, you can simultaneously reach two or more downstream applications. Other benefits include additional audit and security transparency.
Also You can enable streaming to Kinesis from your DynamoDB table.

Kinesis Analytics Destination Guidance: Lambda vs Kinesis Stream to Lambda

After Kinesis Analytics does it's job, the next step is to send that information off to a destination. AWS currently offers 3 destination choices:
Kinesis stream
Kinesis Firehose delivery stream
AWS Lambda function
For my use case, Kinesis Firehose delivery stream is not what I want so I am left with:
Kinesis stream
AWS Lambda function
If I set the destination to a Kinesis Stream, I would then attach a Lambda to that stream to process the records.
AWS also offers the ability to set the destination to a Lambda, bypassing the Kinesis Stream step of this process. In doing some digging for docs I found this:
Using a Lambda Function as Output
Specifically in those docs under Lambda Output Invocation Frequency it says:
If records are emitted to the destination in-application stream within the data analytics application as a continuous query or a sliding window, the AWS Lambda destination function is invoked approximately once per second.
My Kinesis Analytics output qualifies under this scenario. So I can assume that my Lambda will be invoked, "approximately once per second".
I'm trying to understand the difference between using these 2 destinations as it pertains to using a Lambda.
Using AWS Lambda with Kinesis states that:
You can subscribe Lambda functions to automatically read batches of records off your Kinesis stream and process them if records are detected on the stream. AWS Lambda then polls the stream periodically (once per second) for new records.
So it sounds like the the invocation interval is the same in either case; approximately 1 second.
So I think the guidence is:
If the next stage in the pipeline only needs one consumer, then use the AWS Lambda function destination. If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
Is this a correct assumption on how to choose a destination? Again, for my use case I am excluding the Kinesis Firehose delivery stream.
If the next stage in the pipeline only needs one consumer, then use the AWS Lambda function destination. If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
• I would always use Kinesis Stream with one shard and batch size = 1 (for example) if I wanted the items to be consumed one by one with no concurrency.
If there are multiple consumers, increase the number of shards, one lambda is launched in parallel for each shard when there are items to process. If it makes sense, also increase the batch size.
But read again at the highlighted phrase below:
If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
If you have one or more producers and many consumers of the exactly same item, I guess you need to use SNS. The producer writes the item on one topic, then all the lambdas listening to the topic will process that item.
If this does not answer your question, please clarify it. There is a little ambiguity.

sending and receiving logs to aws kinesis

How can I send (receive) logs to (from) different kinesis stream shards using python boto3? I can send and receive when there is only one shard but can not figure out how will it work if I specify multiple shards for my kinesis stream.
You can use kinesis-agent library to push logs to kinesis streams, and use the KCL python library to read records from the stream. The KCL handles loads of stuff such as reading parallelly from multiple shards, resharding, etc.

Kafka like offset on Kinesis Stream?

I have worked a bit with Kafka in the past and lately there is a requirement to port part of the data pipeline on AWS Kinesis Stream. Now I have read that Kinesis is effectively a fork of Kafka and share many similarities.
However I have failed to see how can we have multiple consumers reading from the same stream, each with their corresponding offset. There is a sequence number given to each data record, but I couldn't find anything specific to consumer(Kafka group Id?).
Is it really possible to have different consumers with different ingestion rate over same AWS Kinesis Stream?
Yes.
You can have multiple Kinesis Consumer Applications. Let's say you have 2.
First consumer application (I think it is "consumer group" in Kafka?) can be "first-app" and store it's positions in the DynamoDB "first-app-table". It can have as many nodes (ec2 instances) as you want.
Second consumer application can also work on the same stream, and store it's positions on another DynamoDB table let's say "second-app-table".
Each table will contain "what is the last processed position on shard X for app Y" information. So the 2 applications store checkpoints for the same shards in a different place, which makes them independent.
About the ingestion rate, there is a "idleTimeBetweenReadsInMillis" value in consumer applications using KCL, that is the polling interval for Amazon Kinesis API for Get operations. For example first application can have "2000" poll interval, so it will poll stream's shards every 2 seconds to see if any new record came.
I don't know Kafka well but as far as I remember; Kafka "partition" is "shard" in Kinesis, likewise Kafka "offset" is "sequence number" in Kinesis. Kinesis Consumer Library uses the term "checkpoint" for the stored sequences. Like you said, the concepts are similar.