Resume reading from kinesis after a KCL consumer outage [duplicate]

Resume reading from kinesis after a KCL consumer outage [duplicate] - amazon-web-services

I can't find in the formal documentation of AWS Kinesis any explicit reference between TRIM_HORIZON and the checkpoint, and also any reference between LATEST and the checkpoint.
Can you confirm my theory:
TRIM_HORIZON - In case the application-name is new, then I will read all the records available in the stream. Else, application-name was already used, then I will read from my last checkpoint.
LATEST - In case the application-name is new, then I will read all the records in the stream which added after I subscribed to the stream. Else, application-name was already used, I will read messages from my last checkpoint.
The difference between TRIM_HORIZON and LATEST is only in case the application-name is new.

AT_TIMESTAMP
-- from specific time stamp
TRIM_HORIZON
-- all the available messages in Kinesis stream from the beginning (same as earliest in Kafka)
LATEST
-- from the latest messages , i.e current message that just came into Kinesis/Kafka and all the incoming messages from that time onwords

From GetShardIterator documentation (which lines up with my experience using Kinesis):
In the request, you can specify the shard iterator type AT_TIMESTAMP to read records from an arbitrary point in time, TRIM_HORIZON to cause ShardIterator to point to the last untrimmed record in the shard in the system (the oldest data record in the shard), or LATEST so that you always read the most recent data in the shard.
Basically, the difference is whether you want to start from the oldest record (TRIM_HORIZON), or from "right now" (LATEST - skipping data between latest checkpoint and now).

The question clearly asks how these options relate to the checkpoint. However, none of the existing answers addresses the checkpoint at all.
An authoritative answer to this question by Justin Pfifer appears in a GitHub issue here.
The most relevant portion is
The KCL will always use the value in the lease table if it's present. It's important to remember that Kinesis itself doesn't track the position of consumers. Tracking is provided by the lease table. Leases in the KCL server double duty. They provide both mutual exclusion, and position tracking. So for mutual exclusion a lease needs to be created, and to satisfy the position tracking an initial value must be selected.
(Emphasis added by me.)

I think choosing between either is a trade off between do you want to start from the most recent data or do you want to start from the oldest data that hasnt been processed from kinesis.
Imagine a scenario when there is a bug in your lambda function and it is throwing an exception on the first record it gets and returns an error back to kinesis because of which now none of the records in your kinesis are going to be processed and going to remain there for 1 day period(retention period). Now after you have fixed the bug and deploy your lambda now your lambda will start getting all those messages from the buffer that kinesis has been holding up. Now your downstream service will have to process old data instead of the most recent data. This could add unwanted latency in your application if you choose TRIM_HIROZON.
But if you used LATEST, you can ignore all those previous stuck messages and have your lambda actually start processing from new events/messages and thus improving the latency your system provides.
So you will have to decide which is more important for your customers. Is losing a few data points fine and what is your tolerance limit or you always want accurate results like calculating sum/counter.

Related

Realtime-ness of S3 event notification

I am interested in traffic lifecycle (i.e. when the objects were created and deleted) of objects.
One approach is to perform periodic scan of the bucket and track explicitly the lastModifiedTime and perform a diff with previous scan result to identify objects deleted.
Another alternate I was considering was to enable S3 event notifications. However, the data in notification does not contain lastModifiedTime for the object. Can the eventTime be used as proxy instead? Is there a guarantee how quickly the event is sent ? In my case, it is acceptable if delivery of the event is delayed; as long as eventTime is not significantly later that modificationTime of object
Also, any other alternatives to capture lifecycle of s3 objects?

Yeah, the eventTime is a pretty good approximation of the lastModifiedTime of an object. One caveat here is the definition of lastModifiedTime is
Object creation date or the last modified date, whichever is the latest.
So in order to use eventTime as an approximation, you probably need a trigger that covers all the events where an object is either created or modified. Regarding to your question of how quickly the event is sent, here is a quote from the S3 documentation:
Amazon S3 event notifications are designed to be delivered at least once. Typically, event notifications are delivered in seconds but can sometimes take a minute or longer.
If you want the accurate lastModifiedTime, you need to do a headObject operation for each object.
Your first periodic pull approach could work, but be careful don't do it naively if you have millions of objects. I mean don't use listObjects and do it in a while loop. This doesn't scale at all and listObjects API is pretty expensive. If you only need to do this traffic analysis once a day or once a week, I recommend using S3 inventory. The lastModifiedTime is included in the inventory report. [ref]

There is no guarantee for how long it takes to deliver the events. From the docs:
Amazon S3 event notifications are designed to be delivered at least once. Typically, event notifications are delivered in seconds but can sometimes take a minute or longer.
Also events occurring at the same time, may be represented by single event at the end:
If two writes are made to a single non-versioned object at the same time, it is possible that only a single event notification will be sent. If you want to ensure that an event notification is sent for every successful write, you can enable versioning on your bucket. With versioning, every successful write will create a new version of your object and will also send an event notification.

Google Dataflow and Pubsub - can not achieve exactly-once delivery

I'm trying to achieve exactly-once delivery using Google Dataflow and PubSub using Apache Beam SDK 2.6.0.
Use case is quite simple:
'Generator' dataflow job sends 1M messages to PubSub topic.
GenerateSequence
.from(0)
.to(1000000)
.withRate(100000, Duration.standardSeconds(1L));
'Archive' dataflow job reads messages from PubSub subscription and saves to Google Cloud Storage.
pipeline
.apply("Read events",
PubsubIO.readMessagesWithAttributes()
// this is to achieve exactly-once delivery
.withIdAttribute(ATTRIBUTE_ID)
.fromSubscription('subscription')
.withTimestampAttribute(TIMESTAMP_ATTRIBUTE))
.apply("Window events",
Window.<Dto>into(FixedWindows.of(Duration.millis(options.getWindowDuration())))
.triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
.withAllowedLateness(Duration.standardMinutes(15))
.discardingFiredPanes())
.apply("Events count metric", ParDo.of(new CountMessagesMetric()))
.apply("Write files to archive",
FileIO.<String, Dto>writeDynamic()
.by(Dto::getDataSource).withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.of((msg, ctx) -> msg.getData(), Requirements.empty()), TextIO.sink())
.to(archiveDir)
.withTempDirectory(archiveDir)
.withNumShards(options.getNumShards())
.withNaming(dataSource ->
new SyslogWindowedDataSourceFilenaming(dataSource, archiveDir, filenamePrefix, filenameSuffix)
));
I added 'withIdAttribute' to both Pubsub.IO.Write ('Generator' job) and PubsubIO.Read ('Archive' job) and expect that it will guarantee exactly-once semantics.
I would like to test the 'negative' scenario:
'Generator' dataflow job sends 1M messages to PubSub topic.
'Archive' dataflow job starts to work, but I stop it in the middle of processing clicking 'Stop job' -> 'Drain'. Some portion of messages has been processed and saved to Cloud Storage, let's say 400K messages.
I start 'Archive' job again and do expect that it will take unprocessed messages (600K) and eventually I will see exactly 1M messages saved to Storage.
What I got in fact - all messages are delivered (at-least-once is achieved), but on top of that there are a lot of duplicates - something in the neighborhood of 30-50K per 1M messages.
Is there any solution to achieve exactly-once delivery?

Dataflow does not enable you to persist state across runs. If you use Java you can update a running pipeline in a way that does not cause it to lose the existing state, allowing you to deduplicate across pipeline releases.
If this doesn't work for you, you may want to archive messages in a way where they are keyed by ATTRIBUTE_ID, e.g,. Spanner or GCS using this as the file name.

So, I've never done it myself, but reasoning about your problem this is how I would approach it...
My solution is a bit convoluted, but I failed to identify others way to achieve this without involving other external services. So, here goes nothing.
You could have your pipeline reading both from pubsub and GCS and then combine them to de-duplicate the data. The tricky part here is that one would be a bounded pCollection (GCS) and the other an unbounded one (pubsub). You can add timestamps to the bounded collection and then window the data. During this stage you could potentially drop GCS data older than ~15 minutes (the duration of the window in your precedent implementation). These two steps (i.e. adding timestamps properly and dropping data that is probably old enough to not create duplicates) are by far the trickiest parts.
Once this has been solved append the two pCollections and then use a GroupByKey on an Id that is common for both sets of data. This will yield a PCollection<KV<Long, Iterable<YOUR_DATUM_TYPE>>. Then you can use an additional DoFn that drops all but the first element in the resulting Iterable and also removes the KV<> boxing. From there on you can simply continue processing the data as your normally would.
Finally, this additional work should be necessary only for the first pubsub window when restarting the pipeline. After that you should re-assign the GCS pCollection to an empty pCollection so the group by key doesn't do too much additional work.
Let me know what you think and if this could work. Also, if you decide to pursue this strategy, please post your mileage :).

In the meanwhile Pub/Sub has support for Exactly once delivery.
It is currently in the pre-GA launch state, so unfortunately not ready for production use yet.

Is there a way to purge a dynamoDB stream?

Hi I am working with dynamodb stream and lambda triggers over them. I've got myself into a fix as my lambda reads records from TRIM_HORIZON and it failed to process the very first record. Now the lambda is hell-bent on retrying the processing of that specific record. Is there a way to purge the stream so that new records start flowing and they can be processed?

If you only want new records (those coming in now, rather than historical records), use LATEST instead of TRIM_HORIZON.
As to answer the question, there is no way to purge a Kinesis/DynamoDb stream yet.

I think there is no way to purge the dynamo stream. One work around is to delete the stream and recreate it. (Highly not recommended in production as that incurs data loss)

How to read the oldest unprocessed record in Kinesis Data Stream

I'm new to AWS and would like some guidance.
I want to process the oldest unprocessed record but I cannot seem to get the params right.
Current Architecture
For the shard iterator:
I've tried TRIM_HORIZON which gave me all the records since the
beginning.
I've also tried LATEST which only gave me the one latest record.
Not sure if these additional details will help but...
I'm putting my own records in through Lambda on the AWS console
I'm debugging this by looking at the log files in CloudWatch
I'm getting records through the shard iterator (TRIM_HORIZON and LATEST)
My getRecords limit is set at 100
Thanks in advance!

There is no "oldest unprocessed record", as Kinesis doesn't know what you've processed (for example, you may have fetched the records but not done anything with them).
If you're using Kinesis, I strongly recommend using Kinesis Client Library, which has the concept of checkpoints - these are essentially a nice wrapper on top of ShardIterator AFTER_SEQUENCE_NUMBER, which translates to "oldest uncheckpointed record" - or as close as you'll get to "oldest unprocessed record".
(You could always implement this logic yourself, but why not reuse work that Amazon has already done for you?)

Amazon S3 conditional put object

I have a system in which I get a lot of messages. Each message has a unique ID, but it can also receives updates during its lifetime. As the time between the message sending and handling can be very long (weeks), they are stored in S3. For each message only the last version is needed. My problem is that occasionally two messages of the same id arrive together, but they have two versions (older and newer).
Is there a way for S3 to have a conditional PutObject request where I can declare "put this object unless I have a newer version in S3"?

I need an atomic operation here
That's not the use-case for S3, which is eventually-consistent. Some ideas:
You could try to partition your messages - all messages that start with A-L go to one box, M-Z go to another box. Then each box locally checks that there are no duplicates.
Your best bet is probably some kind of database. Depending on your use case, you could use a regular SQL database, or maybe a simple RAM-only database like Redis. Write to multiple Redis DBs at once to avoid SPOF.
There is SWF which can make a unique processing queue for each item, but that would probably mean more HTTP requests than just checking in S3.
David's idea about turning on versioning is interesting. You could have a daemon that periodically trims off the old versions. When reading, you would have to do "read repair" where you search the versions looking for the newest object.

Couldn't this be solved by using tags, and using a Condition on that when using PutObject? See "Example 3: Allow a user to add object tags that include a specific tag key and value" here: https://docs.aws.amazon.com/AmazonS3/latest/dev/object-tagging.html#tagging-and-policies

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js