Recurring warning - unable to clean up lease on KCL 2.3 - amazon-web-services

I performed a re-sharding from 1 shard to 2 in my kinesis stream, after which my KCL consumer keeps logging the following warning every one minute.
Unable to clean up lease shardId-000000000006 for newStream due to LeaseCleanupManager.LeaseCleanupResult(cleanedUpCompletedLease=false, cleanedUpGarbageLease=false, wereChildShardsPresent=true, wasResourceNotFound=false)
shardId-000000000006 is the parent shard that was split into two child shards 7 and 8. My DynamoDB entries are like so:
A restart of the consumer does not help. Is this a cause for concern, and why is the worker unable to clean up the lease on shardId-000000000006?

This seems like the expected behavior. Once both the child shards performed the checkpointing, shardId-000000000006 lease was cleaned up without any issues.

Related

Kinesis vs SQS, which is the best for this particular case?

I have been reading about Kinesis vs SQS differences and when to use each but I'm struggling to know which is the appropriate solution for this particular problem:
Strava-like app where users record their runs
50 incoming runs per second
The processing of each run takes exactly 1 minute
I want the user to have their results in less than 5 minutes
A run is just a guid, the job that processes it will get al the info from S3
If i understand correctly in kinesis you can have 1 worker per shard, correct? That would mean 1 runs per minute. Since i have 3000 incoming runs per minute, to meet the 5 minute deadline would mean i would need to have 600 shards with 1 worker each.
Is this assumption correct?
With SQS I can just have 1 queue and as many workers as I like, up to SQS's limit of 120,000 inflight messages.
If 1 run errors during processing I want to reprocess it a few more times and then store it for further inspection.
I don't need to process messages in order, and duplicates are totally fine.
1 worker per message, after it's processed i no longer care about the message
In that case, a queuing services such as SQS should be used. Kinesis is a streaming service, which persist a data. This means that multiple works can read messages from a stream for as long as they are valid. Non of your workers would be able to remove the message from the stream.
Also with SQS you can setup dead-letter queues which would allow you capture messages with fail to process after a pre-defined number of trials.

Long-running Dataflow job fails with no errors in user code

After running for 17 hours, my Dataflow job failed with the following message:
The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures.
The 4 failures consist of 3 workers losing contact with the service, and one worker reported dead:
****-q15f Root cause: The worker lost contact with the service.
****-pq33 Root cause: The worker lost contact with the service.
****-fzdp Root cause: The worker ****-fzdp has been reported dead. Aborting lease 4624388267005979538.
****-nd4r Root cause: The worker lost contact with the service.
I don't see any errors in the worker logs for the job in Stackdriver. Is this just bad luck? I don't know how frequently work items need to be retried, so I don't know what the probability is that a single work item will fail 4 times over the course of a 24 hour job. But this same type of job failure happens frequently for this long-running job, so it seems like we need some way to either decrease the failure rate of work items, or increase the allowed number of retries. Is either possible? This doesn't seem related to my pipeline code, but in case it's relevant, I'm using the Python SDK with apache-beam==2.15.0. I'd appreciate any advice on how to debug this.
Update: The "STACK TRACES" section in the console is totally empty.
I was having the same problem and it was solved by scaling up my workers resources. Specifically, I set --machine_type=n1-highcpu-96 in my pipeline configs. See this for a more extensive list on machine type options.
Edit: Set it to highcpu or highmem depending on the requirements of your pipeline process

Kinesis client library record processor failure

According to AWS docs:
The worker invokes record processor methods using Java ExecutorService tasks. If a task fails, the worker retains control of the shard that the record processor was processing. The worker starts a new record processor task to process that shard. For more information, see Read Throttling.
According to another page on AWS docs:
The Kinesis Client Library (KCL) relies on your processRecords code to
handle any exceptions that arise from processing the data records. Any
exception thrown from processRecords is absorbed by the KCL. To avoid
infinite retries on a recurring failure, the KCL does not resend the
batch of records processed at the time of the exception. The KCL then
calls processRecords for the next batch of data records without
restarting the record processor. This effectively results in consumer
applications observing skipped records. To prevent skipped records,
handle all exceptions within processRecords appropriately.
Aren't these 2 contradictory statements? One says that record processor restarts and another says that the shard is skipped.
What does KCL exactly do when a record processor fails? How does a KCL worker comes to know if a record processor failed?
Based on my experience writing, debugging, and supporting KCL-based applications, the second statement is more clear/accurate/useful for describing how you should consider error handling.
First, a bit of background:
KCL record processing is designed to run from multiple hosts. Say you have 3 hosts and 12 shards to process - each host runs a single worker, and will own processing for 4 shards.
If, during processing for one of those shards, an exception is thrown, KCL will absorb the exception and treat it as if all records were processed - effectively "skipping" any records that weren't processed.
Remember, this is your code that threw the exception, so you can handle it before it escapes to KCL
When KCL worker itself fails/is stopped, those shards are transferred to another worker. For example, if you scale down to two hosts, the 4 shards that were being worked by that third worker are transferred to the other two.
The first statement is trying (not very clearly) to say that when a KCL task fails, that instance of the worker will keep control of the shards it's processing (and not transfer them to another worker).

Does the Spring SqsListener wait until the last message is processed (or completed) from the current poll before the next poll of messages happens?

I have a SQS Listener with a max message count of 10. When my consumer receives a batch of 10 message they all get processed but sometimes (depending on the message) the process will take 5-6 hours and some with take as little as 5 minutes. I have 3 consumers (3 different JVM's) polling from the queue with a maxMessageCount of 10. Here is my issue:
If one of those 10 messages takes 5 hours to process it seems as though the listener is waiting to do the next poll of 10 messages until all of the previous messages are 100% complete. Is there a way to allow it to poll a new batch of messages even though another is still being processed?
I'm guessing that I am missing something little here. How I am using Spring Cloud library and the SqsListener annotation. Has anybody ran across this before?
Also I dont think this should matter but the queue is AWS SQS and there JVM's are running on an ECS cluster.
If you run the task on the poller thread, the next poll won't happen until the current one completes.
You can use an ExecutorChannel or QueueChannel to hand the work off to another thread (or threads) but you risk message loss if you do that.
Your situation is rather unusual; 5 hours is a long time to process a message.
You should perhaps consider redesigning your application to persist these "long running" requests to a database or similar, instead of processing them directly from the message. Or, perhaps put them in a different queue so that they don't impact the shorter tasks.

Amazon KCL Checkpoints and Trim Horizon

How are checkpoints and trimming related in AWS KCL library?
The documentation page Handling Startup, Shutdown, and Throttling says:
By default, the KCL begins reading records from the tip of the
stream;, which is the most recently added record. In this
configuration, if a data-producing application adds records to the
stream before any receiving record processors are running, the records
are not read by the record processors after they start up.
To change the behavior of the record processors so that it always
reads data from the beginning of the stream, set the following value
in the properties file for your Amazon Kinesis Streams application:
initialPositionInStream = TRIM_HORIZON
The documentation page Developing an Amazon Kinesis Client Library Consumer in Java says:
Streams requires the record processor to keep track of the records
that have already been processed in a shard. The KCL takes care of
this tracking for you by passing a checkpointer
(IRecordProcessorCheckpointer) to processRecords. The record processor
calls the checkpoint method on this interface to inform the KCL of how
far it has progressed in processing the records in the shard. In the
event that the worker fails, the KCL uses this information to restart
the processing of the shard at the last known processed record.
The first page seems to say that the KCL resumes at the tip of the stream, the second page at the last known processed record (that was marked as processed by the RecordProcessor using the checkpointer). In my case, I definitely need to restart at the last known processed record. Do I need to set the initialPositionInStream to TRIM_HORIZON?
With kinesis stream you have two options, you can read the newest records, or start from the oldest (TRIM_HORIZON).
But, once you started your application it just reads from the position it stopped using its checkpoints.
You can see those checkpoints in dynamodb (Usually the table name is as the app name).
So if you restart your app it will usually continue from where it stopped.
The answer is no, you don't need to set the initialPositionInStream to TRIM_HORIZON.
When you are reading events from a kinesis stream, you have 4 options:
TRIM_HORIZON - the oldest events that are still in the stream shards before they are automatically trimmed (default 1 day, but can be extended up to 7 days). You will use this option if you want to start a new application that will process all the records that are available in the stream, but it will take a while until it is able to catch up and start processing the events in real-time.
LATEST - the newest events in the stream, and ignore all the past events. You will use this option if you start a new application that you want to process in teal time immediately.
AT/AFTER_SEQUENCE_NUMBER - the sequence number is usually the checkpoint that you are keeping while you are processing the events. These checkpoints are allowing you to reliably process the events, even in cases of reader failure or when you want to update its version and continue processing all the events and not lose any of them. The difference between AT/AFTER is based on the time of your checkpoint, before or after you processed the events successfully.
Please note that this is the only shard specific option, as all the other options are global to the stream. When you are using the KCL it is managing a DynamoDB table for that application with a record for each shard with the "current" sequence number for that shard.
AT_TIMESTAMP - the estimate time of the event put into the stream. You will use this option if you want to find specific events to process based on their timestamp. For example, when you know that you had a real life event in your service at a specific time, you can develop an application that will process these specific events, even if you don't have the sequence number.
See more details in Kinesis documentation here: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html
You should use the "TRIM_HORIZON". It will only have effect on the first time your application starts to read records from the stream.
After that, it will continue from last known position.