Rebalancing redundancy in cooperative-sticky partition assignment strategy - librdkafka

I tried to log the partitions passed to RdKafka::RebalanceCb::rebalance_cb() for the cases of RdKafka::ERR__ASSIGN_PARTITIONS and ERR__REVOKE_PARTITIONS. When I run my consumer application, I get the following log:
[03-14-22 16:48:31:366]::MyConsumerApp1(6F7FE700)::INFO: Rebalance assignment event received using COOPERATIVE rebalancing protocol
[03-14-22 16:48:31:366]::MyConsumerApp1(6F7FE700)::INFO: topic 1001 partition 0
[03-14-22 16:48:31:367]::MyConsumerApp1(6F7FE700)::INFO: topic 1001 partition 1
[03-14-22 16:48:31:367]::MyConsumerApp1(6F7FE700)::INFO: topic 1001 partition 2
[03-14-22 16:48:31:367]::MyConsumerApp1(6F7FE700)::INFO: topic 1001 partition 3
[03-14-22 16:48:31:367]::MyConsumerApp1(6F7FE700)::INFO: topic 1001 partition 4
[03-14-22 16:48:31:367]::MyConsumerApp1(6F7FE700)::INFO: topic 1001 partition 5
[03-14-22 16:48:32:450]::MyConsumerApp1(92FFD700)::INFO: Rebalance assignment event received using COOPERATIVE rebalancing protocol
[03-14-22 16:48:32:450]::MyConsumerApp1(92FFD700)::INFO: topic 1002 partition 0
[03-14-22 16:48:32:450]::MyConsumerApp1(92FFD700)::INFO: topic 1002 partition 1
[03-14-22 16:48:32:450]::MyConsumerApp1(92FFD700)::INFO: topic 1002 partition 2
[03-14-22 16:48:32:451]::MyConsumerApp1(92FFD700)::INFO: topic 1002 partition 3
[03-14-22 16:48:32:451]::MyConsumerApp1(92FFD700)::INFO: topic 1002 partition 4
[03-14-22 16:48:32:451]::MyConsumerApp1(92FFD700)::INFO: topic 1002 partition 5
When I run another instance of my consumer application, I get this log:
[03-14-22 16:50:04:622]::MyConsumerApp1(90FF9700)::INFO: Rebalance revocation event received using COOPERATIVE rebalancing protocol.
[03-14-22 16:50:04:622]::MyConsumerApp1(90FF9700)::INFO: topic 1001 partition 0
[03-14-22 16:50:04:622]::MyConsumerApp1(90FF9700)::INFO: topic 1001 partition 1
[03-14-22 16:50:04:622]::MyConsumerApp1(90FF9700)::INFO: topic 1001 partition 2
[03-14-22 16:50:04:622]::MyConsumerApp1(90FF9700)::INFO: topic 1001 partition 3
[03-14-22 16:50:04:622]::MyConsumerApp1(90FF9700)::INFO: topic 1001 partition 4
[03-14-22 16:50:04:622]::MyConsumerApp1(90FF9700)::INFO: topic 1001 partition 5
[03-14-22 16:50:04:624]::MyConsumerApp1(90FF9700)::INFO: Rebalance assignment event received using COOPERATIVE rebalancing protocol
[03-14-22 16:50:06:211]::MyConsumerApp1(90FF9700)::INFO: Rebalance revocation event received using COOPERATIVE rebalancing protocol.
[03-14-22 16:50:06:212]::MyConsumerApp1(90FF9700)::INFO: topic 1002 partition 1
[03-14-22 16:50:06:212]::MyConsumerApp1(90FF9700)::INFO: topic 1002 partition 3
[03-14-22 16:50:06:212]::MyConsumerApp1(90FF9700)::INFO: topic 1002 partition 5
[03-14-22 16:50:06:213]::MyConsumerApp1(90FF9700)::INFO: Rebalance assignment event received using COOPERATIVE rebalancing protocol
[03-14-22 16:50:06:213]::MyConsumerApp1(90FF9700)::INFO: topic 1001 partition 0
[03-14-22 16:50:06:213]::MyConsumerApp1(90FF9700)::INFO: topic 1001 partition 2
[03-14-22 16:50:06:218]::MyConsumerApp1(90FF9700)::INFO: topic 1001 partition 4
[03-14-22 16:50:08:279]::MyConsumerApp1(6D7FA700)::INFO: Rebalance assignment event received using COOPERATIVE rebalancing protocol
[03-14-22 16:50:08:927]::MyConsumerApp1(927FC700)::INFO: Rebalance revocation event received using COOPERATIVE rebalancing protocol.
[03-14-22 16:50:08:927]::MyConsumerApp1(927FC700)::INFO: topic 1001 partition 0
[03-14-22 16:50:08:927]::MyConsumerApp1(927FC700)::INFO: topic 1001 partition 2
[03-14-22 16:50:08:927]::MyConsumerApp1(927FC700)::INFO: topic 1001 partition 4
[03-14-22 16:50:08:927]::MyConsumerApp1(927FC700)::INFO: topic 1002 partition 0
[03-14-22 16:50:08:927]::MyConsumerApp1(927FC700)::INFO: topic 1002 partition 2
[03-14-22 16:50:08:927]::MyConsumerApp1(927FC700)::INFO: topic 1002 partition 4
[03-14-22 16:50:10:759]::MyConsumerApp1(91FFB700)::INFO: Rebalance assignment event received using COOPERATIVE rebalancing protocol
[03-14-22 16:50:10:759]::MyConsumerApp1(91FFB700)::INFO: topic 1001 partition 0
[03-14-22 16:50:10:759]::MyConsumerApp1(91FFB700)::INFO: topic 1001 partition 2
[03-14-22 16:50:10:759]::MyConsumerApp1(91FFB700)::INFO: topic 1001 partition 4
[03-14-22 16:50:10:759]::MyConsumerApp1(91FFB700)::INFO: topic 1002 partition 0
[03-14-22 16:50:10:759]::MyConsumerApp1(91FFB700)::INFO: topic 1002 partition 2
[03-14-22 16:50:10:759]::MyConsumerApp1(91FFB700)::INFO: topic 1002 partition 4
From the logs, it is evident that several rebalancing events were triggered by the creation of the second consumer application. What confounds me though is the seemingly inefficient process of the rebalancing protocol. At time 03-14-22 16:50:06:218, the rebalancing should have been done already with consumer 1 being assigned partitions 0, 2 and 4 for topics 1001 and 1002. However, at time 03-14-22 16:50:08:927, librdkafka revokes all these partitions only to reassign them again at time 03-14-22 16:50:10:759. So my question is if this behavior is expected, and if so, why the redundancy in revocation and assignment?
Another question that bothers me is why some rebalancing events don't have any partitions involved. For instance at times 03-14-22 16:50:04:624 and 03-14-22 16:50:08:279
I used librdkafka 1.6.2 and Apache Kafka 2.8.0 and CentOS7, and used "cooperative-sticky" for the "partition.assignment.strategy" of the consumer application.

Related

Sqs error in which Approximate age pf oldest alarm is high

I am having sqs with visibility time out of 30 minutes and we have setup sqs approximate age of oldest message alarm for value 14000 for 1 data point within 1 minute, the alarm is going high everytime and this issue is recurring and i am not sure what to be done. Any suggestions that i can follow. Its default retention period is 1 day. Thanks

AWS SQS FIFO queue behaviour with AWS Lambda reserved concurrency

So I have 10 message groups in FIFO queue having 2 messages per group and I have also reserved lambda concurrency set to 5. (Lambda completes execution in 1 min and SQS visibility timeout set to 2 mins)
when all 20 messages are pushed to queue, SQS inflight messages gets set to 10 and then after the execution time, 5 messages gets processed successfully and other 5 moves to DLQ.
And then the next executions inflight messages gets set to 5 (as reserved lambda concurrency set to 5.) and processes as expected (This should be the expected behaviour right?)
Any particular reason why this is happening?

Will an SQS delay apply at the queue level or message group level of a FIFO queue?

An AWS SQS FIFO queue has a batch setting of 1 and a delay of 1 second. Every item received is associated with a MessageGroup.
All at once the queue receives 30 messages across 10 different message groups with each message group accounting for 3 messages...
Will the delay of one second apply at the queue level i.e. the 30 messages will take an elapsed time of 30 seconds to deliver?
Or will the queue spin up 10 consumers, one for each message group, emptying the queue in 3 seconds?
Will the delay of one second apply at the queue level i.e. the 30 messages will take an elapsed time of 30 seconds to deliver?
For FIFO, the delay is applied at queue level:
FIFO queues don't support per-message delays, only per-queue delays. If your application sets the same value of the DelaySeconds parameter on each message, you must modify your application to remove the per-message delay and set DelaySeconds on the entire queue instead.
Number of comsumer lambdas working in parallel does not need to be 10 as written bellow:
In SQS FIFO queues, using more than one MessageGroupId enables Lambda to scale up and process more items in the queue using a greater concurrency limit. Total concurrency is equal to or less than the number of unique MessageGroupIds in the SQS FIFO queue.
Thus you can have still fewer consumer lambdas than groups. But in ideal situation you would have 10 lambdas working in parallel. How it works with lambda is explained in the following AWS blog post:
New for AWS Lambda – SQS FIFO as an event source

Maximum number of concurrent tasks in 1 DPU in AWS Glue

A standard DPU in AWS Glue comes with 4 vCPU and 2 executors.
I am confused about the maximum number of concurrent tasks that can be run in parallel with this configuration. Is it 4 or 8 on a single DPU with 4vcpu and 2 executors?
I had a similar discussion with the AWS Glue support team about this, I'll share with you what they told me about Glue Configuration. Take in example the Standard and the G1.X configuration.
Standard DPU Configuration:
1 DPU reserved for MasterNode
1 executor reserved for Driver/ApplicationMaster
Each DPU is configured with 2 executors
Each executor is configured with 5.5 GB memory
Each executor is configured with 4 cores
G.1X WorkerType Configuration:
1 DPU added for MasterNode
1 DPU reserved for Driver/ApplicationMaster
Each worker is configured with 1 executor
Each executor is configured with 10 GB memory
Each executor is configured with 8 cores
If we have for example a Job with Standard Configuration with 21 DPU means that we have:
1 DPU reserved for Master
20 DPU x 2 = 40 executors
40 executors - 1 Driver/AM = 39 executors
Which we then end up with a total amount of 156 cores. Meaning, your job has 156 slots for execution. If for example you read files from S3 that means that you will be able to accept 156 input files in parallel.
Hope it helps.

Is there a way to monitor AWS SQS traffic between specific time interval in a day using AWS CloudWatch?

In my project, I add messages only between 4 am and 7 pm pacific time to my AWS SQS. After 7 pm pacific time and until 4 am next day, I do not add any messages to my AWS SQS.
So I would like to monitor the AWS SQS between 4 am and 7 pm pacific time only otherwise one of my monitoring condition triggers alarm for valid no message period (after 7 pm).
My monitoring conditions are:
(1) If there are no messages added to queue between 4 am and 7 pm pacific time for more than 10 minutes, raise an alarm.
(2) If there are no messages removed from the queue for more than 10 minutes between 4 am and 7 pm pacific time, raise an alarm.
Is this feasible? If so, please explain how?
I appreciate your time spent on my question.
How about using AWS Lambda or a (cron job in an instance) to do this?
Create a Lambda that runs at 3:50 PM that would setup CloudWatch Alarm for the SQS traffic.
Create a Lamda that runs at 7 PM that would disable this CloudWatch Alarm.