Event hub Send event to random partitions but exactly one partition - azure-eventhub

I have event hub publisher but it is duplicating messages across random partitions multiple times . I want parallel messages for huge number of messages coming in which should go into random but exactly in one partition from where the consumer should get the data .
How do I do that . This is causing the message to be duplicated .
EventHubProducerClientOptions producerClientOptions = new EventHubProducerClientOptions
{
RetryOptions = new EventHubsRetryOptions
{
Mode = EventHubsRetryMode.Exponential,
MaximumRetries = 30,
TryTimeout = TimeSpan.FromSeconds(5),
Delay = TimeSpan.FromSeconds(10),
MaximumDelay = TimeSpan.FromSeconds(15),
}
};
using EventDataBatch eventBatch = await producerClient.CreateBatchAsync();
// Add events to the batch. An event is a represented by a collection of bytes and metadata.
eventBatch.TryAdd(eventMessage);
string logInfo = $"[PUBLISHED - [{EventId}]] =======> {message}";
logger.LogInformation(logInfo);
// Use the producer client to send the batch of events to the event hub
await producerClient.SendAsync(eventBatch);

Your code sample is publishing your batch to the Event Hubs gateway, where events will be routed to a partition. For a successful publish operation, each event will be sent to one partition only.
"Successful" is the key in that phrase. You're configuring your retry policy with a TryTimeout of 5 seconds and allowing 30 retries. The duplication that you're seeing is most likely caused by your publish request timing out due to the very short interval, being successfully received by the service, but leaving the service unable to acknowledge success. This will cause the client to consider the operation a failure and retry.
By default, the TryTimeout interval is 60 seconds. I'm not sure why you've chosen to restrict the timeout to such a small value, but I'd strongly advise considering changes. Respectfully, unless you've done profiling and measuring to prove that you need to make changes, I'd advise using the default values for retries in their entirety.

Related

Why am I getting 50% of GCP Pub/Sub messages duplicated?

I'm running an analytics pipeline.
Throughput is ~11 messages per second.
My Pub/Sub topic holds around 2M messages scheduled.
80 GCE instances are pulling messages in parallel.
Here is my topic and the subscription:
gcloud pubsub topics create pipeline-input
gcloud beta pubsub subscriptions create pipeline-input-sub \
--topic pipeline-input \
--ack-deadline 600 \
--expiration-period never \
--dead-letter-topic dead-letter
Here is how I pull messages:
import { PubSub, Message } from '#google-cloud/pubsub'
const pubSubClient = new PubSub()
const queue: Message[] = []
const populateQueue = async () => {
const subscription = pubSubClient.subscription('pipeline-input-sub', {
flowControl: {
maxMessages: 5
}
})
const messageHandler = async (message: Message) => {
queue.push(message)
}
subscription.on('message', messageHandler)
}
const processQueueMessage = () => {
const message = queue.shift()
try {
...
message.ack()
} catch {
...
message.nack()
}
processQueueMessage()
}
processQueueMessage()
Processing time is ~7 seconds.
Here is one of the many similar dup cases.
The same message is delivered 5 (!!!) times to different GCE instances:
03:37:42.377
03:45:20.883
03:48:14.262
04:01:33.848
05:57:45.141
All 5 times the message was successfully processed and .ack()ed. The output includes 50% more messages than the input! I'm well aware of the "at least once" behavior, but I thought it may duplicate like 0.01% of messages, not 50% of them.
The topic input is 100% free of duplicates. I verified both the topic input method AND the number of un-acked messages through the Cloud Monitor. Numbers match: there are no duplicates in the pub/sub topic.
UPDATE:
It looks like all those duplicates created due to ack deadline expiration. I'm 100% sure that I'm acknowledging 99.9% of messages before the 600 seconds deadline.
Some duplicates are expected, though a 50% duplicate rate is definitely high. The first question is, are these publish-side duplicates or subscribe-side duplicates? The former are created when a publish of the same message is retried, resulting in multiple publishes of the same message. These messages will have different message IDs. The latter is caused by redeliveries of the same message to the subscriber. These messages have the same message ID (though different ack IDs).
It sounds like you have verified that these are subscribe-side duplicates. Therefore, the likely cause, as you mention is an expired ack deadline. The question is, why are the messages exceeding the ack deadline? One thing to note is that when using the client library, the ack deadline set in the subscription is not the one used. Instead, the client library tries to optimize ack deadlines based on client library settings and the 99th percentile ack latency. It then renews leases on messages until the max_lease_duration property of the FlowControl object passed into the subscribe method. This defaults to one hour.
Therefore, in order for messages to remain leased, it is necessary for the client library to be able to send modifyAckDeadline requests to the server. One possible cause of duplicates would be the inability of the client to send these requests, possibly due to overload on the machine. Are the machines running this pipeline doing any other work? If so, it is possible they are overloaded in terms of CPU, memory, or network and are unable to send the modifyAckDeadline requests and unable to process messages in a timely fashion.
It is also possible that message batching could be affecting your ability to ack messages. As an optimization, the Pub/Sub system stores acknowledgements for batches of messages instead of individual messages. As a result, all messages in a batch must be acknowledged in order for all of them to be acknowledged. Therefore, if you have five messages in a batch and acknowledge four of them, but then do not ack the final message, all five will be redelivered. There are some caches in place to try to minimize this, but it is still a possibility. There is a Medium post that discusses this in more detail (see the "Message Redelivery & Duplication Rate" section). It might be worth checking that all messages are acked and not nacked in your code by printing out the message ID as soon as the message is received and right before the calls to ack and nack. If your messages were published in batches, it is possible that a single nack is causing redelivery of more messages.
This coupling between batching and duplicates is something we are actively working on improving. I would expect this issue to stop at some point. In the meantime, if you have control over the publisher, you could set the max_messages property in the batch settings to 1 to prevent the batching of messages.
If none of that helps, it would be best to open up a support case and provide the project name, subscription name, and message IDs of some duplicated messages. Engineers can investigate in more detail why individual messages are getting redelivered.

How to check if the topic-queue is empty and then terminate the subscriber?

In my business application I have to batch-process all the messages from a topic periodically because it is cheaper than processing them in a first-come-first-serve fashion. The current way I am planning to do it is have a cronjob that runs the subscriber every T hours. The problem that I am currently solving is how to terminate the subscriber once all the messages have been processed. I want to fire up the cronjob every T hours, let the subscriber consume all the messages in the topic-queue and terminate. From what I understand, there is no pub-sub Java API that tells me whether the topic-queue is empty or not. I have come up with the following 2 solutions:
Create a subscriber that pulls asynchronously. Sleep for t minutes while it consumes all the messages and then terminate it using subscriber.stopAsync().awaitTerminated();. In this approach, there is a possibility I might not consume all the messages before terminating the subscriber. A google example here
Use Pub/Sub Cloud monitoring to find the value of the metric subscription/num_undelivered_messages. Then pull that many messages using the synchronous pull example provided by Google here. Then terminate the Subscriber.
Is there a better way to do this?
Thanks!
It might be worth considering whether or not Cloud Pub/Sub is the right technology to use for this case. If you want to do batch processing, you might be better off storing the data in Google Cloud Storage or in a database. Cloud Pub/Sub is really best for continuous pulling/processing of messages.
The two suggestions you have are trying to determine when there are no more messages to process. There isn't really a clean way to do this. Your first suggestion is possible, though keep in mind that while most messages will be delivered extremely quickly, there can be outliers that take longer to be sent to your subscriber. If it is critical that all outstanding messages be processed, then this approach may not work. However, if it is okay for messages to occasionally be processed the next time you start up your subscriber, then you could use this approach. It would be best to set up a timer since the last time you received a message as guillaum blaquiere suggests, though I would use a timeout on the order of 1 minute and not 100ms.
Your second suggestion of monitoring the number of undelivered messages and then sending a pull request to retrieve that many messages would not be as viable an approach. First of all, the max_messages property of a pull request does not guarantee that all available messages up to max_messages will be returned. It is possible to get zero messages back in a pull response and still have undelivered messages. Therefore, you'd have to keep the count of messages received and try to match the num_undelivered_messages metric. You'd have to account for duplicate delivery in this scenario and for the fact that the Stackdriver monitoring metrics can lag behind the actual values. If the value is too large, you may be pulling trying to get messages you won't receive. If the value is too small, you may not get all of the messages.
Of the two approaches, the one that tracks how long since the last message has been received is the better one, but with the caveats mentioned.
I have done this same implementation in Go some month ago. My assumption was the following:
If there is messages in the queue, the app consume them very quickly (less than 100ms between 2 messages).
If the queue is empty (my app has finished to consume all the messages), new messages can come but slower than 100ms
Thereby, I implement this:
* Each time that I received a message,
* I suspend the 100ms timeout
* I process and ack the message
* I reset to 0 the 100ms timeout
* If the 100ms timeout is fired, I terminate my pull subscription
In my use case, I schedule my processing each 10 minutes. So, I set a global timeout at 9m30 to finish the processing and let the new app instance to continue the processing
Just a tricky thing: For the 1st message, set the timeout to 2s. Indeed, the first message message takes longer to come because of connexion establishment. Thus set a flag when you init your timeout "is the first message or not".
I can share my Go code if it can help you for your implementation.
EDIT
Here my Go code about the message handling
func (pubSubService *pubSubService) Received() (msgArray []*pubsub.Message, err error) {
ctx := context.Background()
cctx, cancel := context.WithCancel(ctx)
// Connect to PubSub
client, err := pubsub.NewClient(cctx, pubSubService.projectId)
if err != nil {
log.Fatalf("Impossible to connect to pubsub client for project %s", pubSubService.projectId)
}
// Put all the message in a array. It will be processed at the end (stored to BQ, as is)
msgArray = []*pubsub.Message{}
// Channel to receive messages
var receivedMessage = make(chan *pubsub.Message)
// Handler to receive message (through the channel) or cancel the the context if the timeout is reached
go func() {
//Initial timeout because the first received is longer than this others.
timeOut := time.Duration(3000)
for {
select {
case msg := <-receivedMessage:
//After the first receive, the timeout is changed
timeOut = pubSubService.waitTimeOutInMillis // Environment variable = 200
msgArray = append(msgArray, msg)
case <-time.After(timeOut * time.Millisecond):
log.Debug("Cancel by timeout")
cancel()
return
}
}
}()
// Global timeout
go func(){
timeOut = pubSubService.globalWaitTimeOutInMillis // Environment variable = 750
time.Sleep(timeOut * time.Second):
log.Debug("Cancel by global timeout")
cancel()
return
}
// Connect to the subscription and pull it util the channel is canceled
sub := client.Subscription(pubSubService.subscriptionName)
err = sub.Receive(cctx, func(ctx context.Context, msg *pubsub.Message) {
receivedMessage <- msg
msg.Ack()
})
}

SQS returns more messages than queue size

I created a basic non-FIFO queue, and put only 1 message on it. I retrieve the message using the following code:
ReceiveMessageRequest request = new ReceiveMessageRequest();
request.setQueueUrl(queueUrl);
request.setMaxNumberOfMessages(10);
request.withMessageAttributeNames("All");
ReceiveMessageResult result = sqsClient.receiveMessage(request);
List<Message> messages = result.getMessages();
messages.size() gives 3
They have:
Same MessageId
Same body and attributes
Same MD5OfBody
Different ReceiptHandle
Changing MaxNumberOfMessages from 10 to 1 fixed it, but I want to receive in batch of 10 in the future.
Can someone explain why it is retrieving more message than it should?
Below is my queue configuration:
Default visibility timeout = 0
message retention = 4 days
max message size = 256kb
delivery delay = 0
receive message wait time = 0
no redrive policy
Details /complement to #Michael - sqlbot comment.
Setting the SQS visibility timeout to a small value are not going to fix your problem. You are going to hit the problem again. Use 30 seconds or more in order to allow your program to consume the message. (To cater for program crashes/unexpected program delay, you should create redrive policy to mitigate the issues.)
AWS has mentioned this in At-Least-Once Delivery
Amazon SQS stores copies of your messages on multiple servers for
redundancy and high availability. On rare occasions, one of the
servers that stores a copy of a message might be unavailable when you
receive or delete a message.
If this occurs, the copy of the message will not be deleted on that
unavailable server, and you might get that message copy again when you
receive messages. You should design your applications to be idempotent
(they should not be affected adversely when processing the same
message more than once).
Changing Default Visibility Timeout from 0 to 1 second fixed the issue

Cloud pubsub slow poll rate

I have a pubsub topic, with one subscription, and two different subscribers are pulling from it.
Using stackdriver, I can see that the subscription has ~1000 messages.
Each subscriber runs the following poll loop:
client = pubsub.Client()
topic = client.topic(topic_name)
subscription = pubsub.Subscription(subscription_name)
while True:
messages = subscription.pull(return_immediately=True, max_messages=100, client=client)
print len(messages)
# put messages in local queue for later processing. Those processes will ack the subsription
My issue is a slow poll rate - even though I have plenty of messages waiting to be polled, I'm getting only several messages each time. Also, lots of responses are back without any messages. According to stackdriver, my messages pulled rate is ~1.5 messages/sec.
I tried to use return_immediately=False, and it improved it a bit - the pull rate increased to ~2.5 messages/sec, but still - not the rate I would expect to have.
Any ideas how to increase pull rate? Any pubsub poll best practices?
In order to increase your pull rate, you need to have more than one outstanding pull request at a time. How many depends on how fast and from how many places you publish. You'll need at least a few outstanding at all times. As soon as one of them returns, create another pull request. That way, whenever Cloud Pub/Sub is ready to deliver messages to your subscriber, you have requests waiting to receive messages.

Subscribing to AWS SQS Messages

I have large number of messages in AWS SQS Queue. These messages will be pushed to it constantly by other source. There are no proper dynamic on how often those messages will be pushed to queue. Currently, I keep polling SQS every second and checking if there are any messages available in there. Is there any better way of handling this, like receiving notification from SQS or SNS that some messages are available so that I only request SQS when I needed instead of constant polling?
The way to do what you want is to use long polling - rather than constantly poll every second, you open a request that stays open until it either times out or a message comes into the queue. Take a look at the documentation for ReceiveMessageRequest
ReceiveMessageRequest req = new ReceiveMessageRequest()
.withWaitTimeSeconds(Integer.valueOf(20)); // set long poll timeout to 20 sec
// set other properties on the request as well
ReceiveMessageResult result = amazonSQS.receiveMessage(req);
A common usage pattern for this is to have a background thread running the long poll and pushing the results into an internal queue (such as LinkedBlockingQueue or an ExecutorService) for a worker thread to read from.
PS. Don't forget to call deleteMessage once you're done processing the result so you don't end up receiving it again.
You can also use the worker functionality in AWS Elastic Beanstalk. It allows you to build a worker to process each message, and when you use Elastic Beanstalk to deploy it to an EC2 instance, you can define it as subscribed to a specific queue. Then each message will be POST to the worker, without your need to call receive-message on it from the queue.
It makes your system wiring much easier, as you can also have auto scaling rules that will allow you to spawn multiple workers to handle more messages in time of peak load, and scale down back to a single worker, when the load is low. It will also delete the message automatically, if you respond with OK from your worker.
See more information about it here: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
You could also have a look at Shoryuken and the property delay:
delay: 25 # The delay in seconds to pause a queue when it's empty
But being honest we use delay: 0 here, the cost of SQS is inexpensive:
First 1 million Amazon SQS Requests per month are free
$0.50 per 1 million Amazon SQS Requests per month thereafter ($0.00000050 per SQS Request)
A single request can have from 1 to 10 messages, up to a maximum total payload of 256KB.
Each 64KB ‘chunk’ of payload is billed as 1 request. For example, a single API call with a 256KB payload will be billed as four requests.
You will probably spend less than 10 dollars monthly polling messages every second 24x7 in a single host.
One of the advantages of Shoryuken is that it fetches in batch, so it saves some money compared with a fetch per message solutions.