Azure Webjobs SDK and Queue handling errors - azure-webjobs

I'm working on a webjob that checks a Webserver for some data. Sometimes the Webserver is busy and will come back with a try again in a few seconds.
Problem is that if I just ignore it and go through the function, it will delete the message from the queue and I will not be able to retry the call.
My current solution (which I don't like at all) is to through an exception which will increase the dequque number and put the message back on the queue. This however seems very brutal and requires the thread which the webjob is running on to restart.
Is there any other way of handling this?

The webjobs SDK will retry queue triggered functions if they throw an exception. The number of retries is configured through the JobHostConfiguration.Queues.MaxDequeueCount property as described here.
If you don't want to rely on the SDK's retry policy, you can always handle the exception in your own code and retry. Just catch the exception or whatever happens when the server is busy and then retry the operation until it succeeds. However, this is exactly what the SDK does behind the scene and by doing this yourself, you don't get the automatic move to poison queue functionality.
Unless the function cannot be designed to be reentrant, I recommend letting the SDK do the retry job. If you still decide to handle it yourself, here is how the code could look like:
public static void WebJobFunction([QueueTrigger("temp")] string message)
{
int retriesLeft= 5;
while(true)
{
try
{
--retriesLeft;
DoServerOperation(message);
break;
}
catch(TimeoutException)
{
if (retriesLeft == 0)
{
throw;
}
}
}
}

Related

How to add an event to grpc completion queue in C++?

I am trying to implement an async GRPC server, where whenever the client makes a call it gets an indefinite stream of messages. I read through the official documentation. It doesn't cover the scenario where I want to keep the stream open for each RPC. This article - https://www.gresearch.co.uk/article/lessons-learnt-from-writing-asynchronous-streaming-grpc-services-in-c/
addresses the issue of keeping the stream open by basically putting the callback handler object in the completion queue again.
The article suggests:
if (HasData())
{
responder_.Write(reply_, this);
}
else
{
grpc::Alarm alarm;
alarm.Set(completion_queue_, gpr_now(gpr_clock_type::GPR_CLOCK_REALTIME), this);
}
I tried using the Alarm object approach as suggested in this article but for some reason on the next Next function call on completion queue I get ok argument as false - GPR_ASSERT(cq_->Next(&tag, &ok));. As a result, I have to close the server and am unable to wait on a stream till further data is available.
I am able to receive data fine till the else case is not hit.
Could someone please help me identify what I might be doing wrong? I am unable to find much C++ resources on GRPC. Thanks!
When Alarm goes out of scope, it will generate a Cancel() causing you to get !ok in Next().
if you want to use this, you would need to put the Alarm into your class scope and trigger it:
std::unique_ptr<Alarm> alarm_;
alarm_.reset(new Alarm);
alarm_->Set(cq_, grpc_timeout_seconds_to_deadline(10), this);
from the doc on Alarm::Set:
Trigger an alarm instance on completion queue cq at the specified
time.
Once the alarm expires (at deadline) or it's cancelled (see Cancel),
an event with tag tag will be added to cq. If the alarm expired, the
event's success bit will be true, false otherwise (ie, upon
cancellation).

Exception handling in a batch of Event Hub events using Azure WebJobs Sdk

I use the EventHub support of the Azure WebJobs Sdk to process Events. Because of the throughput I decided to go for batch processing of those Events, e.g. my method looks like this:
public static void HandleBatchRaw([EventHubTrigger("std")] EventData[] events) {...}
Now one of those events within a batch might cause an Exception - what's the right way to handle that? When I leave the Exception uncaught the processing stops and the remainder of the Events in the EventData[] parameter get lost.
Options:
Catch the Exception manually, forward the Event to some place
else and continue
Let the SDK do the magic, e.g. it should just
'ACK' the Events processed until then (I probably would have to do that), mark this event as 'Poisoned', exit the method and continue on the next call of the function.
Move to Single Event Handling - but for performance
goals I don't feel that's right
I missed the point and should think of another strategy
How should I approach this?
There are only four choices in any messaging solution:
1 Stop
2 Drop
3 Retry
4 Dead letter
You have to do that. I don't believe that SDK will retry anything. Recall there is no ACK for Event Hubs read, you just read.
How are you checkpointing?
Your best bet is probably your option #1. WebJobs EventHub binding doesn't give you many options here. Feel free to file an issue at https://github.com/Azure/azure-webjobs-sdk/issues to request better error handling support here.
If you want to see exactly what it's doing under the hood, here's the spot in the WebJobs SDK EventHub binding that receives events via EventProcessorHost:
https://github.com/Azure/azure-webjobs-sdk/blob/dev/src/Microsoft.Azure.WebJobs.ServiceBus/EventHubs/EventHubListener.cs#L86

Wait for async calls to return before exiting program when using AWS-cpp-sdk

I am doing a POC on using the AWS-cpp-sdk
For which I wrote a simple program to send messages to the SQS queue.
I am using the SendMessageAsync method to send the messages like below.
sqsClient->SendMessageAsync(sendMessageRequest, &sendMessageCallBack);
My program crashes since my program is exiting before the async method returns and Aws::ShutdownAPI(options); terminates the threads created by the Async method call.
I found that the AWS-sdk for JAVA suggests the following for exactly this scenario.
`
/**
* Shuts down the client, releasing all managed resources. This includes
* forcibly terminating all pending asynchronous service calls. Clients who
* wish to give pending asynchronous service calls time to complete should
* call getExecutorService().shutdown() prior to calling this method.
*/
#Override
public void shutdown() {
super.shutdown();
executorService.shutdownNow();
}`
I am unable to find something equivalent in the AWS cpp SDK.
Can someone suggest what would be the best way to fix this issue.
You are responsible for making sure your requests are done before calling ShutdownAPI(). This is usually only an issue in contrived "sample app" sorts of scenarios where you are performing the operation directly inside your main() function. You also need to make sure the SQS client is deleted before ShutdownAPI is called.
On option is to use a std::condition_variable(semaphore) to synchronize before the exit. You can pass the semaphore to your callback and notify_one() at the end of the callback. Then, before shutdown you can call wait() on the semaphore.

duplicated account login checking in server

The communication is based on socket and the it is keep-alive connection. User use account name to log in, I need to implement a feature when two user use same account to log in, the former one need to be kicked off.
Codes need to updated:
void session::login(accountname) // callback when server recv login request
{
boost::shared_ptr<UserData> d = database.get_user_data(accountname);
this->data = d;
this->send(login success);
}
boost::shared_ptr<UserData> Database::get_user_data(accountname)
{
// read from db and return the data
}
The most simple way is improve Database::get_user_data(accountname)
boost::shared_ptr<UserData> Database::get_user_data(accountname)
{
add a boost::unqiue_lock<> here
find session has same accountname or find user data has same accountname in cache,
if found, then kick this session offline first, then execute below codes
// read from db and return the data
}
This modification has 2 problems:
1, too bad concurrency because the scenario happen rarely. However, if I need to check account online or not, I must cache it somewhere(user data or session), that means I need to write to a container which must has exclusive lock whatever the account same or not. So the concurrency can hardly improved.
2, kick other one off by calling "other_session->offline()" in "this thread" that might concurrent with other operations executing in other thread at same time.
If I add lock in offline(), that will result in all others function belong to session also need to add that lock, obviously, not good. Or, I can push a event to other_session, and let other_session handle the event, that will make sure "offline" executing in its own thread. But the problem is that will make offline executing async, codes below "other one offline" must executed after "offline" runs complete.
I use boost::asio, but I try to describe this problem in common because I think this is a common problem in server writing. Is there a pattern to solve this? Notice that this problem gets complex when there are N same account log in at same time
If this scenario rarely happens, I wouldn't worry about it. lock and release of mutex are not long actions the user would notice (if you had to do it thousands of times a second it could be a problem).
In general trying to fix performance issues that are not there is a bad idea.

Why do SqS messages sometimes remain in-flight on queue

I'm using Amazon SQS queues in a very simple way. Usually, messages are written and immediately visible and read. Occasionally, a message is written, and remains In-Flight(Not Visible) on the queue for several minutes. I can see it from the console. Receive-message-wait time is 0, and Default Visibility is 5 seconds. It will remain that way for several minutes, or until a new message gets written that somehow releases it. A few seconds delay is ok, but more than 60 seconds is not ok.
There a 8 reader threads that are long polling always, so its not that something is not trying to read it, they are.
Edit : To be clear, none of the consumer reads are returning any messages at all and it happens regardless of whether or not the console is open. In this scenario, only one message is involved, and it is just sitting in the queue invisible to the consumers.
Has anyone else seen this behavior and what I can do to improve it?
Here is the sdk for java I am using:
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.5.2</version>
</dependency>
Here is the code that does the reading (max=10,maxwait=0 startup config):
void read(MessageConsumer consumer) {
List<Message> messages = read(max, maxWait);
for (Message message : messages) {
if (tryConsume(consumer, message)) {
delete(message.getReceiptHandle());
}
}
}
private List<Message> read(int max, int maxWait) {
AmazonSQS sqs = getClient();
ReceiveMessageRequest rq = new ReceiveMessageRequest(queueUrl);
rq.setMaxNumberOfMessages(max);
rq.setWaitTimeSeconds(maxWait);
List<Message> messages = sqs.receiveMessage(rq).getMessages();
if (messages.size() > 0) {
LOG.info("read {} messages from SQS queue",messages.size());
}
return messages;
}
The log line for "read .." never appears when this is happening, and its what causes me to go in with the console and see if the message is there or not, and it is.
It sounds like you are misinterpreting what you are seeing.
Messages "in flight" are not pending delivery, they're messages that have already been delivered but not further acted on by the consumer.
Messages are considered to be in flight if they have been sent to a client but have not yet been deleted or have not yet reached the end of their visibility window.
— https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html
When a consumer receives a message, it has to -- at some point -- either delete the message, or send a request to increase the timeout for that message; otherwise the message becomes visible again after the timeout expires. If a consumer fails to do one of these things, the message automatically becomes visible again. The visibility timeout is how long the consumer has before one of these things must be done.
Messages should not be "in flight" without something having already received them -- but that "something" can include the console itself, as you'll note on the pop-up you see when you choose "View/Delete Messages" in the console (unless you already checked the "Don't show this again" checkbox):
Messages displayed in the console will not be available to other applications until the console stops polling for messages.
Messages displayed in the console are "in flight" while the console is observing the queue from the "View/Delete Messages" screen.
The part that does not make obvious sense is messages being in flight "for several minutes" if your default visibility timeout is only 5 seconds and nothing in your code is increasing that timeout... however... that could be explained almost perfectly by your consumers not properly disposing of the message, causing it to timeout and immediately be redelivered, giving the impression that a single instance of the message was remaining in-flight, when in fact, the message is briefly transitioning back to visible, only to be claimed almost immediately by another consumer, taking it back to in-flight again.
It may happen when you send or lock a message and within some seconds you try to get the fresh list of messages. Amazon SQS stores the data into multiple servers and in multiple data centers http://aws.amazon.com/sqs/faqs/#How_reliably_is_my_data_stored_in_Amazon_SQS.
To get rid of these issues you need to wait more so that queue would have more time to give appropriate results.