TryTimeout for EventHubProducerClient SendAsync - azure-eventhub

We can setup timeout value for EventHubProducerClient.SendAsync() while connection creation using EventHubProducerClientOptions.
I have initialized EventHubProducerClientOptions.TryTimeout with 10 seconds and EventHubProducerClientOptions.MaximumRetries to 3.
Does TryTimeout value applicable for each individual retries (10 seconds * 3 retries) or TryTimeout value is shared between all retries?
Example:
If each individual retry takes 5 seconds before failing, then first approach will timeout after 15 seconds verses second approach, which will timeout after 10 seconds with 2 retries.

The TryTimeout governs a single service operation. In the case of transient failures with retries, the EventHubsRetryOptions for MaximumDelay, Mode, and (sometimes) Delay control the delay between those attempts.
You'll see the call take MaximumRetries * TryTimeout * retry policy delays. With the default exponential retry pattern, delays contain a small amount of random jitter, so you cannot reasonably predict the exact amount of delay.
To illustrate, let's take your scenario and assume that each service operation times out. The pattern that you would see is:
Your application makes a call to SendAsync
The service operation times out after 10 seconds (TryTimeout)
The failure is deemed transient, the retry policy is consulted and specifies a delay. This is governed by the retry options and will be <= the MaximumDelay value.
After the retry delay, the service operation is made a second time and times out after 10 seconds.
The failure is deemed transient, the retry policy specifies a delay.
After the retry delay, the service operation is made for a third time and times out after 10 seconds.
The failure is deemed transient, the retry policy recognizes that it is out of retries (MaximumRetries was set to 3) and does not specify a delay.
The call to SendAsync returns control to your application, throwing a TimeoutException.

Related

When a state of a step function times out, does the lambda execution correlated to it continue to be performed?

I want to know if a lambda execution continues to be performed even if the state of the step function correlated to it times out. If it happens, how can i stop it?
There is no way to kill a running lambda. However, you can set concurrency limit to 0 to stop it from starting further executions
Standard StepFunctions have a max timeout of 1 year. (yes! One year)
As such any individual task also has a max timeout of 1 year.
(Express StepFunctions have a timeout of 30 seconds mind you)
Lambda's have a max time out of 15 mins.
If you need your lambda to complete in a certain amount of time, you are best served by setting your lambda timeout to that - not your state machine. (i see in your comments you say you cannot pass a value for this? If you cannot change it then you have no choice but to let it run its course)
Consider StepFunctions and state machines to be orchestrators, but they have very little control over the individual components. They tell who to act and when but otherwise are stuck waiting on those components to reply before continuing.
If your lambda times out, it will cause your StateMachine to fail that task as as it receives a lambda service error. You can then handle that in the StepFunction without failing the entire process, see:
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html
You could specifically use: TimeoutSecondsPath in your definition to set a specific result if the task timesout.
But as stated, no, once a lambda begins execution it will continue until it finishes or it times out at 15 mins / its set timeout.

handling failure after maximum number of retries in google app engine task queues

I am using google-api-python-client and I am using google app engine task queues for some async operations.
For the specific task queue, I am also setting max number of times that the task should be retried(In my case retries are less likely to be successful, so I want to limit them).
Is there a way to write a handler which can handle the case where the task is still failing even after the specified number of retries?
Basically if my retry limit is 5, after 5 unsuccessful retries, I want to move the task to a different queue where it can be retried more number of times with a larger interval between the retries, that way it is more likely to succeed.
From here I believe that I can use X-AppEngine-TaskExecutionCount header in each retry and write some custom logic to know when the task is going to execute for the last time and achieve this but I am trying find out if there is any cleaner way.
By the way X-AppEngine-TaskExecutionCount specifies(from the doc), The number of times this task has previously failed during the execution phase. This number does not include failures due to a lack of available instance.
At least presently there is no support for automatically moving a task from one queue to another.
One option is to keep the task on the same queue, increase the max number of retries and use the retry_parameters to customize the retry backoff policy (i.e. the increase of time between retries):
retry_parameters
Optional. Configures retry attempts for failed tasks. This addition
allows you to specify the maximum number of times to retry failed
tasks in a specific queue. You can also set a time limit for retry
attempts and control the interval between attempts.
The retry parameters can contain the following subelements:
task_retry_limit
The maximum number of retry attempts for a failed task. If specified with task_age_limit, App Engine retries the task until
both limits are reached. If 0 is specified, the task will not be
retried.
task_age_limit (push queues)
The time limit for retrying a failed task, measured from when the task was first run. The value is a number followed by a unit of time,
where the unit is s for seconds, m for minutes, h for
hours, or d for days. For example, the value 5d specifies a
limit of five days after the task's first execution attempt. If
specified with task_retry_limit, App Engine retries the task until
both limits are reached.
min_backoff_seconds (push queues)
The minimum number of seconds to wait before retrying a task after it fails.
max_backoff_seconds (push queues)
The maximum number of seconds to wait before retrying a task after it fails.
max_doublings (push queues)
The maximum number of times that the interval between failed task retries will be doubled before the increase becomes constant. The
constant is: 2**max_doublings * min_backoff_seconds**.
But the pattern of the increase will be gradual - doubling after each failure, you can't get a significant "step"-like increase of the time between retries. Still, it may be a good enough solution for which no additional coding is required. Personally I'd go for this approach.
Another approach is to add that logic to determine if that execution is the final retry of the original task and, if so, enqueue a new corresponding task on a different queue which has the desired "slower" retry policy. I'm unsure if this is what you were referring to in the question and wanted to avoid.

S3 Lambda trigger double invocation after exactly 10 minutes

We are experiencing double Lambda invocations of Lambdas triggered by S3 ObjectCreated-Events. Those double invocations happen exactly 10 minutes after the first invocation, not 10 minutes after the first try is complete, but 10 minutes after the first invocation happened. The original invocation takes anything in the range between 0.1 to 5 seconds. No invocations results in errors, they all complete successfully.
We are aware of the fact that SQS for example does not guarantee exactly-once but at-least-once delivery of messages and we would accept some of the lambdas getting invoked a second time due to results of the distributed system underneath. A delay of 10 minutes however sounds very weird.
Of about 10k messages 100-200 result in double invocations.
The AWS Support basically says "the 10 minute wait time is by design but we cannot tell you why", which is not at all helpful.
Has anyone else experienced this behaviour before?
How did you solve the issue or did you simply ignore it (which we could do)?
One proposed solution is not to use direct S3-lambda-triggers, but let S3 put its event on SNS and subscribe a Lambda to that. Any experience with that approach?
example log: two invocations, 10 minutes apart, same RequestId
START RequestId: f9b76436-1489-11e7-8586-33e40817cb02 Version: 13
2017-03-29 14:14:09 INFO ImageProcessingLambda:104 - handle 1 records
and
START RequestId: f9b76436-1489-11e7-8586-33e40817cb02 Version: 13
2017-03-29 14:24:09 INFO ImageProcessingLambda:104 - handle 1 records
After a couple of rounds with the AWS support and others and a few isolated trial runs it seems like this is simply "by design". It is not clear why, but it simply happens. The problem is neither S3 nor SQS / SNS but simply the lambda invocation and how the lambda service dispatches the invocations to lambda instances.
The double invocations happen somewhere between 1% and 3% of all invocations, 10 minutes after the first invocation. Surprisingly there are even triple (and probably quadruple) invocations with a rate of powers of the base probability, so basically 0.09%, ... The triple invocations happened 20 minutes after the first one.
If you encounter this, you simply have to work around it using whatever you have access to. We for example now store the already processed entities in a Cassandra with a TTL of 1 hour and only responding to messages from the lambda if the entity has not been processed yet. The double and triple invocations all happen within this one hour timeframe.
Not wanting to spin up a data store like Dynamo just to handle this, I did two things to solve our use case
Write a lock file per function into S3 (which we were already using for this one) and check for its existence on function entry, aborting if present; for this function we only ever want one of it running at a time. The lock file is removed before we call callback on error or success.
Write a request time in the initial event payload and check the request time on function entry; if the request time is too old then abort. We don't want Lambda retries on error unless they're done quickly, so this handles the case where a duplicate or retry is sent while another invocation of the same function is not already running (which would be stopped by the lock file) and also avoids the minimal overhead of the S3 requests for the lock file handling in this case.

Multiple Timers in C++ / MySQL

I've got a service system that gets requests from another system. A request contains information that is stored on the service system's MySQL database. Once a request is received, the server should start a timer that will send a FAIL message to the sender if the time has elapsed.
The problem is, it is a dynamic system that can get multiple requests from the same, or various sources. If a request is received from a source with a timeout limit of 5 minutes, and another request comes from the same source after only 2 minutes, it should be able to handle both. Thus, a timer needs to be enabled for every incoming message. The service is a web-service that is programmed in C++ with the information being stored in a MySQL database.
Any ideas how I could do this?
A way I've seen this often done: Use a SINGLE timer, and keep a priority queue (sorted by target time) of every timeout. In this way, you always know the amount of time you need to wait until the next timeout, and you don't have the overhead associated with managing hundreds of timers simultaneously.
Say at time 0 you get a request with a timeout of 100.
Queue: [100]
You set your timer to fire in 100 seconds.
Then at time 10 you get a new request with a timeout of 50.
Queue: [60, 100]
You cancel your timer and set it to fire in 50 seconds.
When it fires, it handles the timeout, removes 60 from the queue, sees that the next time is 100, and sets the timer to fire in 40 seconds. Say you get another request with a timeout of 100, at time 80.
Queue: [100, 180]
In this case, since the head of the queue (100) doesn't change, you don't need to reset the timer. Hopefully this explanation makes the algorithm pretty clear.
Of course, each entry in the queue will need some link to the request associated with the timeout, but I imagine that should be simple.
Note however that this all may be unnecessary, depending on the mechanism you use for your timers. For example, if you're on Windows, you can use CreateTimerQueue, which I imagine uses this same (or very similar) logic internally.

Do any boost::asio async calls automatically time out?

I have a client and server using boost::asio asynchronously. I want to add some timeouts to close the connection and potentially retry if something goes wrong.
My initial thought was that any time I call an async_ function I should also start a deadline_timer to expire after I expect the async operation to complete. Now I'm wondering if that is strictly necessary in every case.
For example:
async_resolve presumably uses the system's resolver which has timeouts built into it (e.g. RES_TIMEOUT in resolv.h possibly overridden by configuration in /etc/resolv.conf). By adding my own timer, I may conflict with how the user wants his resolver to work.
For async_connect, the connect(2) syscall has some sort of timeout built into it
etc.
So which (if any) async_ calls are guaranteed to call their handlers within a "reasonable" time frame? And if an operation [can|does] timeout would the handler be passed the basic_errors::timed_out error or something else?
So I did some testing. Based on my results, it's clear that they depend on the underlying OS implementation. For reference, I tested this with a stock Fedora kernel: 2.6.35.10-74.fc14.x86_64.
The bottom line is that async_resolve() looks to be the only case where you might be able to get away without setting a deadline_timer. It's practically required in every other case for reasonable behavior.
async_resolve()
A call to async_resolve() resulted in 4 queries 5 seconds apart. The handler was called 20 seconds after the request with the error boost::asio::error::host_not_found.
My resolver defaults to a timeout of 5 seconds with 2 attempts (resolv.h), so it appears to send twice the number of queries configured. The behavior is modifiable by setting options timeout and options attempts in /etc/resolv.conf. In every case the number of queries sent was double whatever attempts was set to and the handler was called with the host_not_found error afterwards.
For the test, the single configured nameserver was black-hole routed.
async_connect()
Calling async_connect() with a black-hole-routed destination resulted in the handler being called with the error boost::asio::error::timed_out after ~189 seconds.
The stack sent the initial SYN and 5 retries. The first retry was sent after 3 seconds, with the retry timeout doubling each time (3+6+12+24+48+96=189). The number of retries can be changed:
% sysctl net.ipv4.tcp_syn_retries
net.ipv4.tcp_syn_retries = 5
The default of 5 is chosen to comply with RFC 1122 (4.2.3.5):
[The retransmission timers] for a SYN
segment MUST be set large enough to
provide retransmission of the segment
for at least 3 minutes. The
application can close the connection
(i.e., give up on the open attempt)
sooner, of course.
3 minutes = 180 seconds, though the RFC doesn't appear to specify an upper bound. There's nothing stopping an implementation from retrying forever.
async_write()
As long as the socket's send buffer wasn't full, this handler was always called right away.
My test established a TCP connection and set a timer to call async_write() a minute later. During the minute where the connection was established but prior to the async_write() call, I tried all sorts of mayhem:
Setting a downstream router to black-hole subsequent traffic to the destination.
Clearing the session in a downstream firewall so it would reply with spoofed RSTs from the destination.
Unplugging my Ethernet
Running /etc/init.d/network stop
No matter what I did, the next async_write() would immediately call its handler to report success.
In the case where the firewall spoofed the RST, the connection was closed immediately, but I had no way of knowing that until I attempted the next operation (which would immediately report boost::asio::error::connection_reset). In the other cases, the connection would remain open and not report errors to me until it eventually timed out 17-18 minutes later.
The worst case for async_write() is if the host is retransmitting and the send buffer is full. If the buffer is full, async_write() won't call its handler until the retransmissions time out. Linux defaults to 15 retransmissions:
% sysctl net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15
The time between the retransmissions increases after each (and is based on many factors such as the estimated round-trip time of the specific connection) but is clamped at 2 minutes. So with the default 15 retransmissions and worst-case 2-minute timeout, the upper bound is 30 minutes for the async_write() handler to be called. When it is called, error is set to boost::asio::error::timed_out.
async_read()
This should never call its handler as long as the connection is established and no data is received. I haven't had time to test it.
Those two calls MAY have timeouts that get propigated up to your handlers, but you might be supprised at the length of time it takes before either of those times out. (I know I have let a connection just sit and try to connect on a single connect call for over 10 minutes with boost::asio before killing the process). Also the async_read and async_write calls do not have timeouts associated with them, so if you wish to have timeouts on your reads and writes, you will still need a deadline_timer.