Problem:I am calling URLFetch with a deadline of 480 seconds from
within a TaskQueue, but it is timing out after only 60 seconds.
The original question was asked in official group more than year ago, but still unanswered.
Bug confirmed, but there is no technical support or developers of gae. Maybe they're here?
While there is information on this old thread that suggests otherwise, I don't believe this is a bug that will be fixed (or that it is a bug). It's unfortunate that this issue has not been updated or closed.
A Urlfetch (regardless) of where you make it from within appengine world has a maximum deadline of 60 seconds.
Requests on front end instances within appengine also have a lifetime of a maximum of 60 seconds.
Requests within the context of the Taskqueue however, have a lifetime of up to 10 minutes. This however does not mean that you can make a Urlfetch made from within the taskqueue context exceed the 60 second deadline.
Related
We are experiencing double Lambda invocations of Lambdas triggered by S3 ObjectCreated-Events. Those double invocations happen exactly 10 minutes after the first invocation, not 10 minutes after the first try is complete, but 10 minutes after the first invocation happened. The original invocation takes anything in the range between 0.1 to 5 seconds. No invocations results in errors, they all complete successfully.
We are aware of the fact that SQS for example does not guarantee exactly-once but at-least-once delivery of messages and we would accept some of the lambdas getting invoked a second time due to results of the distributed system underneath. A delay of 10 minutes however sounds very weird.
Of about 10k messages 100-200 result in double invocations.
The AWS Support basically says "the 10 minute wait time is by design but we cannot tell you why", which is not at all helpful.
Has anyone else experienced this behaviour before?
How did you solve the issue or did you simply ignore it (which we could do)?
One proposed solution is not to use direct S3-lambda-triggers, but let S3 put its event on SNS and subscribe a Lambda to that. Any experience with that approach?
example log: two invocations, 10 minutes apart, same RequestId
START RequestId: f9b76436-1489-11e7-8586-33e40817cb02 Version: 13
2017-03-29 14:14:09 INFO ImageProcessingLambda:104 - handle 1 records
and
START RequestId: f9b76436-1489-11e7-8586-33e40817cb02 Version: 13
2017-03-29 14:24:09 INFO ImageProcessingLambda:104 - handle 1 records
After a couple of rounds with the AWS support and others and a few isolated trial runs it seems like this is simply "by design". It is not clear why, but it simply happens. The problem is neither S3 nor SQS / SNS but simply the lambda invocation and how the lambda service dispatches the invocations to lambda instances.
The double invocations happen somewhere between 1% and 3% of all invocations, 10 minutes after the first invocation. Surprisingly there are even triple (and probably quadruple) invocations with a rate of powers of the base probability, so basically 0.09%, ... The triple invocations happened 20 minutes after the first one.
If you encounter this, you simply have to work around it using whatever you have access to. We for example now store the already processed entities in a Cassandra with a TTL of 1 hour and only responding to messages from the lambda if the entity has not been processed yet. The double and triple invocations all happen within this one hour timeframe.
Not wanting to spin up a data store like Dynamo just to handle this, I did two things to solve our use case
Write a lock file per function into S3 (which we were already using for this one) and check for its existence on function entry, aborting if present; for this function we only ever want one of it running at a time. The lock file is removed before we call callback on error or success.
Write a request time in the initial event payload and check the request time on function entry; if the request time is too old then abort. We don't want Lambda retries on error unless they're done quickly, so this handles the case where a duplicate or retry is sent while another invocation of the same function is not already running (which would be stopped by the lock file) and also avoids the minimal overhead of the S3 requests for the lock file handling in this case.
I have been trying to load test my API server using Locust.io on EC2 compute optimized instances. It provides an easy-to-configure option for setting the consecutive request wait time and number of concurrent users. In theory, rps = wait time X #_users. However while testing, this rule breaks down for very low thresholds of #_users (in my experiment, around 1200 users). The variables hatch_rate, #_of_slaves, including in a distributed test setting had little to no effect on the rps.
Experiment info
The test has been done on a C3.4x AWS EC2 compute node (AMI image) with 16 vCPUs, with General SSD and 30GB RAM. During the test, CPU utilization peaked at 60% max (depends on the hatch rate - which controls the concurrent processes spawned), on an average staying under 30%.
Locust.io
setup: uses pyzmq, and setup with each vCPU core as a slave. Single POST request setup with request body ~ 20 bytes, and response body ~ 25 bytes. Request failure rate: < 1%, with mean response time being 6ms.
variables: Time between consecutive requests set to 450ms (min:100ms and max: 1000ms), hatch rate at a comfy 30 per sec, and RPS measured by varying #_users.
The RPS follows the equation as predicted for upto 1000 users. Increasing #_users after that has diminishing returns with a cap reached at roughly 1200 users. #_users here isn't the independent variable, changing the wait time affects the RPS as well. However, changing the experiment setup to 32 cores instance (c3.8x instance) or 56 cores (in a distributed setup) doesn't affect the RPS at all.
So really, what is the way to control the RPS? Is there something obvious I am missing here?
(one of the Locust authors here)
First, why do you want to control the RPS? One of the core ideas behind Locust is to describe user behavior and let that generate load (requests in your case). The question Locust is designed to answer is: How many concurrent users can my application support?
I know it is tempting to go after a certain RPS number and sometimes I "cheat" as well by striving for an arbitrary RPS number.
But to answer your question, are you sure your Locusts doesn't end up in a dead lock? As in, they complete a certain number of requests and then become idle because they have no other task to perform? Hard to tell what's happening without seeing the test code.
Distributed mode is recommended for larger production setups and most real-world load tests I've run have been on multiple but smaller instances. But it shouldn't matter if you are not maxing out the CPU. Are you sure you are not saturating a single CPU core? Not sure what OS you are running but if Linux, what is your load value?
While there is no direct way of controlling rps, you can try constant_pacing and constant_throughput option in wait_time
From docs
https://docs.locust.io/en/stable/api.html#locust.wait_time.constant_throughput
In the following example the task will always be executed once every 1 seconds, no matter the task execution time:
class MyUser(User):
wait_time = constant_throughput(1)
constant_pacing is inverse of this.
So if you run with 100 concurrent users, test will run at 100rps (assuming each request takes less than 1 second in first place
I bit of a puzzle to solve in CF here.
You have a timestamp passed as an argument to the server. For example 2012-5-10 14:55.
How would you see if 30 seconds has passed from the time given and return true or false?
Rules:
You cannot use the current server time to check if 30 seconds have passed because the timestamp comes from anywhere in the world and will always be wrong.
In this case you can always trust the argument timestamp passed to the server.
How would this be possible in ColdFusion 9 (if at all)?
Hmmm.... Your problem is that you don't know the latency. You can count 30 seconds from the time you receive the timestamp - but that could be seconds after the time stamp was created.
You can count 30 seconds easily enough with...
sleep(30000); //1000 miliseconds = 1 second. 30k = 30 seconds
So as soon as you get the var you could "wait" for thirty seconds and then do something. But from your question it seems like you need exact 30 seconds from the time the timestamp (created on the client) was created. You probably cannot be that exact because:
The 2 clocks are not in synch
You cannot figure out the latency of the request.
Even if you could, since you don't control the client you will have trouble guaranteeing the results because HTTP is stateless and not real time.
If you can dictate an HTML5 browser you could use websockets for this - it's practically begging for it :) But that's the only real solution I can think of.
You don't specify if the argument passed to the server is a API request, form submit or an ajax request, etc, so there will be differences in the implementation on the other end. However, the ColdFusion server end should be extremely similar and is basically bound with 3 issues; timezones, time disparity, and latency.
To solve the timezone issue you can require the argument to be passed as UTC time. When comparing the times with Coldfusion you would do something like:
<cfif DateDiff("s", Variables.PassedTimestamp, DateConvert( "Local2UTC", Now() )) EQ 30>
<!--- Exactly 30 seconds difference between timestamps --->
</cfif>
But you don't know if the time sent to you is synchronized with the time on your server. To help with this you can provide a method for the other end to query your time and either adjust their time accordingly or to echo back your time with their own time.
That time synching and the originally discussed send of the timestamp will both suffer from latency. Any communication over a network will suffer from latency, just varying levels of latency. Location and Internet connection type can have more of an impact than protocol type. You could do something like pinging the other end to get the response time and then halving that and adding it to the timestamp, but that's still an aproximation and not all servers will respond to a ping.
Additionally, the code on your server and the other end will also introduce latency. To reduce this you want the other end to calculate the timestamp as close to the end of the request as possible, and your code to calculate the timestamp as soon as possible. This can be done with ColdFusion by setting the request time as near the top of you Application.cfm|cfc as possible.
<cfset Request.Now = Now()>
And then change the checking code to:
DateDiff("s", Variables.PassedTimestamp, DateConvert( "Local2UTC", Request.Now ))
If, instead, you're wanting to loop until 30 seconds has passed and then respond, note that your response won't appear to happen after 30 seconds because the latency will delay the response going back to the other end.
Getting exactly 30 seconds is impossible, but you take steps that will get your closer. If you're just needing to see if approximately 30 seconds has passed, then that will be good enough.
I am profiling some AS code by measuring wall clock time. In order to minimize the error I need to run the code for a long period of time. However, flash seems to protect itself from unresponsive scripts by throwing an exception after some period of unresponsiveness, namely: Error #1502: A script has executed for longer than the default timeout period of 15 seconds.
Is there any way to disable this protection, or at least extend the timeout period?
If you are publishing with Adobe Flash CS/4/5 etc.
Goto the publish settings. select "flash" at the bottom of this screen there is a textbox which says "Script Timeout" I know you can increase this, I think the limit is 90seconds even though you can enter any value here.
Can you move execution of the script across separate frames, and add a timer to advance the frame before the timeout period has lapsed? I believe the error only occurs when you've dwelled on a frame for more than 15 seconds.
I was just trying the SetTimer method in Win32 with some low values such as 10ms as the timeout period. I calculated the time it took to get 500 timer events and expected it to be around 5 seconds. Surprisingly I found that it is taking about 7.5 seconds to get these many events which means that it is timing out at about 16ms. Is there any limitation on the value we can set for the timeout period ( I couldn't find anything on the MSDN ) ? Also, does the other processes running in my system affect these timer messages?
OnTimer is based on WM_TIMER message, which is a low message priority, meaning it will be send only when there's no other message waiting.
Also MSDN explain that you can not set an interval less than USER_TIMER_MINIMUM, which is 10.
Regardless of that the scheduler will honor the time quantum.
Windows is not a real-time OS and can't handle that kind of precision (10 ms intervals). Having said that, there are multiple kinds of timers and some have better precision than others.
You can alter the granularity of the system timer down to 1ms - this is intended for MIDI work.
Basically, my experiences on w2k are that any requested wait period under 13ms returns a wait which oscillates randomly between two values, 0ms and 13ms. Timers longer than that are generally very accurate. Your 500 timer events - some were 0ms, some were 13ms (assuming 13ms is still correct). You ended up with a time shortfall.
As stated - windows is not a realtime OS. Asking it to do anything and expecting it at a specific time later is a fools errand. Setting a timer asks windows nicely to fire the WM_TIMER event as soon after the time has passed as is possible. This may be after other threads are dealt with and done. Therefore the actual time to see the WM_TIMER event can't be realistically predicted - All you know is it's >the time you set....
Checkout this article on windows time