GAE - how to avoid service request timing out after 1 day - python-2.7

As I explained in this post, I'm trying to scrape tweets from Twitter.
I implemented the suggested solution with services, so that the actual heavy lifting happens in the backend.
The problem is that after about one day, I get this error
"Process terminated because the request deadline was exceeded. (Error code 123)"
I guess this is because the manual scaling has the requests timing out after 24 hours.
Is it possible to make it run for more than 24 hours?

You can't make a single request / task run for more than 24 hours but you can split up your request into different parts and each lasting a day. It's unwise to have a request run indefinitely that's why app engine closes them after a certain time to prevent idling / loopy request that lasts indefinitely.
I would recommend having your task fire a call at the end to trigger the queuing of the next task, that way it's automatic and you don't have to queue a task daily. Make sure there's some cursor or someway for your task to communicate progress so it won't duplicate work.

Related

gUnicorn/Flask/GAE - two processes started for processing the same http request

I have an app on Google AppEngine (Python39 standard env) running on gUnicorn and Flask. I'm making a request to the server from client-side app for a long-running operation and seeing that the request processed twice. The second process (worker) started after a while (a hour and a half) after the first one has been working.
I'm not sure is it related to gUnicorn specifically or to GAE.
The server controller has logging at the beginning :
#app.route("/api/campaign/generate", methods=["GET"])
def campaign_generate():
logging.info('Entering campaign_generate');
# some very long processing here
The controller is called by clicking a button from the UI app. I checked the network in DevTools in the browser that only one request fired. And I can see that there's only one request in server logs at the moment of executing of workers (more on this follow).
The whole app.yaml is like this:
runtime: python39
default_expiration: 0
instance_class: B2
basic_scaling:
max_instances: 1
entrypoint: gunicorn -b :$PORT server.server:app --timeout 0 --workers 2
So I have 2 workers with infinite timeouts, basic scaling with max instances = 1.
I expect while the app is processing one request for a long-running operation, another worker is available for serving.
I don't expect the second worker will used to processing the same request, it's a nonsense (if only the user won't start another operation from another browser).
Thanks to timeout=0 I expect gUnicorn will wait indefinitely till the controller finishes. And only one thing that can hinder is GAE'e timeout. But thanks to basic-scaling it's 24 hours. So I expect the app should process requests for several hours without problem.
But what I'm seeing instead is that after the processing the request for a while another execution is started. Here's simplified logs I see in Cloud Logging:
13:00:58 GET /api/campaign/generate
13:00:59 Entering campaign_generate
..skipped
13:39:13 Starting generating zip-archive (it's something that takes a while)
14:25:49 Entering campaign_generate
So, at 14:25, 1:25 after the current request came another processing of the same request started!
And now there're two request processings running in parallel.
Needless to say that this increase memory pressure and doubles execution time.
When the first "worker" finished (14:29:28 in our example) its processing, its result isn't being returned to the client. It looks like gUnicorn or GAE simply abandoned the first request. And the client has to wait till the second worker finishes processing.
Why is it happening?
And how can I fix it?
Regarding http requests records in the log.
I did see only one request in Cloud Logging (the first one) when the processing was active, and even after the controller was called for the second time ('Entering campaign_generate' in logs appeared) there was not any new GET-request in the logs. But after that everything completed (actually the second processing returned a response) a mysterious second GET-request appeared. So technically after everything is done, from the server logs' view (Cloud Logging) it looks like there were two subsequent requests from the client. But there weren't! There was only one, and I can see it in the browser's DevTools.
Those two requests have different traceId and requestId http headers.
It's very hard to understand what's going on, I tried running the app locally (on the same data) but it works as intended.

How to handle long requests in Google Cloud Run?

I have hosted my node app in Cloud Run and all of my requests served within 300 - 600ms time. But one endpoint that gets data from a 3rd party service so that request takes 1.2s - 2.5s to complete the request.
My doubts regarding this are
Is 1.2s - 2.5s requests suitable for cloud run? Or is there any rule that the requests should be completed within xx ms?
Also see the screenshot, I got a message along with the request in logs "The request caused a new container instance to be started and may thus take longer and use more CPU than a typical request"
What caused a new container instance to be started?
Is there any alternative or work around to handle long requests?
Any advice / suggestions would be greatly appreciated.
Thanks in advance.
I don't think that will be an issue unless you're worried about the cost of the CPU/memory time, which honestly should only matter if you're getting 10k+ requests/day. So, probably doesn't matter and cloud run can handle that just fine (my own app does requests longer than that with no problem)
It's possible that your service was "scaled to zero" meaning that there were no containers left running to serve requests. In that case, it would be necessary to start up a new instance and wait for whatever initializing/startup costs are associated with that process. It's also possible that it was auto-scaled due to all other instances being at their request limits. Make sure that your setting for max concurrent requests per instance is set greater than one - Node/Express can handle multiple requests at once. Plus, you'll only get charged for the total time spend, not per request:
In situations where you get very long (30 seconds, minutes+) operations, it may be a good idea to switch to some different data transfer method. You could use polling, where the client makes a request every 5 seconds and checks if the response is ready. You could also switch to some kind of push-based system like WebSockets, but Cloud Run doesn't have support for that.
TL;DR longer requests (~10-30 seconds) should be fine unless you're worried about the cost of the increased compute time they may occur at scale.

Celery/SQS task retry gone haywire - how to get rid of it?

We've got Celery/SQS set up for asynchronous task management. We're running Django for our framework. We have a celery task that has a self.retry() in it. Max_retries is set to 15. The retry is happening with an exponential backoff and takes 182 hours to complete all 15 retries.
Last week, this task went haywire, I think due to a bug in our code not properly handling a service outage. It resulted in exponential creation (retrying?) of the same celery task. It eventually used up all available memory and the worker crashed. Restarting the worker results in another crash a couple hours later, since all those tasks (and their retries) keep retrying and spawning new retries until we run out of memory again. Ultimately we ended up with nearly 600k tasks created!
We need our workers to ignore all the tasks with a specific celery GUID. Ideally we could just get rid of them for good. I was going to use revoke() but, per documentation (http://docs.celeryproject.org/en/3.1/userguide/workers.html#commands), this is only implemented for Redis and RabbitMQ, not SQS. Furthermore, when I go to the SQS service in the AWS console, it's showing zero messages in flight so it's not like I can just flush it.
Is there a way to delete or revoke a specific message from SQS using the Celery task ID? Or is there another way to fix this problem? Obviously we need to fix our code so we don't get into this situation again, but first we need to get our worker up and running because without it our website has reduced functionality. Thanks!

Is there an AWS / Pagerduty service that will alert me if it's NOT notified

We've got a little java scheduler running on AWS ECS. It's doing what cron used to do on our old monolith. it fires up (fargate) tasks in docker containers. We've got a task that runs every hour and it's quite important to us. I want to know if it crashes or fails to run for any reason (eg the java scheduler fails, or someone turns the task off).
I'm looking for a service that will alert me if it's not notified. I want to call the notification system every time the script runs successfully. Then if the alert system doesn't get the "OK" notification as expected, it shoots off an alert.
I figure this kind of service must exist, and I don't want to re-invent the wheel trying to build it myself. I guess my question is, what's it called? And where can I go to get that kind of thing? (we're using AWS obviously and we've got a pagerDuty account).
We use this approach for these types of problems. First, the task has to write a timestamp to a file in S3 or EFS. This file is the external evidence that the task ran to completion. Then you need an http based service that will read that file and calculate if the time stamp is valid ie has been updated in the last hour. This could be a simple php or nodejs script. This process is exposed to the public web eg https://example.com/heartbeat.php. This script returns a http response code of 200 if the timestamp file is present and valid, or a 500 if not. Then we use StatusCake to monitor the url, and notify us via its Pager Duty integration if there is an incident. We usually include a message in the response so a human can see the nature of the error.
This may seem tedious, but it is foolproof. Any failure anywhere along the line will be immediately notified. StatusCake has a great free service level. This approach can be used to monitor any critical task in same way. We've learned the hard way that critical cron type tasks and processes can fail for any number of reasons, and you want to know before it becomes customer critical. 24x7x365 monitoring of these types of tasks is necessary, and helps us sleep better at night.
Note: We always have a daily system test event that triggers a Pager Duty notification at 9am each day. For the truly paranoid, this assures that pager duty itself has not failed in some way eg misconfiguratiion etc. Our support team knows if they don't get a test alert each day, there is a problem in the notification system itself. The tech on duty has to awknowlege the incident as per SOP. If they do not awknowlege, then it escalates to the next tier, and we know we have to have a talk about response times. It keeps people on their toes. This is the final piece to insure you have robust monitoring infrastructure.
OpsGene has a heartbeat service which is basically a watch dog timer. You can configure it to call you if you don't ping them in x number of minutes.
Unfortunately I would not recommend them. I have been using them for 4 years and they have changed their account system twice and left my paid account orphaned silently. I have to find a new vendor as soon as I have some free time.

Is there a way to set a walltime on AWS Batch jobs?

Is there a way to set a maximum running time for AWS Batch jobs (or queues)? This is a standard setting in most batch managers, which avoids wasting resources when a job hangs for whatever reason.
As of April, 2018, AWS Batch now supports setting a Job Timeout when submitting a Job, or in the job definition.
https://aws.amazon.com/about-aws/whats-new/2018/04/aws-batch-adds-support-for-automatic-termination-with-job-execution-timeout/
You specify an attemptDurationSeconds parameter, which must be at least 60 seconds, either in your job definition, or when you submit the job. When this number of seconds has passed following the job attempt's startedAt timestamp, AWS Batch terminates the job. On the compute resource, your job's container receives a SIGTERM signal to give your application a chance to shut down gracefully; if the container is still running after 30 seconds, a SIGKILL signal is sent to forcefully shut down the container.
Source: https://docs.aws.amazon.com/batch/latest/userguide/job_timeouts.html
POST /v1/submitjob HTTP/1.1
Content-type: application/json
{
...
"timeout": {
"attemptDurationSeconds": number
}
}
AFAIK there is no feature to do this. However, a workaround was suggested in the forum for a similar question.
One idea is to call Batch as an Activity from Step Functions, pingback
back on a schedule (e.g. every minute) from that job. If it stops
responding then you can detect that situation as a Timeout in the
activity and act accordingly (terminate the job etc.). Not an ideal
solution (especially if the job continues to ping back as a "zombie"),
but it's a start. You'd also likely have to store activity tokens in a
database to trace them to Batch job id.
Alternatively, you split that setup into 2 steps, and schedule a Batch
job from a Lambda in the first state, then pass the Batch job id to
the second step which then polls Batch (from another Lambda) for its
state with Retry and IntervalSeconds (e.g. once every minute, or even
with exponential backoff), and MaxAttempts calculated based on your
timeout. This way, you don't need any external state storage
mechanism, long polling or even a "ping back" from the job (it CAN be
a zombie), but the downside is more steps.
There is no option to set timeout on batch job but you can setup a lambda function that triggers every 1 hour or so and deletes jobs created before say 24 hours.
working with aws for some time now and could not find a way to set a maximum running time for batch jobs.
However there are some alternative way which you could utilize.
AWS Forum
Sadly there is no way to set the limit execution time on AWS Batch.
One solution may be to edit the docker's entry point to schedule the execution time limit.