python-rq worker closes automatically - python-2.7

I am implementing python-rq to pass domains in a queue and scrape it using Beautiful Soup. So i am running multiple workers to get the job done. I started 22 workers as of now, and all the 22 workers is registered in the rq dashboard. But after some time the worker stops by itself and is not getting displayed in dashboard. But in webmin, it displays all workers as running. The speed of crawling has also decreased i.e. the workers are not running. I tried running the worker using supervisor and nohup. In both the cases the workers stops by itself.
What is the reason for this? Why does workers stops by itself? And how many workers can we start in a single server?
Along with that, whenever a worker is unregistered from the rq dashboard, the failed count increases. I don't understand why?
Please help me with this. Thank You

Okay I figured out the problem. It was because of worker timeout.
try:
--my code goes here--
except Exception, ex:
self.error += 1
with open("error.txt", "a") as myfile:
myfile.write('\n%s' % sys.exc_info()[0] + "{}".format(self.url))
pass
So according to my code, the next domain is dequeued if 200 url(s) is fetched from each domain. But for some domains there were insufficient number of urls for the condition to terminate (like only 1 or 2 urls).
Since the code catches all the exception and appends to error.txt file. Even the rq timeout exception rq.timeouts.JobTimeoutException was caught and was appended to the file. Thus making the worker to wait for x amount of time, which leads to termination of the worker.

Related

gUnicorn/Flask/GAE - two processes started for processing the same http request

I have an app on Google AppEngine (Python39 standard env) running on gUnicorn and Flask. I'm making a request to the server from client-side app for a long-running operation and seeing that the request processed twice. The second process (worker) started after a while (a hour and a half) after the first one has been working.
I'm not sure is it related to gUnicorn specifically or to GAE.
The server controller has logging at the beginning :
#app.route("/api/campaign/generate", methods=["GET"])
def campaign_generate():
logging.info('Entering campaign_generate');
# some very long processing here
The controller is called by clicking a button from the UI app. I checked the network in DevTools in the browser that only one request fired. And I can see that there's only one request in server logs at the moment of executing of workers (more on this follow).
The whole app.yaml is like this:
runtime: python39
default_expiration: 0
instance_class: B2
basic_scaling:
max_instances: 1
entrypoint: gunicorn -b :$PORT server.server:app --timeout 0 --workers 2
So I have 2 workers with infinite timeouts, basic scaling with max instances = 1.
I expect while the app is processing one request for a long-running operation, another worker is available for serving.
I don't expect the second worker will used to processing the same request, it's a nonsense (if only the user won't start another operation from another browser).
Thanks to timeout=0 I expect gUnicorn will wait indefinitely till the controller finishes. And only one thing that can hinder is GAE'e timeout. But thanks to basic-scaling it's 24 hours. So I expect the app should process requests for several hours without problem.
But what I'm seeing instead is that after the processing the request for a while another execution is started. Here's simplified logs I see in Cloud Logging:
13:00:58 GET /api/campaign/generate
13:00:59 Entering campaign_generate
..skipped
13:39:13 Starting generating zip-archive (it's something that takes a while)
14:25:49 Entering campaign_generate
So, at 14:25, 1:25 after the current request came another processing of the same request started!
And now there're two request processings running in parallel.
Needless to say that this increase memory pressure and doubles execution time.
When the first "worker" finished (14:29:28 in our example) its processing, its result isn't being returned to the client. It looks like gUnicorn or GAE simply abandoned the first request. And the client has to wait till the second worker finishes processing.
Why is it happening?
And how can I fix it?
Regarding http requests records in the log.
I did see only one request in Cloud Logging (the first one) when the processing was active, and even after the controller was called for the second time ('Entering campaign_generate' in logs appeared) there was not any new GET-request in the logs. But after that everything completed (actually the second processing returned a response) a mysterious second GET-request appeared. So technically after everything is done, from the server logs' view (Cloud Logging) it looks like there were two subsequent requests from the client. But there weren't! There was only one, and I can see it in the browser's DevTools.
Those two requests have different traceId and requestId http headers.
It's very hard to understand what's going on, I tried running the app locally (on the same data) but it works as intended.

Celery 4.4.0 - Disable prefetch while using SQS as broker

I've looked in multiple SO questions about this but still didn't find a solution, and I'm working on it for several days now, so any help would be appreciated!
I'm using celery 4.4.0, SQS as a broker, and Django 2.2, most of my tasks are pretty long (1 to 4 hours).
This is the command to start the workers:
celery worker -A config.celeryconfig:app -Ofair --prefetch-multiplier=1 --max-tasks-per-child=2
And this configuration within my Django config file:
CELERYD_PREFETCH_MULTIPLIER = 1
CELERY_ACKS_LATE = True
task_acks_late = True # I wasn't sure what's the name of the ack late configuration.
BROKER_TRANSPORT_OPTIONS = {
'polling_interval': 3,
'region': 'us-east-1',
'visibility_timeout': 3600 # 1 hour,
}
Scenario - let's say we have 2 workers - 1,2 and each with one sub-process and two tasks a(longer than 1 hours),b,c and I've dispatched them around the same time:
worker 1 picked up task a
worker 1 picked up task c (but not executing)
worker 2 picked up task b
worker 2 finished with task b
worker 2 picked up task c (because of the visibility_timeout)
worker 2 finished up task c
worker 1 finished with task a
worker 1 start with task c
worker 1 finish with task c
So:
Workers are still prefetching tasks.
(much worse) The same task can be executed twice (or even more, if there were
more workers and the execution time is longer than the visibility_timeout). It's discussed in details in here https://github.com/celery/celery/issues/4400.
I've realized the solution to the above problems would be disabling the pre-fetching behavior, but so far I wasn't able to achieve that.
I'm so frustrated, so if you have any idea on how to solve this - please let me know!
Few notes:
*I saw it as well in my production env where we have more than 1 subprocess per worker.
*I used to dispatch tasks using countdown (which resulted in ETA tasks), but have disabled it and the issue still persists.
I'm not using result_backend, I'm handling everything on a model in the DB that I've created.
*I don't want to set CONCURRENCY=1 because I want to have one worker per machine with subprocess=CPU cores in the machine.
*I've tried so many combinations of the celery configurations mentioned above, but none worked (the one listed above is based on this comment - https://stackoverflow.com/a/58958823)
*Another possible solution which I prefer not to use is increasing the visibility_timeout to a very big number. But this could result in one task committing all the long tasks one after the other and no distribution of them between all the workers.
*Not sure if related - I'm deploying celery using ElasticBeanstalk on an EC2 machine.
*Another possible solution that I'm currently considering - is checking the status of the task (We use Pending/In Progress, etc statuses), and only if it's pending - continue (This might not solve the entire use cases because of a possible race condition but it should solve most of them).
Well, What I understood from the Celery docs that the "CELERYD_PREFETCH_MULTIPLIER" will only change the behaviour to prefetch one task at a time, but it doesn't limit the prefetching functionality. So if you have concurrency > 1, it will still be prefetching 1 task for each worker. Try setting --concurrency=1
How many messages to prefetch at a time multiplied by the number of concurrent processes. The default is 4 (four messages for each process).

GAE - how to avoid service request timing out after 1 day

As I explained in this post, I'm trying to scrape tweets from Twitter.
I implemented the suggested solution with services, so that the actual heavy lifting happens in the backend.
The problem is that after about one day, I get this error
"Process terminated because the request deadline was exceeded. (Error code 123)"
I guess this is because the manual scaling has the requests timing out after 24 hours.
Is it possible to make it run for more than 24 hours?
You can't make a single request / task run for more than 24 hours but you can split up your request into different parts and each lasting a day. It's unwise to have a request run indefinitely that's why app engine closes them after a certain time to prevent idling / loopy request that lasts indefinitely.
I would recommend having your task fire a call at the end to trigger the queuing of the next task, that way it's automatic and you don't have to queue a task daily. Make sure there's some cursor or someway for your task to communicate progress so it won't duplicate work.

Celery/SQS task retry gone haywire - how to get rid of it?

We've got Celery/SQS set up for asynchronous task management. We're running Django for our framework. We have a celery task that has a self.retry() in it. Max_retries is set to 15. The retry is happening with an exponential backoff and takes 182 hours to complete all 15 retries.
Last week, this task went haywire, I think due to a bug in our code not properly handling a service outage. It resulted in exponential creation (retrying?) of the same celery task. It eventually used up all available memory and the worker crashed. Restarting the worker results in another crash a couple hours later, since all those tasks (and their retries) keep retrying and spawning new retries until we run out of memory again. Ultimately we ended up with nearly 600k tasks created!
We need our workers to ignore all the tasks with a specific celery GUID. Ideally we could just get rid of them for good. I was going to use revoke() but, per documentation (http://docs.celeryproject.org/en/3.1/userguide/workers.html#commands), this is only implemented for Redis and RabbitMQ, not SQS. Furthermore, when I go to the SQS service in the AWS console, it's showing zero messages in flight so it's not like I can just flush it.
Is there a way to delete or revoke a specific message from SQS using the Celery task ID? Or is there another way to fix this problem? Obviously we need to fix our code so we don't get into this situation again, but first we need to get our worker up and running because without it our website has reduced functionality. Thanks!

Django-celery project, how to handle results from result-backend?

1) I am currently working on a web application that exposes a REST api and uses Django and Celery to handle request and solve them. For a request in order to get solved, there have to be submitted a set of celery tasks to an amqp queue, so that they get executed on workers (situated on other machines). Each task is very CPU intensive and takes very long (hours) to finish.
I have configured Celery to use also amqp as results-backend, and I am using RabbitMQ as Celery's broker.
Each task returns a result that needs to be stored afterwards in a DB, but not by the workers directly. Only the "central node" - the machine running django-celery and publishing tasks in the RabbitMQ queue - has access to this storage DB, so the results from the workers have to return somehow on this machine.
The question is how can I process the results of the tasks execution afterwards? So after a worker finishes, the result from it gets stored in the configured results-backend (amqp), but now I don't know what would be the best way to get the results from there and process them.
All I could find in the documentation is that you can either check on the results's status from time to time with:
result.state
which means that basically I need a dedicated piece of code that runs periodically this command, and therefore keeps busy a whole thread/process only with this, or to block everything with:
result.get()
until a task finishes, which is not what I wish.
The only solution I can think of is to have on the "central node" an extra thread that runs periodically a function that basically checks on the async_results returned by each task at its submission, and to take action if the task has a finished status.
Does anyone have any other suggestion?
Also, since the backend-results' processing takes place on the "central node", what I aim is to minimize the impact of this operation on this machine.
What would be the best way to do that?
2) How do people usually solve the problem of dealing with the results returned from the workers and put in the backend-results? (assuming that a backend-results has been configured)
I'm not sure if I fully understand your question, but take into account each task has a task id. If tasks are being sent by users you can store the ids and then check for the results using json as follows:
#urls.py
from djcelery.views import is_task_successful
urlpatterns += patterns('',
url(r'(?P<task_id>[\w\d\-\.]+)/done/?$', is_task_successful,
name='celery-is_task_successful'),
)
Other related concept is that of signals each finished task emits a signal. A finnished task will emit a task_success signal. More can be found on real time proc.