Set dynamic scheduling celerybeat - django

I have send_time field in my Notification model. I want to send notification to all mobile clients at that time.
What i am doing right now is, I have created a task and scheduled it for every minute
tasks.py
#app.task(name='app.tasks.send_notification')
def send_notification():
# here is logic to filter notification that fall inside that 1 minute time span
cron.push_notification()
settings.py
CELERYBEAT_SCHEDULE = {
'send-notification-every-1-minute': {
'task': 'app.tasks.send_notification',
'schedule': crontab(minute="*/1"),
},
}
All things are working as expected.
Question:
is there any way to schedule task as per send_time field, so i don't have to schedule task for every minute.
More specifically i want to create a new instance of task as my Notification model get new entry and schedule it according to send_time field of that record.
Note: i am using new integration of celery with django not django-celery package

To execute a task at specified date and time you can use eta attribute of apply_async while calling task as mentioned in docs
After creation of notification object you can call your task as
# here obj is your notification object, you can send extra information in kwargs
send_notification.apply_async(kwargs={'obj_id':obj.id}, eta=obj.send_time)
Note: send_time should be datetime.

You have to use PeriodicTask and CrontabSchedule to schedule task that can be imported from djcelery.models.
So the code will be like:
from djcelery.models import PeriodicTask, CrontabSchedule
crontab, created = CrontabSchedule.objects.get_or_create(minute='*/1')
periodic_task_obj, created = PeriodicTask.objects.get_or_create(name='send_notification', task='send_notification', crontab=crontab, enabled=True)
Note: you have to write full path to the task like 'app.tasks.send_notification'
You can schedule the notification task in post_save of Notification Model like:
#post_save
def schedule_notification(sender, instance, *args, **kwargs):
"""
instance is notification model object
"""
# create crontab according to your notification object.
# there are more options you can pass like day, week_day etc while creating Crontab object.
crontab, created = CrontabSchedule.objects.get_or_create(minute=instance.send_time.minute, hour=instance.send_time.hour)
periodic_task_obj, created = PeriodicTask.objects.get_or_create(name='send_notification', task='send_notification_{}'.format(instance.pk))
periodic_task_obj.crontab = crontab
periodic_task_obj.enabled = True
# you can also pass kwargs to your task like this
periodic_task_obj.kwargs = json.dumps({"notification_id": instance.pk})
periodic_task_obj.save()

Related

Saving a celery task (for re-running) in database

Our workflow is currently built around an old version of celery, so bear in mind things are already not optimal. We need to run a task and save a record of that task run in the database. If that task fails or hangs (it happens often), we want to re run, exactly as it was run the first time. This shouldn't happen automatically though. It needs to be triggered manually depending on the nature of the failure and the result needs to be logged in the DB to make that decision (via a front end).
How can we save a complete record of a task in the DB so that a subsequent process can grab the record and run a new identical task? The current implementation saves the path of the #task decorated function in the DB as part of a TaskInfo model. When the task needs to be rerun, we have a get_task() method on the TaskInfo model that gets the path from the DB, imports it using getattr, and another rerun() method that runs the task again with *args, **kwargs (also saved in the DB).
Like so (these are methods on the TaskInfo model instance):
def get_task(self):
"""Returns the task's decorated function, which can be delayed."""
module_name, object_name = self.path.rsplit('.', 1)
module = import_module(module_name)
task = getattr(module, object_name)
if inspect.isclass(task):
task = task()
# task = current_app.tasks[self.path]
return task
def rerun(self):
"""Re-run the task, and replace this one.
- A new task is scheduled to run.
- The new task's TaskInfo has the same parent as this TaskInfo.
- This TaskInfo is deleted.
"""
args, kwargs = self.get_arguments()
celery_task = self.get_task()
celery_task.delay(*args, **kwargs)
defaults = {
'path': self.path,
'status': Status.PENDING,
'timestamp': timezone.now(),
'args': args,
'kwargs': kwargs,
'parent': self.parent,
}
TaskInfo.objects.update_or_create(task_id=celery_task.id, defaults=defaults)
self.delete()
There must be a cleaner solution for saving a task in the DB to rerun later, right?
The latest version of Celery (4.4.0) included a param extended_result. You can set it to True, then the table (it is named celery_taskmeta by default) in the Result Backend Database will store the args and kwargs of the task.
Here is a demo:
app = Celery('test_result_backend')
app.conf.update(
broker_url='redis://localhost:6379/10',
result_backend='db+mysql://root:passwd#localhost/celery_toys',
result_extended=True
)
#app.task(bind=True, name='add')
def add(self, x, y):
self.request.task_name = 'add' # For saving the task name.
time.sleep(5)
return x + y
With the task info recorded in MySQL, you are able to re-run your task easily.

How can I set a timeout on Dataflow?

I am using Composer to run my Dataflow pipeline on a schedule. If the job is taking over a certain amount of time, I want it to be killed. Is there a way to do this programmatically either as a pipeline option or a DAG parameter?
Not sure how to do it as a pipeline config option, but here is an idea.
You could launch a taskqueue task with countdown set to your timeout value. When the task does launch, you could check to see if your task is still running:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/list
If it is, you can call update on it with job state JOB_STATE_CANCELLED
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/update
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#jobstate
This is done through the googleapiclient lib: https://developers.google.com/api-client-library/python/apis/discovery/v1
Here is an example of how to use it
class DataFlowJobsListHandler(InterimAdminResourceHandler):
def get(self, resource_id=None):
"""
Wrapper to this:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/list
"""
if resource_id:
self.abort(405)
else:
credentials = GoogleCredentials.get_application_default()
service = discovery.build('dataflow', 'v1b3', credentials=credentials)
project_id = app_identity.get_application_id()
_filter = self.request.GET.pop('filter', 'UNKNOWN').upper()
jobs_list_request = service.projects().jobs().list(
projectId=project_id,
filter=_filter) #'ACTIVE'
jobs_list = jobs_list_request.execute()
return {
'$cursor': None,
'results': jobs_list.get('jobs', []),
}

Push results of one to another job django-rq

I'm building an application in which users' information are frequently updated from external APIs. For that, I'm using django-rq.
The first job is scheduled once a day to get users who need to update their profiles.
With each result returned by the first job, schedule another job to get new user information from remote API and update the user in my database.
# tasks.py
import requests
from django_rq import job
#job('default')
def get_users_to_update():
"""Get a list of users to who need update"""
users = User.objects.filter(some_condition_here...)
return users
#job('default')
def remote_update_user(user):
"""Calls external API to get new user information"""
url = 'http://somwhere.io/api/users/{}'.format(user.id)
headers= {'Authorization': "Bearer {}".format(user.access_token)}
# Send the request, probably takes long time
r = requests.get(url, headers=headers)
data = r.json() # new user data
for key, value in data.items():
setattr(user, key, value)
user.save()
return user
I can schedule those two jobs like following:
# update_info.py
import django_rq
scheduler = django_rq.get_scheduler('default')
scheduler.schedule(
scheduled_time=datetime.utcnow(),
func=tasks.get_users_to_update,
interval=24 * 60 * 60
)
scheduler.schedule(
scheduled_time=datetime.utcnow(),
func=tasks.remote_update_user,
interval=24 * 60 * 60
)
However, this is certainly not what I want. I'm wondering if there is any way in django-rq to notify when get_users_to_update is finished, get its results and schedule the remote_update_user.
rq allows depends_on for declaring the dependent job a submitted job, but it seems that such functionality isn't available in django-rq.

Celery group multiple tasks in one design

I just getting familiar with Celery and have a question. My setup is Django-Redis-Celery
Lets take an example of a task sending email:
TASKS
#task
def send_email(message):
mailserver.sendOneMessage(message)
VIEWS
class newaccount(APIView):
def post(self, request, format=None):
send_email.delay(request.data.email)
This works perfectly, Django sends messages to Redis and those are picked up by Celery then to execute task. But I want to improve the system so that Celery picks up all messages from Redis at certain intervals and executes a single task with multiple messages. This because, connecting to the email server is slow and sending multiple messages as a single request will result in a faster process.
I want something like this to work:
TASKS
#task
def send_emails(messages):
mailserver.sendMultipleMessages(messages)
Thoughts?
Since i am using redis as a cache (django-redis) already i implemented the following workflow:
Step 1. Create a task that adds new emails to cache
#shared_task()
def add_email(user_id):
cache.set("email#{}".format(user_id), None, timeout=None)
Step 2. Create a periodic task that runs every second and looks up for new emails in the cache
class ProcessEmailsTask(PeriodicTask):
run_every = timedelta(seconds=1)
def run(self, **kwargs):
call_email()
def call_email():
item_exists = True
ids = []
while item_exists:
try:
key = next(cache.iter_keys("email#*"))
ids.append(key.split("email#")[1])
cache.delete_pattern(key)
except:
item_exists = False
if len(ids) > 0:
send_emails_to(ids)
Step 3. Run both celery workers and celery beat and profit!

How to clear Django RQ jobs from a queue?

I feel a bit stupid for asking, but it doesn't appear to be in the documentation for RQ. I have a 'failed' queue with thousands of items in it and I want to clear it using the Django admin interface. The admin interface lists them and allows me to delete and re-queue them individually but I can't believe that I have to dive into the django shell to do it in bulk.
What have I missed?
The Queue class has an empty() method that can be accessed like:
import django_rq
q = django_rq.get_failed_queue()
q.empty()
However, in my tests, that only cleared the failed list key in Redis, not the job keys itself. So your thousands of jobs would still occupy Redis memory. To prevent that from happening, you must remove the jobs individually:
import django_rq
q = django_rq.get_failed_queue()
while True:
job = q.dequeue()
if not job:
break
job.delete() # Will delete key from Redis
As for having a button in the admin interface, you'd have to change django-rq/templates/django-rq/jobs.html template, who extends admin/base_site.html, and doesn't seem to give any room for customizing.
The redis-cli allows FLUSHDB, great for my local environment as I generate a bizzallion jobs.
With a working Django integration I will update. Just adding $0.02.
You can empty any queue by name using following code sample:
import django_rq
queue = "default"
q = django_rq.get_queue(queue)
q.empty()
or even have Django Command for that:
import django_rq
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def add_arguments(self, parser):
parser.add_argument("-q", "--queue", type=str)
def handle(self, *args, **options):
q = django_rq.get_queue(options.get("queue"))
q.empty()
As #augusto-men method seems not to work anymore, here is another solution:
You can use the the raw connection to delete the failed jobs. Just iterate over rq:job keys and check the job status.
from django_rq import get_connection
from rq.job import Job
# delete failed jobs
con = get_connection('default')
for key in con.keys('rq:job:*'):
job_id = key.decode().replace('rq:job:', '')
job = Job.fetch(job_id, connection=con)
if job.get_status() == 'failed':
con.delete(key)
con.delete('rq:failed:default') # reset failed jobs registry
The other answers are outdated with the RQ updates implementing Registries.
Now, you need to do this to loop through and delete failed jobs. This would work any particular Registry as well.
import django_rq
from rq.registry import FailedJobRegistry
failed_registry = FailedJobRegistry('default', connection=django_rq.get_connection())
for job_id in failed_registry.get_job_ids():
try:
failed_registry.remove(job_id, delete_job=True)
except:
# failed jobs expire in the queue. There's a
# chance this will raise NoSuchJobError
pass
Source
You can empty a queue from the command line with:
rq empty [queue-name]
Running rq info will list all the queues.