Push results of one to another job django-rq - django

I'm building an application in which users' information are frequently updated from external APIs. For that, I'm using django-rq.
The first job is scheduled once a day to get users who need to update their profiles.
With each result returned by the first job, schedule another job to get new user information from remote API and update the user in my database.
# tasks.py
import requests
from django_rq import job
#job('default')
def get_users_to_update():
"""Get a list of users to who need update"""
users = User.objects.filter(some_condition_here...)
return users
#job('default')
def remote_update_user(user):
"""Calls external API to get new user information"""
url = 'http://somwhere.io/api/users/{}'.format(user.id)
headers= {'Authorization': "Bearer {}".format(user.access_token)}
# Send the request, probably takes long time
r = requests.get(url, headers=headers)
data = r.json() # new user data
for key, value in data.items():
setattr(user, key, value)
user.save()
return user
I can schedule those two jobs like following:
# update_info.py
import django_rq
scheduler = django_rq.get_scheduler('default')
scheduler.schedule(
scheduled_time=datetime.utcnow(),
func=tasks.get_users_to_update,
interval=24 * 60 * 60
)
scheduler.schedule(
scheduled_time=datetime.utcnow(),
func=tasks.remote_update_user,
interval=24 * 60 * 60
)
However, this is certainly not what I want. I'm wondering if there is any way in django-rq to notify when get_users_to_update is finished, get its results and schedule the remote_update_user.
rq allows depends_on for declaring the dependent job a submitted job, but it seems that such functionality isn't available in django-rq.

Related

Django API sudden degrade in performance

I am working on a project where all REST API are implemented in django and UI is taken care by Vue.js. Since last week our dev and qa environments are facing long TTFB response time even for simple select queries to Postgres Database. To investigate this I added some logs to see how much time it takes to execute the query. Here is the code below
import datetime
from django.core.signals import request_started, request_finished
from django.conf import settings
class TenantListViewSet(APIView):
"""
API endpoint that returns all tenants present in member_master table.
"""
#ldap_login_required
#method_decorator(login_required)
def get(self, request):
settings.get_start_time = datetime.datetime.now()
print(f'Total time from started to get begin {settings.get_start_time-settings.request_start_time}')
errors = list()
messages = list()
try:
settings.db_query_start_time = datetime.datetime.now()
tenant_list_objects = list(MemberMaster.objects.all())
print(MemberMaster.objects.all().explain(verbose=True, analyze=True))
settings.db_query_end_time = datetime.datetime.now()
print(f'Total time taken to execute DB query {settings.db_query_end_time-settings.db_query_start_time}')
db_start = datetime.datetime.now()
serializer = MemberMasterSerializer(tenant_list_objects, many=True)
db_end = datetime.datetime.now()
print(f'Total time taken to serialize data {db_end-db_start}')
tenant_list = serializer.data
response = result_service("tenant_list", tenant_list, errors, messages)
settings.get_end_time = datetime.datetime.now()
print(f'Total time taken to finish get call {settings.get_end_time-settings.get_start_time}')
return JsonResponse(response.result["result_data"], status=response.result["status_code"])
except Exception as e:
if isinstance(e, dict) and 'messages' in e:
errors.extend(e.messages)
else:
errors.extend(e.args)
response = result_service("tenant_list", [], errors, messages)
return JsonResponse(response.result['result_data'], status=400)
def started(sender, **kwargs): # First method that gets invoked when GET API is called
settings.request_start_time = datetime.datetime.now()
def finished(sender, **kwargs): # Last method that gets invoked before sending the response back
return_from_get = datetime.datetime.now() - settings.get_end_time
total = datetime.datetime.now() - settings.request_start_time
print(f'Total time from get return to finished {return_from_get}')
print(f'API total time {total}')
request_started.connect(started) # Added hooks to django signal to check the time
request_finished.connect(finished) # Added hooks to django signal to check the time
These are the logs for the above API which has 126 rows in the table member_master
Total time taken to check user is authenticated 0:00:00.000374
Total time from started to get begin 0:00:02.060007
Seq Scan on member_master (cost=0.00..11.20 rows=120 width=634) (actual time=0.004..0.032 rows=126 loops=1)
Output: member_master_id, member_name, mbr_sk, mbr_uid, member_guid
Planning time: 0.028 ms
Execution time: 0.065 ms
Total time taken to execute DB query 0:00:02.026839
Total time taken to serialize data 0:00:00.000132
Total time taken to finish get call 0:00:02.267809
Total time from get return to finished 0:00:00.001454
API total time 0:00:04.329273
As you can see the execution time of the query itself is 0.065 ms but the time taken by django is 2 seconds. I tried to use raw query instead of list(MemberMaster.objects.all()) and the issue still exist. Also I am not able to reproduce this in my local environment but happens in both dev and qa environments. Hence I did not use django debug toolbar. No changes were made recently to any model and this issue exist for all API's in the project. Any idea what might be the issue here?
Model code
class MemberMaster(models.Model):
member_master_id = models.AutoField(primary_key=True)
member_name = models.CharField(max_length=256)
mbr_sk = models.ForeignKey('Member', on_delete=models.DO_NOTHING, db_column='mbr_sk')
mbr_uid = models.CharField(max_length=32, blank=True, null=True)
member_guid = models.UUIDField(blank=True, null=True)
class Meta:
managed = False
db_table = 'member_master'
app_label = 'queryEditor'

Saving a celery task (for re-running) in database

Our workflow is currently built around an old version of celery, so bear in mind things are already not optimal. We need to run a task and save a record of that task run in the database. If that task fails or hangs (it happens often), we want to re run, exactly as it was run the first time. This shouldn't happen automatically though. It needs to be triggered manually depending on the nature of the failure and the result needs to be logged in the DB to make that decision (via a front end).
How can we save a complete record of a task in the DB so that a subsequent process can grab the record and run a new identical task? The current implementation saves the path of the #task decorated function in the DB as part of a TaskInfo model. When the task needs to be rerun, we have a get_task() method on the TaskInfo model that gets the path from the DB, imports it using getattr, and another rerun() method that runs the task again with *args, **kwargs (also saved in the DB).
Like so (these are methods on the TaskInfo model instance):
def get_task(self):
"""Returns the task's decorated function, which can be delayed."""
module_name, object_name = self.path.rsplit('.', 1)
module = import_module(module_name)
task = getattr(module, object_name)
if inspect.isclass(task):
task = task()
# task = current_app.tasks[self.path]
return task
def rerun(self):
"""Re-run the task, and replace this one.
- A new task is scheduled to run.
- The new task's TaskInfo has the same parent as this TaskInfo.
- This TaskInfo is deleted.
"""
args, kwargs = self.get_arguments()
celery_task = self.get_task()
celery_task.delay(*args, **kwargs)
defaults = {
'path': self.path,
'status': Status.PENDING,
'timestamp': timezone.now(),
'args': args,
'kwargs': kwargs,
'parent': self.parent,
}
TaskInfo.objects.update_or_create(task_id=celery_task.id, defaults=defaults)
self.delete()
There must be a cleaner solution for saving a task in the DB to rerun later, right?
The latest version of Celery (4.4.0) included a param extended_result. You can set it to True, then the table (it is named celery_taskmeta by default) in the Result Backend Database will store the args and kwargs of the task.
Here is a demo:
app = Celery('test_result_backend')
app.conf.update(
broker_url='redis://localhost:6379/10',
result_backend='db+mysql://root:passwd#localhost/celery_toys',
result_extended=True
)
#app.task(bind=True, name='add')
def add(self, x, y):
self.request.task_name = 'add' # For saving the task name.
time.sleep(5)
return x + y
With the task info recorded in MySQL, you are able to re-run your task easily.

How can I set a timeout on Dataflow?

I am using Composer to run my Dataflow pipeline on a schedule. If the job is taking over a certain amount of time, I want it to be killed. Is there a way to do this programmatically either as a pipeline option or a DAG parameter?
Not sure how to do it as a pipeline config option, but here is an idea.
You could launch a taskqueue task with countdown set to your timeout value. When the task does launch, you could check to see if your task is still running:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/list
If it is, you can call update on it with job state JOB_STATE_CANCELLED
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/update
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#jobstate
This is done through the googleapiclient lib: https://developers.google.com/api-client-library/python/apis/discovery/v1
Here is an example of how to use it
class DataFlowJobsListHandler(InterimAdminResourceHandler):
def get(self, resource_id=None):
"""
Wrapper to this:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/list
"""
if resource_id:
self.abort(405)
else:
credentials = GoogleCredentials.get_application_default()
service = discovery.build('dataflow', 'v1b3', credentials=credentials)
project_id = app_identity.get_application_id()
_filter = self.request.GET.pop('filter', 'UNKNOWN').upper()
jobs_list_request = service.projects().jobs().list(
projectId=project_id,
filter=_filter) #'ACTIVE'
jobs_list = jobs_list_request.execute()
return {
'$cursor': None,
'results': jobs_list.get('jobs', []),
}

Set dynamic scheduling celerybeat

I have send_time field in my Notification model. I want to send notification to all mobile clients at that time.
What i am doing right now is, I have created a task and scheduled it for every minute
tasks.py
#app.task(name='app.tasks.send_notification')
def send_notification():
# here is logic to filter notification that fall inside that 1 minute time span
cron.push_notification()
settings.py
CELERYBEAT_SCHEDULE = {
'send-notification-every-1-minute': {
'task': 'app.tasks.send_notification',
'schedule': crontab(minute="*/1"),
},
}
All things are working as expected.
Question:
is there any way to schedule task as per send_time field, so i don't have to schedule task for every minute.
More specifically i want to create a new instance of task as my Notification model get new entry and schedule it according to send_time field of that record.
Note: i am using new integration of celery with django not django-celery package
To execute a task at specified date and time you can use eta attribute of apply_async while calling task as mentioned in docs
After creation of notification object you can call your task as
# here obj is your notification object, you can send extra information in kwargs
send_notification.apply_async(kwargs={'obj_id':obj.id}, eta=obj.send_time)
Note: send_time should be datetime.
You have to use PeriodicTask and CrontabSchedule to schedule task that can be imported from djcelery.models.
So the code will be like:
from djcelery.models import PeriodicTask, CrontabSchedule
crontab, created = CrontabSchedule.objects.get_or_create(minute='*/1')
periodic_task_obj, created = PeriodicTask.objects.get_or_create(name='send_notification', task='send_notification', crontab=crontab, enabled=True)
Note: you have to write full path to the task like 'app.tasks.send_notification'
You can schedule the notification task in post_save of Notification Model like:
#post_save
def schedule_notification(sender, instance, *args, **kwargs):
"""
instance is notification model object
"""
# create crontab according to your notification object.
# there are more options you can pass like day, week_day etc while creating Crontab object.
crontab, created = CrontabSchedule.objects.get_or_create(minute=instance.send_time.minute, hour=instance.send_time.hour)
periodic_task_obj, created = PeriodicTask.objects.get_or_create(name='send_notification', task='send_notification_{}'.format(instance.pk))
periodic_task_obj.crontab = crontab
periodic_task_obj.enabled = True
# you can also pass kwargs to your task like this
periodic_task_obj.kwargs = json.dumps({"notification_id": instance.pk})
periodic_task_obj.save()

Celery group multiple tasks in one design

I just getting familiar with Celery and have a question. My setup is Django-Redis-Celery
Lets take an example of a task sending email:
TASKS
#task
def send_email(message):
mailserver.sendOneMessage(message)
VIEWS
class newaccount(APIView):
def post(self, request, format=None):
send_email.delay(request.data.email)
This works perfectly, Django sends messages to Redis and those are picked up by Celery then to execute task. But I want to improve the system so that Celery picks up all messages from Redis at certain intervals and executes a single task with multiple messages. This because, connecting to the email server is slow and sending multiple messages as a single request will result in a faster process.
I want something like this to work:
TASKS
#task
def send_emails(messages):
mailserver.sendMultipleMessages(messages)
Thoughts?
Since i am using redis as a cache (django-redis) already i implemented the following workflow:
Step 1. Create a task that adds new emails to cache
#shared_task()
def add_email(user_id):
cache.set("email#{}".format(user_id), None, timeout=None)
Step 2. Create a periodic task that runs every second and looks up for new emails in the cache
class ProcessEmailsTask(PeriodicTask):
run_every = timedelta(seconds=1)
def run(self, **kwargs):
call_email()
def call_email():
item_exists = True
ids = []
while item_exists:
try:
key = next(cache.iter_keys("email#*"))
ids.append(key.split("email#")[1])
cache.delete_pattern(key)
except:
item_exists = False
if len(ids) > 0:
send_emails_to(ids)
Step 3. Run both celery workers and celery beat and profit!