I'm running on a development environment so this maybe different in production, but when I run a task from Django Celery, it seems to only fetch tasks from the broker every 10-20 seconds. I'm only testing at this point but lets say I'm sending around 1000 tasks this means it will take over 5 hours+ to complete.
Is this is normal? Should it be quicker? Or I'm I doing something wrong?
This is my task
class SendMessage(Task):
name = "Sending SMS"
max_retries = 10
default_retry_delay = 3
def run(self, message_id, gateway_id=None, **kwargs):
logging.debug("About to send a message.")
# Because we don't always have control over transactions
# in our calling code, we will retry up to 10 times, every 3
# seconds, in order to try to allow for the commit to the database
# to finish. That gives the server 30 seconds to write all of
# the data to the database, and finish the view.
try:
message = Message.objects.get(pk=message_id)
except Exception as exc:
raise SendMessage.retry(exc=exc)
if not gateway_id:
if hasattr(message.billee, 'sms_gateway'):
gateway = message.billee.sms_gateway
else:
gateway = Gateway.objects.all()[0]
else:
gateway = Gateway.objects.get(pk=gateway_id)
#response = gateway._send(message)
print(message_id)
logging.debug("Done sending message.")
which gets run from my view
for e in Contact.objects.filter(contact_owner=request.user etc etc):
SendMessage.delay(e.id, message)
Yes this is normal. This is the default workers to be used. They set this default so that it will not affect the performance of the app.
There is another way to change it. The task decorator can take a number of options that change the way the task behaves. Any keyword argument passed to the task decorator will actually be set as an attribute of the resulting task class.
You can set the rate limit which limits the number of tasks that can be run in a given time frame.
//means hundred tasks a minute, another /s (second) and /h (hour)
CELERY_DEFAULT_RATE_LIMIT = "100/m" --> set in settings
Related
I'm using apschedular lib to run the cron job every 5 min.So that it will send the push notifications(It is sending twice)
job_defaults = {
'coalesce': True, # The accumulated task will only run once
'max_instances': 1000
}
schedule = BackgroundScheduler(job_defaults=job_defaults)
#app.before_first_request
#schedule.scheduled_job('interval', minutes=5)
def timed_job():
with app.app_context():
time = datetime.now().replace(microsecond=0)
print("calling create {0}".format(time))
schedule_notifications.create()
I have tried this solution
apscheduler in Flask executes twice
But still it is calling twice
I have updates this in flask app also
app.run(use_reloader=False)
I hope master and child class are calling , is there a way to stop this
I have a script that loops through the rows of an external csv file (about 12,000 rows) and executes a single Model.objects.get() query to retrieve each item from the database (final product will be much more complicated but right now it's stripped down to the barest functionality possible to try to figure this out).
For right now the path to the local csv file is hardcoded into the script. When I run the script through the shell using py manage.py runscript update_products_from_csv it runs in about 6 seconds.
The ultimate goal is to be able to upload the csv through the admin and then have the script run from there. I've already been able to accomplish that, but the runtime when I do it that way takes more like 160 seconds. The view for that in the admin looks like...
from .scripts import update_products_from_csv
class CsvUploadForm(forms.Form):
csv_file = forms.FileField(label='Upload CSV')
#admin.register(Product)
class ProductAdmin(admin.ModelAdmin):
# list_display, list_filter, fieldsets, etc
def changelist_view(self, request, extra_context=None):
extra_context = extra_context or {}
extra_context['csv_upload_form'] = CsvUploadForm()
return super().changelist_view(request, extra_context=extra_context)
def get_urls(self):
urls = super().get_urls()
new_urls = [path('upload-csv/', self.upload_csv),]
return new_urls + urls
def upload_csv(self, request):
if request.method == 'POST':
# csv_file = request.FILES['csv_file'].file
# result_string = update_products_from_csv.run(csv_file)
# I commented out the above two lines and added the below line to rule out
# the possibility that the csv upload itself was the problem. Whether I execute
# the script using the uploaded file or let it use the hardcoded local path,
# the results are the same. It works, but takes more than 20 times longer
# than executing the same script from the shell.
result_string = update_products_from_csv.run()
print(result_string)
messages.success(request, result_string)
return HttpResponseRedirect(reverse('admin:products_product_changelist'))
Right now the actual running parts of the script are about as simple as this...
import csv
from time import time
from apps.products.models import Product
CSV_PATH = 'path/to/local/csv_file.csv'
def run():
csv_data = get_csv_data()
update_data = build_update_data(csv_data)
update_handler(update_data)
return 'Finished'
def get_csv_data():
with open(CSV_PATH, 'r') as f:
return [d for d in csv.DictReader(f)]
def build_update_data(csv_data):
update_data = []
# Code that loops through csv data, applies some custom logic, and builds a list of
# dicts with the data cleaned and formatted as needed
return update_data
def update_handler(update_data):
query_times = []
for upd in update_data:
iter_start = time()
product_obj = Product.objects.get(external_id=upd['external_id'])
# external_id is not the primary key but is an indexed field in the Product model
query_times.append(time() - iter_start)
# Code to export query_times to an external file for analysis
update_handler() has a bunch of other code checking field values to see if anything needs to be changed, and building the objects when a match does not exist, but that's all commented out right now. As you can see, I'm also timing each query and logging those values. (I've been dropping time() calls in various places all day and have determined that the query is the only part that's noticeably different.)
When I run it from the shell, the average query time is 0.0005 seconds and the total of all query times comes out to about 6.8 seconds every single time.
When I run it through the admin view and then check the queries in Django Debug Toolbar it shows the 12,000+ queries as expected, and shows a total query time of only about 3900ms. But when I look at the log of query times gathered by the time() calls, the average query time is 0.013 seconds (26 times longer than when I run it through the shell), and the total of all query times always comes out at 156-157 seconds.
The queries in Django Debug Toolbar when I run it through the admin all look like SELECT ••• FROM "products_product" WHERE "products_product"."external_id" = 10 LIMIT 21, and according to the toolbar they are mostly all 0-1ms. I'm not sure how I would check what the queries look like when running it from the shell, but I can't imagine they'd be different? I couldn't find anything in django-extensions runscript docs about it doing query optimizations or anything like that.
One additional interesting facet is that when running it from the admin, from the time I see result_string print in the terminal, it's another solid 1-3 minutes before the success message appears in the browser window.
I don't know what else to check. I'm obviously missing something fundamental, but I don't know what.
Somebody on Reddit suggested that running the script from the shell might be automatically spinning up a new thread where the logic can run unencumbered by the other Django server processes, and this seems to be the answer. If I run the script in a new thread from the admin view, it runs just as fast as it does when I run it from the shell.
Currently I have a method that retrieves all ~119,000 gmail accounts and writes them to a csv file using python code below and the enabled admin.sdk + auth 2.0:
def get_accounts(self):
students = []
page_token = None
params = {'customer': 'my_customer'}
while True:
try:
if page_token:
params['pageToken'] = page_token
current_page = self.dir_api.users().list(**params).execute()
students.extend(current_page['users'])
# write each page of data to a file
csv_file = CSVWriter(students, self.output_file)
csv_file.write_file()
# clear the list for the next page of data
del students[:]
page_token = current_page.get('nextPageToken')
if not page_token:
break
except errors.HttpError as error:
break
I would like to retrieve all 119,000 as a lump sum, that is, without having to loop or as a batch call. Is this possible and if so, can you provide example python code? I have run into communication issues and have to rerun the process multiple times to obtain the ~119,000 accts successfully (takes about 10 minutes to download). Would like to minimize communication errors. Please advise if better method exists or non-looping method also is possible.
There's no way to do this as a batch because you need to know each pageToken and those are only given as the page is retrieved. However, you can increase your performance somewhat by getting larger pages:
params = {'customer': 'my_customer', 'maxResults': 500}
since the default page size when maxResults is not set is 100, adding maxResults: 500 will reduce the number of API calls by an order of 5. While each call may take slightly longer, you should notice performance increases because you're making far fewer API calls and HTTP round trips.
You should also look at using the fields parameter to only specify user attributes you need to read in the list. That way you're not wasting time and bandwidth retrieving details about your users that your app never uses. Try something like:
my_fields = 'nextPageToken,users(primaryEmail,name,suspended)'
params = {
'customer': 'my_customer',
maxResults': 500,
fields: my_fields
}
Last of all, if your app retrieves the list of users fairly frequently, turning on caching may help.
I understand that Django's cache functions expire after a specified time interval has elapsed (e.g. 1 minute, 1 hour, etc.), but I have some content that changes on a daily basis (e.g. "message of the day"). Ideally this would be cached for 24 hours, but if I set the timeout to 24 hours there's no guarantee that the cache will expire precisely at midnight. What is the best practice for handling this situation?
Two easy options spring to mind, both involving a scheduled task that needs to run at (say) midnight.
1) Get ahead of the game: Schedule some code to run (eg a custom management command) that asks for your 'message of the day' content at midnight, with an 24hr expiry. (This assumes the relevant cache key is not set yet)
2) Go nuclear: schedule a cache purge at midnight
or, combining the two:
Don't go nuclear, just schedule a call to only delete the MOTD key (eg cache.delete('motd_key') at midnight, then cache the new one instead.
Alternatively, if you use Redis as your cache backend, you could cache the MOTD, then make an EXPIREAT call to set that cached MOTD entry to expire at 23:59:59. redis.py will let you do that in a Pythonic way.
If you're using Memcached as your backend, you don't get that level of control.
(And if you're using locmem://, you're Doing It Wrong ;o) )
Why not just implement a custom cache instead of introducing another side effect like scheduled jobs?
Create a cache class like so:
from datetime import datetime, timedelta
from django.core.cache.backends.locmem import LocMemCache
class MidnightCacher(LocMemCache):
def __init__(self, name, params):
super().__init__(name, params)
def get_backend_timeout(self, timeout=None):
# return time until midnight
return (datetime.utcnow() + timedelta(days=1)).replace(hour=0, minute=0, second=0, microsecond=0).timestamp()
Configure your cache in settings.py
CACHES = {
'midnight': {
'BACKEND': 'backend.midnight_cache.MidnightCacher',
'LOCATION': 'unique-snowflake',
}
}
And finally, decorate your view:
#cache_page(1, cache="midnight")
def get_comething(request):
pass
I am trying to get Django-cron running, but it seems to only run after hitting the site once. I am using Virtualenv.
Any ideas why it only runs once?
On my PATH, I added the location of django_cron: '/Users/emilepetrone/Workspace/zapgeo/zapgeo/django_cron'
My cron.py file within my Django app:
from django_cron import cronScheduler, Job
from products.views import index
class GetProducts(Job):
run_every = 60
def job(self):
index()
cronScheduler.register(GetProducts)
class GetLocation(Job):
run_every = 60
def job(self):
index()
cronScheduler.register(GetLocation)
The first possible reason
There is a variable in django_cron/base.py:
# how often to check if jobs are ready to be run (in seconds)
# in reality if you have a multithreaded server, it may get checked
# more often that this number suggests, so keep an eye on it...
# default value: 300 seconds == 5 min
polling_frequency = getattr(settings, "CRON_POLLING_FREQUENCY", 300)
So, the minimal interval of checking for time to start your task is polling_frequency. You can change it by setting in settings.py of your project:
CRON_POLLING_FREQUENCY = 100 # use your custom value in seconds here
To start a job hit your server at least one time after starting Django web server.
The second possible reason
Your job has an error and it is not queued (queued flag is set to 'f' if your job raises an exception). In this case it stores in field 'queued' of table 'django_cron_job' string value 'f'. You can test it making the query:
select queued from django_cron_job;
If you change the code of your job the field may stay as 'f'. So, if you correct the error of your job you should manually set in queued field: 't'. Or the flag executing in the table django_cron_cron is 't'. It means that your app. server was stopped when your task was in progress. In this case you should manually set it into 'f'.