Django: Script that executes many queries runs massively slower when executed from Admin view than when executed from shell - django

I have a script that loops through the rows of an external csv file (about 12,000 rows) and executes a single Model.objects.get() query to retrieve each item from the database (final product will be much more complicated but right now it's stripped down to the barest functionality possible to try to figure this out).
For right now the path to the local csv file is hardcoded into the script. When I run the script through the shell using py manage.py runscript update_products_from_csv it runs in about 6 seconds.
The ultimate goal is to be able to upload the csv through the admin and then have the script run from there. I've already been able to accomplish that, but the runtime when I do it that way takes more like 160 seconds. The view for that in the admin looks like...
from .scripts import update_products_from_csv
class CsvUploadForm(forms.Form):
csv_file = forms.FileField(label='Upload CSV')
#admin.register(Product)
class ProductAdmin(admin.ModelAdmin):
# list_display, list_filter, fieldsets, etc
def changelist_view(self, request, extra_context=None):
extra_context = extra_context or {}
extra_context['csv_upload_form'] = CsvUploadForm()
return super().changelist_view(request, extra_context=extra_context)
def get_urls(self):
urls = super().get_urls()
new_urls = [path('upload-csv/', self.upload_csv),]
return new_urls + urls
def upload_csv(self, request):
if request.method == 'POST':
# csv_file = request.FILES['csv_file'].file
# result_string = update_products_from_csv.run(csv_file)
# I commented out the above two lines and added the below line to rule out
# the possibility that the csv upload itself was the problem. Whether I execute
# the script using the uploaded file or let it use the hardcoded local path,
# the results are the same. It works, but takes more than 20 times longer
# than executing the same script from the shell.
result_string = update_products_from_csv.run()
print(result_string)
messages.success(request, result_string)
return HttpResponseRedirect(reverse('admin:products_product_changelist'))
Right now the actual running parts of the script are about as simple as this...
import csv
from time import time
from apps.products.models import Product
CSV_PATH = 'path/to/local/csv_file.csv'
def run():
csv_data = get_csv_data()
update_data = build_update_data(csv_data)
update_handler(update_data)
return 'Finished'
def get_csv_data():
with open(CSV_PATH, 'r') as f:
return [d for d in csv.DictReader(f)]
def build_update_data(csv_data):
update_data = []
# Code that loops through csv data, applies some custom logic, and builds a list of
# dicts with the data cleaned and formatted as needed
return update_data
def update_handler(update_data):
query_times = []
for upd in update_data:
iter_start = time()
product_obj = Product.objects.get(external_id=upd['external_id'])
# external_id is not the primary key but is an indexed field in the Product model
query_times.append(time() - iter_start)
# Code to export query_times to an external file for analysis
update_handler() has a bunch of other code checking field values to see if anything needs to be changed, and building the objects when a match does not exist, but that's all commented out right now. As you can see, I'm also timing each query and logging those values. (I've been dropping time() calls in various places all day and have determined that the query is the only part that's noticeably different.)
When I run it from the shell, the average query time is 0.0005 seconds and the total of all query times comes out to about 6.8 seconds every single time.
When I run it through the admin view and then check the queries in Django Debug Toolbar it shows the 12,000+ queries as expected, and shows a total query time of only about 3900ms. But when I look at the log of query times gathered by the time() calls, the average query time is 0.013 seconds (26 times longer than when I run it through the shell), and the total of all query times always comes out at 156-157 seconds.
The queries in Django Debug Toolbar when I run it through the admin all look like SELECT ••• FROM "products_product" WHERE "products_product"."external_id" = 10 LIMIT 21, and according to the toolbar they are mostly all 0-1ms. I'm not sure how I would check what the queries look like when running it from the shell, but I can't imagine they'd be different? I couldn't find anything in django-extensions runscript docs about it doing query optimizations or anything like that.
One additional interesting facet is that when running it from the admin, from the time I see result_string print in the terminal, it's another solid 1-3 minutes before the success message appears in the browser window.
I don't know what else to check. I'm obviously missing something fundamental, but I don't know what.

Somebody on Reddit suggested that running the script from the shell might be automatically spinning up a new thread where the logic can run unencumbered by the other Django server processes, and this seems to be the answer. If I run the script in a new thread from the admin view, it runs just as fast as it does when I run it from the shell.

Related

Does a local python 'long time task' function stops running when flask app url refreshes or changes without using a task queue

Let me use the following to explain my question:
In my flask app, the url '/long_time_task' produces a result based on the output of foo(). Foo() will take long time to run.
I'd like to know:
when foo() is running (ie. url was clicked), but not finished.
1> If user refreshes the link, it will start another foo(). Will both results eventually (be saved in db and) visible at url /show_db_result (if no error)?
2> If user goes to other links, or log out. Will result eventually (be saved in db and) visible at url /show_db_result (if no error)?
I've tested the application myself. And all seem to work fine (ie yes to both questions) without celery or any other MQ support.
#app.route('/long_time_task')
def long_time_task():
if session.get('logged_in'):
foo()
return 'some results'
return 'please log in'
def foo():
# also require log in
# time consuming task, print('calculating')
# when finished, save result to a database
return 'msg'
#app.route('/page1')
def page1():
return 'page1'
#app.route('/page2')
def page2():
return 'page2'
#app.route('/show_db_result')
def show_db_result():
# require log in to see user's result
return 'all foo()\'s result row by row'
I understand that there are many articles about flask long_time_task and celery. I'd like to know more about the basic machenisms.

Best way to update my django model coming from an external api source?

I am getting my data through requesting an api source, then I put it in my django model. However, data update daily.. so how can I update these data without rendering it everytime?
def index (request):
session = requests.Session()
df = session.get('https://api.coincap.io/v2/assets')
response= df.json()
coin = response['data']
final_result = coin.to_dict('records')
for coin in final_result:
obj, created = Coincap.objects.update_or_create(
symbol = coin['symbol'],
name = coin['name'],
defaults = {
'price': coin['priceUsd']
})
return render(request, '/home.html/')
Right now, I have to go to /home.html , if I want my data update. However, my goal is to later serialize it and make it REST api data, so I wouldn't touch django template anymore. Anyway for it to update internally once a day after i do manage.py runserver?
For those that are looking for an example:
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def handle(self,*args,**kwargs):
//Your request api here
for coin in final_result:
obj, created = Coincap.objects.update_or_create(
symbol = coin['symbol'],
name = coin['name'],
defaults = {
'price': coin['priceUsd']})
Then you run in with cron just as Nikita suggested.
One simple and common solution is to create a custom Django admin command and use Cron to run it at specified intervals. You can write a command's code to your liking and it can have access to all of the models, settings and other parts of your Django project.
You would put your code making a request and writing data to the DB, using your Django models, in your new Command class's handle() method (obviously request parameter is no longer needed). And then, if for example you have named your command update_some_data, you can run it as python manage.py update_some_data.
Assuming Cron exists and is running on the machine. Then you could setup Cron to run this command for you at specified intervals, for example create a file /etc/cron.d/your_app_name and put
0 4 * * * www-data /usr/local/bin/python /path/to/your/manage.py update_some_data >> /var/log/update_some_data.log 2>&1
This would make your update be done everyday at 04:00. If your command would provide any output, it will be written to /var/log/update_some_data.log file.
Of course this is just an example, so your server user running your app (www-data here) and path to the Python executable on the server (/usr/local/bin/python here) should be adjusted for particular use.
See links for further guidance.

flask how to keep database queries references up to date

I am creating a flask app with two panels one for the admin and the other is for users. In the app scheme I have a utilities file where I keep most of the redundant variables besides other functions, (by redundant i mean i use it in many different parts of the application)
utilities.py
# ...
opening_hour = db_session.query(Table.column).one()[0] # 10:00 AM
# ...
The Table.column or let's say the opening_hour variable's value above is entered to the database by the admin though his/her web panel. This value limits the users from accessing certain functionalities of the application before the specified hour.
The problem is:
If the admin changes that value through his/her web panel, let's say to 11:00 AM. the changes is not being shown directly in the users panel."even though it was entered to the database!".
If I want the new opening_hour's value to take control. I have to manually shutdown the app and restart it "sometimes even this doesn't work"
I have tried adding gc.collect()...did nothing. There must be a way around this other than shutting and restarting the app manually. first, I doubt the admin will be able to do that. second, even if he/she can, that would be really frustrating.
If someone can relate to this please explain why is this occurring and how to get around it. Thanks in advance :)
You are trying to add advanced logic to a simple variable: You want to query the DB only once, and periodically force the variable to update by re-loading the module. That's not how modules and the import mechanism is supposed to be used.
If you want to access a possibly changing value from the database, you have to read it over and over again.
The solution is to, instead of a variable, define a function opening_hours that executes the DB query every time you check the value
def opening_hours():
return (
db_session.query(Table.column).one()[0], # 10:00 AM
db_session.query(Table.column).one()[1] # 5:00 PM
)
Now you may not want to have to query the Database every time you check the value, but maybe cache it for a few minutes. The easiest would be to use cachetools for that:
import cachetools
cache = cachetools.TTLCache(maxsize=10, ttl=60) # Cache for 60 seconds
#cachetools.cached(cache)
def opening_hours():
return (
db_session.query(Table.column).one()[0], # 10:00 AM
db_session.query(Table.column).one()[1] # 5:00 PM
)
Also, since you are using Flask, you can create a route decorator that controls access to your views depending on the view of the day
from datetime import datetime, time
from functools import wraps
from flask import g, request, render_template
def only_within_office_hours(f):
#wraps(f)
def decorated_function(*args, **kwargs):
start_time, stop_time = opening_hour()
if start_time <= datetime.now().time() <= stop_time:
return render_template('office_hours_error.html')
return f(*args, **kwargs)
return decorated_function
that you can use like
#app.route('/secret_page')
#login_required
#only_within_office_hours
def secret_page():
pass

python code for directory api to batch retrieve all users from domain

Currently I have a method that retrieves all ~119,000 gmail accounts and writes them to a csv file using python code below and the enabled admin.sdk + auth 2.0:
def get_accounts(self):
students = []
page_token = None
params = {'customer': 'my_customer'}
while True:
try:
if page_token:
params['pageToken'] = page_token
current_page = self.dir_api.users().list(**params).execute()
students.extend(current_page['users'])
# write each page of data to a file
csv_file = CSVWriter(students, self.output_file)
csv_file.write_file()
# clear the list for the next page of data
del students[:]
page_token = current_page.get('nextPageToken')
if not page_token:
break
except errors.HttpError as error:
break
I would like to retrieve all 119,000 as a lump sum, that is, without having to loop or as a batch call. Is this possible and if so, can you provide example python code? I have run into communication issues and have to rerun the process multiple times to obtain the ~119,000 accts successfully (takes about 10 minutes to download). Would like to minimize communication errors. Please advise if better method exists or non-looping method also is possible.
There's no way to do this as a batch because you need to know each pageToken and those are only given as the page is retrieved. However, you can increase your performance somewhat by getting larger pages:
params = {'customer': 'my_customer', 'maxResults': 500}
since the default page size when maxResults is not set is 100, adding maxResults: 500 will reduce the number of API calls by an order of 5. While each call may take slightly longer, you should notice performance increases because you're making far fewer API calls and HTTP round trips.
You should also look at using the fields parameter to only specify user attributes you need to read in the list. That way you're not wasting time and bandwidth retrieving details about your users that your app never uses. Try something like:
my_fields = 'nextPageToken,users(primaryEmail,name,suspended)'
params = {
'customer': 'my_customer',
maxResults': 500,
fields: my_fields
}
Last of all, if your app retrieves the list of users fairly frequently, turning on caching may help.

Optimizing database queries in Django

I have a bit of code that is causing my page to load pretty slow (49 queries in 128 ms). This is the landing page for my site -- so it needs to load snappily.
The following is my views.py that creates a feed of latest updates on the site and is causing the slowest load times from what I can see in the Debug toolbar:
def product_feed(request):
""" Return all site activity from friends, etc. """
latestparts = Part.objects.all().prefetch_related('uniparts').order_by('-added')
latestdesigns = Design.objects.all().order_by('-added')
latest = list(latestparts) + list(latestdesigns)
latestupdates = sorted (latest, key = lambda x: x.added, reverse = True)
latestupdates = latestupdates [0:8]
# only get the unique avatars that we need to put on the page so we're not pinging for images for each update
uniqueusers = User.objects.filter(id__in = Part.objects.values_list('adder', flat=True))
return render_to_response("homepage.html", {
"uniqueusers": uniqueusers,
"latestupdates": latestupdates
}, context_instance=RequestContext(request))
The query that causes the most time seem to be:
latest = list(latestparts) + list(latestdesigns) (25ms)
There is another one that's 17ms (sitewide annoucements) and 25ms (adding tagged items on each product feed item) respectively that I am also investigating.
Does anyone see any ways in which I can optimize the loading of my activity feed?
You never need more than 8 items, so limit your queries. And don't forget to make sure that added in both models is indexed.
latestparts = Part.objects.all().prefetch_related('uniparts').order_by('-added')[:8]
latestdesigns = Design.objects.all().order_by('-added')[:8]
For bonus marks, eliminate the magic number.
After making those queries a bit faster, you might want to check out memcache to store the most common query results.
Moreover, I believe adder is ForeignKey to User model.
Part.objects.distinct().values_list('adder', flat=True)
Above line is QuerySet with unique addre values. I believe you ment exactly that.
It saves you performing a subuery.