Right way to delay file download in Django - django

I have a class-based view which triggers the composition and downloading of a report for a user.
Normally in def get of the class I just compile the report, add response['Content-Disposition'] = 'attachment; filename="somefilename.pdf"' and return response to a user.
The problem is that some reports are large and while they are compiling the request timeout happens.
I know that the right way of dealing with this is to delegate it to a background process (like Celery). But the problem is that it means that instead of creating a temporary file which ceases to exist the moment the user downloads a report, I have to store these reports somewhere, and write a cronjob which will regularly clean the reports directory.
Is there any more elegant way in Django to deal with this issue?

One solution less fancy than using celery is to use is Django's StreamingHttpResponse:
(https://docs.djangoproject.com/en/2.0/ref/request-response/#django.http.StreamingHttpResponse
With this, you use a generator function, which is a python function that uses yield to return its results as an iterator. This allows you to return the data as you generate it, rather than all at once at after you're finished. You can yield after each line or section of the report.. thus keeping a flow of data back to the browser.
But.. this only works if you are building up the finished file bit by bit.. for example, a CSV file. If you're returning something that you need to format all at once, for example if you're using something like wkhtmltopdf to generate a pdf file after you're done, then it's not as easy.
But there's still a solution:
What you can do in that case is, use StreamingHttpReponse along with a generator function to generate your report into a temporary file, instead of back to the browser. But as you are doing this, yield HTML snippets back to the browser which lets the user know the progress, eg:
def get(self, request, **kwargs):
# first you need a tempfile name.. do that however you like
tempfile = "kfjkdsjfksjfks"
# then you need to create a view which will open that file and serve it
# but I won't show that here.
# For security reasons it has to serve only out of one directory
# that is dedicated to this.
fetchurl = reverse('reportgetter_url') + '?file=' + tempfile
def reportgen():
yield 'Starting report generation..<br>'
# do some stuff to generate your report into the tempfile
yield 'Doing this..<br>'
# do this
yield 'Doing that..<br>'
# do that
yield 'Finished.<br>'
# when the browser receives this script, it'll go to fetchurl where
# you will send them the finished report.
yield '<script>document.location="%s";</script>' % fetchurl
return http.StreamingHttpResponse(reportgen())
That's not a complete example obviously, but should give you the idea.
When your user fetches this view, they will see the progress of the report as it comes along. At the end, you're sending the javacript which redirect the browser to the other view you will have to write which returns the response containing the finished file. When the browser gets this javacript, if the view returning the tempfile is setting the response Content-Disposition as an attachment before returning it, eg:
response['Content-Disposition'] = 'attachment; filename="%s"' % filename
..then the browser will stay on the current page showing your progress.. and simply pop up a file save dialog for the user.
For cleanup, you'll need a cron job regardless.. because if people don't wait around, they'll never pick up the report. Sometimes things don't work out... So you could just clean up files older than let's say 1 hour. For a lot of systems this is acceptable.
But if you want to clean up right away, what you can do, if you are on unix/linux, is to use an old unix filesystem trick: Files which are deleted while they are open are not really gone until they are closed. So, open your tempfile.. then delete it. Then return your response. As soon as the response has finished sending, the space used by the file will be freed.
PS: I should add.. that if you take this second approach, you could use one view to do both jobs.. just:
if `file` in request.GET:
# file= was in the url.. they are trying to get an already generated report
with open(thepathname) as f:
os.unlink(f)
# file has been 'deleted' but f is still a valid open file
response = HttpResponse( etc etc etc)
response['Content-Disposition'] = 'attachment; filename="thereport"'
response.write(f)
return response
else:
# generate the report
# as above

This is not really a Django question but a general architecture question.
You can always increase your server time outs but it would still, IMO, give you a bad user experience if the user has to sit watching the browser just spin.
Doing this on a background task is the only way to do it right. I don’t know how large the reports are, but using email can be a good solution. The background task simply generates the report, sends it via email and deletes it.
If the files are too large to send via email, then you will have to store them. Maybe send an email with a link and a message indicating the link will not work after X days/hours. Once you have a background worker, creating a daily or hourly clean up task would be very easy.
Hope it helps

Related

What is the advantage of flash() over using render_template() args?

Today I learned Flask's flash() method.
What I understood was, it tells template to show message once.
But it can also be done by render_template() method like below:
#app.route('/')
def index():
if condition:
error = "Some error messages to flash"
else:
error = None
return render_template('index.html', error=error)
This makes flash() method look like an unnecessary duplicate.
What kind of advantages or difference in practice are there to use flash over this?
When you issue a redirect to, say, /profile, how would you know if the user visited that page himself, or he's redirected as a result of an action (like after a login)?
flash() method saves the message to the user session, so that they don't get lost due to navigation.
You can't use render_template(name, message="success") in this case, because when you redirect the user, the browser doesn't care about the response, when it sees a 30x response, it reads the Location header and loads that URL in a new request, so whatever context you've set, or whatever you render, is all discarded. That's why you need to flash messages to pass messages between two consequent requests.
Also, read the explanation in flask docs, it's quite clear.
https://flask.palletsprojects.com/en/2.0.x/patterns/flashing/

PyMongo find query returns empty/partial cursor when running in a Django+uWsgi project

We developed a REST API using Django & mongoDB (PyMongo driver). The problem is that, on some requests to the API endpoints, PyMongo cursor returns a partial response which contains less documents than it should (but it’s a completely valid JSON document).
Let me explain it with an example of one of our views:
def get_data(key):
return collection.find({'key': key}, limit=24)
def my_view(request):
key = request.POST.get('key')
query = get_data(key)
res = [app for app in query]
return JsonResponse({'list': res})
We're sure that there is more than 8000 documents matching the query, but in
some calls we get less than 24 results (even zero). The first problem we've
investigated was that we had more than one MongoClient definition in our code. By resolving this, the number of occurrences of the problem decreased, but we still had it in a lot of calls.
After all of these investigations, we've designed a test in which we made 16 asynchronous requests at the same time to the server. With this approach, we could reproduce the problem. On each of these 16 requests, 6-8 of them had partial results. After running this test we reduced uWsgi’s number of processes to 6 and restarted the server. All results were good but after applying another heavy load on the server, the problem began again. At this point, we restarted uwsgi service and again everything was OK. With this last experiment we have a clue now that when the uwsgi service starts running, everything is working correctly but after a period of time and heavy load, the server begins to return partial or empty results again.
The latest investigation we had was to run the API using python manage.py with DEBUG=False, and we had the problem again after a period of time in this situation.
We can't figure out what the problem is and how to solve it. One reason that we can think of is that Django closes pymongo’s connections before completion. Because the returned result is a valid JSON.
Our stack is:
nginx (with no cache enabled)
uWsgi
MemCached (disabled during debugging procedure)
Django (v1.8 on python 3)
PyMongo (v3.0.3)
Your help is really appreciated.
Update:
Mongo version:
db version v3.0.7
git version: 6ce7cbe8c6b899552dadd907604559806aa2e9bd
We are running single mongod instance. No sharding/replicating.
We are creating connection using this snippet:
con = MongoClient('localhost', 27017)
Update 2
Subject thread in Pymongo issue tracker.
Pymongo cursors are not thread safe elements. So using them like what I did in a multi-threaded environment will cause what I've described in question. On the other hand Python's list operations are mostly thread safe, and changing snippet like this will solve the problem:
def get_data(key):
return list(collection.find({'key': key}, limit=24))
def my_view(request):
key = request.POST.get('key')
query = get_data(key)
res = [app for app in query]
return JsonResponse({'list': res})
My very speculative guess is that you are reusing a cursor somewhere in your code. Make sure you are initializing your collection within the view stack itself, and not outside of it.
For example, as written, if you are doing something like:
import ...
import con
collection = con.documents
# blah blah code
def my_view(request):
key = request.POST.get('key')
query = collection.find({'key': key}, limit=24)
res = [app for app in query]
return JsonResponse({'list': res})
You could end us reusing a cursor. Better to do something like
import ...
import con
# blah blah code
def my_view(request):
collection = con.documents
key = request.POST.get('key')
query = collection.find({'key': key}, limit=24)
res = [app for app in query]
return JsonResponse({'list': res})
EDIT at asker's request for clarification:
The reason you need to define the collection within the view stack and not when the file loads is that the collection variable has a cursor, which is basically how the database and your application talk to each other. Cursors do things like keep track of where you are in a long list of data, in addition to a bunch of other stuff, but thats the important part.
When you create the collection cursor outside the view method, it will re-use the cursor on each request if it exists. So, if you make one request, and then another really, really fast right after that (like what happened when you applied high load), the cursor might only be half way through talking to the database, and so some of your data goes to the first request, and some to the second. The reason you would get NO data in a request would be if a cursor finished fetching data but hadn't been closed yet, so the next request tried to fetch data from the cursor, and there was none left to fetch in the query.
By moving the collection definition (and by association, the cursor definition) into the view stack, you will ALWAYS get a new cursor when you process a new request. You wont get any cross talking between your cursors and different requests, as each request cycle will have its own.

How to render a Django template with a transient variable?

I have an application that enforces a strict page sequence. If the user clicks the Back button, the application detects an out-of-order page access and sends the user back to the start.
I'd like to make this a bit more friendly, by redirecting the user back to the correct page and displaying a pop-up javascript alert box telling them not to use the Back button.
I'm already using a function that does a lot of validity checking which returns None if the request is okay, or an HttpResponseRedirect to another page (generally the error page or login page) if the request is invalid. All of my views have this code at the top:
response = validate(request)
if response:
return response
So, since I have this validate() function already, it seems like a good place to add this extra code for detecting out-of-order access.
However, since the out-of-order detection flag has to survive across a redirect, I can't just set a view variable; I have to set the flag in the session data. But of course I don't want the flag to be set in the session data permanently; I want to remove the flag from the session data after processing the template.
I could add code like this to all of my render calls:
back_button = request.session.get('back_button', False)
response = render(request, 'foo.html', { 'back_button': back_button } )
if back_button:
del request.session['back_button']
return response
But this seems a bit messy. Is there some way to automatically remove the session key after processing the template? Perhaps a piece of middleware?
I'm using function-based views, not class-based, btw.
The session object uses the dictionary interface, so you can use pop instead of get to retrieve and delete the key at the same time:
back_button = request.session.pop('back_button', False)

Multiprogramming in Django, writing to the Database

Introduction
I have the following code which checks to see if a similar model exists in the database, and if it does not it creates the new model:
class BookProfile():
# ...
def save(self, *args, **kwargs):
uniqueConstraint = {'book_instance': self.book_instance, 'collection': self.collection}
# Test for other objects with identical values
profiles = BookProfile.objects.filter(Q(**uniqueConstraint) & ~Q(pk=self.pk))
# If none are found create the object, else fail.
if len(profiles) == 0:
super(BookProfile, self).save(*args, **kwargs)
else:
raise ValidationError('A Book Profile for that book instance in that collection already exists')
I first build my constraints, then search for a model with those values which I am enforcing must be unique Q(**uniqueConstraint). In addition I ensure that if the save method is updating and not inserting, that we do not find this object when looking for other similar objects ~Q(pk=self.pk).
I should mention that I ham implementing soft delete (with a modified objects manager which only shows non-deleted objects) which is why I must check for myself rather then relying on unique_together errors.
Problem
Right thats the introduction out of the way. My problem is that when multiple identical objects are saved in quick (or as near as simultaneous) succession, sometimes both get added even though the first being added should prevent the second.
I have tested the code in the shell and it succeeds every time I run it. Thus my assumption is if say we have two objects being added Object A and Object B. Object A runs its check upon save() being called. Then the process saving Object B gets some time on the processor. Object B runs that same test, but Object A has not yet been added so Object B is added to the database. Then Object A regains control of the processor, and has allready run its test, even though identical Object B is in the database, it adds it regardless.
My Thoughts
The reason I fear multiprogramming could be involved is that each Object A and Object is being added through an API save view, so a request to the view is made for each save, thus not a single request with multiple sequential saves on objects.
It might be the case that Apache is creating a process for each request, and thus causing the problems I think I am seeing. As you would expect, the problem only occurs sometimes, which is characteristic of multiprogramming or multiprocessing errors.
If this is the case, is there a way to make the test and set parts of the save() method a critical section, so that a process switch cannot happen between the test and the set?
Based on what you've described, it seems reasonable to assume that multiple Apache processes are a source of problems. Are you able to replicate if you limit Apache to a single worker process?
Maybe the suggestions in this thread will help: How to lock a critical section in Django?
An alternative approach could be utilizing a queue. You'd just stick your objects to be saved into the queue and have another process doing the actual save. That way you could guarantee that objects were processed sequentially. This wouldn't work well if your application depends on having the object saved by the time the response is returned unless you also had the request processes wait on the result (watching a finished queue for example).
Updated
You may find this info useful. Mr. Dumpleton does a much better job of laying out the considerations than I could attempt to summarize here:
http://code.google.com/p/modwsgi/wiki/ProcessesAndThreading
http://code.google.com/p/modwsgi/wiki/ConfigurationGuidelines especially the Defining Process Groups section.
http://code.google.com/p/modwsgi/wiki/QuickConfigurationGuide Delegation to Daemon Process section
http://code.google.com/p/modwsgi/wiki/IntegrationWithDjango
Find the section of text toward the bottom of the page that begins with:
Now, traditional wisdom in respect of
Django has been that it should
perferably only be used on single
threaded servers. This would mean for
Apache using the single threaded
'prefork' MPM on UNIX systems and
avoiding the multithreaded 'worker'
MPM.
and read until the end of the page.
I have found a solution that I think might work:
import threading
def save(self, *args, **kwargs):
lock = threading.Lock()
lock.acquire()
try:
# Test and Set Code
finally:
lock.release()
It doesn't seam to break the save method like that decorator and thus far I have not seen the error again.
Unless anyone can say that this is not a correct solution, I think this works.
Update
The accepted answer was the inspiration for this change.
I seams I was under the impressions that locks were some sort of special voodoo that were exempt by normal logic. Here the lock = threading.Lock() is run each time, thus instantiating a new unlocked lock which may always be acquired.
I needed a single central lock for the purpose, but were could that go unless I had a thread running all the time holding the lock? The answer seamed to be to use file locks explained in this answer to the StackOverflow question mentioned in the accepted answer.
The following is that solution modified to suit my situation:
The Code
Th following is my modified DjangoLock. I wished to keep locks relative to the Django root, to do this I put a custom variable into the settings.py file.
# locks.py
import os
import fcntl
from django.conf import settings
class DjangoLock:
def __init__(self, filename):
self.filename = os.path.join(settings.LOCK_DIR, filename)
self.handle = open(self.filename, 'w')
def acquire(self):
fcntl.flock(self.handle, fcntl.LOCK_EX)
def release(self):
fcntl.flock(self.handle, fcntl.LOCK_UN)
def __del__(self):
self.handle.close()
And now the additional LOCK_DIR settings variable:
# settings.py
import os
PATH = os.path.abspath(os.path.dirname(__file__))
# ...
LOCK_DIR = os.path.join(PATH, 'locks')
That will now put locks in a folder named locks relative to the root of the Django project. Just make sure you give apache write access, in my case:
sudo chown www-data locks
And finally the usage is much the same as before:
import locks
def save(self, *args, **kwargs):
lock = locks.DjangoLock('ClassName')
lock.acquire()
try:
# Test and Set Code
finally:
lock.release()
This is now the implementation I am using and it seams to be working really well. Thanks to all who have contributed to the process of arriving at this end.
You need to use synchronization on the save method. I haven't tried this yet, but here's a decorator that can be used to do so.

Django File Upload Handler Errors

In Django, I want to stop any file uploads that don't end in my designated extension as soon as they're received. To do this, I define a custom Upload Handler and write new_file to look something like this:
def new_file(self, field_name, file_name, content_type, content_length, charset=None):
basename, extension = os.path.splitext(file_name)
if extension != ".txt":
raise StopUpload(True) # Error: Only .txt files are accepted
This works fine for stopping the upload but it also clears request.FILES. Now, when my view receives the request, it has no way to telling whether the upload handler caused the file to be missing and I can't display a useful message to the user.
Is there any way to propagate messages from the Upload Handler to the corresponding view, such that I can display the error to the user? I've tried using the request object, but it is immutable.
I've found a reason for why I couldn't get it working and a solution
as well.
It would appear that Django uses lazy evaluation for request.FILES to
determine when the upload handler is called. Therefore, the upload
handler is only evoked when and if you attempt to access
request.FILES. Additionally, the request object I am using
(WSGIRequest in my case) has made GET and POST immutable dictionaries,
so we can't pass information through there. However, META is still
available to add information to.
My combined solution has the line "request.FILES" in the view that
handles uploads, which forces the upload handler to begin. When the
error is captured in new_files, I set self.request.META['error'] to
the error message and raise StopUpload, which pushes us back into the
view without a file. Finally, I check for request.META['error'] in the
view and display that message when there is a problem.
I hope this helps!