Django FileField, how to avoid long file copy delays? - django

I have the following class:
class VideoFile(models.Model):
media_file = models.FileField(upload_to=update_filename, null=True)
And when I try to upload large files to it (from 100mb up to 2Gb) using the following request, it can take quite a long time after the upload process, and during the VideoFile.save() process.
def upload(request):
video_file = VideoFile.objects.create(uploader=request.user.profile)
video_file.media_file = uploaded_file
video_file.save()
On my Macbook Pro Core i7, 8Gb RAM, a 300mb uploaded file can take around 20 seconds or so to run video_file.save()
I suspect this delay is relating to a disk copy operation from /tmp to the files permanent location? I've proven this by running watch ls -l on the target directory, and as soon as video_file.save() runs, I can see the file appear and grow throughout the delay.
Is there a way to eliminate this file transfer delay? Either by uploading the file directly to the target filename or just by moving the original file instead of copying? This is not the only upload operation across the site however so any solution needs to be localized to this model.
Thanks for any advice!
UPDATE:
Just further evidence to support a copy instead of a move, i can watch lsof during the upload and see a file within /private/var/folders/... written from python which maps exactly to upload progress. After upload is complete, another lsof entry appears for the ultimate file location which grows over time. After that's complete, both entries disappear.

Ok after a bit of digging I've come up with a solution. It turns out Django's default storage already attempts to move the file instead of copy, which it first tests:
hasattr(content, 'temporary_file_path')
This attribute exists for the class TemporaryUploadedFile which is the object returned to the Upload View, however the field itself is created as the class specified by FileField.attr_class
So instead I decided to subclass FieldFile and FileField and slot in the temporary_file_path attribute:
class VideoFieldFile(FieldFile):
_temporary_file_path = None
def temporary_file_path(self):
return self._temporary_file_path
class VideoFileField(FileField):
attr_class = VideoFieldFile
Finally in the view, before saving the model I manually assigned the temp path:
video_file.media_file._temporary_file_path = uploaded_file.temporary_file_path()
This now means my 1.1Gb test file becomes available in about 2-3 seconds rather than the 50 seconds or so I was seeing before. It also comes with the added benefit that if the files exist on different file systems, it appears to fall back to the copy operation.
As a side note however, my site is not utilizing MemoryFileUploadHandler which some sites may use to handle smaller file uploads, so I'm not sure how nice my solution might work with that, but I'm sure it would be simple enough to detect the uploaded file's class and act accordingly.

I would caution that there are quite a few reasons why uploading to /tmp and then cping over is best practice, and that uploading large files directly to their target is a potentially dangerous operation.
But, what you're asking is absolutely possible. Django defines upload handlers:
You can write custom handlers that customize how Django handles files. You could, for example, use custom handlers to enforce user-level quotas, compress data on the fly, render progress bars, and even send data to another storage location directly without storing it locally.

Related

Django Request input from User

I'm having a Django stack process issue and am wondering if there's a way to request user input.
To start with, the user is loading Sample data (Oxygen, CHL, Nutrients, etc.) which typically comes from an excel file. The user clicks on a button indicating what type of sample is being loaded on the webpage and gets a dialog to choose the file to load. The webpage passes the file to python via VueJS/Django where python passes the file down to the appropriate parser to read that specific sample type. The file is processed and sample data is written to a database.
Issues (I.E sample ID is outside an expected range of IDs because it was keyed in wrong when the sample was taken) get passed back to the webpage as well as written to the database to tell the user when something happened:
(E.g "No Bottle exists for sample with ID 495619, expected range is 495169 - 495176" or "356 is an unreasonable salinity, check data for sample 495169"). Maybe there's fifty samples without bottle IDs because the required bottle file wasn't loaded before the sample data. Generally you have one big 10L bottle with water in it, the ocean depth (pressure) and bottle ID where the bottle was closed is in a bottle file, and samples are placed into different vials with that bottle's unique id and the vials are run thought different machines and tests to produce the sample files.
My issue occurs when the user picks a file that contains data that has already been loaded. If the filename matches the existing file data was loaded from I clear data associated with that file and reload the database with the new data, sometimes data is spread over several files that were already loaded and uploader will merge all the files, including some that weren't uploaded, together.
A protocol for uploading data is for the uploader to append/prepend their initials onto a copy of a file if corrections were made and not to modify the original file; a chain of custody. Sometimes a file can be modified by multiple people and each person will create a copy and append/prepend their initials so people will know who all touched the data. (I don't make the rules I just work with what I have)
So we get all the way back to the parser and it's discovered the data already exists (for a given sample ID), but the filename is different. At this point I want to ask the user, do you want to reload all the data loaded from the other file, update existing data with the new file or ignore existing data and only append new data.
Is there a way for Django to make a request to the webpage to ask the user how it should handle this data without having to terminate the current request? - which the webpage is waiting for a response from the server to say the data was loaded and what errors with the data might have been found -
My current thoughts are to:
Ask the user before every file upload how a collision should be handled, if it happens
Or
Abort the data load, pass an error with a code back to the webpage, the error code indicates to the webpage that the user has to decide what to do. Upon the user answering, the load process is restarted with the same file, but with a flag to tell the parser what to do when the issue is eventually encountered again.
Nothing is written to the database until a whole file is read so no problem aborting the process and restarting if the parser doesn't know what to do, but I feel like there might be a better way.

Shiny fileinput datapath may be deleted after user uploads a new file - is there a workaround?

I'm building an app where the user uploads 1-4 files to shiny through fileinput. An issue arrises where, should the user not drag/select multiple files in one go, then the app will not be able to access them. For example, say the user has 4 files saved locally in 4 different folders and they try uploading them one by one, the app will not function. This happens because when the files are uploaded, fileinput creates a dataframe where one column (datapath) contains the path to a temp file which you can then reference in the server. In the documentation it states...
datapath
The path to a temp file that contains the data that was uploaded. This file may be deleted if the user performs another upload operation.
https://shiny.rstudio.com/reference/shiny/1.6.0/fileInput.html
Is there any way around this problem to prevent this datapath being deleted or perhaps find a way to store the temp file so it won't be lost should a user upload another file?
I had considered multiple fileinput boxes but that just makes the app messy.
There is a reproducible example in the example section of the documentation above.

Will s3 creates objects itself when we save a file?

I created a dataframe and selected some columns say col1col2 and col3 using df.select().
df1=df.select(col1,col2,col3)
I am writing this into a parquet file and saving it to s3.
df1.write.partitionBy("col1").format("parquet").save('s3a://myBucket/fol1/subfolder')
currently there is no location like 's3a://myBucket/fol1/subfolder' in my s3. Only thing I have is 's3a:myBucket'. My question as there are no objects named fol1 and subfolder.Will It create objects itself and save the file? or the code will fail?
I think you're asking if save('s3a://myBucket/fol1/subfolder') will create the fol1/subfolder structure in S3, and if it doesn't, do you need to.
The bottom line is that you don't need to worry about creating the intermediate folder structure because Hadoop FS API creates it for you, as needed.
#SteveLoughran's answer provides much more detail and deserves to be the accepted answer.
Although S3 is an object store, Spark, Hive &c all pretend its a filesystem & use the Hadoop filesystem API.
Some early actions of a spark save() are
call FileSystem.exists(dest) & fail if there's something there (unless you have enabled appending to existing data)
call FileSystem.mkdir(dest).
set up some _temporary dir underneath for the job, renaming things into place when the job is committed.
Action #2 triggers a scan for any entry in the path /a/b/c/dest being a file (Failure), creates an empty directory marker object /a/b/c/dest/. That marker will be deleted as soon as a child directory (i.e _temporary) is created.
At the end of the job then, there won't be any parent marker entries, but they go in there just to keep quiet all those bits of code which expect that after a mkdirs() call that the created directory exists.
Finally, be advised: the whole commit-by-rename mechanism is broken when it comes to S3 as it is (a) slow and (b) at risk of losing data due to directory listing consistency. You need a consistent listing layer (EMR: Consistent S3, Apache Hadoop: S3Guard, Databricks: something also DynamoDB based), and, for maximum performance atop Apache Hadoop 3.1, switch to a specific zero-rename S3A committer.

How to use Matplotlib images in Django templates?

Background: I'm starting off with Django, and have limited experience with Python, so please bear with me. I've written a Python script that runs periodically (in a cron job) to store data into a SQLite3 database, from which I'd like to read from and generate images with Matplotlib (more specifically, with Basemap). This started off as an interest in learning Python and building an "interesting" enough project. I'm picking the Django framework because it seems reasonably-well documented, although I was pleasantly surprised by web.py because of its "lightweightness" in its requirements (but web.py's sparse documentation makes it a bit harder to start off with); but at the moment, I'm not entirely dead-set on a framework.
The example in question 1874642 is almost what I'm looking for, with an image being generated on-the-fly without requiring having to write it to disk (and thusly having to deal with periodically cleaning up the generated files).
However, what is not clear to me is how the generated image can be incorporated in a template, instead of having the browser simply showing the image. From the tutorial material, I'm guessing that it should be possible to represent the variables incorporated in some django.template.Context into the django.http.HttpResponse, but the referenced example shortcuts it by responding directly with a Mime object instead of building it with a Context.
So what I'm asking is:
Is it necessary to invoke a print_png on the generated Matplotlib FigureCanvas object? Or is the FigureCanvas copyied "unprinted" to the Context, so that in the Django template I explicity write the HTML img tag and put by hand the tag's attributes?
I'm under the impression that I have to write the Canvas to disk (i.e. do a canvas.print_figure("image.png")), so that the HTML img tag sees it in the Django template. But I want to be sure that there isn't a "more manageable way" -- i.e. passing the image in the Context and having the template "magically" generate it. If it's really necessary to write to disk, I suppose I could use Django's filesystem caching facility to write the generated images in some way (checking whether an image was already written for a given input parameter set, of course). I welcome your suggestions on this regard, since it's not yet clear at this time the size and number of the images that will be generated, and thusly I'm looking to avoid spending disk space and instead prefer waiting for an image to be generated (even if it takes a few seconds).
Thank you in advance.
you can pass a StringIO object to pyplot.savefig(), and get the PNG file content by StringIO.getvalue().
This view would serve a PNG image. Just bind it to some URL like "img.png" and use that in an img tag.
def create_fig(request):
# MPL stuff
response = HttpResponse(content_type='image/png')
fig.savefig(response)
return response
Of course, that assumes that you can generate the image independent of the main view. You can pass arguments to the image, like (in urls.py):
url(r'^img(?P<nr>\d+).png$', create_fig),
which passes a (string repr. of) a number nr to create_fig.

Django return large file

I am trying to find the best way (most efficient way) to return large files from Django back to an http client.
receive http get request
read large file from disk
return the content of that file
I don't want to read the file then post the response using HttpResponse as the file content is first stored in RAM if I am correct. How can I do that efficiently ?
Laurent
Look into mod_xsendfile on Apache (or equivalents for nginx, etc) if you like to use Django for authentication. Otherwise, there's no need to hit django, and just server straight from Apache.
There is a ticket that aims to deal with this problem here: http://code.djangoproject.com/ticket/2131
It adds an HttpResponseSendFile class that uses sendfile() to send the file, which transparently sends the file as it's read.
However, the standard HttpResponse is implemented as an iterator, so if you pass it a file-like object, it will follow its iteration semantics, so presumably you could create a file-like object wrapper that chunks the file in small enough pieces before sending them out.
I believe the semantics of iterating over a standard file object in python is that it reads line-by-line, which most likely won't solve your problem if you're dealing with binary files.
Of course, you could always put the static files in another location and serve that with a normal web server, unless you require intricate control (like access control requiring knowledge of the Django database)
My preference for all of this is to synthesize django with your http server so that when you want to serve static files, you simply refer them to a path that will never reach django. The strategy will look something like this:
Configure http server so that some requests go to django and some go to a static document root
link to static documents from any web pages that obviously need the static documents (e.g. css, javascript, etc.)
for any non-obvious return of a static document, use an HttpRedirect("/web-path/to/doc").
If you need to include the static document inside a dynamic document (maybe a page-viewer wrapping a large text or binary file), then return a wrapper page that populates a div with an ajax call to your static document.