I can't upload multiple large files - django

I am developing an application with Django, AWS S3 and hosted on Heroku.
At one point users have to upload multiple large files, totaling around 150MB each time.
I have tried various approaches.
1st attempt: directly call the save method of the Django form:
Result: the request takes more than 30 seconds and returns a timeout.
2nd attempt: temporarily save the file to a Heroku directory and read it from Celery task.
Result: Cannot save because it throws FileNotFoundError: [Errno 2] No such file or directory on production.
3rd attempt: pass the uploaded files (in memory files) to a celery task but the bytes cannot be serialized and passed to the task neither with json or with pickle.
Could anyone help me please?
Thanks advance.

Another approach can be like that
Expose an APIs to generate presigned URL for Frontend (Steps are here).
Upload files by using that URL from the frontend in async way. That will offload your computation at Backend.
After successful upload, you will get an URL of file location. Now save the S3 URL along with other fields data to Django model.
You can upload more than 150MB file size by this method. Your system will be scalable.

Related

Django rest framework large file upload

Is there an easy way to upload large files from the client side to a django rest framework endpoint. In my application, users will be uploading very large files (>4gb). Browsers have a upload limit, here's the chart.
My current idea is to upload the file in chunks from the client side and receive the chunks from the rest endpoint. But how will I do that? I saw some libraries like - resumable.js, tus.js, flow.js etc. But how will I handle the chunks in the backend? Is there any library that is actively maintained for a problem like this? Please help me.
Maybe this module could help: https://github.com/jkeifer/drf-chunked-upload.
The module is utilized into a sample django app at the link, with example code for implementation. Here is the typical usage case the module provides, without the sample code for simplicity (code is at the link if you want it):
An initial PUT request is sent to the url linked to ChunkedUploadView (or any subclass) with the first chunk of the file. The name of the chunk file can be overriden in the view (class attribute field_name).
In return, the server will respond with the url of the upload, and the current offset.
3 Repeatedly PUT subsequent chunks to the url returned from the server.
Server will continue responding with the url and current offset.
Finally, when upload is completed, POST a request to the returned url. This request must include the checksum (hex) of the entire file.
If everything is OK, server will response with status code 200 and the data returned in the method get_response_data (if any).
If you want to upload a file as a single chunk, this is also possible! Simply make the first request a POST and include the checksum digest for the file. You don't need to include the Content-Range header if uploading a whole file.
Based on these instructions, it seems that the server handles the upload by tracking offsets of the chunk through received headers ("Content-Range"), as well as its url, storing the uploaded chunks in .part files. It then responds like so:
{
'id': 'f64ebd67-83a3-45b6-8acd-c749ea1ed4cd'
'url': 'https://your-host/<path_to_view>/f64ebd67-83a3-45b6-8acd-c749ea1ed4cd',
'file': 'https://your-host/<path_to_file>/f64ebd67-83a3-45b6-8acd-c749ea1ed4cd.part',
'filename': 'example.bin',
'offset': 10000,
`created_at`: '2021-05-18T17:12:50.318718Z',
'status': 1,
'completed_at': None,
'user': 1
}
When the full file is uploaded as determined by the recieved headers, the .part files are combined into the final upload. This also allows you to resume uploads if they are interuptted, because the existing .part files persist until the upload finishes.
https://stackoverflow.com/a/26278960/12776116
Maybe this can help you. As you mentioned, the file is uploaded by breaking it into small parts.
you should do it with celery tasks.
take a look at this link. it explains how to upload a file using django and celery.

Django how to upload file directly to 3rd-part storage server, like Cloudinary, S3

Now, I have realized the uploading process is like that:
1. Generate the HTTP request object, and set the value to request.FILE by using uploadhandler.
2. In the views.py, the instance of FieldFile which is the mirror of FileField will call the storage.save() to upload file.
So, as you see, django always use the cache or disk to pass the data, if your file is too large, it will cost too much time.
And the design I want to figure this problem is to custom an uploadhandler which will call storage.save() by using input raw data. The only question is how can I modify the actions of FileField?
Thanks for any help.
you can use this package
Add direct uploads to AWS S3 functionality with a progress bar to file input fields.
https://github.com/bradleyg/django-s3direct
You can use one of the following packages
https://github.com/cloudinary/pycloudinary
http://django-storages.readthedocs.io/en/latest/backends/amazon-S3.html

Django upload data into memory

Is it possible in Django to load data into the Memory, process it and return it to the User?
Example:
User uploads image -> Python Script process it (e.g. make it s/w) -> User sees processed Image on same Webpage
(Other Examples would be all these "convert-online" sites like pdf2doc.com)
Is it a bad idea to load it into the Memory?
Would a Queue and saving it on a CDN be a better solution?
I'm trying to understand the possibilities of handling files from the User, which don't need to be safed. I appreciate any further ideas/considerations.

How to mix Django, Uploadify, and S3Boto Storage Backend?

Background
I'm doing fairly big file uploads on Django. File size is generally 10MB-100MB.
I'm on Heroku and I've been hitting the request timeout of 30 seconds.
The Beginning
In order to get around the limit, Heroku's recommendation is to upload from the browser DIRECTLY to S3.
Amazon documents this by showing you how to write an HTML form to perform the upload.
Since I'm on Django, rather than write the HTML by hand, I'm using django-uploadify-s3 (example). This provides me with an SWF object, wrapped in JS, that performs the actual upload.
This part is working fine! Hooray!
The Problem
The problem is in tying that data back to my Django model in a sane way.
Right now the data comes back as a simple URL string, pointing to the file's location.
However, I was previously using S3 Boto from django-storages to manage all of my files as FileFields, backed by the delightful S3BotoStorageFile.
To reiterate, S3 Boto is working great in isolation, Uploadify is working great in isolation, the problem is in putting the two together.
My understanding is that the only way to populate the FileField is by providing both the filename AND the file content. When you're uploading files from the browser to Django, this is no problem, as Django has the file content in a buffer and can do whatever it likes with it. However, when doing direct-to-S3 uploads like me, Django only receives the file name and URL, not the binary data, so I can't properly populate the FieldFile.
Cry For Help
Anyone know a graceful way to use S3Boto's FileField in conjunction with direct-to-S3 uploading?
Else, what's the best way to manage an S3 file just based on its URL? Including setting expiration, key id, etc.
Many thanks!
Use a URLField.
I had a similar issue where i want to store file to s3 either directly using FileField or i have an option for the user to input the url directly. So to circumvent that, i used 2 fields in my model, one for FileField and one for URLField. And in the template i could use 'or' to see which one exists and to use that like {{ instance.filefield or instance.url }}.
This is untested, but you should be able to use:
from django.core.files.storage import default_storage
f = default_storage.open('name_you_expect_in_s3', 'r')
#f is an instance of S3BotoStorageFile, and can be assigned to a field
obj, created = YourObject.objects.get_or_create(**stuff_you_know)
obj.s3file_field = f
obj.save()
I think this should set up the local pointer to s3 and save it, without over writing the content.
ETA: You should do this only after the upload completes on S3 and you know the key in s3.
Checkout django-filetransfers. Looks like it plays nice with django-storages.
I've never used django, so ymmv :) but why not just write a single byte to populate the content? That way, you can still use FieldFile.
I'm thinking that writing actual SQL may be the easiest solution here. Alternatively you could subclass S3BotoStorage, override the _save method and allow for an optional kwarg of filepath which sidesteps all the other saving stuff and just returns the cleaned_name.

Trouble uploading lots of file in Django

I'm having problems when uploading lots of files in Django. The context is the following: I've a spreadsheet with one or more columns being image filenames; those images are being uploaded through an form with input type=file and the option multiple.
With few lines - say 70, everything goes fine. But with more lines, and consequently more images, there's a IOError happening in random positions.
I've checked several questions about file/image upload in Django but couldn't find any that is related to my problem.
The model I'm using is the Product model of LFS (www.getlfs.com). We are developing a system that is based on LFS and to facilitate the creation of dozens of products in batch we wrote some views and templates to receive the main product properties through a spreadsheet. Each line is a product and the columns are the desired properties.
LFS uses a custom class ImageWithThumbsField(ImageField) to store the product's image and when saving the product instance (got from the spreadsheet), all thumbnails are generated. This is a time (cpu) consuming task, and my initial guess is that for some reason the temporary file is deleted before all processing had occurred.
Is there a way to keep these uploaded files for more time? Any other approach suggested to be able to process hundreds of uploaded files? Any hints on what can be happening?
Hope you can understand my question. I can post code if need.
Links to relevant portions of LFS code:
where thumbnails are generated:
https://github.com/diefenbach/django-lfs/blob/master/lfs/core/fields/thumbs.py
product model
https://github.com/diefenbach/django-lfs/blob/master/lfs/catalog/models.py
Thanks in advance!
It sounds like you are running out of memory. When django processess uploads, until the form is validated all of the files are either:
kept in memory inside the python/wsgi process/worker. (Usual mode of op for runserver)
In this case, you are uploading enough photos to fill up the process memory and running out of space. This will be non-deterministic as to where the IOError happens as you can imagine (GC Dependent).
Temporarily stored in /tmp/ (usual setup of apache)
In this case, the webserver's ramfs is full of images that have not yet been written to disk. In this case it should IOError arround the same place.
In either case, you should not be bulk uploading images in this way anyway. Apache/Django is not designed for it. Try uploading a single product/image per request/response, and all your problems will go away.