Django rest framework large file upload - django

Is there an easy way to upload large files from the client side to a django rest framework endpoint. In my application, users will be uploading very large files (>4gb). Browsers have a upload limit, here's the chart.
My current idea is to upload the file in chunks from the client side and receive the chunks from the rest endpoint. But how will I do that? I saw some libraries like - resumable.js, tus.js, flow.js etc. But how will I handle the chunks in the backend? Is there any library that is actively maintained for a problem like this? Please help me.

Maybe this module could help: https://github.com/jkeifer/drf-chunked-upload.
The module is utilized into a sample django app at the link, with example code for implementation. Here is the typical usage case the module provides, without the sample code for simplicity (code is at the link if you want it):
An initial PUT request is sent to the url linked to ChunkedUploadView (or any subclass) with the first chunk of the file. The name of the chunk file can be overriden in the view (class attribute field_name).
In return, the server will respond with the url of the upload, and the current offset.
3 Repeatedly PUT subsequent chunks to the url returned from the server.
Server will continue responding with the url and current offset.
Finally, when upload is completed, POST a request to the returned url. This request must include the checksum (hex) of the entire file.
If everything is OK, server will response with status code 200 and the data returned in the method get_response_data (if any).
If you want to upload a file as a single chunk, this is also possible! Simply make the first request a POST and include the checksum digest for the file. You don't need to include the Content-Range header if uploading a whole file.
Based on these instructions, it seems that the server handles the upload by tracking offsets of the chunk through received headers ("Content-Range"), as well as its url, storing the uploaded chunks in .part files. It then responds like so:
{
'id': 'f64ebd67-83a3-45b6-8acd-c749ea1ed4cd'
'url': 'https://your-host/<path_to_view>/f64ebd67-83a3-45b6-8acd-c749ea1ed4cd',
'file': 'https://your-host/<path_to_file>/f64ebd67-83a3-45b6-8acd-c749ea1ed4cd.part',
'filename': 'example.bin',
'offset': 10000,
`created_at`: '2021-05-18T17:12:50.318718Z',
'status': 1,
'completed_at': None,
'user': 1
}
When the full file is uploaded as determined by the recieved headers, the .part files are combined into the final upload. This also allows you to resume uploads if they are interuptted, because the existing .part files persist until the upload finishes.

https://stackoverflow.com/a/26278960/12776116
Maybe this can help you. As you mentioned, the file is uploaded by breaking it into small parts.

you should do it with celery tasks.
take a look at this link. it explains how to upload a file using django and celery.

Related

Correct way to fetch data from an aws server into a flutter app?

I have a general understanding question. I am building a flutter app that relies on a content library containing text files, latex equations, images, pdfs, videos etc.
The content lies on an aws amplify backend. Depending on the navigation of the user in the app, the corresponding data is fetched and displayed.
I am not sure about the correct way of fetching the data. The current method (which works) is that the data is stored in an S3 bucket. When data is requested, the data is downloaded to a temporary directory and then opened and processed in the app. This is actually not slow, but I feel that it is not the way it should be done.
When data is downloaded a file transfer notification pops up, which bothers me because it is shown all the time. Also I would like to read the data directly with something like a get request, without downloading the file first (specially for text files, which I would like to read directly into a String). But here I don't know how it works, because I don't see that you can save data in a file system with the other amplify services like data store or the rest api. Also, the S3 bucket is an intuitive way of storing data that is easy to use for the content creators of my company, for me it seems that the S3 bucket is the way to go. However with S3 I have only figured out the download method to fetch data.
Could someone give me a hint on what is the correct approach for this use case? Thank you very much!

I can't upload multiple large files

I am developing an application with Django, AWS S3 and hosted on Heroku.
At one point users have to upload multiple large files, totaling around 150MB each time.
I have tried various approaches.
1st attempt: directly call the save method of the Django form:
Result: the request takes more than 30 seconds and returns a timeout.
2nd attempt: temporarily save the file to a Heroku directory and read it from Celery task.
Result: Cannot save because it throws FileNotFoundError: [Errno 2] No such file or directory on production.
3rd attempt: pass the uploaded files (in memory files) to a celery task but the bytes cannot be serialized and passed to the task neither with json or with pickle.
Could anyone help me please?
Thanks advance.
Another approach can be like that
Expose an APIs to generate presigned URL for Frontend (Steps are here).
Upload files by using that URL from the frontend in async way. That will offload your computation at Backend.
After successful upload, you will get an URL of file location. Now save the S3 URL along with other fields data to Django model.
You can upload more than 150MB file size by this method. Your system will be scalable.

How to facilitate downloading both CSV and PDF from API Gateway connected to S3

In the app I'm working on, we have a process whereby a user can download a CSV or PDF version of their data. The generation works great, but I'm trying to get it to download the file and am running into all sorts of problems. We're using API Gateway for all the requests, and the generation happens inside a Lambda on a POST request. The GET endpoint takes in a file_name parameter and then constructs the path in S3 and then makes the request directly there. The problem I'm having is when I'm trying to transform the response. I get a 500 error and when I look at the logs, it says Execution failed due to configuration error: Unable to transform response. So, clearly that's where I've spent most of my time. I've tried at least 50 different iterations of templates and combinations with little success. The closest I've gotten is the following code, where the CSV downloads fine, but the PDF is not a valid PDF anymore:
CSV:
#set($contentDisposition = "attachment;filename=${method.request.querystring.file_name}")
$input.body
#set($context.responseOverride.header.Content-Disposition = $contentDisposition)
PDF:
#set($contentDisposition = "attachment;filename=${method.request.querystring.file_name}")
$util.base64Encode($input.body)
#set($context.responseOverride.header.Content-Disposition = $contentDisposition)
where contentHandling = CONVERT_TO_TEXT. My binaryMediaTypes just has application/pdf and that's it. My goal is to get this working without having to offload the problem into a Lambda so we don't have that overhead at the download step. Any ideas how to do this right?
Just as another comment, I've tried CONVERT_TO_BINARY and just leaving it as Passthrough. I've tried it with text/csv as another binary media type and I've tried different combinations of encoding and decoding base64 and stuff. I know the data is coming back right from S3, but the transformation is where it's breaking. I am happy to post more logs if need be. Also, I'm pretty sure this makes sense on StackOverflow, but if it would fit in another StackExchange site better, please let me know.
Resources I've looked at:
https://docs.aws.amazon.com/apigateway/latest/developerguide/request-response-data-mappings.html
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-mapping-template-reference.html#util-template-reference
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-payload-encodings-workflow.html
https://docs.amazonaws.cn/en_us/apigateway/latest/developerguide/api-gateway-payload-encodings-configure-with-control-service-api.html.
(But they're all so confusing...)
EDIT: One Idea I've had is to do CONVERT_TO_BINARY and somehow base64 encode the CSVs in the transformation, but I can't figure out how to do it right. I keep feeling like I'm misunderstanding the order of things, specifically when the "CONVERT" part happens. If that makes any sense.
EDIT 2: So, I got rid of the $util.base64Encode in the PDF one and now I have a PDF that's empty. The actual file in S3 definitely has things in it, but for some reason CONVERT_TO_TEXT is not handling it right or I'm still not understading how this all works.
Had similar issues. One major thing is the Accept header. I was testing in chrome which sends Accept header as text/html,application/xhtml.... api-gateway ignores everything except the first one(text/html). It will then convert any response from S3 to base64 to try and conform to text/html.
At last after trying everything else I tried via Postman which defaults the Accept header to */*. Also set your content handling on the Integration response to Passthrough. And everything was working!
One other thing is to pass the Content-Type and Content-Length headers through(Add them in method response first and then in Integration response):
Content-Length integration.response.header.Content-Length
Content-Type integration.response.header.Content-Type

Trouble uploading lots of file in Django

I'm having problems when uploading lots of files in Django. The context is the following: I've a spreadsheet with one or more columns being image filenames; those images are being uploaded through an form with input type=file and the option multiple.
With few lines - say 70, everything goes fine. But with more lines, and consequently more images, there's a IOError happening in random positions.
I've checked several questions about file/image upload in Django but couldn't find any that is related to my problem.
The model I'm using is the Product model of LFS (www.getlfs.com). We are developing a system that is based on LFS and to facilitate the creation of dozens of products in batch we wrote some views and templates to receive the main product properties through a spreadsheet. Each line is a product and the columns are the desired properties.
LFS uses a custom class ImageWithThumbsField(ImageField) to store the product's image and when saving the product instance (got from the spreadsheet), all thumbnails are generated. This is a time (cpu) consuming task, and my initial guess is that for some reason the temporary file is deleted before all processing had occurred.
Is there a way to keep these uploaded files for more time? Any other approach suggested to be able to process hundreds of uploaded files? Any hints on what can be happening?
Hope you can understand my question. I can post code if need.
Links to relevant portions of LFS code:
where thumbnails are generated:
https://github.com/diefenbach/django-lfs/blob/master/lfs/core/fields/thumbs.py
product model
https://github.com/diefenbach/django-lfs/blob/master/lfs/catalog/models.py
Thanks in advance!
It sounds like you are running out of memory. When django processess uploads, until the form is validated all of the files are either:
kept in memory inside the python/wsgi process/worker. (Usual mode of op for runserver)
In this case, you are uploading enough photos to fill up the process memory and running out of space. This will be non-deterministic as to where the IOError happens as you can imagine (GC Dependent).
Temporarily stored in /tmp/ (usual setup of apache)
In this case, the webserver's ramfs is full of images that have not yet been written to disk. In this case it should IOError arround the same place.
In either case, you should not be bulk uploading images in this way anyway. Apache/Django is not designed for it. Try uploading a single product/image per request/response, and all your problems will go away.

Best way to write an image to a Django HttpResponse()

I need to serve images securely to validated users only (i.e. they can't be served as static files). I currently have the following Python view in my Django project, but it seems inefficient. Any ideas for a better way?
def secureImage(request,imagePath):
response = HttpResponse(mimetype="image/png")
img = Image.open(imagePath)
img.save(response,'png')
return response
(Image is imported from PIL.)
Well, re-encoding is needed sometimes (i.e. applying an watermark over an image while keeping the original untouched), but for the most simple of cases you can use:
try:
with open(valid_image, "rb") as f:
return HttpResponse(f.read(), content_type="image/jpeg")
except IOError:
red = Image.new('RGBA', (1, 1), (255,0,0,0))
response = HttpResponse(content_type="image/jpeg")
red.save(response, "JPEG")
return response
Make use of FileResponse
A cleaner way, here we dont have to worry about the Content-Length and Content-Type headers, they are automatically added by guessing the contents of open().
from django.http import FileResponse
def send_file(response):
img = open('media/hello.jpg', 'rb')
response = FileResponse(img)
return response
Just stumbled on the somewhat bad advice (for production) and thought I would mention X-Sendfile which works with both Apache and Nginx and probably other webservers too.
https://pythonhosted.org/xsendfile/
Modern Web servers like Nginx are generally able to serve files faster, more efficiently and more reliably than any Web application they host. These servers are also able to send to the client a file on disk as specified by the Web applications they host. This feature is commonly known as X-Sendfile.
This simple library makes it easy for any WSGI application to use X-Sendfile, so that they can control whether a file can be served or what else to do when a file is served, without writing server-specific extensions. Use cases include:
Restrict document downloads to authenticated users.
Log who’s downloaded a file. Force a file to be downloaded instead of
rendered by the browser, or serve it with a name different from the
one on disk, by setting the Content-Disposition header.
The basic idea is you open the file and pass that handle back to the webserver which then returns the bytes to the client, freeing your python code to handle the next request. This is far more performant than the solution above since a slow client on the other end could hang your python thread for as long as it takes to download the file.
Here is a repo that shows how to do this for various webservers and although it is pretty old, it will at least give you an idea of what you need to do. https://github.com/johnsensible/django-sendfile