How to speed up Django FileField upload speed? - django

I have a FileField that uses django-storages' S3BotoBackend to upload audio files to to Amazon S3. The audio files can be up to 10MB in size, and a user can upload multiple files in the same form. The upload time can be very long and blocks. In order to speed up processing, I thought about writing a custom storage backend that inherits S3BotoBackend and submits jobs to a beanstalk queue before uploading to S3.
Are there any easier alternatives to speed up the user experience?

If you want to speed things up, you'll want to have your Web server more engaged with handling uploads. You can check out out the Nginx upload module for Nginx, though you can accomplish much of the same using any Web server.
For this approach, you'll configure a view that's going to receive a request once a file has been successfully uploaded by the user, which would then be the opportune moment to queue the file to be uploaded to S3.
This will allow you to asynchronously receive multiple uploads from a user and asynchronously send the files to S3, which should cover just about everything you could do to improve the file upload experience.

Related

how to upload django files as background? Or how to increased file upload?

I need to increased file upload speed in Django. Any ways to do this? I guess about to upload files as a background, when user send POST request i just redirect them to some page and start uploading files. Any ways to do this? Or do you know any ways to increased upload speed? Thank you in advance
Low upload speed could be a result of several issues.
It is a normal situation and your client doesn't have a possibility to upload at a higher speed.
Your server instance using an old HDD and can't write quickly.
Your server works on another pool of requests and serves your clients as fast as it could but it is overloaded.
Your instance doesn't have free space on the hard drive
Your server redirects the file as a stream somewhere else.
You've written a not optimized code of the upload handler.
etc.
You don't use a proxy-server that works perfectly with slow clients and when a file is on the proxy server's side give it to Django in a moment.
You are trying to upload a very big file.
etc.
Maybe you have more details on how you handle the uploads and your environment.

Zip images on web server and return the url

I am currently looking for a way to improve the traffic flow of an app.
Currently the user uploads his data via the app, using Google Cloud Platform as storage provider. Other users can then download this data again.
This works well so far, but since the download traffic at GCP is relatively expensive I had the idea to outsource this to a cheap web server.
The idea is that the user requests the file(s) at GCP. There it is checked if the file(s) are already on the web server. If not, the file(s) will be uploaded to the server.
At the server the files are zipped and the link is sent back to GCP, where it is emailed to the user.
TL:DR My question is, how can i zip a specific selection of files on a web server without nodejs etc. and send the link of the generated file back to GCP
I'm open for other ideas aswell
This is a particular case, covered by Google Cloud CDN (Content Delivery Network) service.
As you can read here, there already is a way to connect the CDN to a Storage bucket, and it will do exactly what you've thought to do with your own web server. The only difference is that it's already production ready. It handles cache misses, cache hits and so on.
You can compare the prices: here you can find CDN prices, and here you can find Storage prices. The important difference is that Storage costs per TB of Egress, meanwhile CDN costs per 10TB of Egress, and the price is still lower.
Of course, you can still stick to your idea. I would implement it by developing a REST API. The API, with just one endpoint will serve the file, if it is present on the web server. If it is not present, it will:
perform a redirect to the direct link for the file hosted in Storage;
start to fetch the file form Storage and put it in the cache.
You would still need to handle the cache: what happens when somebody changes a file? That's something related to the way you're working with those files, so it strictly depends on your app functional domain, and in any case, Cloud CDN would solve it without any further development.

How to fix page doing 4 extra queries when using easy_thumbnails in combination with Amazon S3?

I'm setting up Amazon S3 to use as my media server for serving image files. I use easy_thumbnails for thumbnailing the images. easy_thumbnails does the cropping before sending them to S3, therefore storing 4 images with each a different size. Without Amazon S3, the page does 2 queries to load the page. With Amazon S3 it uses 6 queries for the same page. The queries show that the original file is queried as well as the cropped file. This shouldn't be necessary I believe. How can I decrease the amount of requests it does using S3?
This image shows the queries with Amazon S3
This image shows the queries without Amazon S3
**edit
I noticed easy_thumbnails is not optimized for remote storages according to django packages. So, an alternative for easy_thumbnails that is optimized would help me as well!
It looks like easy_thumbnails requests the same image files everytime the page is loaded (caching that doesn't work for easy_thumbnails maybe). As I read that easy_thumbnails isn't optimized for remote storage I looked for alternatives and tried sorl-thumbnail. This seems to do the job! It doesn't send request each page load and therefore the amount of queries decreased a lot!

Uploaded images with malicious code in Amazon S3

I have a custom php web application where users will be able to upload images.
I know that there is a security concern with image files as a hacker can add malicious code to them and trigger them through the url of the image file.
So I'm no longer storing images in the web server and uploading them directly into Amazon S3. I was wondering if it is still possible for a hacker to achieve the same results with a malicious image even if the image files are stored in a completely separate place like Amazon S3.
If you upload files into S3 then there would be no need to worry about the server side exploits like RCE as it is an object storage which won't be executed, but you need to take care of client side vulnerabilities like XSS...
i.e., even in your case of image upload, attacker cannot harm the server side setup directly by exploiting Unrestricted file upload but he can embed client side script into the image and exploit... as #dy10 mentioned, setting the proper content-type would help...

Architecture design: how web-tier knows when worker-tier is done processing?

I'm building a small application for educational purposes using Amazon AWS.
The web application has two parts:
A form for uploading an image.
A grid showing all thumbnails of uploaded images.
The flow of the application:
Users opens the web page.
Users chooses an image to upload.
An AJAX request to the web-tier for generating a pre-signed S3 URL is sent.
Upon receiving the URL, an AJAX PUT request is initiated and the image is uploaded directly to S3.
Upon upload completion, S3 sends an SQS queue message with the image's key.
One of the workers receives that message and creates a thumbnail.
Upon image process completion, worker uploads the thumbnail to the S3.
This figure illustrates the above:
Now, the web-tier uses a db.json file for keeping the links to all existing thumbnails. Using that file, the client-side web page renders all the thumbnails in the grid.
The problem is, how would the web-tier know when to update the db.json containing the link for the new thumbnail?
Ideally, the web-tier would accomplish the following:
Refresh the json only when needed (if the web-tier refreshed the json then it must have been modified).
Serve the updated db.json once it's updated (if a thumbnail was added on time x and another user requested the web page on time x+1, then the users is aware of the new thumbnail).
Few approaches:
For every index.html request, list the S3 bucket and serve the latest thumbnails (violates item 1 from previous section).
List the S3 bucket on interval basis (violates both items).
Set timer once a pre-signed URL was requested and assume that the worker is done processing the new image upon timer's ringing (this is not even a solution for two main reasons; the web tier has more than 1 instance, the timer might ring before processing is done).
Using S3 Events and setup a lambda expression that sends a HTTP GET request to a special endpoint on my web-tier (also not a solution, as this request will be directed from the load-balancer to a single instance, what about the other instances?).
I have no idea how to solve this problem.
What do you suggest I do?
Edit
As this is an educational exercise, DB services are out of scope.
The question is a bit ridiculous, with the notion of storing everything in a JSON file that us continuously being updated, but the solution seems obvious enough... another S3 event notification.
Anytime you have a system that hands you the magical gift of events, relieving you of having to poll anything, you'd be remiss to overlook the value that brings.
If each web server keeps it own copy of the json file and needs to update it, that's easily solved, too.
S3 event fires on thumbnail creation (S3 notifications can match prefixes rather than be for the whole bucket) > S3 event publishes to SNS topic > SNS topic fans-out to multiple SQS queues, one for each web server. A process on the web server subscribes to that server's queue with a single thread, and each time a message comes in, the json file is modified on that server by the local worker. Each server gets a copy of each notification.
I have an old legacy system where web site template changes (not code, just templates) are made live by committing the template changes to subversion, followed by svn up on the servers. Because this subversion repo exists for the purpose, the web servers read the templates directly from he check out directory. Strange as it sounds, it's served well for many years. I recently enhanced it by setting up an arrangement reminiscent of what's described above, but without S3. The "post-commit hook" fires a shell script on the subversion server when anything is committed. This, in turn, publishes a message about the changed file to an SNS topic, which fans out to several SQS queues -- one for each web server, and a simple script on each server listens to the SQS queue for that server. One listener, one thread, for each server, so there are no concurrency issues. The listener, it runs "svn up" on the newly-committed file, deletes the queue message, then listens for the next one. Real-time event fan-out, why not?
Is the db.json file stored on one of the web servers? How can you coordinate updates to the db.json file across multiple web servers? How can you prevent multiple worker servers from updating the db.json file at the same time and stepping on each other?
I would suggest storing the existence of the thumbnails somewhere other than a flat file. DynamoDB would be a great place to store this. PostgreSQL or one of the MySQL flavors on RDS would also work.
To serve the JSON data to the UI that contains the list of thumbnails I would create a dynamic page that queries the database and renders JSON data. This would also allow you to implement things like paging of the data, which will be a requirement once your set of images gets very large.
To prevent the web tier from being overloaded by requests for the JSON data I would place a CDN such as CloudFront or CloudFlare in front of the web tier. To prevent the database from being overloaded with queries for the thumbnail list I would implement a caching layer (Redis) between the web tier and the database.