How to serve content that can't be served via nginx or apache by Django? - django

I need to serve some content that should be preprocessed before getting served. There are huge volume of files (500,000 files with sizes around 1GB for example.) My server is written in Django. I can do the preprocess in python (a Django view for example) but I can't do it in Nginx or other static file servers. Is there anyway to implement a Django view that serves these files efficiently with random access? Is there any modules for this purpose? What should I take care of to implement it?
P.S. The files are not saved anywhere, there are around 2000 files and all other files can get generated by these 2000 files in the python code (Django view). I don't wanna buy that much hard disk to preprocess and store all 500,000 final files.

Related

Django / Docker, manage a million images and many large files

The project I am working on relies on many static files. I am looking for guidance on how to deal with the situation. First I will explain the situation, then I will ask my questions.
The Situation
The files that need management:
Roughly 1.5 million .bmp images, about 100 GB
Roughly 100 .h5 files, 250 MB each, about 25 GB
bmp files
The images are part of an image library, the user can filter trough them based on multiple different kinds of meta data. The meta data is spread out over multiple Models such as: Printer, PaperType, and Source.
In development the images sit in the static folder of the Django project, this works fine for now.
h5 files
Each app has its own set of .h5 files. They are used to inspect user generated images. Results of this inspection are stored in the database, the image itself is stored on disc.
Moving to production
Now that you know a bit about the problem it is time to ask my questions.
Please note that I have never pushed a Django project to production before. I am also new to Docker.
Docker
The project needs to be deployed on multiple machines, to make this more easy I decided to use Docker. I managed to build the Image and run the Container without the .bmp and .h5 files. So far so good!
How do I deal with the .h5 files? It does not seem like a good idea to build an Image that is 25 GB in size. Is there a way to download the .h5 file at a later point in time? As in building a Docker Image that only contains the code and downloads the .h5 files later.
Image files
I'm pretty sure that Django's collectstatic command is not meant for moving the amount of images the project uses. I'm thinking along the lines of directly uploading the images to some kind of image server.
If there are specialized image servers I would love to hear your suggestions.

Static Files and Upload, process, and download in Django

I've made a desktop app in Python to process .xls files with openpyxl, tkinter and other stuff. Now i'm looking to run this App on www.pythonanywhere.com. I was expecting to make an app to upload the file to the server, process it and then retrieve it changed to the user. But after long months struggling with Django i`ve reached the problem of static media. As I understood Django doesn’t serve files in production mode. Does this means that I can’t upload-process-download as I was planning? Or I can run in the view function the process algorithm on the request.file and then retrieve it changed, regardless of not serving static files? Am I missing something? Does Flask has the same issue? What’s the optimal solution? Apologize me for the many doubts.
PD: I took the time to read many similar questions and seems to be possible to upload and download, but then im missing something on handling static files and what that means

Large project files in Heroku on Django

I have some large files that I need to store in my project. Their sum total size is 157 MB. When my application starts, I have an AppConfig which runs. It reads the files into memory and they stay there for the lifetime of the application. By the way, these files aren't static/media assets like images; they're numpy arrays for some server-side logic.
So I want a solution which accomplishes the following:
Keeps my files out of source control / the Heroku slug
Does not require me to read from some cloud service S3/Google frequently
Has my files readily available during the stage where AppConfigs are initialized.
Here is one solution that I thought of: Store the files in S3/Google, and download them from the server when AppConfig is initialized. The downside of this is that it would make restarting my server really slow because downloading 157 MB is pretty time-consuming, and it might cost a bit too.
Is there a better way?
You would have hard time finding ideal solution for heroku (without paying to someone)
These are some of thoughts:
Keep datasets in database ( rows are not free on Heroku )
Keep datasets in memcached/redis ( these instances are pretty expensive on heroku )
OR
Host your site on cheap VPS :)

Django Whitenoise cache busting control

I've run manage.py collectstatic and Whitenoise has post-processed all of the static files. I'm not quite sure what should i do now if i want to change/update some of the files, for example, my .css stylesheet? Should i run manage.py collectstatic every time files have been changed? I'm asking this because my development server takes about 45 minutes to finish that task, and i'm not sure if that's normal because i have only 550 static files, 250Mb total.
Secondly, as Whitenoise doesn't support serving media files i use Amazon CloudFront for that. How can i control cache busting with those media files that users have uploaded? This is very important for me.
Yes, you'll need to run collectstatic every time your files change.
It's very unusual to have 250MB of static files. Also, because Django's cache busting creates a copy of each file with a unique name you'll end up with two copies of each file so that's 500MB already. On top of this WhiteNoise will be creating gzip-compressed versions of each file so you could be heading towards 1GB of files.
One quick way of speeding up this process would be to tell WhiteNoise not to compress your PDF files, which you can do by adding .pdf to the WHITENOISE_SKIP_COMPRESS_EXTENSIONS setting.
It sounds like your brochures would be better stored as user media though, rather than static assets.
To control caching you should make your code generate a unique name for each file when it's uploaded (adding a random string as a prefix to the filename should do the trick). You can then set caching headers on these files for as long as you like.

Django media files under GIT

we have just started using a git account of our Django website project so that the team can collaborate on the source code.
I have heard different things concerning what should be done with the /media directory. We currently keep the /static directory under version control so that the whole project can be cloned and recreated. However, the website also contains a large amount (>400mb) of uploaded images for galleries which will likely grow over time.
Should this be under git also? Is there a reasonable size limit to be aware of when using GIT? And is there some other method for dealing with the /media folder which is used by the Django community?
Any guidance would be much appreciated.
You should exclude your media folder in the .gitignore. There are some problems.
When you check in the files its possible that they are modified (Upload script) on the server. Then you cannot pull.
when you need your sources you have to download the whole media files.
You must commit new files everytime on your server.
So we use it without media files. But if you have do automatic deployment and enough time you can to it.
Definitely don't put all your uploaded files from the live site in the source code. It's not where they belong. At the very least you should back up your /media directory to an external location e.g. another server, a local NAS, some backup provider etc.
If your development team wants access to the files during development, you should consider putting a small subset of these files in your source tree and using fixtures to create a standard set of test data for the development environment.