Django / Docker, manage a million images and many large files - django

The project I am working on relies on many static files. I am looking for guidance on how to deal with the situation. First I will explain the situation, then I will ask my questions.
The Situation
The files that need management:
Roughly 1.5 million .bmp images, about 100 GB
Roughly 100 .h5 files, 250 MB each, about 25 GB
bmp files
The images are part of an image library, the user can filter trough them based on multiple different kinds of meta data. The meta data is spread out over multiple Models such as: Printer, PaperType, and Source.
In development the images sit in the static folder of the Django project, this works fine for now.
h5 files
Each app has its own set of .h5 files. They are used to inspect user generated images. Results of this inspection are stored in the database, the image itself is stored on disc.
Moving to production
Now that you know a bit about the problem it is time to ask my questions.
Please note that I have never pushed a Django project to production before. I am also new to Docker.
Docker
The project needs to be deployed on multiple machines, to make this more easy I decided to use Docker. I managed to build the Image and run the Container without the .bmp and .h5 files. So far so good!
How do I deal with the .h5 files? It does not seem like a good idea to build an Image that is 25 GB in size. Is there a way to download the .h5 file at a later point in time? As in building a Docker Image that only contains the code and downloads the .h5 files later.
Image files
I'm pretty sure that Django's collectstatic command is not meant for moving the amount of images the project uses. I'm thinking along the lines of directly uploading the images to some kind of image server.
If there are specialized image servers I would love to hear your suggestions.

Related

Images and Thumbnails in Django

I want to store the images along with the thumbnails of it. I am storing the images in file system using django. At first, the user will be able to see the thumbnails and after clicking it original images can be seen. I am using postgres database. Also, I have already installed Pillow library. Thumbnail size will be approx 200*200.
Now my questions are:
How should I store the thumbnails ? (in database or in file system)
How to convert the images to it's thumbnails ? (python library or something else)
If anything better is possible for the mentioned feature please let me know.
P.S.: High performance and lesser page load time is required.
There are third party apps that do the heavy lifting like sorl-thumbnail or easy-thumbnail
For the first question, storing the image in the system or cdn and the path in the db is the best approach, and that's what django does by default.

Image files as inputs with AWS Elastic Transcoder?

Here's my situation. I've been working on building a service at work that takes dynamically generated images and outputs animations as mp4 or gif. The user has the options of setting dimensions, time for each frame, etc.
I have this working currently with ffmpeg. It works ok, but is difficult (and potentially expensive) to scale due largely to the cpu/memory requirements that ffmpeg needs.
I just spent some time experimenting with AWS's Elastic Transcoder. It doesn't seem to like static image files (jpg, png) as source material in jobs. The file types aren't listed under the available Preset options either.
I'm sure that I could adapt the existing architecture to save the static images as video files (sound isn't needed) and upload those. That will still require ffmpeg to be in the pipeline though.
Are there any other AWS services that might work with my needs and allow the use of Elastic Transcoder?

Large project files in Heroku on Django

I have some large files that I need to store in my project. Their sum total size is 157 MB. When my application starts, I have an AppConfig which runs. It reads the files into memory and they stay there for the lifetime of the application. By the way, these files aren't static/media assets like images; they're numpy arrays for some server-side logic.
So I want a solution which accomplishes the following:
Keeps my files out of source control / the Heroku slug
Does not require me to read from some cloud service S3/Google frequently
Has my files readily available during the stage where AppConfigs are initialized.
Here is one solution that I thought of: Store the files in S3/Google, and download them from the server when AppConfig is initialized. The downside of this is that it would make restarting my server really slow because downloading 157 MB is pretty time-consuming, and it might cost a bit too.
Is there a better way?
You would have hard time finding ideal solution for heroku (without paying to someone)
These are some of thoughts:
Keep datasets in database ( rows are not free on Heroku )
Keep datasets in memcached/redis ( these instances are pretty expensive on heroku )
OR
Host your site on cheap VPS :)

How to serve content that can't be served via nginx or apache by Django?

I need to serve some content that should be preprocessed before getting served. There are huge volume of files (500,000 files with sizes around 1GB for example.) My server is written in Django. I can do the preprocess in python (a Django view for example) but I can't do it in Nginx or other static file servers. Is there anyway to implement a Django view that serves these files efficiently with random access? Is there any modules for this purpose? What should I take care of to implement it?
P.S. The files are not saved anywhere, there are around 2000 files and all other files can get generated by these 2000 files in the python code (Django view). I don't wanna buy that much hard disk to preprocess and store all 500,000 final files.

Sitecore Database and App_Data Size

We have 5 relatively small sites running on top of Sitecore. The oldest has been hosted within the environment for 3 years. Currently both the master and web databases are roughly 8 GB a piece - surprising in size but also that they are nearly identical in size (I would expect the web database to be much smaller). Also surprising is the App_Data is over 50 GB in size (MediaCache is 15 GB and MediaFiles is 37 GB). Any ideas or suggestions on ways to reduce files on disc - even temporarily?
Media Files - media items stored on disk (keep this folder)
Media Cache - is where sitecore caches image versions (e.g rezised images)
You can delete all the contents of the Media Cache folder. Sitecore will be gradually recreate the image cache of images that are being used on the sites.
If you use item versioning then you can run use the Version Manager and archive old versions. However as you Master and Web database are almost the same size I don't think that will help you. The web database only holds 1 version of each item.
The last thing would be to crawl through the media library and find items that don't have any referrers in the LinkDatabase and delete them. Make sure you back everything up first.
http://trac.sitecore.net/VersionManager
If you are storing media assets in the Database, I assume you are based on the sizes, I believe you can delete the App_Data folder and it will re-add files there as needed. If you're storing media assets on the file system, they're stored in App_Data which would explain the large size. Is it possible you're storing some assets in the DB and others on the file system? In any case, you should analyze what types of files are in the App_Data and compare it to an out-of-the-box Sitecore instance to see what is site content vs. generated cache files.
You can try to shrink databases as well
DBCC UPDATEUSAGE (web)
DBCC SHRINKDATABASE(web, 0);