I have some large files that I need to store in my project. Their sum total size is 157 MB. When my application starts, I have an AppConfig which runs. It reads the files into memory and they stay there for the lifetime of the application. By the way, these files aren't static/media assets like images; they're numpy arrays for some server-side logic.
So I want a solution which accomplishes the following:
Keeps my files out of source control / the Heroku slug
Does not require me to read from some cloud service S3/Google frequently
Has my files readily available during the stage where AppConfigs are initialized.
Here is one solution that I thought of: Store the files in S3/Google, and download them from the server when AppConfig is initialized. The downside of this is that it would make restarting my server really slow because downloading 157 MB is pretty time-consuming, and it might cost a bit too.
Is there a better way?
You would have hard time finding ideal solution for heroku (without paying to someone)
These are some of thoughts:
Keep datasets in database ( rows are not free on Heroku )
Keep datasets in memcached/redis ( these instances are pretty expensive on heroku )
OR
Host your site on cheap VPS :)
Related
I'm building a web api by watching the youtube video below and until the AWS S3 bucket setup I understand everything fine. But he first deploy everything locally then after making sure everything works he is transferring all static files to AWS and for DB he switches from SQLdb3 to POSgres.
django portfolio
I still don't understand this part why we need to put our static files to AWS and create POSTgresql database even there is an SQLdb3 default database from django. I'm thinking that if I'm the only admin and just connecting my GitHub from Heroku should be enough and anytime I change something in the api just need to push those changes to github master and that should be it.
Why we need to use AWS to setup static file location and setup a rds (relational data base) and do the things from the beginning. Still not getting it!
Can anybody help to explain this ?
Thanks
Databases
There are several reasons a video guide would encourage you to switch from SQLite to a database server such as MySQL or PostgreSQL:
SQLite is great but doesn't scale well if you're expecting a lot of traffic
SQLite doesn't work if you want to distribute your app accross multiple servers. Going back to Heroky, if you serve your app with multiple Dynos, you'll have a problem because each Dyno will use a distinct SQLite database. If you edit something through the admin, it will happen on one of this databases, at random, leading to inconsistencies
Some Django features aren't available on SQLite
SQLite is the default database in Django because it works out of the box, and is extremely fast and easy to use in local/development environments for prototyping.
However, it is usually not suited for production websites. Additionally, while it can be tempting to store your sqlite.db file along with your code, for instance in a git repository, it is considered a bad practice because your database can contain sensitive data (such as passwords, usernames, emails, etc.). Hence, a strict separation between your code and data is a good practice.
Another way to put it is that your code and your data have different lifecycles. You want to be able to edit data in your database without redeploying your code, and update your code without touching your database.
Even if you can remove public access to some files through GitHub, this is not a good practice because when you work in a team with multiple developpers, developpers may have access to the code but not the production data, because it's usually sensitive. If you work with 5 people and each one of them has a copy of your database, it means the risk to lose it or have it stolen is 5x higher ;)
Static files
When you work locally, Django's built-in runserver command handles the serving of static assets such as CSS, Javascript and images for you.
However, this server is not designed for production use either. It works great in development, but will start to fail very fast on a production website, that should handle way more requests than your local version.
Because of that, you need to host these static files somewhere else, and AWS is one place where you can do that. AWS will serve those files for you, in a very efficient way. There are other options available, for instance configuring a reverse proxy with Nginx to serve the files for you, if you're using a dedicated server.
As far as I can tell, the progression you describe from the video is bringing you from a local, development enviromnent to a more efficient and scalable production setup. That is to be expected, because it's less daunting to start with something really simple (SQLite, Django's built-in runserver), and move on to more complex and abstract topics and tools later on.
The project I am working on relies on many static files. I am looking for guidance on how to deal with the situation. First I will explain the situation, then I will ask my questions.
The Situation
The files that need management:
Roughly 1.5 million .bmp images, about 100 GB
Roughly 100 .h5 files, 250 MB each, about 25 GB
bmp files
The images are part of an image library, the user can filter trough them based on multiple different kinds of meta data. The meta data is spread out over multiple Models such as: Printer, PaperType, and Source.
In development the images sit in the static folder of the Django project, this works fine for now.
h5 files
Each app has its own set of .h5 files. They are used to inspect user generated images. Results of this inspection are stored in the database, the image itself is stored on disc.
Moving to production
Now that you know a bit about the problem it is time to ask my questions.
Please note that I have never pushed a Django project to production before. I am also new to Docker.
Docker
The project needs to be deployed on multiple machines, to make this more easy I decided to use Docker. I managed to build the Image and run the Container without the .bmp and .h5 files. So far so good!
How do I deal with the .h5 files? It does not seem like a good idea to build an Image that is 25 GB in size. Is there a way to download the .h5 file at a later point in time? As in building a Docker Image that only contains the code and downloads the .h5 files later.
Image files
I'm pretty sure that Django's collectstatic command is not meant for moving the amount of images the project uses. I'm thinking along the lines of directly uploading the images to some kind of image server.
If there are specialized image servers I would love to hear your suggestions.
I understand other files being ignored. But why would I want to ignore the SQLite database file if that holds data needed to run the website? How can the website function without a database?
You probably only want to write to one instance of the file. This means it either lives in production or in your sandbox. If you change data in production, it's now newer than what you are tracking in git, and it will presumable be overwritten on the next deploy causing data loss.
Couple of minor issues:
git doesn't perform well when you store large binary files in it.
git can track binary files (like images) but you don't get as much value like being able to diff your .sqlite file before/after a change.
Because you should want to use different databases on testing environment and production environment
How can i get django to process media files on production when DEBUG = False on heroku server?
I know that it’s better not to do this and that this will lead to a loss of performance, but my application is used only by me and my cat, so I don't think that this will be unjustified in my case.
The reason this won't work is because the Heroku file system is ephemeral, meaning any files uploaded after your app code is pushed will be overwritten anytime your app is restarted. This will leave your app with image links in the DB which lead to non-existent files.
You can read more about it here:
https://help.heroku.com/K1PPS2WM/why-are-my-file-uploads-missing-deleted
Your best bet is using a bucket like Amazon S3 to upload your files to. It costs almost nothing for small use, and is very reliable.
https://blog.theodo.com/2019/07/aws-s3-upload-django/
I have a website that consists of one main website and several subsites (and more are coming).
The thing is, the main site and the subsites has the same layout, uses the same js etc., so I want to ask if it's possible for all the sites to share a single static folder?
The static folder is 130 mb. atm. I find it kinda redundant that I need to copy that folder every time a new site is created. With 200 sites (a somewhat realistic goal), it would be 20 gb space wasted on duplicate files.
So is there a way to do this? I know it is somewhat against good django practice (no use of collectstatic)
In a situation like this, I would use Amazon S3 and CloudFront. You can transparently upload all of your static files to your S3 bucket using django-storages when you run collect static by replacing the default file upload mechanism with boto + s3 as such:
DEFAULT_FILE_STORAGE = 'storages.backends.s3boto.S3BotoStorage'
As #AdamKG stated, if all of these sites share the same code, with different content, you're probably better off using Django-CMS or moving these sites to database records rather than deploying the same code over and over.
AdamKG gave me the "right" answer - at least for my needs.
I might move to S3 at some point, when it's more relevant.
"Well, the easy hacks are symlinks & related. The question you should be asking, though, is why you're using django projects as a unit of what seems to be (going by count ~ 200) a commodity. IOW: why do the sites have separate anything, including static media, instead of just being rows in a table? "