Sitecore Database and App_Data Size - sitecore

We have 5 relatively small sites running on top of Sitecore. The oldest has been hosted within the environment for 3 years. Currently both the master and web databases are roughly 8 GB a piece - surprising in size but also that they are nearly identical in size (I would expect the web database to be much smaller). Also surprising is the App_Data is over 50 GB in size (MediaCache is 15 GB and MediaFiles is 37 GB). Any ideas or suggestions on ways to reduce files on disc - even temporarily?

Media Files - media items stored on disk (keep this folder)
Media Cache - is where sitecore caches image versions (e.g rezised images)
You can delete all the contents of the Media Cache folder. Sitecore will be gradually recreate the image cache of images that are being used on the sites.
If you use item versioning then you can run use the Version Manager and archive old versions. However as you Master and Web database are almost the same size I don't think that will help you. The web database only holds 1 version of each item.
The last thing would be to crawl through the media library and find items that don't have any referrers in the LinkDatabase and delete them. Make sure you back everything up first.
http://trac.sitecore.net/VersionManager

If you are storing media assets in the Database, I assume you are based on the sizes, I believe you can delete the App_Data folder and it will re-add files there as needed. If you're storing media assets on the file system, they're stored in App_Data which would explain the large size. Is it possible you're storing some assets in the DB and others on the file system? In any case, you should analyze what types of files are in the App_Data and compare it to an out-of-the-box Sitecore instance to see what is site content vs. generated cache files.

You can try to shrink databases as well
DBCC UPDATEUSAGE (web)
DBCC SHRINKDATABASE(web, 0);

Related

Django / Docker, manage a million images and many large files

The project I am working on relies on many static files. I am looking for guidance on how to deal with the situation. First I will explain the situation, then I will ask my questions.
The Situation
The files that need management:
Roughly 1.5 million .bmp images, about 100 GB
Roughly 100 .h5 files, 250 MB each, about 25 GB
bmp files
The images are part of an image library, the user can filter trough them based on multiple different kinds of meta data. The meta data is spread out over multiple Models such as: Printer, PaperType, and Source.
In development the images sit in the static folder of the Django project, this works fine for now.
h5 files
Each app has its own set of .h5 files. They are used to inspect user generated images. Results of this inspection are stored in the database, the image itself is stored on disc.
Moving to production
Now that you know a bit about the problem it is time to ask my questions.
Please note that I have never pushed a Django project to production before. I am also new to Docker.
Docker
The project needs to be deployed on multiple machines, to make this more easy I decided to use Docker. I managed to build the Image and run the Container without the .bmp and .h5 files. So far so good!
How do I deal with the .h5 files? It does not seem like a good idea to build an Image that is 25 GB in size. Is there a way to download the .h5 file at a later point in time? As in building a Docker Image that only contains the code and downloads the .h5 files later.
Image files
I'm pretty sure that Django's collectstatic command is not meant for moving the amount of images the project uses. I'm thinking along the lines of directly uploading the images to some kind of image server.
If there are specialized image servers I would love to hear your suggestions.

How can i get django to process media files on production?

How can i get django to process media files on production when DEBUG = False on heroku server?
I know that it’s better not to do this and that this will lead to a loss of performance, but my application is used only by me and my cat, so I don't think that this will be unjustified in my case.
The reason this won't work is because the Heroku file system is ephemeral, meaning any files uploaded after your app code is pushed will be overwritten anytime your app is restarted. This will leave your app with image links in the DB which lead to non-existent files.
You can read more about it here:
https://help.heroku.com/K1PPS2WM/why-are-my-file-uploads-missing-deleted
Your best bet is using a bucket like Amazon S3 to upload your files to. It costs almost nothing for small use, and is very reliable.
https://blog.theodo.com/2019/07/aws-s3-upload-django/

Large project files in Heroku on Django

I have some large files that I need to store in my project. Their sum total size is 157 MB. When my application starts, I have an AppConfig which runs. It reads the files into memory and they stay there for the lifetime of the application. By the way, these files aren't static/media assets like images; they're numpy arrays for some server-side logic.
So I want a solution which accomplishes the following:
Keeps my files out of source control / the Heroku slug
Does not require me to read from some cloud service S3/Google frequently
Has my files readily available during the stage where AppConfigs are initialized.
Here is one solution that I thought of: Store the files in S3/Google, and download them from the server when AppConfig is initialized. The downside of this is that it would make restarting my server really slow because downloading 157 MB is pretty time-consuming, and it might cost a bit too.
Is there a better way?
You would have hard time finding ideal solution for heroku (without paying to someone)
These are some of thoughts:
Keep datasets in database ( rows are not free on Heroku )
Keep datasets in memcached/redis ( these instances are pretty expensive on heroku )
OR
Host your site on cheap VPS :)

Using same static folder on multiple django sites

I have a website that consists of one main website and several subsites (and more are coming).
The thing is, the main site and the subsites has the same layout, uses the same js etc., so I want to ask if it's possible for all the sites to share a single static folder?
The static folder is 130 mb. atm. I find it kinda redundant that I need to copy that folder every time a new site is created. With 200 sites (a somewhat realistic goal), it would be 20 gb space wasted on duplicate files.
So is there a way to do this? I know it is somewhat against good django practice (no use of collectstatic)
In a situation like this, I would use Amazon S3 and CloudFront. You can transparently upload all of your static files to your S3 bucket using django-storages when you run collect static by replacing the default file upload mechanism with boto + s3 as such:
DEFAULT_FILE_STORAGE = 'storages.backends.s3boto.S3BotoStorage'
As #AdamKG stated, if all of these sites share the same code, with different content, you're probably better off using Django-CMS or moving these sites to database records rather than deploying the same code over and over.
AdamKG gave me the "right" answer - at least for my needs.
I might move to S3 at some point, when it's more relevant.
"Well, the easy hacks are symlinks & related. The question you should be asking, though, is why you're using django projects as a unit of what seems to be (going by count ~ 200) a commodity. IOW: why do the sites have separate anything, including static media, instead of just being rows in a table? "

What is the best way to backup a django project?

I maintain a couple of low-traffic sites that have reasonable user uploaded media files and semi big databases. My goal is to backup all the data that is not under version control in a central place.
My current approach
At the moment I use a nightly cronjob that uses dumpdata to dump all the DB content into JSON files in a subdirectory of the project. The media uploads is already in the project directory (in media).
After the DB is dumped, the files are copied with rdiff-backup (makes an incremental backup) into another location. I then download the rdiff-backup directory on a regular basis with rsync to store a local copy.
Your Ideas?
What do you use to backup your data? Please post your backup solution - if you only have a few hits per day on your site or if you maintain a high traffic one with shareded databases and multiple fileservers :)
Thanks for your input.
Recently, I've found this solution called Django-Backup and has worked for me. You can even combine the task of backing up the databases or media files with a cronjob.
Regards,
My backup solution works the following way:
Every night, dump the data to a separate directory. I prefer to keep data dump directory distinct from the project directory (one reason being that project directory changes with every code deployment).
Run a job to upload the data to my Amazon S3 account and another location using rsync.
Send me an email with the log.
To restore a backup locally I use a script to download the data from S3 and upload it locally.