Wise to switch Media Library storage from File to Database? - sitecore

I have my Media Library stored as physical files. When a Sitecore user publishes an item, the files are dispersed to a number of CD servers using WebDeploy.
I would like to switch to Database storage due to some performance issues with WebDeploy, but I'm concerned that it may be too late. I have hundreds of physical Media Library files already attached to items in Sitecore.
How will Sitecore react to switching storage after the fact? Can it handle the two modes simultaneously, or must I migrate all my files into the DB?

I would make the switch, its makes less problems with the media in the database, and less things to keep track of, when running in a Multi server environment.
See more pros and cons here
You can very easy, make all existing media items, to a database media.
I have used this tool, to make the migration:
https://marketplace.sitecore.net/en/Modules/Media_Conversion_Tool.aspx

Related

Retrieving data from AWS S3 too slow in Shiny app

I know that this question can be mostly answered generally for any Web App, but because I am specifically using Shiny I figured that your answers may be considerably more useful.
I have made a relatively complex app. The data is not complex, but the user interface is.
I am storing the data in S3 using the aws.s3 package, and have built my app using golem. Because most shiny apps are used to analyse or enter some data, they usually deal with a couple of datasets, and a relational database is very useful and fast for that type of app.
However, my app is quite UI/UX extensive. Users can have their own/shared whiteboard space(s) where they drag around items. The coordinates of the items are stored in rds files in my S3 bucket, for each user. They can customise many aspects of the app just for them, font size, colours of various experimental groups (it's a research app), experimental visits that are storing pdf files, .html files and .rds files.
The .rds files stored can contain variables, lists, data.frames, reactiveValues, renderUI() objects etc.. So they are widely different.
As such I have dozens of rds files that are stored in a bucket and everytime the app loads each of these .rds files need to be read one by one in order to recreate the environment appropriate for each user. The number of files/folders in directories are queried to know how many divs need to be generated for the user to click inside their files etc..
The range of objects stored is too wide for me to use a relational database - but my app is taking at least 40 seconds to load. It is also generally slow when submitting data as well, mostly because the data entered often modified many UI elements that need to be pushed to S3 again. Because I have no background in proper Web Dev, I have no idea what is the best way to store user-related UX/UI elements and how to retrieve them seamlessly.
Could anyone please recommend me to appropriate resources for me to learn more about it?
Am I doing it completely wrong? I honestly do not know how else to store and retrieve all these R objects.
Thank you in advance for your help with the above.

What are the downsides of using filesystem for storing uploaded files in django?

I know about s3 storage but I am trying to see if things can work out by only using filesystem
The main reason to use a service like S3 is scalability. Imagine that you use a the file system of a simple server to store files. Then it means that everyone that visits your site and wants to access a file, has to visit the same server. If there are enough visitors, then this will eventually render the system unresponsive.
Scalable storage services will store the same data on multiple servers to allow serving the content when the number of requests increases. Furthermore one normally hits a server that is close to the location of that user which minimizes the delay to fetch a file.
Finally such storage services are more reliable. If you use a single disk to store all the files, it is possible that eventually the disk fails losing all the data. By storing the data on multiple locations, it is less likely that the files are completely lost.

Is there an implementation of a single instance blob store for Django?

I am new to Django so I apologize if I missed something. I would like to have a library that gives me a single-instance data store for Blob / Binary data. I want a library that masks whether or not the files are stored in the database, file system or some kind of back end like S3 on Amazon. I want a single API that lets me add files, and get back URLs to serve those files. Also it would be nice if the implementation supported some kind of migration if I had blobs in a database for a site when it just started out and then move those blobs to an S3 bucket behind the scenes without me needing to change how my application stores and serves the data.
An important sub-aspect of this is that the files have to be only shown to properly authorized users (i.e. just putting them in an open /media/ folder as files is not sufficient).
Perhaps I am asking too much - but I find this kind of service very useful in my applications. The main reason that I am asking is that unless I find such a thing - I will wander off and build my own library - I just don't want to waste the time if this kind of thing already exists.

Caching situation for images stored in a database

The Django recommendation for dealing with user uploads is to store them on the filesystem and store the filesystem path in a database column. This works, but presents some problems I do not want to deal with:
No transactions
No simple way to keep the filesystem and database in sync
Complicates backups since data is stored in 2 places
My solution is to store the image as a base64 encoded string in a text column (https://djangosnippets.org/snippets/1669/). This requires more space, but makes replication dead simple.
The concern with this approach is performance. Hitting the database for every image request is not desirable. I need some kind of server-side caching system together with reasonable caching headers. For example, if someone requests "/media/documents/earth.jpg", the cache should be consulted first and if the file is not found there the database should be hit.
Questions:
What is a good cache tool for my purpose?
Given these requirements is it required that every image request goes through my Django application? Or is there a caching tool that I can use to prevent this. I have certain files that can be accessed only by certain people. For these I assume the request must go through the application since there would be no other way to check for authorizaton.
If this tool caches the files to the filesystem, then are hashed directories enough to mitigate the problem of having too many files in one directory? For example, a hashed directory path for elephant.gif could be /e/el/elephant.gif.
tl;dr: stop worrying and deliver, "premature optimization is the root of all evil"
The Django recommendation for dealing with user uploads is to store them on the filesystem and store the filesystem path in a database column.
The recommendation for using the file system is that you can have the images served directly by the web server instead of served by the application - web servers are very, very good at serving static files.
My solution is to store the image as a base64 encoded string in a text column (https://djangosnippets.org/snippets/1669/). This requires more space, but makes replication dead simple.
In general, replication is seldom used for static content. For a high traffic website, you have a dedicated server for static content - Django makes this very easy, that is what MEDIA_URL and STATIC_URL are for. Even if you are starting with the media served by the same web server, it is good practice to have it done by a separate virtual host (for example, have the app at http://www.example.com and the media at http://static.example.com even if serving both from the same machine).
Web servers are so good at serving static content that hardly you will need more than one. In practice you rarely hit the point where a dedicated server is not handling the load anymore, because by that time you will be using a CDN to cut your bandwidth bill, and the CDN will take most of the heat off the server.
If you choose to follow the "store on the file system" recommendation, don't worry about this until deployment, when the time arrives have a deployment expert at your side.
The concern with this approach is performance.
The performance hit you take when storing static content in the database is serving the image: it is somewhat negligible for small files - but for a large file, one app instance (or thread) will be stuck until the download finishes. Don't worry unless your images take too long to download.
Hitting the database for every image request is not desirable.
Honestly, why is that? Databases are designed to take hits. When you choose to store images in the database, performance is in the hands of the DBA now; as a developer you should stop thinking about it. When (and if) you hit any performance bottleneck related to database issues, consult a professional DBA, he will fix it.
1 - What is a good cache tool for my purpose?
Short story: this is static content, do the cache at the network layer (CDN, reverse caching proxy, etc). It is a problem for a professional network engineer, not for the developer.
There are many popular cache backends for Django, IMHO they are overkill for static content.
2 - Given these requirements is it required that every image request goes through my Django application? Or is there a caching tool that I can use to prevent this. I have certain files that can be accessed only by certain people. For these I assume the request must go through the application since there would be no other way to check for authorizaton.
Use an URL scheme that is unique and hard to guess, for example, with a path component made from a SHA2 hash of the file contents plus some secret token. Restrict service to requests refered by your site to avoid someone re-publishing the file URL. Use expiration headers if appropriate.
3 - If this tool caches the files to the filesystem, then are hashed directories enough to mitigate the problem of having too many files in one directory? For example, a hashed directory path for elephant.gif could be /e/el/elephant.gif.
Again, ask yourself why are you concerned. The cache layer should be transparent to the developer. I'm not aware of any popular cache solution for Django that don't have such basic concern very well covered.
[update]
Very good points. I understand that replication is seldom used for static content. That's not the point though. How often other people use replication for files has no effect on the fact that not replicating/backing up your database is wrong. Other people may be fine with losing ACID just because some bit of data is binary; I'm not. As far as I'm concerned these files are "of the database" because there are database columns whose values reference the files. If backing up hard drives is something seldom done, does that mean I shouldn't back up my hard drive? NO!
Your concern is valid, I was just trying to explain why Django developers have a bias for this arrangement (dedicated webserver for static content), Django started at the news publishing industry where this approach works well because of its ratio of one trusted publisher for thousands of readers.
It is important to note that the recommended approach (IMHO) is not in ACID violation. Ok, Django does not erase older images stored in the filesystem when the record changes or is deleted - but PostgreSQL don't really erase tuples from disk immediately when you delete records, they are just marked to be vacuumed later. Pity that Django lacks a built-in "vacuum" for images, but it is very hard to write a general one, so I side with the core team - data safety comes first. Look for example at database migrations: they took so long to have database migrations incorporated in Django because it is a hard problem as well. While writing a generic solution is hard, writing specific ones is trivial - for some projects I have a "garbage collector" process that I run from crontab in the low traffic hours, this script simply delete all files that are not referenced by metadata in the database - and this dirty cron job is enough consistency for me.
If you choose to store images at the database that is all fine. There are trade-offs, but rest assured you don't have to worry about them as a developer, it is a problem for the "ops" part of DevOps.

Sitecore media items and race conditions

How does Sitecore deal with race conditions when publishing media items?
Scenario:
A non versioned media item with a 500mb mpg file (stored as blob) is
being downloaded by a site visitor.
The download will take at best
a bew minutes, at worst could be measured in hours (if they're on a
low bandwidth connection).
While the user is downloading an author
uploads a new version of the mpg on the media item and publishes.
What happens, and why?
Other variations include:
The security settings on the media item change to block access from the visitor downloading
The media item is deleted and the change published
I'm guessing that in all these cases the download is aborted, but if so, what response does the server send?
I don't have an exact answer, but Sitecore caches blob assets on the file system under /App_Data/MediaCache/ so perhaps the existing asset is still in that cache. I'm not sure how Sitecore's media caching mechanism works but I bet it purges/re-caches on the next request to the new assets once the asset is completely there.
Just a guess. Maybe decompile the kernel to find the code that handles caching media.
(Not really an answer.. just comment was too big for the box :P)
This is a very interesting question.. Sitecore's media performance is done a lot through it caching a copy to disk and the delivering it from there on subsequent requests (also for caching scaled copies of originals such as thumbnails etc). The file is flushed once the original item is edited in some way and then re-published.
I am uncertain (and intrigued) how this would affect a large file as I think a lot of people assume media is probably smaller files such as images or pdfs etc that a user would just re-request if broken and how this effects a file currently being streamed when the item itself was updated. I'm sure a lot of the work at that point is IIS/ASP.NET streaming rather than Sitecore itself.
I'm not sure if Sitecore's cache would protect / shield against that but this should be pretty simple enough to test with a larger media file. Interested in the results (as larger files I've delivered personally have been done by CDN or a dedicated streaming partner)
This is not a difinitive answer, and I agree w/Stephen about dedicated streaming partner. I wonder how such systems handle this.
It seems that Sitecore creates a new media cache file for each published and accessed revision, so the HTTP transmit can continue reading the old file while the system writes the new file. Not sure if/how that works if you disable caching (I didn't try disabling caching). Otherwise, trying to write while reading could be blocked or interfere with read.
Note that you get a new revision ID even if you don't version. And it might be the publication that causes a new cache entry, not the occurrence of a new revision.