How does Sitecore deal with race conditions when publishing media items?
Scenario:
A non versioned media item with a 500mb mpg file (stored as blob) is
being downloaded by a site visitor.
The download will take at best
a bew minutes, at worst could be measured in hours (if they're on a
low bandwidth connection).
While the user is downloading an author
uploads a new version of the mpg on the media item and publishes.
What happens, and why?
Other variations include:
The security settings on the media item change to block access from the visitor downloading
The media item is deleted and the change published
I'm guessing that in all these cases the download is aborted, but if so, what response does the server send?
I don't have an exact answer, but Sitecore caches blob assets on the file system under /App_Data/MediaCache/ so perhaps the existing asset is still in that cache. I'm not sure how Sitecore's media caching mechanism works but I bet it purges/re-caches on the next request to the new assets once the asset is completely there.
Just a guess. Maybe decompile the kernel to find the code that handles caching media.
(Not really an answer.. just comment was too big for the box :P)
This is a very interesting question.. Sitecore's media performance is done a lot through it caching a copy to disk and the delivering it from there on subsequent requests (also for caching scaled copies of originals such as thumbnails etc). The file is flushed once the original item is edited in some way and then re-published.
I am uncertain (and intrigued) how this would affect a large file as I think a lot of people assume media is probably smaller files such as images or pdfs etc that a user would just re-request if broken and how this effects a file currently being streamed when the item itself was updated. I'm sure a lot of the work at that point is IIS/ASP.NET streaming rather than Sitecore itself.
I'm not sure if Sitecore's cache would protect / shield against that but this should be pretty simple enough to test with a larger media file. Interested in the results (as larger files I've delivered personally have been done by CDN or a dedicated streaming partner)
This is not a difinitive answer, and I agree w/Stephen about dedicated streaming partner. I wonder how such systems handle this.
It seems that Sitecore creates a new media cache file for each published and accessed revision, so the HTTP transmit can continue reading the old file while the system writes the new file. Not sure if/how that works if you disable caching (I didn't try disabling caching). Otherwise, trying to write while reading could be blocked or interfere with read.
Note that you get a new revision ID even if you don't version. And it might be the publication that causes a new cache entry, not the occurrence of a new revision.
Related
Since s3 is eventually consistent, in how much time will the data become consistent. If my user uploads some media, and I if I have to show the same content in website, can i expect a scenario where others users might not see the content in website for some time, and some users can see it.
Its eventually consistent, but in my experience that usually means under a few seconds - so yes it's possible, but only for a very,very small amount of time
Yes it is very quick. Quick enough that any limitations to update speed would not be caused by s3. I host a static website in s3, and I can upload/update a file and see the contents in less than 1 second (i.e. refresh the page as fast as I can).
I'm setting up the output caching for sitecore with guide provided on SDN.
I've ticked the cacheable and clear on index update and by Data options.
However I noticed that every hard refresh, the full image gets requested i.e. 4 MB.
IS this a an expected behaviour?
Output cache stores generated html instead of executing the process of rendering your component.
It has nothing to do with sending images and caching them in browser cache.
Read How the Sitecore ASP.NET CMS Caches Output JW blog post for more details and see the links in comments.
I am going to assume that by saying "hard refresh" you mean bypassing your browser cache.
What you are seeing is not specific to Sitecore, or any server-side technology. It is your browser that uses caching to keep local copies of images and other "static" resources that it has loaded in the past. This cache is used to speed up page loads and reduce network traffic.
When you perform a hard refresh, the browser will ignore its cache and load all resources from the server. This is why there's a request to your image after a hard refresh.
This is not a normal behaviour.
Sitecore stores all media cache to file system, unlike all other caches, stored in RAM. Media items are stored in in database, so media cache is required to reduce database calls and serve media files faster to end-user. Let's understand Sitecore media cache mechanism.
Please check next link for details: http://sitecoreblog.patelyogesh.in/2014/04/how-sitecore-media-cache-is-works.html
The Django recommendation for dealing with user uploads is to store them on the filesystem and store the filesystem path in a database column. This works, but presents some problems I do not want to deal with:
No transactions
No simple way to keep the filesystem and database in sync
Complicates backups since data is stored in 2 places
My solution is to store the image as a base64 encoded string in a text column (https://djangosnippets.org/snippets/1669/). This requires more space, but makes replication dead simple.
The concern with this approach is performance. Hitting the database for every image request is not desirable. I need some kind of server-side caching system together with reasonable caching headers. For example, if someone requests "/media/documents/earth.jpg", the cache should be consulted first and if the file is not found there the database should be hit.
Questions:
What is a good cache tool for my purpose?
Given these requirements is it required that every image request goes through my Django application? Or is there a caching tool that I can use to prevent this. I have certain files that can be accessed only by certain people. For these I assume the request must go through the application since there would be no other way to check for authorizaton.
If this tool caches the files to the filesystem, then are hashed directories enough to mitigate the problem of having too many files in one directory? For example, a hashed directory path for elephant.gif could be /e/el/elephant.gif.
tl;dr: stop worrying and deliver, "premature optimization is the root of all evil"
The Django recommendation for dealing with user uploads is to store them on the filesystem and store the filesystem path in a database column.
The recommendation for using the file system is that you can have the images served directly by the web server instead of served by the application - web servers are very, very good at serving static files.
My solution is to store the image as a base64 encoded string in a text column (https://djangosnippets.org/snippets/1669/). This requires more space, but makes replication dead simple.
In general, replication is seldom used for static content. For a high traffic website, you have a dedicated server for static content - Django makes this very easy, that is what MEDIA_URL and STATIC_URL are for. Even if you are starting with the media served by the same web server, it is good practice to have it done by a separate virtual host (for example, have the app at http://www.example.com and the media at http://static.example.com even if serving both from the same machine).
Web servers are so good at serving static content that hardly you will need more than one. In practice you rarely hit the point where a dedicated server is not handling the load anymore, because by that time you will be using a CDN to cut your bandwidth bill, and the CDN will take most of the heat off the server.
If you choose to follow the "store on the file system" recommendation, don't worry about this until deployment, when the time arrives have a deployment expert at your side.
The concern with this approach is performance.
The performance hit you take when storing static content in the database is serving the image: it is somewhat negligible for small files - but for a large file, one app instance (or thread) will be stuck until the download finishes. Don't worry unless your images take too long to download.
Hitting the database for every image request is not desirable.
Honestly, why is that? Databases are designed to take hits. When you choose to store images in the database, performance is in the hands of the DBA now; as a developer you should stop thinking about it. When (and if) you hit any performance bottleneck related to database issues, consult a professional DBA, he will fix it.
1 - What is a good cache tool for my purpose?
Short story: this is static content, do the cache at the network layer (CDN, reverse caching proxy, etc). It is a problem for a professional network engineer, not for the developer.
There are many popular cache backends for Django, IMHO they are overkill for static content.
2 - Given these requirements is it required that every image request goes through my Django application? Or is there a caching tool that I can use to prevent this. I have certain files that can be accessed only by certain people. For these I assume the request must go through the application since there would be no other way to check for authorizaton.
Use an URL scheme that is unique and hard to guess, for example, with a path component made from a SHA2 hash of the file contents plus some secret token. Restrict service to requests refered by your site to avoid someone re-publishing the file URL. Use expiration headers if appropriate.
3 - If this tool caches the files to the filesystem, then are hashed directories enough to mitigate the problem of having too many files in one directory? For example, a hashed directory path for elephant.gif could be /e/el/elephant.gif.
Again, ask yourself why are you concerned. The cache layer should be transparent to the developer. I'm not aware of any popular cache solution for Django that don't have such basic concern very well covered.
[update]
Very good points. I understand that replication is seldom used for static content. That's not the point though. How often other people use replication for files has no effect on the fact that not replicating/backing up your database is wrong. Other people may be fine with losing ACID just because some bit of data is binary; I'm not. As far as I'm concerned these files are "of the database" because there are database columns whose values reference the files. If backing up hard drives is something seldom done, does that mean I shouldn't back up my hard drive? NO!
Your concern is valid, I was just trying to explain why Django developers have a bias for this arrangement (dedicated webserver for static content), Django started at the news publishing industry where this approach works well because of its ratio of one trusted publisher for thousands of readers.
It is important to note that the recommended approach (IMHO) is not in ACID violation. Ok, Django does not erase older images stored in the filesystem when the record changes or is deleted - but PostgreSQL don't really erase tuples from disk immediately when you delete records, they are just marked to be vacuumed later. Pity that Django lacks a built-in "vacuum" for images, but it is very hard to write a general one, so I side with the core team - data safety comes first. Look for example at database migrations: they took so long to have database migrations incorporated in Django because it is a hard problem as well. While writing a generic solution is hard, writing specific ones is trivial - for some projects I have a "garbage collector" process that I run from crontab in the low traffic hours, this script simply delete all files that are not referenced by metadata in the database - and this dirty cron job is enough consistency for me.
If you choose to store images at the database that is all fine. There are trade-offs, but rest assured you don't have to worry about them as a developer, it is a problem for the "ops" part of DevOps.
My Django backend is always dynamic. It serves an iOS app similar to that of Instagram and Vine where users upload photos/videos and their followers can comment and like the content. Just for the sake of this question, imagine my backend serves an iOS app that is exactly like Instagram.
Many sources claim that using memcached can improve performance because it decreases the amount of hits that are made to the database.
My question is, for a backend that is already in dynamic in nature (always changing since users are uploading new pictures, commenting, liking, following new users etc..) what can I possibly cache?
It's a problem I've been thinking about for quite some time. I could cache the user profile data, but other than that, I don't know where else memcached would be useful.
Other sources mentioned using it everywhere in the backend where a 'GET' call is made but then I would need to set a suitable time limit to expire the cache since the app is always dynamic. What are your solutions and suggestions for getting around this problem?
You would cache whatever is being most frequently accessed from your Database. Make a list of the most frequent requests to get data from the database and cache the data in that priority.
Cache the most frequent requests based on category of the pictures
Cache based on users - power users go into cache (those which do a lot of data access)
Cache the most recent inserts (in case you have a page which shows the recently added posts/pictures)
I am sure you can come up with more scenarios. I am positive memcached (or any other caching) will help, even though your app is very 'dynamic'.
I have my Media Library stored as physical files. When a Sitecore user publishes an item, the files are dispersed to a number of CD servers using WebDeploy.
I would like to switch to Database storage due to some performance issues with WebDeploy, but I'm concerned that it may be too late. I have hundreds of physical Media Library files already attached to items in Sitecore.
How will Sitecore react to switching storage after the fact? Can it handle the two modes simultaneously, or must I migrate all my files into the DB?
I would make the switch, its makes less problems with the media in the database, and less things to keep track of, when running in a Multi server environment.
See more pros and cons here
You can very easy, make all existing media items, to a database media.
I have used this tool, to make the migration:
https://marketplace.sitecore.net/en/Modules/Media_Conversion_Tool.aspx