Can I expect a delay if I am showing the s3 content in website - amazon-web-services

Since s3 is eventually consistent, in how much time will the data become consistent. If my user uploads some media, and I if I have to show the same content in website, can i expect a scenario where others users might not see the content in website for some time, and some users can see it.

Its eventually consistent, but in my experience that usually means under a few seconds - so yes it's possible, but only for a very,very small amount of time

Yes it is very quick. Quick enough that any limitations to update speed would not be caused by s3. I host a static website in s3, and I can upload/update a file and see the contents in less than 1 second (i.e. refresh the page as fast as I can).

Related

Cloud File Storage with Bandwidth Limits

I want to develop an app for a friend's small business that will store/serve media files. However I'm afraid of having a piece of media goes viral, or getting DDoS'd. The bill could go up quite easily with a service like S3 and I really want to avoid surprise expenses like that. Ideally I'd like some kind of max-bandwidth limit.
Now, the solutions for S3 this has been posted here
But it does require quite a few steps. So I'm wondering if there is a cloud storage solution that makes this simpler I.e. where I don't need to create a custom microservice. I've talked to the support on Digital Ocean and they also don't support this
So in the interest of saving time, and perhaps for anyone else who finds themselves in a similar dilemma, I want to ask this question here, I hope that's okay.
Thanks!
Not an out-of-the-box solution, but you could:
Keep the content private
When rendering a web page that contains the file or links to the file, have your back-end generate an Amazon S3 pre-signed URLs to grant time-limited access to the object
The back-end could keep track of the "popularity" of the file and, if it exceeds a certain rate (eg 1000 over 15 minutes), it could instead point to a small file with a message of "please try later"

How to use CloudFront efficiently for less popular website?

We are building a website which contains a lot of images and data. We have optimized a lot to make the website faster. Then we decided to use AWS CloudFront also to make it faster for all regions around the world. The app works faster after the integration of CloudFront.
But later we found that the data will load to CloudFront cache only when the website asks for it. So we are afraid that the initial load will take the same time as it used to take without the CDN because it loads from S3 to CDN first and then to the user.
Also, we used the default TTL values (ie., 24 hours). In our case, a user may log in once or twice per week to this website. So in that case also, the advantage of caching won't work here as well because the caching expires after 24 hours. Will raising the time of TTL (Maximum TTL) to a larger value solve the issue? Does it cost more money? And I also read that, increasing to a longer TTL is not a good idea as it has some disadvantages also for updating the data in s3.
Cloudfront will cache the response only after the first user requests for it. So it will be slow for the first user, but it will be significantly faster for every other user after the first user. So it does make sense to use Cloudfront.
Using the default TTL value is okay. Since most users will see the same content and the website has a lot of static components as well. Every user except the first user will see a fast response from your website. You could even reduce this to 10-12 hours depending on how often you expect your data to change.
There is no additional cost to increasing your TTL. However invalidation requests are charged. So if you want to remove a cache, there will be a cost added to it. So I would prefer to keep a short TTL as short as your data is expected to change, so you dont have to invalidate existing caches when your data changes. At the same time, maximum number of users can benefit from your CDN.
No additional charge for the first 1,000 paths requested for invalidation each month. Thereafter, $0.005 per path requested for invalidation.
UPDATE: In the event that you only have 1 user using the website over a long period of time (1 week or so), it might not be of much benefit to use CloudFront at all. CloudFront and all caching services are only effective when there are multiple users requesting for the same resources.
However you might still have a marginal benefit using CloudFront, as the requests will be routed from the edge location to S3 over AWS's backbone network which is much faster than the internet. But whether this is cost effective for you or not depends on how many users are using the website and how slow it is.
Aside from using CloudFront, you could also try S3 Cross Region Replication to increase your overall speed. Cross Region Replication can replicate your buckets to a different region as and when they are added in one region. This can help to minimize latency for users from other regions.

S3 cloudfront expiry on images - performance very slow

I recently started serving my website images on cloudfront CDN instead of S3 thinking that it would be much faster. It is not. In fact it is much much slower. After a lot of investigation I'm being hinted that setting an expiry date on image objects is the key as Cloudfront will know how long to keep cached static content. Makes sense. But this is poorly documented by AWS and I can't figure out how to change the expiry date. people have said "you can change this on the aws console" Please show how dumb I am because I cannot see this. been at it for several hours. Needless to say I'm quite frustrated of fumbling around on this. Anyways any hints as small as they might be would be terrific. I like AWS, and what Cloudfront promised, but so far it's not what it seems.
EDIT ADITIONAL DETAIL:
Added expiry date headers per answer. In my case I had no headers. My hypothesis was that my slow Cloudfront performance serving images had to do with having NO expiry in the header. Having set an expiry date as shown in screenshot, and described in the answer, I'm seeing no noticeable difference in performance (going from no headers to adding an expiry date only). My site takes on average 7s to load with 10 core images (each <60Kbs). Those 10 images (served via cloudfront) account for 60-80% of the load time latency - depending on the performance tool used. Obviously something is wrong given that serving files on my VPS is faster. I hate to conclude that cloudfront is the problem given that so many people use it and I'd hate to break of from EC2 and S3, but right now testing MAxCDN is showing better results. I'll keep testing over the next 24hrs, but my conclusion is that the expiry date header is just a confusing detail with no favorable impact. Hopefully I'm wrong because I'd like to keep it all in the AWS family. Perhaps I'm barking up the wrong tree on the expiry date thing?
You will need to set it in the meta-data while uploading the file into S3. This article describes how you can achieve this.
The format for the expiry date is the RFC1123 date which is formatted like this:
Expires: Thu, 01 Dec 1994 16:00:00 GMT
Setting this to a far future date will enable caches like CloudFront to hold the file for a long time and in this case speed up the delivery as the single edge-locations (server all over the world delivering content for CloudFront) already have the data and don't need to fetch it again and again.
Even with a far future expiry header the first request of an object will be slow as the edge-location has to fetch the object once before it can be served from the cache.
Alternately you can omit the Expires header and use Cache-Control instead. CloudFront will understand that one too and you are more flexible with the expiry. In that case you can for example state the object should be held for one day from the first request the edge-location made to the object:
Cache-Control: public, max-age=86400
In that case the time is given using seconds instead of a fixed date.
While setting Cache-Control or Expires headers will improve the cache-ability of your objects, it won't improve your 60k/sec download speed. This requires some help from AWS. Therefore, I would recommend posting to the AWS CloudFront Forums with some sample response headers, traceroutes and resolver information.

Caching situation for images stored in a database

The Django recommendation for dealing with user uploads is to store them on the filesystem and store the filesystem path in a database column. This works, but presents some problems I do not want to deal with:
No transactions
No simple way to keep the filesystem and database in sync
Complicates backups since data is stored in 2 places
My solution is to store the image as a base64 encoded string in a text column (https://djangosnippets.org/snippets/1669/). This requires more space, but makes replication dead simple.
The concern with this approach is performance. Hitting the database for every image request is not desirable. I need some kind of server-side caching system together with reasonable caching headers. For example, if someone requests "/media/documents/earth.jpg", the cache should be consulted first and if the file is not found there the database should be hit.
Questions:
What is a good cache tool for my purpose?
Given these requirements is it required that every image request goes through my Django application? Or is there a caching tool that I can use to prevent this. I have certain files that can be accessed only by certain people. For these I assume the request must go through the application since there would be no other way to check for authorizaton.
If this tool caches the files to the filesystem, then are hashed directories enough to mitigate the problem of having too many files in one directory? For example, a hashed directory path for elephant.gif could be /e/el/elephant.gif.
tl;dr: stop worrying and deliver, "premature optimization is the root of all evil"
The Django recommendation for dealing with user uploads is to store them on the filesystem and store the filesystem path in a database column.
The recommendation for using the file system is that you can have the images served directly by the web server instead of served by the application - web servers are very, very good at serving static files.
My solution is to store the image as a base64 encoded string in a text column (https://djangosnippets.org/snippets/1669/). This requires more space, but makes replication dead simple.
In general, replication is seldom used for static content. For a high traffic website, you have a dedicated server for static content - Django makes this very easy, that is what MEDIA_URL and STATIC_URL are for. Even if you are starting with the media served by the same web server, it is good practice to have it done by a separate virtual host (for example, have the app at http://www.example.com and the media at http://static.example.com even if serving both from the same machine).
Web servers are so good at serving static content that hardly you will need more than one. In practice you rarely hit the point where a dedicated server is not handling the load anymore, because by that time you will be using a CDN to cut your bandwidth bill, and the CDN will take most of the heat off the server.
If you choose to follow the "store on the file system" recommendation, don't worry about this until deployment, when the time arrives have a deployment expert at your side.
The concern with this approach is performance.
The performance hit you take when storing static content in the database is serving the image: it is somewhat negligible for small files - but for a large file, one app instance (or thread) will be stuck until the download finishes. Don't worry unless your images take too long to download.
Hitting the database for every image request is not desirable.
Honestly, why is that? Databases are designed to take hits. When you choose to store images in the database, performance is in the hands of the DBA now; as a developer you should stop thinking about it. When (and if) you hit any performance bottleneck related to database issues, consult a professional DBA, he will fix it.
1 - What is a good cache tool for my purpose?
Short story: this is static content, do the cache at the network layer (CDN, reverse caching proxy, etc). It is a problem for a professional network engineer, not for the developer.
There are many popular cache backends for Django, IMHO they are overkill for static content.
2 - Given these requirements is it required that every image request goes through my Django application? Or is there a caching tool that I can use to prevent this. I have certain files that can be accessed only by certain people. For these I assume the request must go through the application since there would be no other way to check for authorizaton.
Use an URL scheme that is unique and hard to guess, for example, with a path component made from a SHA2 hash of the file contents plus some secret token. Restrict service to requests refered by your site to avoid someone re-publishing the file URL. Use expiration headers if appropriate.
3 - If this tool caches the files to the filesystem, then are hashed directories enough to mitigate the problem of having too many files in one directory? For example, a hashed directory path for elephant.gif could be /e/el/elephant.gif.
Again, ask yourself why are you concerned. The cache layer should be transparent to the developer. I'm not aware of any popular cache solution for Django that don't have such basic concern very well covered.
[update]
Very good points. I understand that replication is seldom used for static content. That's not the point though. How often other people use replication for files has no effect on the fact that not replicating/backing up your database is wrong. Other people may be fine with losing ACID just because some bit of data is binary; I'm not. As far as I'm concerned these files are "of the database" because there are database columns whose values reference the files. If backing up hard drives is something seldom done, does that mean I shouldn't back up my hard drive? NO!
Your concern is valid, I was just trying to explain why Django developers have a bias for this arrangement (dedicated webserver for static content), Django started at the news publishing industry where this approach works well because of its ratio of one trusted publisher for thousands of readers.
It is important to note that the recommended approach (IMHO) is not in ACID violation. Ok, Django does not erase older images stored in the filesystem when the record changes or is deleted - but PostgreSQL don't really erase tuples from disk immediately when you delete records, they are just marked to be vacuumed later. Pity that Django lacks a built-in "vacuum" for images, but it is very hard to write a general one, so I side with the core team - data safety comes first. Look for example at database migrations: they took so long to have database migrations incorporated in Django because it is a hard problem as well. While writing a generic solution is hard, writing specific ones is trivial - for some projects I have a "garbage collector" process that I run from crontab in the low traffic hours, this script simply delete all files that are not referenced by metadata in the database - and this dirty cron job is enough consistency for me.
If you choose to store images at the database that is all fine. There are trade-offs, but rest assured you don't have to worry about them as a developer, it is a problem for the "ops" part of DevOps.

Sitecore media items and race conditions

How does Sitecore deal with race conditions when publishing media items?
Scenario:
A non versioned media item with a 500mb mpg file (stored as blob) is
being downloaded by a site visitor.
The download will take at best
a bew minutes, at worst could be measured in hours (if they're on a
low bandwidth connection).
While the user is downloading an author
uploads a new version of the mpg on the media item and publishes.
What happens, and why?
Other variations include:
The security settings on the media item change to block access from the visitor downloading
The media item is deleted and the change published
I'm guessing that in all these cases the download is aborted, but if so, what response does the server send?
I don't have an exact answer, but Sitecore caches blob assets on the file system under /App_Data/MediaCache/ so perhaps the existing asset is still in that cache. I'm not sure how Sitecore's media caching mechanism works but I bet it purges/re-caches on the next request to the new assets once the asset is completely there.
Just a guess. Maybe decompile the kernel to find the code that handles caching media.
(Not really an answer.. just comment was too big for the box :P)
This is a very interesting question.. Sitecore's media performance is done a lot through it caching a copy to disk and the delivering it from there on subsequent requests (also for caching scaled copies of originals such as thumbnails etc). The file is flushed once the original item is edited in some way and then re-published.
I am uncertain (and intrigued) how this would affect a large file as I think a lot of people assume media is probably smaller files such as images or pdfs etc that a user would just re-request if broken and how this effects a file currently being streamed when the item itself was updated. I'm sure a lot of the work at that point is IIS/ASP.NET streaming rather than Sitecore itself.
I'm not sure if Sitecore's cache would protect / shield against that but this should be pretty simple enough to test with a larger media file. Interested in the results (as larger files I've delivered personally have been done by CDN or a dedicated streaming partner)
This is not a difinitive answer, and I agree w/Stephen about dedicated streaming partner. I wonder how such systems handle this.
It seems that Sitecore creates a new media cache file for each published and accessed revision, so the HTTP transmit can continue reading the old file while the system writes the new file. Not sure if/how that works if you disable caching (I didn't try disabling caching). Otherwise, trying to write while reading could be blocked or interfere with read.
Note that you get a new revision ID even if you don't version. And it might be the publication that causes a new cache entry, not the occurrence of a new revision.