Scheduling a future version of AWS S3 files - amazon-web-services

I'd like to queue up a collection of new versions of web site assets, and make them all go live at nearly the same time.
I've got a series of related files and directories that need to go live at a future time, all at once. In other words, a collection of AWS S3 files in a given bucket need to be updated at nearly the same time. Some of these files are large, and they could originate from locations where Internet access is unreliable and slow. That means they need to be staged somewhere, possibly in another bucket.
I want to be able to roll back to previous version(s) of individual files, or a set of files.
Suggestions or ideas? Bash code is preferred.

One option would be to put Amazon CloudFront in front of the Amazon S3 bucket. The CloudFront distribution can be "pointed" to an origin, such as an S3 bucket and path.
So, the update could be done just by changing one configuration in the CloudFront distribution.
If you are sticking with S3 exclusively, the updated files would need to be copied to the appropriate location (either from another bucket or from elsewhere in the same bucket). The time to make this happen would depend upon the size of the objects. You could do a parallel copy to make them copy faster.
Or, if the data is being accessed via a web page, then you could have the new version of the files already in place, then just update the web page that references the files. This means that all the pages (with different names) could be sitting there ready to be used and you just update the home page, which points to the other pages. Think of it as directing people through a different front door.

Related

Best way to s3 sync changed files in a folder only

I have a job that clones a repo then s3 syncs changes files over to an s3 bucket. I'd like to sync only changed files. Since the repo is cloned first, the files always have a new timestamp so s3 sync will always upload them. I thought about using "--size-only", but my understanding is that this can potentially miss files that have legitimately changed. What's the best way to go about this?
There are no answers out of the box that will sync changed files if the mtime cannot be counted on. As you point out, this means that if a file does not change in size, then using the "--size-only" flag will cause aws s3 sync to skip those files. To my mind there are two basic paths, the solution you use will depend on your exact needs.
Take advantage of Git
First off, you could use the fact you have the files stored in git to help update the modified time. git itself will not store the metadata, the maintainers have a philisphy that doing so is a bad idea. I won't argue for or against this, but there are two basic ways around this:
You could store this metadata in git. There are multiple approaches to doing this, one such is metastore which uses a tool that's installed alongside git to store the metadata and apply it later. This does require adding a tool to all users of your git repo, which may or may not be acceptable.
Another option is to attempt to recreate the mtime from metadata that's already in git. For instance, git-restore-mtime does this by using the timestamp of the most recent commit that modified the file. This would require running an external tool before running the sync command, but it shouldn't require any other workflow changes.
Using either of these options would allow a basic aws sync command to work, since the timestamps would be consistent from one run to another.
Do your own thing
Fundamentally, you want to upload files that have changed. aws sync attempts to use file size and modification timestamps to detect changes, but if you wanted to, you could write a script or program to enumerate all files you want to upload, and upload them along with a small bit of extra metadata including something like a sha256 hash. Then on future runs, you can enumerate the files in S3 using list-objects and use head-object on each object in turn to get the metadata to see if the hash has changed.
Alternatively, you could use the "etag" of each object in S3, as that is returned in the list-objects call. As I understand it, the etag formula isn't documented and subject to change. That said, it is known, you can find implementations of it here on Stack Overflow and elsewhere. You could calculate the etag for your local files, then see if the remote files differ and need to be updated. That would save you having to do the head-object on each object as you check for changes.

Hiding file source in an S3 bucket

I'm running an S3 bucket with a Cloudfront distribution. Everything works except the ability to read the source code is still there.
So the bucket is at mybucket.domain.com and that works okay. However, navigating to mybucket.domain.com/script.js or mybucket.domain.com/style.css will reveal the contents of each file.
I have searched far and wide for a solution but seem to be coming up blank every time. I've tried things with the bucket policy and Cloudfront settings to no avail. Any thoughts are appreciated. Thanks.
There's no way to prevent this. The web browser has to be able to download those files to the local computer in order to render your website. In order for the web browser to download those files they have to be publicly available. There's no way to stop someone from viewing the source of files that are publicly available. Since there are copies of these files on every computer that has visited your website, there is absolutely no way to keep people from viewing the source of those files.
You shouldn't place anything in those files that shouldn't be publicly available.

Continuous Delivery issues with S3 and AWS CloudFront

I'm building out a series of content websites, and I've built a working CodePipeline that allows me to push edits to HTML files on github that instantly reflect in the S3 bucket, and consequently on the live website.
I created a cloudfront distro to get HTTPS for my website. The certificate and distro work fine, and it populates with my index.html in my S3 bucket, but the changes made via my github pipeline to the S3 bucket are reflected in the S3 bucket but not the CloudFront Distribution.
From what I've read, the edge locations used in cloudfront don't update their caches super often, and when they do, they might not update the edited index.html file because it has the same name as the old version.
I don't want to manually rename my index.html file in S3 every time one of my writers needs to post a top 10 Tractor Brands article or implement an experimental, low-effort clickbait idea, so that's pretty much off the table.
My overall objective is to build something where teams can quickly add an article with a few images to the website that goes live in minutes, and I've been able to do it so far but not with HTTPS.
If any of you know a good way of instantly updating CloudFront Distributions without changing file names, that would be great. Othterwise I'll probably have to start over because I need my sites secured and the ability to update them instantly.
You people are awesome. Thanks a million for any help.
You need to invalidate files from the edge caches. It's a simple and quick process.
You can automate the process yourself in your pipeline, or you could potentially use a third-party tool such as aws-cloudfront-auto-invalidator.

How to set no cache AT ALL on AWS S3?

I started to use AWS S3 to provide a fast way to my users download the installation files of my Win32 apps. Each install file has about 60MB and the download it's working very fast.
However when i upload a new version of the app, S3 keeps serving the old file instead ! I just rename the old file and upload the new version with the same name of the old. After i upload, when i try to download, the old version is downloaded instead.
I searched for some solutions and here is what i tried :
Edited all TTL values on cloudfrond to 0
Edited the metadata 'Cache-control' with the value 'max-age=0' for each file on the bucket
None of these fixed the issue, AWS keeps serving the old file instead of the new !
Often i will upload new versions, so i need that when the users try to download, S3 never use cache at all.
Please help.
I think this behavior might be because S3 uses an eventually consistent model, meaning that updates and deletes will propagate eventually but it is not guaranteed that this happens immediately, or even within a specific amount of time. (see here for the specifics of their consistency approach). Specifically, they say "Amazon S3 offers eventual consistency for overwrite PUTS and DELETES in all Regions" and I think the case you're describing would be an overwrite PUT. There appears to be a good answer on a similar issue here: How long does it take for AWS S3 to save and load an item? which touches on the consistency issue and how to get around it, hopefully that's helpful

Setting up Amazon Cloudfront without S3

I want to use Cloudfront to serve images and CSS from my static website. I have read countless articles showing how to set it up with Amazon S3 but I would like to just host the files on my host and use cloud front to speed up delivery of said files, I'm just unsure on how to go about it.
So far I have created a distribution on CloudFront with my Origin Domain and CName and deployed it.
Origin Domain: example.me CName media.example.me
I added the CNAME for my domain:
media.mydomain.com with destination xxxxxx.cloudfront.net
Now this is where I'm stuck? Do I need to update the links in my HTML to that cname so if the stylesheet was http://example.me/stylesheets/screen.css do I change that to http://media.example.me/stylesheets/screen.css
and images within the stylesheet that were ../images/image1.jpg to http://media.example.me/images/image1.jpg?
Just finding it a little confusing how to link everything it's the first time I have really dabbled in using a CDN.
Thanks
Yes, you will have to update the paths in your HTML to point to CDN. Typically if you have a deployment/build process this link changing can be done at that time (so that development time can use the local files).
Another important thing to also handle here is the versioning the CSS/JS etc. You might make frequent changes to your CSS/JS. When you make any change typically CDNs take 24 hrs to reflect. (Another option is invalidating files on CDN, this is but charged explicitly and is discouraged). The suggested method is to generate a path like "media.example.me/XYZ/stylesheets/screen.css", and change this XYZ to a different number for each deployment (time stamp /epoch will do). This way with each deployment, you need to invalidate only the HTML and other files are any way a new path and will load fresh. This technique is generally called finger-printing the URLs.
Yes, you would update the references to your CSS files to load via the CDN domain. If image paths within CSS do not include a domain, they will also automatically load via cloudfront.