Issue with update of objects in AWS S3 bucket - amazon-web-services

While building an AWS website for one of my client I am having issues with the eventual consistency of S3 Bucket while updating an object.
In one of the feature that we have developed the user can update his profile picture and we are saving the profile picture in the S3 bucket and saving the public URL of it in the DB for later retrieval.
Now for new Objects it is working fine but for updates it is taking time(~ 5-10 mins) for the update to happen. I have explored the internet and could not find a solution to this. Some people suggested to use a versioning like v1/filename and v2/filename and with update take the data from the latest version directory but this is too impractical.
Can any one please suggest me what to do?

enable versioning in the bucket and use the versioning features to get the latest - rather than altering the path. s3 will handle the number of copies. See
https://forums.aws.amazon.com/thread.jspa?threadID=263531 for a discussion of this feature and consistency

Related

AEM/Adobe Experience Manager upload only some assets to AWS S3

My company is using AEM 6.5 and we were thinking to get some better performance out of our systems.
The idea we had is to upload only some assets (for example videos) to an S3 bucket and keep the other assets locally, we do not want to upload all the assets/datastore to S3. I know I can switch the datastore to S3, but that would mean all the assets go to S3, and we don't want this.
Restriction: we want the video upload to be done seamlessly from within the AEM Author, the editor should upload the video normally and somehow, behind the scenes, this transition to S3 to happen.
I checked as much documentation as I could find, and there is no mention of this partial asset upload to S3, you either go full S3 or nothing at all (we already tested full S3 datastore, it's working, but we do not want it).
So, my question is: did someone manage to do something like this?
Thanks
Have you looked into writing an Adobe Experience Manager workflow that would then read a list of assets to upload and then only update those specified assets. You could control which assets are uploaded to an Amazon S3 bucket before running the AEM workflow.
You can create a custom workflow step as discussed here. However in your use case - you would use the S3 Java API to create a custom workflow step. This is one way you can control which assets are uploaded to an Amazon S3 bucket from AEM.
https://helpx.adobe.com/experience-manager/using/message_service_gateway_api_64.html
Technically, it is possible to upload assets to S3, when they are uploaded to AEM instead of storing them in JCR. Nevertheless, this probably won't work as you expect and would require a lot of refactoring of AEM itself to make it work properly.
Just because the binary is stored in S3, does not mean that AEMs internals are aware of that and can deal with it.
Take asset preview on the author for example: this part of AEM would expect the binary to be stored in JCR. Now you have to rewrite this whole part of AEM to go look for those assets in S3. This would be a massive headache, overlaying those parts of AEM are already deprecated etc. And this is just one example of hundreds, that you would need to find a solution for.
It is not worth the effort.
You probably need to go "all-in" with S3 or leave it as is. Not sure what the reasoning is behind this drive to only use S3 "partially" for videos instead of all assets. Videos are probably already the largest assets you have, so it can't be cost. We run pure asset installations with S3 datastore that have 20TB-60TB of data which is totally fine.

Continuous Delivery issues with S3 and AWS CloudFront

I'm building out a series of content websites, and I've built a working CodePipeline that allows me to push edits to HTML files on github that instantly reflect in the S3 bucket, and consequently on the live website.
I created a cloudfront distro to get HTTPS for my website. The certificate and distro work fine, and it populates with my index.html in my S3 bucket, but the changes made via my github pipeline to the S3 bucket are reflected in the S3 bucket but not the CloudFront Distribution.
From what I've read, the edge locations used in cloudfront don't update their caches super often, and when they do, they might not update the edited index.html file because it has the same name as the old version.
I don't want to manually rename my index.html file in S3 every time one of my writers needs to post a top 10 Tractor Brands article or implement an experimental, low-effort clickbait idea, so that's pretty much off the table.
My overall objective is to build something where teams can quickly add an article with a few images to the website that goes live in minutes, and I've been able to do it so far but not with HTTPS.
If any of you know a good way of instantly updating CloudFront Distributions without changing file names, that would be great. Othterwise I'll probably have to start over because I need my sites secured and the ability to update them instantly.
You people are awesome. Thanks a million for any help.
You need to invalidate files from the edge caches. It's a simple and quick process.
You can automate the process yourself in your pipeline, or you could potentially use a third-party tool such as aws-cloudfront-auto-invalidator.

How to set no cache AT ALL on AWS S3?

I started to use AWS S3 to provide a fast way to my users download the installation files of my Win32 apps. Each install file has about 60MB and the download it's working very fast.
However when i upload a new version of the app, S3 keeps serving the old file instead ! I just rename the old file and upload the new version with the same name of the old. After i upload, when i try to download, the old version is downloaded instead.
I searched for some solutions and here is what i tried :
Edited all TTL values on cloudfrond to 0
Edited the metadata 'Cache-control' with the value 'max-age=0' for each file on the bucket
None of these fixed the issue, AWS keeps serving the old file instead of the new !
Often i will upload new versions, so i need that when the users try to download, S3 never use cache at all.
Please help.
I think this behavior might be because S3 uses an eventually consistent model, meaning that updates and deletes will propagate eventually but it is not guaranteed that this happens immediately, or even within a specific amount of time. (see here for the specifics of their consistency approach). Specifically, they say "Amazon S3 offers eventual consistency for overwrite PUTS and DELETES in all Regions" and I think the case you're describing would be an overwrite PUT. There appears to be a good answer on a similar issue here: How long does it take for AWS S3 to save and load an item? which touches on the consistency issue and how to get around it, hopefully that's helpful

Backup strategies for AWS S3 bucket [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm looking for some advice or best practice to back up S3 bucket.
The purpose of backing up data from S3 is to prevent data loss because of the following:
S3 issue
issue where I accidentally delete this data from S3
After some investigation I see the following options:
Use versioning http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html
Copy from one S3 bucket to another using AWS SDK
Backup to Amazon Glacier http://aws.amazon.com/en/glacier/
Backup to production server, which is itself backed up
What option should I choose and how safe would it be to store data only on S3? Want to hear your opinions.
Some useful links:
Data Protection Documentation
Data Protection FAQ
Originally posted on my blog: http://eladnava.com/backing-up-your-amazon-s3-buckets-to-ec2/
Sync Your S3 Bucket to an EC2 Server Periodically
This can be easily achieved by utilizing multiple command line utilities that make it possible to sync a remote S3 bucket to the local filesystem.
s3cmd
At first, s3cmd looked extremely promising. However, after trying it on my enormous S3 bucket -- it failed to scale, erroring out with a Segmentation fault. It did work fine on small buckets, though. Since it did not work for huge buckets, I set out to find an alternative.
s4cmd
The newer, multi-threaded alternative to s3cmd. Looked even more promising, however, I noticed that it kept re-downloading files that were already present on the local filesystem. That is not the kind of behavior I was expecting from the sync command. It should check whether the remote file already exists locally (hash/filesize checking would be neat) and skip it in the next sync run on the same target directory. I opened an issue (bloomreach/s4cmd/#46) to report this strange behavior. In the meantime, I set out to find another alternative.
awscli
And then I found awscli. This is Amazon's official command line interface for interacting with their different cloud services, S3 included.
It provides a useful sync command that quickly and easily downloads the remote bucket files to your local filesystem.
$ aws s3 sync s3://your-bucket-name /home/ubuntu/s3/your-bucket-name/
Benefits:
Scalable - supports huge S3 buckets
Multi-threaded - syncs the files faster by utilizing multiple threads
Smart - only syncs new or updated files
Fast - thanks to its multi-threaded nature and smart sync algorithm
Accidental Deletion
Conveniently, the sync command won't delete files in the destination folder (local filesystem) if they are missing from the source (S3 bucket), and vice-versa. This is perfect for backing up S3 -- in case files get deleted from the bucket, re-syncing it will not delete them locally. And in case you delete a local file, it won't be deleted from the source bucket either.
Setting up awscli on Ubuntu 14.04 LTS
Let's begin by installing awscli. There are several ways to do this, however, I found it easiest to install it via apt-get.
$ sudo apt-get install awscli
Configuration
Next, we need to configure awscli with our Access Key ID & Secret Key, which you must obtain from IAM, by creating a user and attaching the AmazonS3ReadOnlyAccess policy. This will also prevent you or anyone who gains access to these credentials from deleting your S3 files. Make sure to enter your S3 region, such as us-east-1.
$ aws configure
Preparation
Let's prepare the local S3 backup directory, preferably in /home/ubuntu/s3/{BUCKET_NAME}. Make sure to replace {BUCKET_NAME} with your actual bucket name.
$ mkdir -p /home/ubuntu/s3/{BUCKET_NAME}
Initial Sync
Let's go ahead and sync the bucket for the first time with the following command:
$ aws s3 sync s3://{BUCKET_NAME} /home/ubuntu/s3/{BUCKET_NAME}/
Assuming the bucket exists, the AWS credentials and region are correct, and the destination folder is valid, awscli will start to download the entire bucket to the local filesystem.
Depending on the size of the bucket and your Internet connection, it could take anywhere from a few seconds to hours. When that's done, we'll go ahead and set up an automatic cron job to keep the local copy of the bucket up to date.
Setting up a Cron Job
Go ahead and create a sync.sh file in /home/ubuntu/s3:
$ nano /home/ubuntu/s3/sync.sh
Copy and paste the following code into sync.sh:
#!/bin/sh
# Echo the current date and time
echo '-----------------------------'
date
echo '-----------------------------'
echo ''
# Echo script initialization
echo 'Syncing remote S3 bucket...'
# Actually run the sync command (replace {BUCKET_NAME} with your S3 bucket name)
/usr/bin/aws s3 sync s3://{BUCKET_NAME} /home/ubuntu/s3/{BUCKET_NAME}/
# Echo script completion
echo 'Sync complete'
Make sure to replace {BUCKET_NAME} with your S3 bucket name, twice throughout the script.
Pro tip: You should use /usr/bin/aws to link to the aws binary, as crontab executes commands in a limited shell environment and won't be able to find the executable on its own.
Next, make sure to chmod the script so it can be executed by crontab.
$ sudo chmod +x /home/ubuntu/s3/sync.sh
Let's try running the script to make sure it actually works:
$ /home/ubuntu/s3/sync.sh
The output should be similar to this:
Next, let's edit the current user's crontab by executing the following command:
$ crontab -e
If this is your first time executing crontab -e, you'll need to select a preferred editor. I'd recommend selecting nano as it's the easiest for beginners to work with.
Sync Frequency
We need to tell crontab how often to run our script and where the script resides on the local filesystem by writing a command. The format for this command is as follows:
m h dom mon dow command
The following command configures crontab to run the sync.sh script every hour (specified via the minute:0 and hour:* parameters) and to have it pipe the script's output to a sync.log file in our s3 directory:
0 * * * * /home/ubuntu/s3/sync.sh > /home/ubuntu/s3/sync.log
You should add this line to the bottom of the crontab file you are editing. Then, go ahead and save the file to disk by pressing Ctrl + W and then Enter. You can then exit nano by pressing Ctrl + X. crontab will now run the sync task every hour.
Pro tip: You can verify that the hourly cron job is being executed successfully by inspecting /home/ubuntu/s3/sync.log, checking its contents for the execution date & time, and inspecting the logs to see which new files have been synced.
All set! Your S3 bucket will now get synced to your EC2 server every hour automatically, and you should be good to go. Do note that over time, as your S3 bucket gets bigger, you may have to increase your EC2 server's EBS volume size to accommodate new files. You can always increase your EBS volume size by following this guide.
Taking into account the related link, which explains that S3 has 99.999999999% durability, I would discard your concern #1. Seriously.
Now, if #2 is a valid use case and a real concern for you, I would definitely stick with options #1 or #3. Which one of them? It really depends on some questions:
Do you need any other of the versioning features or is it only to avoid accidental overwrites/deletes?
Is the extra cost imposed by versioning affordable?
Amazon Glacier is optimized for data that is infrequently accessed and for which retrieval times of several hours are suitable. Is this OK for you?
Unless your storage use is really huge, I would stick with bucket versioning. This way, you won't need any extra code/workflow to backup data to Glacier, to other buckets, or even to any other server (which is really a bad choice IMHO, please forget about it).
How about using the readily available Cross Region Replication feature on the S3 buckets itself? Here are some useful articles about the feature
https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/
http://docs.aws.amazon.com/AmazonS3/latest/UG/cross-region-replication.html
You can backup your S3 data using the following methods
Schedule backup process using AWS datapipeline ,it can be done in 2 ways mentioned below:
a. Using copyActivity of datapipeline using which you can copy from one s3 bucket to another s3 bucket.
b. Using ShellActivity of datapipeline and "S3distcp" commands to do the recursive copy of recursive s3 folders from bucket to another (in parallel).
Use versioning inside the S3 bucket to maintain different version of data
Use glacier for backup your data ( use it when you don't need to restore the backup fast to the original buckets(it take some time to get back the data from glacier as data is stored in compressed format) or when you want to save some cost by avoiding to use another s3 bucket fro backup), this option can easily be set using the lifecycle rule on the s3 bucket fro which you want to take backup.
Option 1 can give you more security let say in case you accidentally delete your original s3 bucket and another benefit is that you can store your backup in datewise folders in another s3 bucket, this way you know what data you had on a particular date and can restore a specific date backup . It all depends on you use case.
You'd think there would be an easier way by now to just hold some sort of incremental backups on a diff region.
All the suggestions above are not really simple or elegant solutions. I don't really consider glacier an option as I think thats more of an archival solution then a backup solution. When I think backup I think disaster recovery from a junior developer recursively deleting a bucket or perhaps an exploit or bug in your app that deletes stuff from s3.
To me, the best solution would be a script that just backs up one bucket to another region, one daily and one weekly so that if something terrible happens you can just switch regions. I don't have a setup like this, I've looked into just haven't gotten around to doing it cause it would take a bit of effort to do this which is why I wish there was some stock solution to use.
While this question was posted some time ago, I thought it important to mention MFA delete protection with the other solutions. The OP is trying to solve for the accidental deletion of data. Multi-factor authentication (MFA) manifests in two different scenarios here -
Permanently deleting object versions - Enable MFA delete on the bucket's versioning.
Accidentally deleting the bucket itself - Set up a bucket policy denying delete without MFA authentication.
Couple with cross-region replication and versioning to reduce the risk of data loss and improve the recovery scenarios.
Here is a blog post on this topic with more detail.
As this topic was created longtime ago and is still pretty actual, here some updated news:
External backup
Nothing changed, you still can use CLI, or any other tool to schedule a copy somewhere else (in or out of AWS).
There is tools to do that and previous answers were very specific
"Inside" backup
S3 now supports versionning for previous versions. It means that you can create and use a bucket normally and let S3 manage the lifecycle in the same bucket.
An example of possible config, if you delete a file, would be:
File marked as deleted (still available but "invisible" to normal operations)
File moved to Glacier after 7 days
File removed after 30 days
You first need to activate versionning, and go to Lifecycle configuration. Pretty straight forward: previous versions only, and deletion is what you want.
Then, define your policy. You can add as many actions as you want (but each transition cost you). You can't store in Glacier less than 30 days.
If, We have too much data. If you have already a bucket then the first time The sync will take too much time, In my case, I had 400GB. It took 3hr the first time. So I think we can make the replica is a good solution for S3 Bucket backup.

Updating uploaded content on Amazon S3?

We have a problem with updating our uploaded content on Amazon S3. We keep our software updates on Amazon S3. We overwrite the old version of our software on S3 with new versions. Sometimes our users get old versions of files, when new versions have already been uploaded over 10 hours ago.
Step by step actions of our team:
We upload our file (size about 300 mb) on S3
This file is located on S3 for some time; more than a day, usually some weeks.
We upload a new version of the file to S3, overwriting the old version of this file
We start testing downloads. Some people get new versions, but another people get old versions.
How to solve this problem?
You should use different file names for different versions, this would make sure that some crazy proxy won't cache old file.
I'd suggest you try to use S3 Object Versioning, and place CloudFront in the solution to expose a short TTL Expiry to make it clear to caches to dismiss it ASAP.
Just a note for CloudFront: Make sure to Invalidate the CloudFront Cache for the Object when releasing a new version