Update Hard Drive backup on AWS S3

Update Hard Drive backup on AWS S3 - amazon-web-services

I would like to run an aws s3 sync command daily to update my hard drive backup on S3. Most of the time there will be no changes. The problem is that the s3 sync command takes days to check for changes (for a 4tb HDD). What is the quickest way to update a hard drive backup on S3?

If you are wanting to backup your own computer to Amazon S3, I would recommend using a Backup Utility that knows how to use S3. These utilities can do smart things like compress data, track files that have changed and set an appropriate Storage Class.
For example, I use Cloudberry Backup on a Windows computer. It does regular checking for new/changed files and uploads them to S3. If I delete a file locally, it waits 90 days before deleting it from S3. It can also handle multiple versions of files, rather than always overwriting files.
I would recommend only backing-up data folders (eg My Documents). There is no benefit to backing-up your Operating System or temporary files because you would not restore the OS from a remote backup.
While some backup utilities can compress files individually or in groups, experience has taught me to never do so since it can make restoration difficult if you do not have the original backup software (and remember -- backups last years!). The great things about S3 is that it is easy to access from many devices -- I have often grabbed documents from my S3 backup via my phone when I'm away from home.
Bottom line: Use a backup utility that knows how to do backups well. Make sure it knows how to use S3.

I would recommend using a backup tool that can synchronize with Amazon S3. For example, for Windows you can use Uranium Backup. It syncs with several clouds, including Amazon S3.
It can be scheduled to perform daily backups and also incremental backups (in case there are changes.)
I think this is the best way, considering the tediousness of daily manual syncing. Plus, it runs in the background and notifies you of any error or success logs.
This is the solution I use, I hope it can help you.

Related

Best way to s3 sync changed files in a folder only

I have a job that clones a repo then s3 syncs changes files over to an s3 bucket. I'd like to sync only changed files. Since the repo is cloned first, the files always have a new timestamp so s3 sync will always upload them. I thought about using "--size-only", but my understanding is that this can potentially miss files that have legitimately changed. What's the best way to go about this?

There are no answers out of the box that will sync changed files if the mtime cannot be counted on. As you point out, this means that if a file does not change in size, then using the "--size-only" flag will cause aws s3 sync to skip those files. To my mind there are two basic paths, the solution you use will depend on your exact needs.
Take advantage of Git
First off, you could use the fact you have the files stored in git to help update the modified time. git itself will not store the metadata, the maintainers have a philisphy that doing so is a bad idea. I won't argue for or against this, but there are two basic ways around this:
You could store this metadata in git. There are multiple approaches to doing this, one such is metastore which uses a tool that's installed alongside git to store the metadata and apply it later. This does require adding a tool to all users of your git repo, which may or may not be acceptable.
Another option is to attempt to recreate the mtime from metadata that's already in git. For instance, git-restore-mtime does this by using the timestamp of the most recent commit that modified the file. This would require running an external tool before running the sync command, but it shouldn't require any other workflow changes.
Using either of these options would allow a basic aws sync command to work, since the timestamps would be consistent from one run to another.
Do your own thing
Fundamentally, you want to upload files that have changed. aws sync attempts to use file size and modification timestamps to detect changes, but if you wanted to, you could write a script or program to enumerate all files you want to upload, and upload them along with a small bit of extra metadata including something like a sha256 hash. Then on future runs, you can enumerate the files in S3 using list-objects and use head-object on each object in turn to get the metadata to see if the hash has changed.
Alternatively, you could use the "etag" of each object in S3, as that is returned in the list-objects call. As I understand it, the etag formula isn't documented and subject to change. That said, it is known, you can find implementations of it here on Stack Overflow and elsewhere. You could calculate the etag for your local files, then see if the remote files differ and need to be updated. That would save you having to do the head-object on each object as you check for changes.

Push into S3 or Pull into S3 which is faster for a small file

So I have a use case where I need to put files from on-prem FTP to S3.
The size of each file (XML) is 5KB max.
The no of files is 100 files per minutes.
No, the use case is such that as soon as files come at FTP location I need to put into S3 bucket immediately.
What could be the best way to achieve that.
Here are my option
Using AWS CLI at my FTP location.(push mechanism )
Using lambda (pull mechanism.
Writing java application to put the file into S3 from FTP.
Or is there anything built in that I can leverage in.
Basically, i need to put the file in S3 as soon as possible because UI is built on top of S3 and if the file does not arrive immediately I might be in trouble.

The easiest would be to use the AWS Command-Line Interface (CLI), or an API call if you wish to do it from application code.
It doesn't really make sense doing it via Lambda, because Lambda would need to somehow retrieve the file from FTP and then copy it to S3 (so it is doing double work).
You can certainly write a Java application to do it, or simply call the AWS CLI (written in Python) since it will work out-of-the-box.
You could either use aws s3 sync to copy all new/updated files, or copy specific files with aws s3 cp. If you have so many files, it's probably best to specify the files otherwise it will waste time scanning many historical files that don't need to be copied.
The ultimate best case would be for the files to be sent to S3 directly, without involving FTP at all!

Backing up symlinks using AWS s3 sync

I'm attempting to backup our system using the aws s3 sync command, however this will either backup the entire directory behind a symlink (default behaviour), or not backup the symlink at all.
I'd like some way of backing up the symlink so it can be restored from S3 if need be.
I don't want to archive the entire directory first, else I'll lose the ability to only backup the changed files.
My current thought is to scan the dir for symlinks, and create metadata files containing the symlink's target, which, after restore, could be read to rebuild the symlink, but I'm not quite sure how to do this.
Any advice would be very welcome. Thanks in advance.

As is, S3 has no standard way to represent a symlink. Note that you could decide of a custom representation, and store that in the metadata of an empty S3 Object, but you would be on your own. AFAIK, aws s3 doesn't do that.
Now, for purpose of backing up to S3 (and Glacier), you may want to take a look at OpenDedup. It does use the same type of rolling checksum as used in rsync to minimize the actual storage used (and the bandwidth).
I've been doing a lot of cp -rl and rsync custom scripts to backup my own system to local drives, but was always frustrated about the unnecessary extra storage due to many duplicate files I may have. Imagine what happens in those simple schemes when you rename a directory (mv dirA dirB): the next backup typically stores a brand new copy of that dir.
With OpenDedup (and other similar systems, such as bup, zpaq, etc.), the content is stored uniquely (thanks to the rolling checksum approach). I like that.

Right now, Amazon S3 does not support symbolic links. It will follow them when uploading from the local disk to S3. According to the AWS documentation, the contents of the symlink are copied or sync’d under the name of the symlink.
The rsync command does have options for symbolic links. One of them --copy-links will copy the destination exactly. So if your symlinks use absolute paths (my/absolute/path), it will copy that path and the symlink on S3 would point to the directory on your local box. If you use relative paths (.../.../path) then the symlink would be pointing to that path on S3.
Rsync would be a way to keep your symlinks to use after restoring the files back to your local box.
Another method would be to use an AWS S3 sync or backup service, such as NetApp’s Cloud Sync, which would catalog your data with each operation. Each service provider offers different features, so how symbolic links would be handled depends on the vendor chosen.

Restoring from Glacier after Synology NAS

Have been using Synology NAS for backup for quite some time.
Due to changes in structure will be migrating to new system and need to restore all files from Glacier.
Tried using CloudBerry Backup (desktop free) however after synchronizing there where no listed files for restoration.
Also tried using Fast Glacier, however this resulted in errors preventing the backup from functioning.
Any and all suggestions more than welcome.

The most easy approach would be if you could copy the files in question from your Synology NAS to the new system, whether it is a new Synology NAS or something else. It is the fastest and most cost-efficient way.
If you migrating to the new Synology, you should just copy the files from the old one. If it is somehow not possible, you should use "Retrieve task" button from "Restore" tab.
Beware, it will take long time just to retrieve the task. When retrieving the task, Synology NAS will download files related to the task. These files contain all the information about your backup task: which files are backed up, versions, file location, size and so on. When the task is retrieved, you can begin to recover the files. Use this calculator to calculate the restore price before you begin - it is far from cheap.
Also, hope that restore will succeed (because, sometimes, backup integrity fails which makes restore not possible)
In case you restore to some other system than Synology, you should definitely copy the files from your old Synology. If you no longer have access to the old Synology NAS, buy a new Synology (or maybe use XPenology), restore to it and then copy the files to the new system. Amazon Glacier on Synology NAS backs up in proprietary format, which to my knowledge only readable by other Synology devices.
Good luck with migration.

Backup strategies for AWS S3 bucket [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm looking for some advice or best practice to back up S3 bucket.
The purpose of backing up data from S3 is to prevent data loss because of the following:
S3 issue
issue where I accidentally delete this data from S3
After some investigation I see the following options:
Use versioning http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html
Copy from one S3 bucket to another using AWS SDK
Backup to Amazon Glacier http://aws.amazon.com/en/glacier/
Backup to production server, which is itself backed up
What option should I choose and how safe would it be to store data only on S3? Want to hear your opinions.
Some useful links:
Data Protection Documentation
Data Protection FAQ

Originally posted on my blog: http://eladnava.com/backing-up-your-amazon-s3-buckets-to-ec2/
Sync Your S3 Bucket to an EC2 Server Periodically
This can be easily achieved by utilizing multiple command line utilities that make it possible to sync a remote S3 bucket to the local filesystem.
s3cmd
At first, s3cmd looked extremely promising. However, after trying it on my enormous S3 bucket -- it failed to scale, erroring out with a Segmentation fault. It did work fine on small buckets, though. Since it did not work for huge buckets, I set out to find an alternative.
s4cmd
The newer, multi-threaded alternative to s3cmd. Looked even more promising, however, I noticed that it kept re-downloading files that were already present on the local filesystem. That is not the kind of behavior I was expecting from the sync command. It should check whether the remote file already exists locally (hash/filesize checking would be neat) and skip it in the next sync run on the same target directory. I opened an issue (bloomreach/s4cmd/#46) to report this strange behavior. In the meantime, I set out to find another alternative.
awscli
And then I found awscli. This is Amazon's official command line interface for interacting with their different cloud services, S3 included.
It provides a useful sync command that quickly and easily downloads the remote bucket files to your local filesystem.
$ aws s3 sync s3://your-bucket-name /home/ubuntu/s3/your-bucket-name/
Benefits:
Scalable - supports huge S3 buckets
Multi-threaded - syncs the files faster by utilizing multiple threads
Smart - only syncs new or updated files
Fast - thanks to its multi-threaded nature and smart sync algorithm
Accidental Deletion
Conveniently, the sync command won't delete files in the destination folder (local filesystem) if they are missing from the source (S3 bucket), and vice-versa. This is perfect for backing up S3 -- in case files get deleted from the bucket, re-syncing it will not delete them locally. And in case you delete a local file, it won't be deleted from the source bucket either.
Setting up awscli on Ubuntu 14.04 LTS
Let's begin by installing awscli. There are several ways to do this, however, I found it easiest to install it via apt-get.
$ sudo apt-get install awscli
Configuration
Next, we need to configure awscli with our Access Key ID & Secret Key, which you must obtain from IAM, by creating a user and attaching the AmazonS3ReadOnlyAccess policy. This will also prevent you or anyone who gains access to these credentials from deleting your S3 files. Make sure to enter your S3 region, such as us-east-1.
$ aws configure
Preparation
Let's prepare the local S3 backup directory, preferably in /home/ubuntu/s3/{BUCKET_NAME}. Make sure to replace {BUCKET_NAME} with your actual bucket name.
$ mkdir -p /home/ubuntu/s3/{BUCKET_NAME}
Initial Sync
Let's go ahead and sync the bucket for the first time with the following command:
$ aws s3 sync s3://{BUCKET_NAME} /home/ubuntu/s3/{BUCKET_NAME}/
Assuming the bucket exists, the AWS credentials and region are correct, and the destination folder is valid, awscli will start to download the entire bucket to the local filesystem.
Depending on the size of the bucket and your Internet connection, it could take anywhere from a few seconds to hours. When that's done, we'll go ahead and set up an automatic cron job to keep the local copy of the bucket up to date.
Setting up a Cron Job
Go ahead and create a sync.sh file in /home/ubuntu/s3:
$ nano /home/ubuntu/s3/sync.sh
Copy and paste the following code into sync.sh:
#!/bin/sh
# Echo the current date and time
echo '-----------------------------'
date
echo '-----------------------------'
echo ''
# Echo script initialization
echo 'Syncing remote S3 bucket...'
# Actually run the sync command (replace {BUCKET_NAME} with your S3 bucket name)
/usr/bin/aws s3 sync s3://{BUCKET_NAME} /home/ubuntu/s3/{BUCKET_NAME}/
# Echo script completion
echo 'Sync complete'
Make sure to replace {BUCKET_NAME} with your S3 bucket name, twice throughout the script.
Pro tip: You should use /usr/bin/aws to link to the aws binary, as crontab executes commands in a limited shell environment and won't be able to find the executable on its own.
Next, make sure to chmod the script so it can be executed by crontab.
$ sudo chmod +x /home/ubuntu/s3/sync.sh
Let's try running the script to make sure it actually works:
$ /home/ubuntu/s3/sync.sh
The output should be similar to this:
Next, let's edit the current user's crontab by executing the following command:
$ crontab -e
If this is your first time executing crontab -e, you'll need to select a preferred editor. I'd recommend selecting nano as it's the easiest for beginners to work with.
Sync Frequency
We need to tell crontab how often to run our script and where the script resides on the local filesystem by writing a command. The format for this command is as follows:
m h dom mon dow command
The following command configures crontab to run the sync.sh script every hour (specified via the minute:0 and hour:* parameters) and to have it pipe the script's output to a sync.log file in our s3 directory:
0 * * * * /home/ubuntu/s3/sync.sh > /home/ubuntu/s3/sync.log
You should add this line to the bottom of the crontab file you are editing. Then, go ahead and save the file to disk by pressing Ctrl + W and then Enter. You can then exit nano by pressing Ctrl + X. crontab will now run the sync task every hour.
Pro tip: You can verify that the hourly cron job is being executed successfully by inspecting /home/ubuntu/s3/sync.log, checking its contents for the execution date & time, and inspecting the logs to see which new files have been synced.
All set! Your S3 bucket will now get synced to your EC2 server every hour automatically, and you should be good to go. Do note that over time, as your S3 bucket gets bigger, you may have to increase your EC2 server's EBS volume size to accommodate new files. You can always increase your EBS volume size by following this guide.

Taking into account the related link, which explains that S3 has 99.999999999% durability, I would discard your concern #1. Seriously.
Now, if #2 is a valid use case and a real concern for you, I would definitely stick with options #1 or #3. Which one of them? It really depends on some questions:
Do you need any other of the versioning features or is it only to avoid accidental overwrites/deletes?
Is the extra cost imposed by versioning affordable?
Amazon Glacier is optimized for data that is infrequently accessed and for which retrieval times of several hours are suitable. Is this OK for you?
Unless your storage use is really huge, I would stick with bucket versioning. This way, you won't need any extra code/workflow to backup data to Glacier, to other buckets, or even to any other server (which is really a bad choice IMHO, please forget about it).

How about using the readily available Cross Region Replication feature on the S3 buckets itself? Here are some useful articles about the feature
https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/
http://docs.aws.amazon.com/AmazonS3/latest/UG/cross-region-replication.html

You can backup your S3 data using the following methods
Schedule backup process using AWS datapipeline ,it can be done in 2 ways mentioned below:
a. Using copyActivity of datapipeline using which you can copy from one s3 bucket to another s3 bucket.
b. Using ShellActivity of datapipeline and "S3distcp" commands to do the recursive copy of recursive s3 folders from bucket to another (in parallel).
Use versioning inside the S3 bucket to maintain different version of data
Use glacier for backup your data ( use it when you don't need to restore the backup fast to the original buckets(it take some time to get back the data from glacier as data is stored in compressed format) or when you want to save some cost by avoiding to use another s3 bucket fro backup), this option can easily be set using the lifecycle rule on the s3 bucket fro which you want to take backup.
Option 1 can give you more security let say in case you accidentally delete your original s3 bucket and another benefit is that you can store your backup in datewise folders in another s3 bucket, this way you know what data you had on a particular date and can restore a specific date backup . It all depends on you use case.

You'd think there would be an easier way by now to just hold some sort of incremental backups on a diff region.
All the suggestions above are not really simple or elegant solutions. I don't really consider glacier an option as I think thats more of an archival solution then a backup solution. When I think backup I think disaster recovery from a junior developer recursively deleting a bucket or perhaps an exploit or bug in your app that deletes stuff from s3.
To me, the best solution would be a script that just backs up one bucket to another region, one daily and one weekly so that if something terrible happens you can just switch regions. I don't have a setup like this, I've looked into just haven't gotten around to doing it cause it would take a bit of effort to do this which is why I wish there was some stock solution to use.

While this question was posted some time ago, I thought it important to mention MFA delete protection with the other solutions. The OP is trying to solve for the accidental deletion of data. Multi-factor authentication (MFA) manifests in two different scenarios here -
Permanently deleting object versions - Enable MFA delete on the bucket's versioning.
Accidentally deleting the bucket itself - Set up a bucket policy denying delete without MFA authentication.
Couple with cross-region replication and versioning to reduce the risk of data loss and improve the recovery scenarios.
Here is a blog post on this topic with more detail.

As this topic was created longtime ago and is still pretty actual, here some updated news:
External backup
Nothing changed, you still can use CLI, or any other tool to schedule a copy somewhere else (in or out of AWS).
There is tools to do that and previous answers were very specific
"Inside" backup
S3 now supports versionning for previous versions. It means that you can create and use a bucket normally and let S3 manage the lifecycle in the same bucket.
An example of possible config, if you delete a file, would be:
File marked as deleted (still available but "invisible" to normal operations)
File moved to Glacier after 7 days
File removed after 30 days
You first need to activate versionning, and go to Lifecycle configuration. Pretty straight forward: previous versions only, and deletion is what you want.
Then, define your policy. You can add as many actions as you want (but each transition cost you). You can't store in Glacier less than 30 days.

If, We have too much data. If you have already a bucket then the first time The sync will take too much time, In my case, I had 400GB. It took 3hr the first time. So I think we can make the replica is a good solution for S3 Bucket backup.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js