AWS S3 sync between buckets overwriting newer destination files - amazon-web-services

We have two s3 buckets, and we have a sync cron job that should copy bucket1 changes to bucket2.
aws s3 sync s3://bucket1/images/ s3://bucket2/images/
When a new image is added to bucket1, it correctly gets copied over to bucket2.
However, if we upload a new version of that image to bucket2, when the sync job next runs it actually copies the older version from bucket1 over to bucket2, replacing the newer version we just put there.
This is part of a migration process, and in time the only place images will be uploaded to will be bucket2, but for the time being sometimes they may be uploaded to either, and we only want changes form bucket1 to be copied up to bucket2, NOT the other way round.
Why does the aws sync job seem to think that the file on bucket1 has changed? Does it not know that the file in bucket2 is newer, so it should be left alone?

The AWS Command-Line Interface (CLI) aws s3 sync command copies content from the Source location to the Destination location. It only copies files that have been added or changed since the last sync.
It is designed as a one-way sync, not a two-way sync. Your file is being overwritten because the file in the Source is not present in the Destination. This is correct behavior.
There is limited range to tweak these controls, such as (from the sync command documentation):
--exact-timestamps (boolean) When syncing from S3 to local, same-sized items will be ignored only when the timestamps match exactly. The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.
However, there does not appear to be an option that stops overwriting of files merely because a file with the same name exists, or something with a preference to keep newer files.
If you want a two-way sync with more specific rules, you will need to code it yourself.

Related

Copy limited number of files from S3?

We are using an S3 bucket to store a growing number of small JSON files (~1KB each) that contain some build-related data. Part of our pipeline involves copying these files from S3 and putting them into memory to do some operations.
That copy operation is done via S3 cli tool command that looks something like this:
aws s3 cp s3://bucket-path ~/some/local/path/ --recursive --profile dev-profile
The problem is that the number of json files on S3 is getting pretty large since more are being made every day. It's nothing even close to the capacity of the S3 bucket since the files are so small. However, in practical terms, there's no need to copy ALL these JSON files. Realistically the system would be safe just copying the most recent 100 or so. But we do want to keep older ones around for other purposes.
So my question boils down to: is there a clean way to copy a specific number of files from S3 (maybe sorted by most recent)? Is there some kind of pruning policy we can set on an S3 bucket to delete files older than X days or something?
The aws s3 sync command in the AWS CLI sounds perfect for your needs.
It will copy only files that are New or Modified since the last sync. However, it means that the destination will need to retain a copy of the 'old' files so that they are not copied again.
Alternatively, you could write a script (eg in Python) that lists the objects in S3 and then only copies objects added since the last time the copy was run.
You can set the Lifecycle policies to the S3 buckets which will remove them after certain period of time.
To copy only some days old objects you will need to write a script

AWS s3 sync to upload if file does not exist in target

I have uploaded about 1,000,000 files from my local directory to s3 buckets/subfolders and some of them have failed.
I would like to use the 'sync' option to capture those that did not make it the first time. The s3 modified date is the date/time my file was uploaded (which differs from my source file date/times).
As I understand, sync will upload a file to the target if it does not exist, if the file date has changed, or if the size is different.
Can I modify the command line to NOT use the file date as a consideration for syncing? I ONLY want to copy a file if it does not exist.
aws s3 sync \localserver\localshare\folder s3://mybucket/Folder1
aws s3 sync will compare the "last modified time".
For the objects in S3, there is only one timestamp LastModified, which should be when you uploaded the files.
For your local file (assume a posix linux file system). It should have 3 timestamps: last-access, last-modified, last-status-change. Only last-modified time will be used for comparison.
Now support you uploaded 1M files and some of them failed. For all the files had uploaded successfully, they should have identical last-modified time, and then another sync will not upload them again (sync will validate whether those files are identical and it will be considerable long for the validations for 1M objects.)
On the meantime, you can use aws s3 sync --size-only arguments. It fits what you described. But be sure to check whether it is really something you need. I mean, in many cases, many files could be keep the same size even after being modified (intentionally or accidentally), --size-only will ignore such same-size files.

How to keep both directory and s3 bucket updated via aws-cli?

I want to synchronize a directory on my machine with a bucket in s3. The problem is that I find that the aws cli option of sync does not seem to do what I expected.
The behavior that I'm looking for is such that when I run the command it evaluates the content of the local directory and the content of the s3 bucket and updates the one with the old content with the changes in the other.
Can't guess the best approach. Thanks in advance.
I think it's a one way sync. If anything is the source is new or updated then it will sync it to the destination. Even if you may have a new file in the destination, it won't sync that down to your source. You would need to do a sync again from destination back to source. It's a one way sync, from source to destination.
You can't do that with awscli. aws sync s3://bucket . will synchronize the data in both sides
If you want a tool that acts according to your needs you have to develop a python script using https://github.com/boto/boto3
Npm package (and cli command) s3-cli can sync a local folder and an S3 bucket, both ways.
s3-cli sync [--delete-removed] /path/to/folder/ s3://bucket/key/on/s3/
s3-cli sync [--delete-removed] s3://bucket/key/on/s3/ /path/to/folder/

Spark doesn't output .crc files on S3

When I use spark locally, writing data on my local filesystem, it creates some usefull .crc file.
Using the same job on Aws EMR and writing on S3, the .crc files are not written.
Is this normal? Is there a way to force the writing of .crc files on S3?
those .crc files are just created by the the low level bits of the Hadoop FS binding so that it can identify when a block is corrupt, and, on HDFS, switch to another datanode's copy of the data for the read and kick off a re-replication of one of the good copies.
On S3, stopping corruption is left to AWS.
What you can get off S3 is the etag of a file, which is the md5sum on a small upload; on a multipart upload it is some other string, which again, changes when you upload it.
you can get at this value with the Hadoop 3.1+ version of the S3A connector, though it's off by default as distcp gets very confused when uploading from HDFS. For earlier versions, you can't get at it, nor does the aws s3 command show it. You'd have to try some other S3 libraries (it's just a HEAD request, after all)

Copy all objects to another S3 bucket in different region with different structure

I have an S3 bucket in Region A structured like this:
ProviderA-1-1
31423423.jpg
ProviderB-1-1
32423432.jpg
The top level folder is a unique image identifier. The filename is the version of the image.
i want to copy the images to a bucket in Region B, structured like this:
ProviderA-1-1.jpg
ProviderB-1-1.jpg
E.g i don't care about the version. I just want the folder name (which is unique) to be the filename.
The reason i'm doing this is to have a flat structure to make use of image services like Imgix / ImageKit. (they provide on the fly image transformation for images, given a flat source origin)
So, my requirements are:
I need to copy lots (millions of images, ~10TB) of images
The destination bucket is in another region
I need to 'flatten' the structure, and change the name of the images to be the name of the folder they are in (folder names isn't fixed)
I've seen a few answers here suggesting the aws cli is the best approach, but not sure how i can achieve 3. with that?
Sounds like i need to loop through the images one by one, changing the name before i copy. If a script is suggested, i'm most comfortable with .NET - so perhaps the AWS .NET SDK?
This is a once off job, where i need to move the images as quickly and cheaply as possible.
Advice please?
Thanks :)
Yes, a script is required because you are moving and renaming the files.
If you're comfortable with .NET, then use that!
The basic program would be:
Create two S3 clients -- one for source bucket (to obtain the listing) and one for the destination bucket (because copy commands are sent to the destination bucket, which pulls the file from the source bucket) because you are using a different region
Use ListObjects() to obtain a list of the source bucket. Note that it will return 1000 files at a time, so use NextMarker to request the subsequent batch.
Loop through each file and use CopyObject() to simultaneously copy and rename the file. Use your own logic to take the folder name and convert it to a filename. Each file will be copied directly between the buckets, without needing to download/upload
Continue, looping through the list of 1000 files and then get the next 1000 files, etc.
The process could be sped up by using multi-threading but the logic gets a bit hard. It might be easier to simply run a few copies of the program at the same time, each handling a different Prefix range (effectively, folder names).
It's a one-off job, so optimization isn't important.
If you are adding more files in future, the best method would be to create an AWS Lambda function that is triggered whenever a new file is created in S3. The Lambda function would then copy the file to the destination, then exit.
Assuming you have no location constraints set up for your buckets, flattening would simply be:
aws s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/
assumes you have the CLI installed and required credentials setup correctly. Or you can pass them on command line:
aws --profile profile_A2B --region XXX s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/ --acl yyy
You don't mention any performance requirements. There are many ways of making transfer faster, depends on many factors. Few blind hints I can give are:
See if transfer acceleration can help you.
In general S3 to S3 transfer is faster than S3 to/from non-S3 location.
See if you can create parallel batches by prefix like:
.
for prefix in {a..z}
do
aws s3 cp --recursive s3://source_bucket/foo/${prefix}* s3://target_bucket/ &
done
If this is not a one time transfer and the transfer acceleration isn't cutting it for you, consider:
download from S3 (in region A) to a local HDD residing in region A.
transfer from local HDD in region A to a local HDD in region B using other methods like Aspera or FileCatalyst or whatever else you can find.
upload from local HDD in region B to S3 (in region B).
I have no practical data to share except that Aspera blows things like FTP out of water, it's not even a competition. YMMV.
John already covered the pseudo code. I'll just make one change to it. Write two separate programs, one to fetch the list of filenames and second to copy. It takes a lot of time to list files if you have millions of them.
Once you've listed the file names in a file, say one per line, it would be pretty easy to parallelize given you can split the file (say split -l 1000 file_list splits).
Use xargs -P or gun parallel to run multiple aws s3 cp commands at once. If you're using shell instead of .NET.
Finally don't forget to set the ACL (and other attributes like TTL etc) on target files during the copy. Doing that after the copy will take a long time.