s3 - Comparing files between two buckets - amazon-web-services

I would like to compare the file contents of two S3-compatible buckets and identify files that are missing or that differ.
Should I use checksum to do it instead?

It appears that your requirement is to compare the contents of two Amazon S3 buckets and identify files that are missing or differ between the buckets.
To do this, you could use:
Object name: This, of course, will help find missing files
Object size: A different size indicates different contents and the size is given with each bucket listing.
eTag: An eTag is an MD5 checksum on the contents of an object. If the same file has a different eTag, then the contents is different.
Creation date: This is not actually a reliable way to identify differences, but it can be used with other metadata to determine whether you want to update a file. For example, if two files differ the object in the destination bucket has a newer date than the object in the source bucket, you probably don't need to copy the file across. But if the source file was modified after the destination file, it's likely to be a candidate for re-copying.
Instead of doing all the above logic yourself, you can also use the AWS Command-Line Interface (CLI). It has a aws s3 sync command that will compare files from the source and destination, and will then copy files that are modified or missing.

Related

Copy ~200.000 of s3 files to new prefixes

I have ~200.000 s3 files that I need to partition, and have made an Athena query to produce a target s3 key for each of the original s3 keys. I can clearly create a script out of this, but how to make the process robust/reliable?
I need to partition csv files using info inside each csv so that each file is moved to a new prefix in the same bucket. The files are mapped 1-to-1, but the new prefix depends on the data inside the file
The copy command for each would be something like:
aws s3 cp s3://bucket/top_prefix/file.csv s3://bucket/top_prefix/var1=X/var2=Y/file.csv
And I can make a single big script to copy all through Athena and bit of SQL, but I am concerned about doing this reliably so that I can be sure that all are copied across, and not have the script fail, timeout etc. Should I "just run the script"? From my machine or better to put it in an ec2 1st? These kinds of questions
This is a one-off, as the application code producing the files in s3 will start outputting directly to partitions.
If each file contains data for only one partition, then you can simply move the files as you have shown. This is quite efficient because the content of the files do not need to be processed.
If, however, lines within the files each belong to different partitions, then you can use Amazon Athena to 'select' lines from an input table and output the lines to a destination table that resides in a different path, with partitioning configured. However, Athena does not "move" the files -- it simply reads them and then stores the output. If you were to do this for new data each time, you would need to use an INSERT statement to copy the new data into an existing output table, then delete the input files from S3.
Since it is one-off, and each file belongs in only one partition, I would recommend you simply "run the script". It will go slightly faster from an EC2 instance, but the data is not uploaded/downloaded -- it all stays within S3.
I often create an Excel spreadsheet with a list of input locations and output locations. I create a formula to build the aws s3 cp <input> <output_path> commands, copy them to a text file and execute it as a batch. Works fine!
You mention that the destination depends on the data inside the object, so it would probably work well as a Python script that would loop through each object, 'peek' inside the object to see where it belongs, then issue a copy_object() command to send it to the right destination. (smart-open ยท PyPI is a great library for reading from an S3 object without having to download it first.)

AWS s3 sync to upload if file does not exist in target

I have uploaded about 1,000,000 files from my local directory to s3 buckets/subfolders and some of them have failed.
I would like to use the 'sync' option to capture those that did not make it the first time. The s3 modified date is the date/time my file was uploaded (which differs from my source file date/times).
As I understand, sync will upload a file to the target if it does not exist, if the file date has changed, or if the size is different.
Can I modify the command line to NOT use the file date as a consideration for syncing? I ONLY want to copy a file if it does not exist.
aws s3 sync \localserver\localshare\folder s3://mybucket/Folder1
aws s3 sync will compare the "last modified time".
For the objects in S3, there is only one timestamp LastModified, which should be when you uploaded the files.
For your local file (assume a posix linux file system). It should have 3 timestamps: last-access, last-modified, last-status-change. Only last-modified time will be used for comparison.
Now support you uploaded 1M files and some of them failed. For all the files had uploaded successfully, they should have identical last-modified time, and then another sync will not upload them again (sync will validate whether those files are identical and it will be considerable long for the validations for 1M objects.)
On the meantime, you can use aws s3 sync --size-only arguments. It fits what you described. But be sure to check whether it is really something you need. I mean, in many cases, many files could be keep the same size even after being modified (intentionally or accidentally), --size-only will ignore such same-size files.

Copy all objects to another S3 bucket in different region with different structure

I have an S3 bucket in Region A structured like this:
ProviderA-1-1
31423423.jpg
ProviderB-1-1
32423432.jpg
The top level folder is a unique image identifier. The filename is the version of the image.
i want to copy the images to a bucket in Region B, structured like this:
ProviderA-1-1.jpg
ProviderB-1-1.jpg
E.g i don't care about the version. I just want the folder name (which is unique) to be the filename.
The reason i'm doing this is to have a flat structure to make use of image services like Imgix / ImageKit. (they provide on the fly image transformation for images, given a flat source origin)
So, my requirements are:
I need to copy lots (millions of images, ~10TB) of images
The destination bucket is in another region
I need to 'flatten' the structure, and change the name of the images to be the name of the folder they are in (folder names isn't fixed)
I've seen a few answers here suggesting the aws cli is the best approach, but not sure how i can achieve 3. with that?
Sounds like i need to loop through the images one by one, changing the name before i copy. If a script is suggested, i'm most comfortable with .NET - so perhaps the AWS .NET SDK?
This is a once off job, where i need to move the images as quickly and cheaply as possible.
Advice please?
Thanks :)
Yes, a script is required because you are moving and renaming the files.
If you're comfortable with .NET, then use that!
The basic program would be:
Create two S3 clients -- one for source bucket (to obtain the listing) and one for the destination bucket (because copy commands are sent to the destination bucket, which pulls the file from the source bucket) because you are using a different region
Use ListObjects() to obtain a list of the source bucket. Note that it will return 1000 files at a time, so use NextMarker to request the subsequent batch.
Loop through each file and use CopyObject() to simultaneously copy and rename the file. Use your own logic to take the folder name and convert it to a filename. Each file will be copied directly between the buckets, without needing to download/upload
Continue, looping through the list of 1000 files and then get the next 1000 files, etc.
The process could be sped up by using multi-threading but the logic gets a bit hard. It might be easier to simply run a few copies of the program at the same time, each handling a different Prefix range (effectively, folder names).
It's a one-off job, so optimization isn't important.
If you are adding more files in future, the best method would be to create an AWS Lambda function that is triggered whenever a new file is created in S3. The Lambda function would then copy the file to the destination, then exit.
Assuming you have no location constraints set up for your buckets, flattening would simply be:
aws s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/
assumes you have the CLI installed and required credentials setup correctly. Or you can pass them on command line:
aws --profile profile_A2B --region XXX s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/ --acl yyy
You don't mention any performance requirements. There are many ways of making transfer faster, depends on many factors. Few blind hints I can give are:
See if transfer acceleration can help you.
In general S3 to S3 transfer is faster than S3 to/from non-S3 location.
See if you can create parallel batches by prefix like:
.
for prefix in {a..z}
do
aws s3 cp --recursive s3://source_bucket/foo/${prefix}* s3://target_bucket/ &
done
If this is not a one time transfer and the transfer acceleration isn't cutting it for you, consider:
download from S3 (in region A) to a local HDD residing in region A.
transfer from local HDD in region A to a local HDD in region B using other methods like Aspera or FileCatalyst or whatever else you can find.
upload from local HDD in region B to S3 (in region B).
I have no practical data to share except that Aspera blows things like FTP out of water, it's not even a competition. YMMV.
John already covered the pseudo code. I'll just make one change to it. Write two separate programs, one to fetch the list of filenames and second to copy. It takes a lot of time to list files if you have millions of them.
Once you've listed the file names in a file, say one per line, it would be pretty easy to parallelize given you can split the file (say split -l 1000 file_list splits).
Use xargs -P or gun parallel to run multiple aws s3 cp commands at once. If you're using shell instead of .NET.
Finally don't forget to set the ACL (and other attributes like TTL etc) on target files during the copy. Doing that after the copy will take a long time.

AWS Redshift: Load data from many buckets on S3

I am trying to load data from two different buckets on S3 to Redshift table. In each bucket, there are directories with dates in their names and each of directories contains many files, but there are not manifest.
Example S3 structure:
# Bucket 1
s3://bucket1/20170201/part-01
s3://bucket1/20170201/part-02
s3://bucket1/20170202/part-01
s3://bucket1/20170203/part-00
s3://bucket1/20170203/part-01
# Bucket 2
s3://bucket2/20170201/part-00
s3://bucket2/20170202/part-00
s3://bucket2/20170202/part-01
s3://bucket2/20170203/part-00
Let's say that data from both buckets for 20170201 and 20170202 should be loaded. One of the solutions can be running 4 times COPY command - ones per each bucket-date pair. But I'm curious if it could be done within single COPY call. I've seen that manifest file allows specifying few different files (including from different buckets). However:
is there option to use prefix instead full path in the manifest,
and can I use somehow manifest in SQL passing it as a string instead file location - I want to avoid creating temporary files on S3?
You can use a manifest file to specify different buckets, paths and files.
The Using a Manifest to Specify Data Files documentation shows an example:
{
"entries": [
{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}
]
}
The documentation also says:
The URL in the manifest must specify the bucket name and full object path for the file, not just a prefix.
The intent of using a manifest file is to know exactly which files have been loaded into Amazon Redshift. This is particularly useful when loading files that become available on a regular basis. For example, if files appear every 5 minutes and a COPY command was run to load the data from a given prefix, then it is unclear which files have already been loaded. This leads to potentially double-loading files.
The remedy is to use a manifest file that clearly specifies exactly which files to load. This obviously needs some code to find the files, create the manifest file and then trigger the COPY command.
It is not possible to load content from different buckets/paths without using a manifest file.

Copy files from S3 bucket to local machine using file index

I need to copy a files from many subdirectories in an S3 bucket to my local machine. The file name is auto generated and would be difficult to obtain without first using ls, but I do know that the target file is always the 2nd file in the subfolder by date creation order.
Is there a way to reference a file the in the s3 bucket subfolder file by index?
I am envisioning doing this with aws cli, though I'm open to other suggestions.
I'm not aware of any way within S3 to list the second oldest object without listing all objects at a given prefix and then explicitly sorting that list by date. If you need to do this then here are a few ideas:
if objects are only ever added (never deleted), then you could perhaps use a key naming convention when objects are uploaded that allows you to easily locate the 2nd oldest object e.g 0001-xxx, 0002-xxx. Then you can find the 2nd oldest object by listing objects with prefix 0002.
maintain an independent index of the objects in an RDBMS or KV database that allows you to easily locate the S3 key of the 2nd oldest object in any part of your S3 hierarchy. Possibly the DB is maintained via a Lambda function called when objects are put or deleted.
use a Lambda function triggered on object PUT that enumerates all of the objects in the relevant 'folder' and writes the key of the 2nd oldest object back to a kind of index object in that same folder (or as metadata on a known index object). Then you can find the 2nd oldest by getting the contents of the index object (or its metadata).
Option #2 might be the best as it's simple, fast, and flexible (what if, as your app changes over time, you find that you also need to know the 4th oldest object, or the 2nd newest object).
You could use this method to obtain the name of the second file in a given bucket/path:
aws s3api list-objects-v2 --bucket BUCKET-NAME --query 'Contents[1].Key' --output text
This would also work with BUCKET-NAME/PATH.
However, you mention that you have many subdirectories, so you would have to know the names of all those subdirectories if you are wanting to avoid doing a full bucket listing.