Copy limited number of files from S3? - amazon-web-services

We are using an S3 bucket to store a growing number of small JSON files (~1KB each) that contain some build-related data. Part of our pipeline involves copying these files from S3 and putting them into memory to do some operations.
That copy operation is done via S3 cli tool command that looks something like this:
aws s3 cp s3://bucket-path ~/some/local/path/ --recursive --profile dev-profile
The problem is that the number of json files on S3 is getting pretty large since more are being made every day. It's nothing even close to the capacity of the S3 bucket since the files are so small. However, in practical terms, there's no need to copy ALL these JSON files. Realistically the system would be safe just copying the most recent 100 or so. But we do want to keep older ones around for other purposes.
So my question boils down to: is there a clean way to copy a specific number of files from S3 (maybe sorted by most recent)? Is there some kind of pruning policy we can set on an S3 bucket to delete files older than X days or something?

The aws s3 sync command in the AWS CLI sounds perfect for your needs.
It will copy only files that are New or Modified since the last sync. However, it means that the destination will need to retain a copy of the 'old' files so that they are not copied again.
Alternatively, you could write a script (eg in Python) that lists the objects in S3 and then only copies objects added since the last time the copy was run.

You can set the Lifecycle policies to the S3 buckets which will remove them after certain period of time.
To copy only some days old objects you will need to write a script

Related

How to migrate data from s3 bucket to glacier?

I have a TB sized S3 bucket with pdf files. I need to migrate the old files to glacier. I know that I can create a life cycle rule to migrate files which are older than certain number of days. But in my case currently the bucket consists of both old and new pdf files and they were added at a same time. So they may have same uploaded date. In this case a life cycle rule won't be useful.
In the pdf files there is a field called capture_date. So i need to migrate those files based on the capture_date. (ie: migrate all pdf files if the capture_date < 2015-05-21 likewise).
Will a Fargate job will be useful here? if so, please give a brief idea.
Please suggest your ideas. Thanks in advance
S3 by itself will not read your pdf files. Thus you have to read them yourself, extract data that determine which ones are old and new, and using AWS SDK (or CLI) to move them to Glacier.
Since the files are not too big, you could use S3 Batch along with lambda function which would do the change of the class to glacier.
Alternatively, you could do this on an EC2 instance, using S3 Inventory's CSV list of your objects (assuming large number of them).
And the most traditional way is to just list your bucket, and iterate over each object.

AWS: Speed up copy of large number of very small files

I have a single bucket with a large number of very small text files (betwen 500 bytes to 1.2k). This bucket currently contains over 1.7 Million files and will be ever increasing.
The way that I add data to this bucket is by generating batches of files (in the order 50.000 files) and transfering those files into the bucket.
Now the problem is this. If I transfer the files one by one in a loop it takes an unbareably long amount of time. So if all the files a in a directory origin_directory I would do
aws s3 cp origin_directory/filename_i s3://my_bucket/filename_i
I would do this command 50000 times.
Right now I'm testing this on a set of about 280K files. Doing this would take approximately 68 hours according to my calculations. However I found out that I can sync:
aws s3 sync origin_directory s3://my_bucket/
Now this, works much much faster. (Will take about 5 hours, according to my calculations). However, the sync needs to figure out what to copy (files present in the directory and not present in the bucket). Since the files in the bucket will be ever increasing, I'm thinking that this will take longer and longer as times moves on.
However, since I delete the information after every sync, I know that the sync operation needs to transfer all files in that directory.
So my question is, is there a way to start a "batch copy" similar to the sync, without actually doing the sync?
You can use:
aws s3 cp --recursive origin_directory/ s3://my_bucket/
This is the same as a sync, but it will not check whether the files already exist.
Also, see Use of Exclude and Include Filters to learn how to specify wildcards (eg all *.txt files).
When copying a large number of files using aws s3 sync or aws s3 cp --recursive, the AWS CLI will parallelize the copying, making it much faster. You can also play with the AWS CLI S3 Configuration to potentially optimize it for your typical types of files (eg copy more files simultaneously).
try using https://github.com/mondain/jets3t
it does this same function but works in parallel, so it will complete the job much faster.

Copy all objects to another S3 bucket in different region with different structure

I have an S3 bucket in Region A structured like this:
ProviderA-1-1
31423423.jpg
ProviderB-1-1
32423432.jpg
The top level folder is a unique image identifier. The filename is the version of the image.
i want to copy the images to a bucket in Region B, structured like this:
ProviderA-1-1.jpg
ProviderB-1-1.jpg
E.g i don't care about the version. I just want the folder name (which is unique) to be the filename.
The reason i'm doing this is to have a flat structure to make use of image services like Imgix / ImageKit. (they provide on the fly image transformation for images, given a flat source origin)
So, my requirements are:
I need to copy lots (millions of images, ~10TB) of images
The destination bucket is in another region
I need to 'flatten' the structure, and change the name of the images to be the name of the folder they are in (folder names isn't fixed)
I've seen a few answers here suggesting the aws cli is the best approach, but not sure how i can achieve 3. with that?
Sounds like i need to loop through the images one by one, changing the name before i copy. If a script is suggested, i'm most comfortable with .NET - so perhaps the AWS .NET SDK?
This is a once off job, where i need to move the images as quickly and cheaply as possible.
Advice please?
Thanks :)
Yes, a script is required because you are moving and renaming the files.
If you're comfortable with .NET, then use that!
The basic program would be:
Create two S3 clients -- one for source bucket (to obtain the listing) and one for the destination bucket (because copy commands are sent to the destination bucket, which pulls the file from the source bucket) because you are using a different region
Use ListObjects() to obtain a list of the source bucket. Note that it will return 1000 files at a time, so use NextMarker to request the subsequent batch.
Loop through each file and use CopyObject() to simultaneously copy and rename the file. Use your own logic to take the folder name and convert it to a filename. Each file will be copied directly between the buckets, without needing to download/upload
Continue, looping through the list of 1000 files and then get the next 1000 files, etc.
The process could be sped up by using multi-threading but the logic gets a bit hard. It might be easier to simply run a few copies of the program at the same time, each handling a different Prefix range (effectively, folder names).
It's a one-off job, so optimization isn't important.
If you are adding more files in future, the best method would be to create an AWS Lambda function that is triggered whenever a new file is created in S3. The Lambda function would then copy the file to the destination, then exit.
Assuming you have no location constraints set up for your buckets, flattening would simply be:
aws s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/
assumes you have the CLI installed and required credentials setup correctly. Or you can pass them on command line:
aws --profile profile_A2B --region XXX s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/ --acl yyy
You don't mention any performance requirements. There are many ways of making transfer faster, depends on many factors. Few blind hints I can give are:
See if transfer acceleration can help you.
In general S3 to S3 transfer is faster than S3 to/from non-S3 location.
See if you can create parallel batches by prefix like:
.
for prefix in {a..z}
do
aws s3 cp --recursive s3://source_bucket/foo/${prefix}* s3://target_bucket/ &
done
If this is not a one time transfer and the transfer acceleration isn't cutting it for you, consider:
download from S3 (in region A) to a local HDD residing in region A.
transfer from local HDD in region A to a local HDD in region B using other methods like Aspera or FileCatalyst or whatever else you can find.
upload from local HDD in region B to S3 (in region B).
I have no practical data to share except that Aspera blows things like FTP out of water, it's not even a competition. YMMV.
John already covered the pseudo code. I'll just make one change to it. Write two separate programs, one to fetch the list of filenames and second to copy. It takes a lot of time to list files if you have millions of them.
Once you've listed the file names in a file, say one per line, it would be pretty easy to parallelize given you can split the file (say split -l 1000 file_list splits).
Use xargs -P or gun parallel to run multiple aws s3 cp commands at once. If you're using shell instead of .NET.
Finally don't forget to set the ACL (and other attributes like TTL etc) on target files during the copy. Doing that after the copy will take a long time.

S3 move files year/month wise

I have a bucket (s3://Bucket1) and there are millions of files in that with format like below:
s3://Bucket1/yyyy-mm-dd/
I want to move these files like
s3://Bucket1/year/mm
Any help, script, method will be really helpful.
I have tried aws s3 cp s3://Bucket1/ s3://Bucket1/ --include "2017-01-01*" but this is not working good and plus I have to put extra stuff to delete files.
The basic steps are:
Get a list of objects
Copy the objects to the new name
Delete the old objects
Get a list of objects
Given that you have millions of files, the best way to start is to use Amazon S3 Inventory to obtain a CSV file of all the objects.
Copy the objects to the new name
Then, write a script that reads the CSV file and issues a copy() command to copy the file to the new location. This could be written in any language that has an AWS SDK (eg Python).
Delete the old objects
Rather than individually deleting the objects, use S3 object lifecycle management to delete the old files. The benefits of using this method are:
There is no charge for the delete (whereas issuing millions of delete commands would involve a charge)
It can be done after the files have been copied, providing a chance to verify that all the files have been correctly copied (by checking the next S3 inventory output)
You could use the AWS CLI to issue a aws s3 mv command, which will combine the copy and delete -- effectively providing a rename function. However, shell scripts aren't that easy and if things fail half-way the files will be in a mixed state. That's why I prefer the "copy all objects, and only then delete" method more.

AWS S3 Sync very slow when copying to large directories

When syncing data to an empty directory in S3 using AWS-CLI, it's almost instant. However, when syncing to a large directory (several million folders), it takes a very long time before even starting to upload / sync the files.
Is there an alternative method? It looks like it's trying to take account of all files in an S3 directory before syncing - I don't need that, and uploading the data without checking beforehand would be fine.
The sync command will need to enumerate all of the files in the bucket to determine whether a local file already exists in the bucket and if it is the same as the local file. The more documents you have in the bucket, the longer it's going to take.
If you don't need this sync behavior just use a recursive copy command like:
aws s3 cp --recursive . s3://mybucket/
and this should copy all of the local files in the current directory to the bucket in S3.
If you use the unofficial s3cmd from S3 Tools, you can use the --no-check-md5 option while using sync to disable the MD5 sums comparison to significantly speed up the process.
--no-check-md5 Do not check MD5 sums when comparing files for [sync].
Only size will be compared. May significantly speed up
transfer but may also miss some changed files.
Source: https://s3tools.org/usage
Example: s3cmd --no-check-md5 sync /directory/to/sync s3://mys3bucket/