AWS CLI - S3 how to replace a folder atomically? - amazon-web-services

So,
Let's say I have a folder called /example in S3. This folder contains a file called a.txt.
using AWS CLI, how do I upload a local folder, also called example, and replace the current S3 /example atomically. The local folder contains a file called b.txt.
So, I want the behaviour to be that the new S3 /example folder only contains b.txt.
Basically, is there a way to atomically replace an entire folder in S3 with a new one via the AWS CLI?
Thank you!

No, you can't do that.
For starters, S3 is an eventual consistent platform. That means that right after you do a write, you can still get old data back from S3. Practically, this converges quickly (seconds), but there is no upper bound. (They do provide consistency guarantees is some sequence of operations, but generally speaking, it's not strongly consistent)
Secondly, S3 does not have a concept of "folder" or "directory". S3 namespace is flat. The only thing that object /example/a.txt and /example/b.txt have in common is that they start with the same string, just like /foobar.txt and /foobaz.txt begin with the same string. (The User Interface does cheat a bit by treating the / character differently, and giving the illusion of directories)

Related

Move large number of folders and files inside a GCS bucket

I have a bucket on GCP and at the top level of this bucket, I have a bunch of folders.
I want to create a new folder and move all of the other ones into it.
However, I've mounted my bucket with gcsfuse and tried traditional Linux mv commands. This is not allowed, apparently.
Likewise, I have also tried gsutil -m mv gs://mybucket/* gs://mybucket/new_folder/ and have received the command error that wildcards are not allowed in this operation.
What's the best option to get this large number of files moved into a new directory?
Posting this as a Community Wiki answer, based in the comments provided by #JohnHanley.
A few concepts to note for Cloud Storage.
Objects are immutable, which means you cannot rename then. You must copy objects and delete the original to emulate changing the name.
Directories/Folders do not exist. The namespace is flat, all objects are in the root directory. The appearance of folders is just a part of the object name.
Cloud Storage supports internal object copy. Be careful not to use a feature which first downloads the object and then uploads it.
Considering this information, you will need to use a tool, for example, the gsutil, so you can start to rename and move the files as you would like.

Replace content in all files inside s3 bucket

I have a s3 bucket which is mapped to a domian say xyz.com . When ever a user register on xyz.com a file is created and stored in s3 bucket. Now i have 1000 of files in s3 and I want to replace some text in those files. All files have common name in start ex abc-{rand}.txt
The safest way of doing this would be to regenerate them again through the same process you originally used.
Personally I would try to avoid find and replace as it could lead to modifying parts that you did not intend.
Run multiple generations in parallel and override the existing files. This will ensure the files you generate will match your expectation and will not need to be modified again.
As a suggestion enable versioning before any of these interactions if you want the ability to rollback quickly in a scenario where it needs to be reverted.
Sadly, you can't do this in place in S3. You have to download them, change their content and re-upload.
This is because S3 is an object storage system, not regular file system.
To simply working with S3 files, you can use third part tool s3fs-fuse. The tool will make the S3 appear like a filesystem on your os.

Copy all objects to another S3 bucket in different region with different structure

I have an S3 bucket in Region A structured like this:
ProviderA-1-1
31423423.jpg
ProviderB-1-1
32423432.jpg
The top level folder is a unique image identifier. The filename is the version of the image.
i want to copy the images to a bucket in Region B, structured like this:
ProviderA-1-1.jpg
ProviderB-1-1.jpg
E.g i don't care about the version. I just want the folder name (which is unique) to be the filename.
The reason i'm doing this is to have a flat structure to make use of image services like Imgix / ImageKit. (they provide on the fly image transformation for images, given a flat source origin)
So, my requirements are:
I need to copy lots (millions of images, ~10TB) of images
The destination bucket is in another region
I need to 'flatten' the structure, and change the name of the images to be the name of the folder they are in (folder names isn't fixed)
I've seen a few answers here suggesting the aws cli is the best approach, but not sure how i can achieve 3. with that?
Sounds like i need to loop through the images one by one, changing the name before i copy. If a script is suggested, i'm most comfortable with .NET - so perhaps the AWS .NET SDK?
This is a once off job, where i need to move the images as quickly and cheaply as possible.
Advice please?
Thanks :)
Yes, a script is required because you are moving and renaming the files.
If you're comfortable with .NET, then use that!
The basic program would be:
Create two S3 clients -- one for source bucket (to obtain the listing) and one for the destination bucket (because copy commands are sent to the destination bucket, which pulls the file from the source bucket) because you are using a different region
Use ListObjects() to obtain a list of the source bucket. Note that it will return 1000 files at a time, so use NextMarker to request the subsequent batch.
Loop through each file and use CopyObject() to simultaneously copy and rename the file. Use your own logic to take the folder name and convert it to a filename. Each file will be copied directly between the buckets, without needing to download/upload
Continue, looping through the list of 1000 files and then get the next 1000 files, etc.
The process could be sped up by using multi-threading but the logic gets a bit hard. It might be easier to simply run a few copies of the program at the same time, each handling a different Prefix range (effectively, folder names).
It's a one-off job, so optimization isn't important.
If you are adding more files in future, the best method would be to create an AWS Lambda function that is triggered whenever a new file is created in S3. The Lambda function would then copy the file to the destination, then exit.
Assuming you have no location constraints set up for your buckets, flattening would simply be:
aws s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/
assumes you have the CLI installed and required credentials setup correctly. Or you can pass them on command line:
aws --profile profile_A2B --region XXX s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/ --acl yyy
You don't mention any performance requirements. There are many ways of making transfer faster, depends on many factors. Few blind hints I can give are:
See if transfer acceleration can help you.
In general S3 to S3 transfer is faster than S3 to/from non-S3 location.
See if you can create parallel batches by prefix like:
.
for prefix in {a..z}
do
aws s3 cp --recursive s3://source_bucket/foo/${prefix}* s3://target_bucket/ &
done
If this is not a one time transfer and the transfer acceleration isn't cutting it for you, consider:
download from S3 (in region A) to a local HDD residing in region A.
transfer from local HDD in region A to a local HDD in region B using other methods like Aspera or FileCatalyst or whatever else you can find.
upload from local HDD in region B to S3 (in region B).
I have no practical data to share except that Aspera blows things like FTP out of water, it's not even a competition. YMMV.
John already covered the pseudo code. I'll just make one change to it. Write two separate programs, one to fetch the list of filenames and second to copy. It takes a lot of time to list files if you have millions of them.
Once you've listed the file names in a file, say one per line, it would be pretty easy to parallelize given you can split the file (say split -l 1000 file_list splits).
Use xargs -P or gun parallel to run multiple aws s3 cp commands at once. If you're using shell instead of .NET.
Finally don't forget to set the ACL (and other attributes like TTL etc) on target files during the copy. Doing that after the copy will take a long time.

Is there anything to be gained by using 'folders' in an s3 bucket?

I am moving a largish number of jpgs (several hundred thousand) from a static filesystem to amazon s3.
On the old filesytem, I grouped files into subfolders to keep the total number of files / folder manageable.
For example, a file
4aca29c7c0a76c1cbaad40b2693e6bef.jpg
would be saved to:
/4a/ca/29/4aca29c7c0a76c1cbaad40b2693e6bef.jpg
From what I understand, s3 doesn't respect hierarchial namespaces. So if I were to use 'folders' on s3, the object, including the /'s, would really just be in a flat namesapce.
Still, according to the docs, amazon recommends mimicking a structured filesytem when working with s3.
So I am wondering: Is there anything to be gained using the above folder structure to organize files on s3? Or in this case am I better off just adding the files to s3 without any kind of 'folder' structure.
Performance is not impacted by the use (or non-use) of folders.
Some systems can use folders for easier navigation of the files. For example, Amazon Athena can scan specific sub-directories when querying data rather than having to read every file.
If your bucket is being used for one specific purpose, there is no reason to use folders. However, if it contains different types of data, then you might consider at least a top-level set of folders to keep data separated.
Another potential reason for using folders is for security. A bucket policy can grant access to buckets based upon a prefix (which is a folder name). However, this is likely not relevant for your use-case.
Using "folders" has no performance impact on S3, either way. It doesn't make it faster, and it doesn't make it slower.
The value of delimiting your object keys with / is in organization, both machine-friendly and human-friendly.
If you're trolling through a bucket in the console, troubleshooting, those meaningless noise-filled keys are a hassle to paginate through, only a few dozen at a time.
The console automatically groups objects into imaginary folders based on the / delimiters, so you can find your object to inspect it (check headers, metadata, etc.) is much easier if you can just click on 4a then ca then 29.
The S3 ListObjects APIs support requesting all the objects with a certain key prefix, but they also support finding all the common prefixes before the next delimiter, so you can send API requests to list prefix 4a/ca/ with delimiter / and it will only return the "folders" one level deep, which it refers to as "common prefixes."
This is less meaningful if your object keys are fully opaque and convey nothing more about the objects, as opposed to using key prefixes like images/ and thumbnails/ and videos/.
Having been an admin and working with S3 for a number of years, and having worked with buckets with key naming schemes designed by different teams, I would definitely recommend using some / delimiters for organization purposes. The buckets without them become more of a hassle to navigate over time.
Note that the console does allow you to "create folders," but this is more of the illusion -- there is no need to actually do this, unless you're loading a bucket manually. When you create a folder in the console, it just creates an empty object with a / at the end.

Top level solution to rename AWS bucket item's folder names?

I've inherited a project at work. Its essentially a niche content repository, and we use S3 to store the content. The project was severely outdated, and I'm in the process of a thorough update.
For some unknown and undocumented reason, the content is stored in an AWS S3 bucket with the pattern web_cl_000000$DB_ID$CONTENT_NAME So, one particular folder can be named web_cl_0000003458zyxwv. This makes no sense, and requires a bit of transformation logic to construct a URL to serve up the content!
I can write a Python script using the boto3 library to do an item-by-item rename, but would like to know if there's a faster way to do so. There are approximately 4M items in that bucket, which will take quite a long time.
That isn't possible, because the folders are an illusion derived from the strings between / delimiters in the object keys.
Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using key name prefixes for objects. (emphasis added)
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
The console contributes to the illusion by allowing you to "create" a folder, but all that actually does is create a 0-byte object with / as its last character, which the console will display as a folder whether there are other objects with that prefix or not, making it easier to upload objects manually with some organization.
But any tool or technique that allows renaming folders in S3 will in fact be making a copy of each object with the modified name, then deleting the old object, because S3 does not actually support rename or move, either -- objects in S3, including their key and metadata, are actually immutable. Any "change" is handled at the API level with a copy/overwrite or copy-then-delete.
Worth noting, S3 should be able to easily sustain 100 such requests per second, so with asynchronous requests or multi-threaded code, or even several processes each handling a shard of the keyspace, you should be able to do the whole thing in a few hours.
Note also that the less sorted (more random) the new keys are in the requests, the harder you can push S3 during a mass-write operation like this. Sending the requests so that the new keys are in lexical order will be the most likely scenario in which you might see 503 Slow Down errors... in which case, you just back off and retry... but if the new keys are not ordered, S3 can more easily accommodate a large number of requests.