Replace content in all files inside s3 bucket

Replace content in all files inside s3 bucket - amazon-web-services

I have a s3 bucket which is mapped to a domian say xyz.com . When ever a user register on xyz.com a file is created and stored in s3 bucket. Now i have 1000 of files in s3 and I want to replace some text in those files. All files have common name in start ex abc-{rand}.txt

The safest way of doing this would be to regenerate them again through the same process you originally used.
Personally I would try to avoid find and replace as it could lead to modifying parts that you did not intend.
Run multiple generations in parallel and override the existing files. This will ensure the files you generate will match your expectation and will not need to be modified again.
As a suggestion enable versioning before any of these interactions if you want the ability to rollback quickly in a scenario where it needs to be reverted.

Sadly, you can't do this in place in S3. You have to download them, change their content and re-upload.
This is because S3 is an object storage system, not regular file system.
To simply working with S3 files, you can use third part tool s3fs-fuse. The tool will make the S3 appear like a filesystem on your os.

Related

Is there a way to copy all objects inside a S3 bucket to Redshift using a Wildcard?

I have an S3 Bucket called Facebook
The structure is like this :
Facebook/AUS/transformedfiles/YYYYMMDDHH/payments.csv
Facebook/IND/transformedfiles/YYYYMMDDHH/payments.csv
Facebook/SEA/transformedfiles/YYYYMMDDHH/payments.csv
Is there a way to copy all payments.csv to AWS Redshift?
something like :
copy payments Facebook/*/transformedfiles/YYYYMMDDHH/payments.csv

No, because the FROM clause accepts an object prefix, and implies a trailing wildcard.
If you want to load specific files, you'll need to use a manifest file. You would build this manifest by calling ListObjects and programmatically selecting the files you want.
A manifest file is also necessary if you're creating the files and immediately uploading them, because S3 is eventually consistent -- if you rely on it selecting files with a prefix, it might miss some.

Rest API: Amazon S3 Best Practices

I have a REST API and I'm using S3 to store images .zip files, and other media like video. Is it common practice to use a single bucket for everything? Or divide buckets by file type?
For example, here are some of the kinds of content I have:
.jpg
.png
.zip
.mov
.maya
.obj

A single Amazon S3 bucket can contain any number of objects.
Reasons for using separate buckets are typically:
Desire for creating a bucket in a different region
Separation of duties (eg keep HR information separate)
Separation of purpose (eg keep test files separate to Production files)
Separation of systems (eg Inventory system vs Customer Service system)
There is no reason to use a different bucket for different file types, unless those file types are used for different purposes (like the above).

You can store all your files in a single bucket without a problem. If you wish to have more separation, keys in S3 are composite similar to URLs. For example you can have:
<your_bucket_name>/images/<key1>.png
<your_bucket_name>/images/<key2>.jpg
<your_bucket_name>/videos/<key3>.mp4
This will keep all your files in a single bucket, but in the AWS console they will be split similar to a file system - inside folders. Note that in order to access the resource stored in S3 you will need to use the full path e.g /images/key1.png

Copy all objects to another S3 bucket in different region with different structure

I have an S3 bucket in Region A structured like this:
ProviderA-1-1
31423423.jpg
ProviderB-1-1
32423432.jpg
The top level folder is a unique image identifier. The filename is the version of the image.
i want to copy the images to a bucket in Region B, structured like this:
ProviderA-1-1.jpg
ProviderB-1-1.jpg
E.g i don't care about the version. I just want the folder name (which is unique) to be the filename.
The reason i'm doing this is to have a flat structure to make use of image services like Imgix / ImageKit. (they provide on the fly image transformation for images, given a flat source origin)
So, my requirements are:
I need to copy lots (millions of images, ~10TB) of images
The destination bucket is in another region
I need to 'flatten' the structure, and change the name of the images to be the name of the folder they are in (folder names isn't fixed)
I've seen a few answers here suggesting the aws cli is the best approach, but not sure how i can achieve 3. with that?
Sounds like i need to loop through the images one by one, changing the name before i copy. If a script is suggested, i'm most comfortable with .NET - so perhaps the AWS .NET SDK?
This is a once off job, where i need to move the images as quickly and cheaply as possible.
Advice please?
Thanks :)

Yes, a script is required because you are moving and renaming the files.
If you're comfortable with .NET, then use that!
The basic program would be:
Create two S3 clients -- one for source bucket (to obtain the listing) and one for the destination bucket (because copy commands are sent to the destination bucket, which pulls the file from the source bucket) because you are using a different region
Use ListObjects() to obtain a list of the source bucket. Note that it will return 1000 files at a time, so use NextMarker to request the subsequent batch.
Loop through each file and use CopyObject() to simultaneously copy and rename the file. Use your own logic to take the folder name and convert it to a filename. Each file will be copied directly between the buckets, without needing to download/upload
Continue, looping through the list of 1000 files and then get the next 1000 files, etc.
The process could be sped up by using multi-threading but the logic gets a bit hard. It might be easier to simply run a few copies of the program at the same time, each handling a different Prefix range (effectively, folder names).
It's a one-off job, so optimization isn't important.
If you are adding more files in future, the best method would be to create an AWS Lambda function that is triggered whenever a new file is created in S3. The Lambda function would then copy the file to the destination, then exit.

Assuming you have no location constraints set up for your buckets, flattening would simply be:
aws s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/
assumes you have the CLI installed and required credentials setup correctly. Or you can pass them on command line:
aws --profile profile_A2B --region XXX s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/ --acl yyy
You don't mention any performance requirements. There are many ways of making transfer faster, depends on many factors. Few blind hints I can give are:
See if transfer acceleration can help you.
In general S3 to S3 transfer is faster than S3 to/from non-S3 location.
See if you can create parallel batches by prefix like:
.
for prefix in {a..z}
do
aws s3 cp --recursive s3://source_bucket/foo/${prefix}* s3://target_bucket/ &
done
If this is not a one time transfer and the transfer acceleration isn't cutting it for you, consider:
download from S3 (in region A) to a local HDD residing in region A.
transfer from local HDD in region A to a local HDD in region B using other methods like Aspera or FileCatalyst or whatever else you can find.
upload from local HDD in region B to S3 (in region B).
I have no practical data to share except that Aspera blows things like FTP out of water, it's not even a competition. YMMV.
John already covered the pseudo code. I'll just make one change to it. Write two separate programs, one to fetch the list of filenames and second to copy. It takes a lot of time to list files if you have millions of them.
Once you've listed the file names in a file, say one per line, it would be pretty easy to parallelize given you can split the file (say split -l 1000 file_list splits).
Use xargs -P or gun parallel to run multiple aws s3 cp commands at once. If you're using shell instead of .NET.
Finally don't forget to set the ACL (and other attributes like TTL etc) on target files during the copy. Doing that after the copy will take a long time.

Is there anything to be gained by using 'folders' in an s3 bucket?

I am moving a largish number of jpgs (several hundred thousand) from a static filesystem to amazon s3.
On the old filesytem, I grouped files into subfolders to keep the total number of files / folder manageable.
For example, a file
4aca29c7c0a76c1cbaad40b2693e6bef.jpg
would be saved to:
/4a/ca/29/4aca29c7c0a76c1cbaad40b2693e6bef.jpg
From what I understand, s3 doesn't respect hierarchial namespaces. So if I were to use 'folders' on s3, the object, including the /'s, would really just be in a flat namesapce.
Still, according to the docs, amazon recommends mimicking a structured filesytem when working with s3.
So I am wondering: Is there anything to be gained using the above folder structure to organize files on s3? Or in this case am I better off just adding the files to s3 without any kind of 'folder' structure.

Performance is not impacted by the use (or non-use) of folders.
Some systems can use folders for easier navigation of the files. For example, Amazon Athena can scan specific sub-directories when querying data rather than having to read every file.
If your bucket is being used for one specific purpose, there is no reason to use folders. However, if it contains different types of data, then you might consider at least a top-level set of folders to keep data separated.
Another potential reason for using folders is for security. A bucket policy can grant access to buckets based upon a prefix (which is a folder name). However, this is likely not relevant for your use-case.

Using "folders" has no performance impact on S3, either way. It doesn't make it faster, and it doesn't make it slower.
The value of delimiting your object keys with / is in organization, both machine-friendly and human-friendly.
If you're trolling through a bucket in the console, troubleshooting, those meaningless noise-filled keys are a hassle to paginate through, only a few dozen at a time.
The console automatically groups objects into imaginary folders based on the / delimiters, so you can find your object to inspect it (check headers, metadata, etc.) is much easier if you can just click on 4a then ca then 29.
The S3 ListObjects APIs support requesting all the objects with a certain key prefix, but they also support finding all the common prefixes before the next delimiter, so you can send API requests to list prefix 4a/ca/ with delimiter / and it will only return the "folders" one level deep, which it refers to as "common prefixes."
This is less meaningful if your object keys are fully opaque and convey nothing more about the objects, as opposed to using key prefixes like images/ and thumbnails/ and videos/.
Having been an admin and working with S3 for a number of years, and having worked with buckets with key naming schemes designed by different teams, I would definitely recommend using some / delimiters for organization purposes. The buckets without them become more of a hassle to navigate over time.
Note that the console does allow you to "create folders," but this is more of the illusion -- there is no need to actually do this, unless you're loading a bucket manually. When you create a folder in the console, it just creates an empty object with a / at the end.

AWS CLI - S3 how to replace a folder atomically?

So,
Let's say I have a folder called /example in S3. This folder contains a file called a.txt.
using AWS CLI, how do I upload a local folder, also called example, and replace the current S3 /example atomically. The local folder contains a file called b.txt.
So, I want the behaviour to be that the new S3 /example folder only contains b.txt.
Basically, is there a way to atomically replace an entire folder in S3 with a new one via the AWS CLI?
Thank you!

No, you can't do that.
For starters, S3 is an eventual consistent platform. That means that right after you do a write, you can still get old data back from S3. Practically, this converges quickly (seconds), but there is no upper bound. (They do provide consistency guarantees is some sequence of operations, but generally speaking, it's not strongly consistent)
Secondly, S3 does not have a concept of "folder" or "directory". S3 namespace is flat. The only thing that object /example/a.txt and /example/b.txt have in common is that they start with the same string, just like /foobar.txt and /foobaz.txt begin with the same string. (The User Interface does cheat a bit by treating the / character differently, and giving the illusion of directories)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js