Rest API: Amazon S3 Best Practices - amazon-web-services

I have a REST API and I'm using S3 to store images .zip files, and other media like video. Is it common practice to use a single bucket for everything? Or divide buckets by file type?
For example, here are some of the kinds of content I have:
.jpg
.png
.zip
.mov
.maya
.obj

A single Amazon S3 bucket can contain any number of objects.
Reasons for using separate buckets are typically:
Desire for creating a bucket in a different region
Separation of duties (eg keep HR information separate)
Separation of purpose (eg keep test files separate to Production files)
Separation of systems (eg Inventory system vs Customer Service system)
There is no reason to use a different bucket for different file types, unless those file types are used for different purposes (like the above).

You can store all your files in a single bucket without a problem. If you wish to have more separation, keys in S3 are composite similar to URLs. For example you can have:
<your_bucket_name>/images/<key1>.png
<your_bucket_name>/images/<key2>.jpg
<your_bucket_name>/videos/<key3>.mp4
This will keep all your files in a single bucket, but in the AWS console they will be split similar to a file system - inside folders. Note that in order to access the resource stored in S3 you will need to use the full path e.g /images/key1.png

Related

dynamically create / append to zip from multiple instances

I have a situation where thousands o files are created for a user by multiple backend instances, and then they're uploaded to AWS S3 / Azure Storage. After all the files are created, the user wants to download them as a zip. I can create the zip and then get a pre-signed URL, but I tried few archiving solutions and all of them are just taking too much time (hours).
Is there any way of creating the zip dynamically from the multiple backend instances? I want append to zip after each file creation, from any backend instance.
Zip itself supports the use case you want. For example, zip command in Linux:
When given the name of an existing zip archive, zip will replace identically named entries in the zip archive (matching the relative names as stored in the archive) or add entries for new names.
You need to persist the working zip file somewhere in a file system though. The most obvious choice I can think of is EFS, so that multiple instances can mount the file system and access the zip file.
If you don't want to modify the existing instances/workloads, you can even mount EFS on Lambda. Then set S3 trigger for the Lambda to update zip file every time a new file is uploaded.
I think you can not use only S3 for this, because you cannot update S3 objects. Then you need to download/upload for every new file, which is really not ideal.

How I Can Search Unknown Folders in S3 Bucket. I Have millions of object in my bucket I only want Folder List?

I Have a bucket with 3 million objects. I Even don't know how many folders are there in my S3 bucket and even don't know the names of folders in my bucket.I want to show only list of folders of AWS s3. Is there any way to get list of all folders ?
I would use AWS CLI for this. To get started - have a look here.
Then it is a matter of almost standard linux commands (ls):
aws s3 ls s3://<bucket_name>/path/to/search/folder/ --recursive | grep '/$' > folders.txt
where:
grep command just reads what aws s3 ls command has returned and searches for entries with ending /.
ending > folders.txt saves output to a file.
Note: grep (if I'm not wrong) is unix only utility command. But I believe, you can achieve this on windows as well.
Note 2: depending on the number of files there this operation might (will) take a while.
Note 3: usually in systems like AWS S3, term folder is there only for user to maintain visual similarity with standard file systems however inside it does treat it as a part of a key. You can see in your (web) console when you filter by "prefix".
Amazon S3 buckets with large quantities of objects are very difficult to use. The API calls that list bucket contents are limited to returning 1000 objects per API call. While it is possible to request 'folders' (by using Delimiter='/' and looking at CommonPrefixes), this would take repeated calls to obtain the hierarchy.
Instead, I would recommend using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then play with that CSV file from code (or possibly Excel? Might be too big?) to obtain your desired listings.
Just be aware that doing anything on that bucket will not be fast.

Replace content in all files inside s3 bucket

I have a s3 bucket which is mapped to a domian say xyz.com . When ever a user register on xyz.com a file is created and stored in s3 bucket. Now i have 1000 of files in s3 and I want to replace some text in those files. All files have common name in start ex abc-{rand}.txt
The safest way of doing this would be to regenerate them again through the same process you originally used.
Personally I would try to avoid find and replace as it could lead to modifying parts that you did not intend.
Run multiple generations in parallel and override the existing files. This will ensure the files you generate will match your expectation and will not need to be modified again.
As a suggestion enable versioning before any of these interactions if you want the ability to rollback quickly in a scenario where it needs to be reverted.
Sadly, you can't do this in place in S3. You have to download them, change their content and re-upload.
This is because S3 is an object storage system, not regular file system.
To simply working with S3 files, you can use third part tool s3fs-fuse. The tool will make the S3 appear like a filesystem on your os.

How find difference of two text files in s3 using lambda services

I have to compare two files in aws s3 bucket and generate a new file with only the difference.
I have tried to do using Java, NodeJs and Python, but i couldn't find way to do that.For example we have some libraries in nodejs and python, but it requires input as 'path', but when you retrieve from s3 bucket its coming in different format.
Your AWS Lambda function could:
Download the two files to /tmp/
Use the difflib — Helpers for computing deltas module to find differences
Save the results to a file in /tmp/
Upload the results file to Amazon S3
Delete temporary files that were generated (in case the container is reused, since there is a 500MB limit in /tmp/)

Is there anything to be gained by using 'folders' in an s3 bucket?

I am moving a largish number of jpgs (several hundred thousand) from a static filesystem to amazon s3.
On the old filesytem, I grouped files into subfolders to keep the total number of files / folder manageable.
For example, a file
4aca29c7c0a76c1cbaad40b2693e6bef.jpg
would be saved to:
/4a/ca/29/4aca29c7c0a76c1cbaad40b2693e6bef.jpg
From what I understand, s3 doesn't respect hierarchial namespaces. So if I were to use 'folders' on s3, the object, including the /'s, would really just be in a flat namesapce.
Still, according to the docs, amazon recommends mimicking a structured filesytem when working with s3.
So I am wondering: Is there anything to be gained using the above folder structure to organize files on s3? Or in this case am I better off just adding the files to s3 without any kind of 'folder' structure.
Performance is not impacted by the use (or non-use) of folders.
Some systems can use folders for easier navigation of the files. For example, Amazon Athena can scan specific sub-directories when querying data rather than having to read every file.
If your bucket is being used for one specific purpose, there is no reason to use folders. However, if it contains different types of data, then you might consider at least a top-level set of folders to keep data separated.
Another potential reason for using folders is for security. A bucket policy can grant access to buckets based upon a prefix (which is a folder name). However, this is likely not relevant for your use-case.
Using "folders" has no performance impact on S3, either way. It doesn't make it faster, and it doesn't make it slower.
The value of delimiting your object keys with / is in organization, both machine-friendly and human-friendly.
If you're trolling through a bucket in the console, troubleshooting, those meaningless noise-filled keys are a hassle to paginate through, only a few dozen at a time.
The console automatically groups objects into imaginary folders based on the / delimiters, so you can find your object to inspect it (check headers, metadata, etc.) is much easier if you can just click on 4a then ca then 29.
The S3 ListObjects APIs support requesting all the objects with a certain key prefix, but they also support finding all the common prefixes before the next delimiter, so you can send API requests to list prefix 4a/ca/ with delimiter / and it will only return the "folders" one level deep, which it refers to as "common prefixes."
This is less meaningful if your object keys are fully opaque and convey nothing more about the objects, as opposed to using key prefixes like images/ and thumbnails/ and videos/.
Having been an admin and working with S3 for a number of years, and having worked with buckets with key naming schemes designed by different teams, I would definitely recommend using some / delimiters for organization purposes. The buckets without them become more of a hassle to navigate over time.
Note that the console does allow you to "create folders," but this is more of the illusion -- there is no need to actually do this, unless you're loading a bucket manually. When you create a folder in the console, it just creates an empty object with a / at the end.