Amazon AWS S3 Glacier: is there a file hierarchy - amazon-web-services

Does Amazon AWS S3 Glacier support some semblance of file hierarchy inside a Vault for Archives?
For example, in AWS S3, objects are given hierarchy via /. For example: all_logs/some_sub_category/log.txt
I am storing multiple .tar.gz files, and would like:
All files in the same Vault
Within the Vault, files are grouped into several categories (as opposed to flat structure)
I could not find how to do this documented anywhere. If file hierarchy inside S3 Glacier is possible, can you provide brief instructions for how to do so?

Does Amazon AWS S3 Glacier support some semblance of file hierarchy inside a Vault for Archives?
No, there's no hierarchy other than "archives exist inside a vault".
For example, in AWS S3, objects are given hierarchy via /. For example: all_logs/some_sub_category/log.txt
This is actually incorrect.
S3 doesn't have any inherent hierarchy. The character / is absolutely no different than any other character valid for the key of an S3 Object.
The S3 Console — and most S3 client tools, including AWS's CLI — treat the / character in a special way. But notice that it is a client-side thing. The client will make sure that listing happens in such a way that a / behaves as most people would expect, that is, as a "hierarchy separator".
If file hierarchy inside S3 Glacier is possible, can you provide brief instructions for how to do so?
You need to keep track of your hierarchy separately. For example, when you store an archive in Glacier, you could write metadata about that archive in a database (RDS, DynamoDB, etc).
As a side note, be careful about .tar.gz in Glacier, especially if you're talking about (1) a very large archive (2) that is composed of a large number of small individual files (3) which you may want to access individually.
If those conditions are met (and in my experience they often are in real-world scenarios), then using .tar.gz will often lead to excessive costs when retrieving data.
The reason is because you pay per number of requests as well as per size of request. So while having one huge .tar.gz file may reduce your costs in terms of number of requests, the fact that gzip uses DEFLATE, which is a non-splittable compression algorithm, means that you'll have to retrieve the entire .tar.gz archive, decompress it, and finally get the one file that you actually want.
An alternative approach that solves the problem I described above — and that, at the same time, relates back to your question and my answer — is to actually first gzip the individual files, and then tar them together. The reason this solves the problem is that when you tar the files together, the individual files actually have clear bounds inside the tarball. And then, when you request a retrieval from glacier, you can request only a range of the archive. E.g., you could say, "Glacier, give me bytes between 105MB and 115MB of archive X". That way you can (1) reduce the total number of requests (since you have a single tar file), and (2) reduce the total size of the requests and storage (since you have compressed data).
Now, to know which range you need to retrieve, you'll need to store metadata somewhere — usually the same place where you will keep your hierarchy! (like I mentioned above, RDS, DynamoDB, Elasticsearch, etc).
Anyways, just an optimization that could save a tremendous amount of money in the future (and I've worked with a ton of customers who wasted a lot of money because they didn't know about this).

Related

How to track / inventory large number of files in S3

listObjectsv2 API call is used to get the list of objects in an S3 bucket. Unfortunately, the listObjectsV2 call is limited to 1000 objects per call -- not particularly useful for buckets with millions of files. The listObjectsv2 API method is particularly painful due to it only returning the objects in alpha order -- meaning if you are looking for 'most recent' files, you need to call the listObjectsv2 API with a continuationToken and work your way through every single file in the bucket matching your prefix before getting to your 'most recent file'.
So, S3 Inventory is the tool for dealing with filesets >1000 objects... Unfortunately, S3 Inventory only 'runs' daily. Efectively making it useless for situations where the real-time status of your files matters. Though - perhaps I'm missing something here?
What tool or mechanism exists to help with real-time listing of larger s3 bucket contents? Specifically, I have buckets with millions of files that include a device identifier an then the date (ie: foobar___2022-02-02T12:00:00Z) and the most common access pattern is needing to see the most-recent X number of files. The second most common pattern is needing to see all files between date Y and date Z.
One option would be to put each filename/key into a DB row and then use that for listing objects and other queries.... but that seems janky and it significantly complicates the entire system - now needing to keep the DB 'in sync' with reality. Surely, there is a tool or standard methodology for this relatively normal access pattern.

Storing many small files (on S3)?

I have 2 million zipped HTML files (100-150KB) being added each day that I need to store for a long time.
Hot data (70-150 million) is accessed semi regularly, anything older than that is barely ever accessed.
This means each day I'm storing an additional 200-300GB worth of files.
Now, Standard storage costs $0.023 per GB and $0.004 for Glacier.
While Glacier is cheap, the problem with it is that it has additional costs, so it would be a bad idea to dump 2 million files into Glacier:
PUT requests to Glacier $0.05 per 1,000 requests
Lifecycle Transition Requests into Glacier $0.05 per 1,000 requests
Is there a way of gluing the files together, but keeping them accessible individually?
An important point, that if you need to provide quick access to these files, then Glacier can give you access to the file in up to 12 hours. So the best you can do is to use S3 Standard – Infrequent Access (0,0125 USD per GB with millisecond access) instead of S3 Standard. And maybe for some really not using data Glacier. But it still depends on how fast do you need that data.
Having that I'd suggest following:
as html (text) files have a good level of compression, you can compress historical data in big zip files (daily, weekly or monthly) as together they can have even better compression;
make some index file or database to know where each html-file is stored;
read only desired html-files from archives without unpacking whole zip-file. See example in python how to implement that.
Glacier would be extremely cost sensitive when it comes to the number of files. The best method would be to create a Lambda function that handles zip, unzip operations for you.
Consider this approach:
Lambda creates archive_date_hour.zip of the 2 Million files from that day by hour, this solves the "per object" cost problem by creating 24 giant archival files.
Set a policy on the s3 bucket to move expired objects to glacier over 1 day old.
Use an unzipping Lambda function to fetch and extract potential hot items from the glacier bucket from within the zip files.
Keep the main s3 bucket for hot files with high frequent access, as a working directory for the zip/unzip operations, and for collecting new files daily
Your files are just too small. You will need to combine them probably in an ETL pipeline such as glue. You can also use the Range header i.e. -range bytes=1000-2000 to download part of an object on S3.
If you do that you'll need to figure out the best way to track the bytes ranges, such as after combining the files recording the range for each one, and changing the clients to use the range as well.
The right approach though depends on how this data is accessed and figuring out the patterns. If somebody who looks at TinyFileA also looks at TinyFileB you could combine them together and just send them both along with other files they are likely to use. I would be figuring out logical groupings of files which make sense to consumers and will reduce the number of requests they need, without sending too much irrelevant data.

How to diff very large buckets in Amazon S3?

I have a use case where I have to back up a 200+TB, 18M object S3 bucket to another account that changes often (used in batch processing of critical data). I need to add a verification step, but due to the large size of both bucket, object count, and frequency of change this is tricky.
My current thoughts are to pull the eTags from the original bucket and archive bucket, and the write a streaming diff tool to compare the values. Has anyone here had to approach this problem and if so did you come up with a better answer?
Firstly, if you wish to keep two buckets in sync (once you've done the initial sync), you can use Cross-Region Replication (CRR).
To do the initial sync, you could try using the AWS Command-Line Interface (CLI), which has a aws s3 sync command. However, it might have some difficulties with a large number of files -- I suggest you give it a try. It uses keys, dates and filesize to determine which files to sync.
If you do wish to create your own sync app, then eTag is definitely a definitive way to compare files.
To make things simple, activate Amazon S3 Inventory, which can provide a daily listing of all files in a bucket, including eTag. You could then do a comparison between the Inventory files to discover which remaining files require synchronization.
For anyone looking for a way to solve this problem in an automated way (as was I),
I created a small python script that leverages S3 Inventories and Athena to do the comparison somewhat efficiently. (This is basically automation of John Rosenstein's suggestion)
You can find it here https://github.com/forter/s3-compare

Is there anything to be gained by using 'folders' in an s3 bucket?

I am moving a largish number of jpgs (several hundred thousand) from a static filesystem to amazon s3.
On the old filesytem, I grouped files into subfolders to keep the total number of files / folder manageable.
For example, a file
4aca29c7c0a76c1cbaad40b2693e6bef.jpg
would be saved to:
/4a/ca/29/4aca29c7c0a76c1cbaad40b2693e6bef.jpg
From what I understand, s3 doesn't respect hierarchial namespaces. So if I were to use 'folders' on s3, the object, including the /'s, would really just be in a flat namesapce.
Still, according to the docs, amazon recommends mimicking a structured filesytem when working with s3.
So I am wondering: Is there anything to be gained using the above folder structure to organize files on s3? Or in this case am I better off just adding the files to s3 without any kind of 'folder' structure.
Performance is not impacted by the use (or non-use) of folders.
Some systems can use folders for easier navigation of the files. For example, Amazon Athena can scan specific sub-directories when querying data rather than having to read every file.
If your bucket is being used for one specific purpose, there is no reason to use folders. However, if it contains different types of data, then you might consider at least a top-level set of folders to keep data separated.
Another potential reason for using folders is for security. A bucket policy can grant access to buckets based upon a prefix (which is a folder name). However, this is likely not relevant for your use-case.
Using "folders" has no performance impact on S3, either way. It doesn't make it faster, and it doesn't make it slower.
The value of delimiting your object keys with / is in organization, both machine-friendly and human-friendly.
If you're trolling through a bucket in the console, troubleshooting, those meaningless noise-filled keys are a hassle to paginate through, only a few dozen at a time.
The console automatically groups objects into imaginary folders based on the / delimiters, so you can find your object to inspect it (check headers, metadata, etc.) is much easier if you can just click on 4a then ca then 29.
The S3 ListObjects APIs support requesting all the objects with a certain key prefix, but they also support finding all the common prefixes before the next delimiter, so you can send API requests to list prefix 4a/ca/ with delimiter / and it will only return the "folders" one level deep, which it refers to as "common prefixes."
This is less meaningful if your object keys are fully opaque and convey nothing more about the objects, as opposed to using key prefixes like images/ and thumbnails/ and videos/.
Having been an admin and working with S3 for a number of years, and having worked with buckets with key naming schemes designed by different teams, I would definitely recommend using some / delimiters for organization purposes. The buckets without them become more of a hassle to navigate over time.
Note that the console does allow you to "create folders," but this is more of the illusion -- there is no need to actually do this, unless you're loading a bucket manually. When you create a folder in the console, it just creates an empty object with a / at the end.

Top level solution to rename AWS bucket item's folder names?

I've inherited a project at work. Its essentially a niche content repository, and we use S3 to store the content. The project was severely outdated, and I'm in the process of a thorough update.
For some unknown and undocumented reason, the content is stored in an AWS S3 bucket with the pattern web_cl_000000$DB_ID$CONTENT_NAME So, one particular folder can be named web_cl_0000003458zyxwv. This makes no sense, and requires a bit of transformation logic to construct a URL to serve up the content!
I can write a Python script using the boto3 library to do an item-by-item rename, but would like to know if there's a faster way to do so. There are approximately 4M items in that bucket, which will take quite a long time.
That isn't possible, because the folders are an illusion derived from the strings between / delimiters in the object keys.
Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using key name prefixes for objects. (emphasis added)
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
The console contributes to the illusion by allowing you to "create" a folder, but all that actually does is create a 0-byte object with / as its last character, which the console will display as a folder whether there are other objects with that prefix or not, making it easier to upload objects manually with some organization.
But any tool or technique that allows renaming folders in S3 will in fact be making a copy of each object with the modified name, then deleting the old object, because S3 does not actually support rename or move, either -- objects in S3, including their key and metadata, are actually immutable. Any "change" is handled at the API level with a copy/overwrite or copy-then-delete.
Worth noting, S3 should be able to easily sustain 100 such requests per second, so with asynchronous requests or multi-threaded code, or even several processes each handling a shard of the keyspace, you should be able to do the whole thing in a few hours.
Note also that the less sorted (more random) the new keys are in the requests, the harder you can push S3 during a mass-write operation like this. Sending the requests so that the new keys are in lexical order will be the most likely scenario in which you might see 503 Slow Down errors... in which case, you just back off and retry... but if the new keys are not ordered, S3 can more easily accommodate a large number of requests.