Retrieve content from specified list of S3 directories

Retrieve content from specified list of S3 directories - amazon-web-services

Is there a way to retrieve list of all files from specified list of directories at specific S3 bucket by invoking cloud API only once?
For example, lets say that I have following structure at my S3 cloud service:
A/
AA/
XXX/
B/
BB/
/EMPTY
C/
/EMPTY
D/
DD/
XXX/
And that I also have list of directories from which I wish to retrieve content:
Requested Paths: {
"A/AA/XXX",
"B/BB/XXX",
"C/CC/XXX",
"D/DD/XXX"
}
I would like to create a map with key/value pairs where key is represented by specific directory path, and value is represented by its content. If path does not exist then key/value pair should not exist ether. Something like this:
Map {
"A/AA/XXX" : Content
"D/DD/XXX" : Content
}
Note that there are no keys that correspond to B/BB/XXX and C/CC/XXX since XXX is not part of B/BB//path and CC/XXX is not part of C/ path ether.

Not with a single call, no - particularly if you have enough objects to trigger paginated results. ListObjects takes a ListObjectsInput where Prefix is a single string, not a slice/array.

Related

Copy file from s3 subfolder in another subfolder in same bucket

I'd like to copy file from subfolder into another subfolder in same s3 bucket. I've read lots of questions in SO and I came finally with this code. It has an issue, when I run it it works, but it doesn't copy the file only, it copy the folder that contain the file into the destination wanted I've have the file but inside a folder(root). How do I only copy the files inside that subfolder?
XXXBUCKETNAME:
-- XXXX-input/ # I want to copy from here
-- XXXX-archive/ # to here
import boto3
from botocore.config import Config
s3 = boto3.resource('s3', config=Config(proxies={'https': getProperty('Proxy', 'Proxy.Host')}))
bucket_obj = s3.Bucket('XXX')
destbucket = 'XXX'
jsonfiles = []
for obj in bucket_obj.objects.filter(Delimiter='/', Prefix='XXXX-input/', ):
if obj.key.endswith('json'):
jsonfiles.append(obj.key)
for k in jsonfiles:
if k.split("_")[-1:][0] == "xxx.txt":
dest = s3.Bucket(destbucket)
source= { 'Bucket' : destbucket, 'Key': k}
dest.copy(source, "XXXX-archive/"+k)
it give:
XXXBUCKETNAME:
-- XXXX-input/
-- XXXX-archive/
-- XXXX-input/file.txt
I want:
XXXBUCKETNAME:
-- XXXX-input/
-- XXXX-archive/
-- file.txt

In S3 there really aren't any "folders." There are buckets and objects, as explained in documentation. The UI may make it seem like there are folders, but the key for an object is the entire path. So if you want to copy one item, you will need to parse its key and build the destination key differently so that it has the same prefix (path) but end with a different value.
In Amazon S3, buckets and objects are the primary resources, and
objects are stored in buckets. Amazon S3 has a flat structure instead
of a hierarchy like you would see in a file system. However, for the
sake of organizational simplicity, the Amazon S3 console supports the
folder concept as a means of grouping objects. It does this by using a
shared name prefix for objects (that is, objects have names that begin
with a common string). Object names are also referred to as key names.
In your code you are pulling out each object's key, so that means the key already contains the full "path" even though there isn't really a path. So you will want to split the key on the / character instead and then take the last element in the resulting list and append that as the file name:
dest.copy(source, "XXXX-archive/" + k.split("/")[-1])

S3 limit to objects in a directory of a bucket

I want to save all of my images in a directory of a bucket. Is the number of objects in a same directory unlimited?
for example:
/imgs/10000000.jpg
/imgs/10000001.jpg
/imgs/10000002.jpg
....
/imgs/99999999.jpg

Yes, the number of objects is unlimited. As John mentioned elsewhere, the entire S3 "file path" is really just one string internally, the use of the / as a path separator is just convention.
One suggestion I'd have to make use of this effectively is to name each image a ULID - https://github.com/ulid/spec - this gives you a couple of advantages:
you don't need to worry about uniqueness, even if you put images in from multiple servers
because the ULIDs are lexicographic and time based, you can query S3 directly to see which images were uploaded when (you can generate ULIDs for the start and end timestamp and call S3's LIST to get the images between them).
it's easier to handle reports and metrics – you can easily find out which images are new, because they'll have a ULID after that period's timestamp.

There is no limit on the number of objects stored in Amazon S3.
Amazon S3 does not actually have 'directories'. Instead, the Key (filename) of an object includes the full path of the object. This actually allows you to create/upload objects to a non-existent directory. Doing so will make the directory 'appear' to exist, but that is simply to make things easier for us humans to understand.
Therefore, there is also no limit on the number of objects stored in a 'directory'.

Printing all keys of files and folders recursively doesn't work as expected

We have stored several files an folders in Amazon S3.
We are using the following code to iterate all the files and folders for the given root folder
ObjectListing listing = s3.listObjects( bucketName, prefix );
List<S3ObjectSummary> summaries = listing.getObjectSummaries();
while (listing.isTruncated()) {
listing = s3.listNextBatchOfObjects (listing);
summaries.addAll (listing.getObjectSummaries());
}
Assume the root folder has 1000 files and 10 folders. One of the folder has 100 sub-folder and each has 500 files.
The above program works fine and list all the files and traverse all the files.
The problem is it is not printing the keys of all the sub-folder.
The interesting thing is it prints the first sub-folder
example
Root Folder: Emp
Folder Under Root folder: FolderA, FolderB, FolderC
Sub-folder under FolderA: 0, 1, 2, 3, 4, 5 ... 100
Each 0 or 1 or 2...has 500 files each
What could be the problem? Any limitation in AWS or Folder should not be numeric or is there is any logical issue?
When used the above code.
FolderA/0/ is coming as key where as FolderA/1....FolderA/10 doesn't come
Thanks.

There is no such thing as folders or directories in Amazon S3. Amazon S3 is a key-data store. Folders and sub-folders are a human interpretation of the "/" character in object keys. S3 doesn't know or care about them.
You can "fake" the creation of an empty folder in S3 by creating a 0-byte object that ends with the "/" character.
When iterating over the list of objects, these 0-byte "folders" will be included.
However, you may also have objects such as "folder1/object1" where in your mind, "folder1" is a sub-folder off the root. But in S3, there may not be such an object as "folder1/". In this case, you will not see "folder1/" outputted in your result list on it's own.
If you need to get a list of all "sub-folders", then you need to not only look for objects that end with the "/" character, but you also need to examine all objects for a "/" character and infer a sub-folder from the object's key because there may not be that 0-byte object for the folder itself.
For example:
folder1/object1
folder2/
folder2/object1
In this example, there's only one sub-folder object, but you could say there are actually two sub-folders.
Java-ish psuedo-code to get sub-folders:
function getSubFolders(bucketName, currentFolder)
{
// Use the current folder as the S3 prefix
var prefix = currentFolder;
// Get all objects
ObjectListing listing = s3.listObjects( bucketName, prefix );
List<S3ObjectSummary> summaries = listing.getObjectSummaries();
while (listing.isTruncated()) {
listing = s3.listNextBatchOfObjects (listing);
summaries.addAll (listing.getObjectSummaries());
}
// Split the list into files in the current folder and sub-folders
List<string> subFolders = new List<string>();
List<string> files = new List<string>();
foreach (var summary in summaries)
{
var key = summary.key;
// The key includes the prefix, so remove it
key = key.subString(prefix.length);
// If the key includes a / character, then
// it's in a subfolder. Just save the subfolder part
// of this object.
// Otherwise, save the key in our list of files.
var slashIndex = key.indexOf("/");
if (slashIndex >= 0)
{
subFolders.add(key.subString(0, slashIndex));
}
else
{
files.add(key);
}
}
// Remove duplicate entries from our subFolder list
subFolders = subFolders.distinct();
}

The folder with numeric characters are not populated recursively properly.
Currently resolved as follows
Iterating all folder under a path .
Iterating all the files under a path recursively.
Getting all the files and folders recursively doesn't work. However processing folders separating when we iterate it recursively works fine.
This seems to be little bit expensive operation but it works.

Amazon S3 - different lifecycle rule for "subdirectory" than for parent "directory"

Let's say I have the following data structure:
/
/foo
/foo/bar
/foo/baz
Is it possible to assign to it the following life-cycle rules:
/ (1 month)
/foo (2 months)
/foo/bar (3 months)
/foo/baz (6 months)
The official documentation is unfortunately self-contradictionary in this regard. It doesn't seem to work with AWS console, which makes me somewhat doubtful that SDKs/REST would be any different ;)
Failing at that my root problem is: I have 4 types of projects. The most rudimentary type has a few thousand projects, the other ones have a few dozen. Each type I am obligated to store for a different period of time. Each project contains hundreds of thousands of objects. It looks more or less as:
type A, 90% of projects, x storage required
type B, 6% of projects, 2x storage required
type C, 3% of projects, 4x storage required
type D, 1% of projects, 8x storage required
So far so simple. However. Projects may be upgraded or downgraded from one type to another. And as I said - I have a few thousand instances of the first type so I can't write specific rules for every one of them (remember 1000 rule limit per bucket). And since they may upgrade from one type to another I can't simply insert them in a their own folders as well (ex. only projects from a particular type) or bucket. Or so I think? Are there any other options open to me other than iterating over every object, every time I want to purge expired files - which I would seriously rather not do because of the sheer number of objects?
Maybe some kind of file "move/transfer" between buckets that doesn't modify the creation time metadata, and isn't costly for our server to process?
Would be much obliged :)

Lifecycle policies are based on prefix, not "subdirectory."
So if objects matching the foo/ prefix are to be deleted in 2 months, it is not logical to ask for objects with a prefix of foo/bar/ to be deleted in 3 months, because they're going to be deleted after 2 months... since they also match the prefix foo/. Prefix means prefix. Delimiters are not a factor in lifecycle rules.
Also note that keys and prefixes in S3 do not begin with /. A policy affecting the entire bucket uses the empty string as a prefix, not /.
You do, also, probably want to remember the trailing slashes when you specify prefixes, because foo/bar matches the file foo/bart.jpg while foo/bar/ does not.
Iterating over objects for deletion is not as bad as you make it out to be, since the List Objects API call returns 1000 objects per request (or fewer, if you want), and allows you to specify both prefix and delimiter (usually, you'll use / as the delimiter if you want the responses grouped using the pseudo-folder model the console uses to create the hierarchical display)... and each object's key and datestamp is provided in the response XML. There's also an API request to delete multiple objects in one call.
Any kind of move, transfer, copy, etc. will always reset the creation date of the object. Even modifying the metadata, because objects are immutable. Any time you move, transfer, copy, or "rename" an object (which is actually copy and delete), or modify metadata (which is actually copy to the same key, with different metadata) you are actually creating a new object.

I met the same issue and bypassed it using tags.
This solution would be in two steps:
Use a lambda function to tag object, and link the tag value to the object's prefix
Use the "tag" parameters of your lifecycle rule
Example of lambda function
In your use case, you want for example a 6-month expiration time for the objects with the /foo/baz prefix.
You can write a lambda like similar to this:
import json
import urllib.parse
import boto3
import re
print('Loading function')
s3 = boto3.client('s3')
def lambda_handler(event, context):
print("Received event: " + json.dumps(event, indent=2))
#Get the object from the event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
tags = {
"delete_after_six_months": "true" if re.match(pattern=".*\/foo\/baz\/.*", string=key) else "false"
}
# applies tags
try:
response = s3.put_object_tagging(
Bucket = bucket,
Key = key,
Tagging={
'TagSet': [{'Key': k, 'Value': v} for k, v in tags.items()]
}
)
except Exception as e:
print(e)
print('Error applying tags to {}'.format(key))
raise e
The trigger is to be adapted to the user's needs.
Using this, all objects with /foo/baz/ prefix will have a delete_after_six_months: true tag, and you can easily define the proper associated expiration policy.

#Zardii you can use unique s3 object tags [1] for the objects under these prefixes
Then you can apply the life cycle policy by tag with varying retention/deletion period.
[1] https://docs.aws.amazon.com/AmazonS3/latest/dev/object-tagging.html
Prefix - S3 Tags
/ tag=> delete_after_one_month
/foo tag=> delete_after_two_months
/foo/bar tag=> delete_after_three_months
/foo/baz tag=> delete_after_six_month

Does the ListObjects command guarantee the results are sorted by key?

When calling the S3 ListObjects command (via either REST or SOAP API), is the result set returned in any particular order? I would expect, given the nature of object keys and markers, that the result set is always sorted by object key. But I haven't seen any documentation confirming this.

Update: Amazon has changed their documentation as shown below.
They are returned alphabetically. List results are always returned in UTF-8 binary order. See http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html
Amazon S3 exposes a list operation
that lets you enumerate the keys
contained in a bucket. Keys are
selected for listing by bucket and
prefix. For example, consider a bucket
named 'dictionary' that contains a key
for every English word. You might make
a call to list all the keys in that
bucket that start with the letter "q".
List results are always returned in
lexicographic (alphabetical) order List results are always returned in UTF-8 binary order.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js