Get all versions of an object in an AWS S3 bucket? - amazon-web-services

I've enabled object versioning on a bucket. I want to get all versions of a key inside that bucket. But I cannot find a method go do this; how would one accomplish this using the S3 APIs?

So, I ran into this brick wall this morning. This seemingly trivial thing is incredibly difficult to do, it turns out.
The API you want is the GET Bucket Object versions API, but it is sadly non-trivial to use.
First, you have to steer clear of some non-solutions: KeyMarker, which is documented by boto3 as,
KeyMarker (string) -- Specifies the key to start with when listing objects in a bucket.
…does not start with the specified key when listing objects in a bucket; rather, it starts immediately after that key, which makes it somewhat useless here.
The best restriction this API provides is Prefix; this isn't going to be perfect, since there could be keys that are not our key of interest that nonetheless contain our key.
Also beware of MaxKeys; it is tempting to think that, lexicographically, our key should be first, and all keys which have our key as a prefix of their key name would follow, so we could trim them using MaxKeys; sadly, MaxKeys controls not how many keys are returned in the response, but rather the number of versions. (And I'm going to presume that isn't known in advance.)
So, Prefix is the best it seems that can be done. Also note that, at least in some languages, the client library will not handle pagination for you, so you'll additionally need to deal with that.
As an example in boto3:
response = client.list_object_versions(
Bucket=bucket_name, Prefix=key_name,
)
while True:
# Process `response`
...
# Check if the results got paginated:
if response['IsTruncated']:
response = client.list_object_versions(
Bucket=bucket_name, Prefix=key_name,
KeyMarker=response['NextKeyMarker'],
VersionIdMarker=response['NextVersionIdMarker'],
)
else:
break

AWS supported get all object versions by prefix, so you could just use your key as this prefix, it works fine, please try it.

You can use AWS CLI to get list of all versions from bucket
aws s3api list-object-versions --bucket bucketname
Using python,
session = boto3.Session(aws_access_key_id, aws_secret_access_key)
s3 = session.client('s3')
bucket_name = 'bucketname'
versions = s3.list_object_versions (Bucket = bucket_name)
print(versions.get('Versions'))

Related

Need to export the path/url of each file in Amazon S3 server

I have an Amazon S3 server filled with multiple buckets, each bucket containing multiple subfolders. There are easily 50,000 files in total. I need to generate an excel sheet that contains the path/url of each file in each bucket.
For eg, If I have a bucket called b1, and it has a file called f1.txt, I want to be able to export the path of f1 as b1/f1.txt.
This needs to be done for every one of the 50,000 files.
I have tried using S3 browsers like Expandrive and Cyberduck, however they require you to select each and every file to copy their urls.
I also tried exploring the boto3 library in python, however I did not come across any in built functions to get the file urls.
I am looking for any tool I can use, or even a script I can execute to get all the urls. Thanks.
Do you have access to the aws cli? aws s3 ls --recursive {bucket} will list all nested files in a bucket.
Eg this bash command will list all buckets, then recursively print all files in each bucket:
aws s3 ls | while read x y bucket; do aws s3 ls --recursive $bucket | while read x y z path; do echo $path; done; done
(the 'read's are just to strip off uninteresting columns).
nb I'm using v1 CLI.
What you should do is have a look again at boto3 documentation as it is what you are looking for. It is fairly simple to do what you are asking but may take you a bit of reading if you are new to it. Since there is multiple steps involved I will try to steer you in the right direction.
In boto3 for S3 the method you are looking for is list_objects_v2(). This will give you the 'Key' or object path of every object. You will notice that it will return the entire json blob for each object. Since you only are interested in the Key, you can target this just the same way you would access Key/Values in a dict. E.g. list_objects_v2()['Contents'][0]['Key'] should return only object path of the very first object.
If you've got that working the next step is to try to loop and get all values. You can either use a for loop to do this or there is an awesome python package I regularly use called jmespath - https://jmespath.org/
Here is how you can retrieve all object paths up to 1000 objects in one line.
import jmespath
bucket_name='im-a-bucket'
s3_client = boto3.client('s3')
bucket_object_paths = jmespath.search('Contents[*].Key', s3_client.list_objects_v2(Bucket=bucket_name))
Now since your buckets may have more than 1000 objects, you will need to use the paginator to do this. Have a look at this to understand it.
How to get more than 1000 objects from S3 by using list_objects_v2?
Basically the way it works is only 1000 objects can be returned. To overcome this we use a paginator which allows you to return the entire result and treats the limit of 1000 as a pagination so you just need to also use it within a for loop to get all the results you are looking for.
Once you get this working for one bucket, store the result in a variable which will be of type list and repeat for the rest of the buckets. Once you have all this data you could easily just copy paste it into an excel sheet or use python to do it. (Haven't tested the code snippets but they should work).
Amazon s3 inventory can help you with this use case.
Do evaluate that option. refer: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html

Boto3, S3 check if keys exist

Right now I do know how to check if a single key exists within my S3 bucket using Boto 3:
res = s3.list_objects_v2(Bucket=record.bucket_name, Prefix='back.jpg', Delimiter='/')
for obj in res.get('Contents', []):
print(obj)
However I'm wondering if it's possible to check if multiple keys exist within a single API call. It feels a bit of a waste to do 5+ requests for that.
You could either use head_object() to check whether a specific object exists, or retrieve the complete bucket listing using list_objects_v2() and then look through the returned list to check for multiple objects.
Please note that list_objects_v2() only returns 1000 objects at a time, so it might need several calls to retrieve a list of all objects.

When to use a boto3 client and when to use a boto3 resource?

I am trying to understand when I should use a Resource and when I should use a Client.
The definitions provided in boto3 docs don't really make it clear when it is preferable to use one or the other.
boto3.resource is a high-level services class wrap around boto3.client.
It is meant to attach connected resources under where you can later use other resources without specifying the original resource-id.
import boto3
s3 = boto3.resource("s3")
bucket = s3.Bucket('mybucket')
# now bucket is "attached" the S3 bucket name "mybucket"
print(bucket)
# s3.Bucket(name='mybucket')
print(dir(bucket))
#show you all class method action you may perform
OTH, boto3.client are low level, you don't have an "entry-class object", thus you must explicitly specify the exact resources it connects to for every action you perform.
It depends on individual needs. However, boto3.resource doesn't wrap all the boto3.client functionality, so sometime you need to call boto3.client , or use boto3.resource.meta.client to get the job done.
If possible use client over resource, especially if dealing with s3 object lists, and then trying to get basic information on those objects themselves.
Client calls s3 10,000/1000 = 10 times and gives you a lot of information on each object in each call..
Resource, I assume calls s3 10,000 times(or maybe same as client??), but if you take that object and try to do something with it, that is probably another call to s3, making this about 20x slower than client.
my Test reveals the following results.
s3 = boto3.resource("s3")
s3bucket = s3.Bucket(myBucket)
s3obj_list = s3bucket.objects.filter(Prefix=key_prefix)
tmp_list = [s3obj.key for s3obj in s3obj_list]
(tmp_list = [s3obj for s3obj in s3obj_list] gives same ~9min results)
When trying to get a list of 150,000 files, took ~9 minutes. If s3obj_list is indeed pulling 1000 files a call and buffering it, s3obj.key is probably not part of it and makes another call.
...some sort of loop, that also sets ContinuationToken...
response = client.list_objects_v2(
Bucket = bucket,
Prefix = prefix,
ContinuationToken=response["NextContinuationToken"],
)
...
Client took ~30 seconds to list the 150,000 files.
I don't know if resource buffers 1000 files at a time but if it doesn't that is a problem.
I also don't know if it is possible for resource to buffer the information attached to the object, but that is another problem.
I also don't know if using pagination could make client faster/easier to use.
Anyone who knows the answer to the 3 questions above please do. I'd be very interested to know.

python boto for aws s3, how to get sorted and limited files list in bucket?

If There are too many files on a bucket, and I want to get only 100 newest files,
How can I get only these list?
s3.bucket.list seems not to have that function. Is there anybody who know this?
please let me know. thanks.
There is no way to do this type of filtering on the service side. The S3 API does not support it. You might be able to accomplish something like this by using prefixes in your object names. For example, if you named all of your objects using a pattern like this:
YYYYMMDD/<objectname>
20140618/foobar (as an example)
you could use the prefix parameter of the ListBucket request in S3 to return only the object that were stored today. In boto, this would look like:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('mybucket')
for key in bucket.list(prefix='20140618'):
# do something with the key object
You would still have to retrieve all of the objects with that prefix and then sort them locally based on their last_modified_date but that would be much easier than listing all of the objects in the bucket and then sorting.
The other option would be to store metadata object the S3 objects in a database like DynamoDB and then query that database to find the objects to retrieve from S3.
You can find out more about hierarchical listing in S3 here
Can you try this code. This worked for me.
import boto,operator,time
con = boto.connect_s3()
key_repo = []
bucket = con.get_bucket('<your bucket name>')
bucket_keys = bucket.get_all_keys()
for object in bucket_keys:
t = (object.key,time.strptime(object.last_modified[:19], "%Y-%m-%dT%H:%M:%S"))
key_repo.append(t)
key_repo.sort(key=lambda item:item[1], reverse=1)
for key in key_repo[:10]: #top 10 items in the list
print key[0], ' ',key[1]
PS : I am beginner to Python so the code might not be optimized. Fell free to edit the answer to provide best code.

Is there a way to build AWS S3 bucket endpoints automatically, regardless of region?

So I have an app in Node that accesses stuff in buckets. I want it to be able to use buckets in any region, transparently. Unfortunately, the way of building the URL for the endpoint differs based on what region you're in.
If it's in US-Standard, I can say http://s3.amazonaws.com/BUCKETNAME/path/to/file. If it's anywhere else, that doesn't work (non-coincidentally, you're limited to domain-allowed characters (lowercase and numbers only) for bucket names in non-US Standard) and you use http://BUCKETNAME.s3.amazonaws.com/path/to/file.
(Note you can get more complicated and say
I'm thinking this is not a unique problem, so want to put it out there.
http://bucketname.s3.amazonaws.com/path/to/file works in US-Standard also, so you should be able to use this single construct on any bucket anywhere (unless I'm missing something in your question).