python boto for aws s3, how to get sorted and limited files list in bucket? - amazon-web-services

If There are too many files on a bucket, and I want to get only 100 newest files,
How can I get only these list?
s3.bucket.list seems not to have that function. Is there anybody who know this?
please let me know. thanks.

There is no way to do this type of filtering on the service side. The S3 API does not support it. You might be able to accomplish something like this by using prefixes in your object names. For example, if you named all of your objects using a pattern like this:
YYYYMMDD/<objectname>
20140618/foobar (as an example)
you could use the prefix parameter of the ListBucket request in S3 to return only the object that were stored today. In boto, this would look like:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('mybucket')
for key in bucket.list(prefix='20140618'):
# do something with the key object
You would still have to retrieve all of the objects with that prefix and then sort them locally based on their last_modified_date but that would be much easier than listing all of the objects in the bucket and then sorting.
The other option would be to store metadata object the S3 objects in a database like DynamoDB and then query that database to find the objects to retrieve from S3.
You can find out more about hierarchical listing in S3 here

Can you try this code. This worked for me.
import boto,operator,time
con = boto.connect_s3()
key_repo = []
bucket = con.get_bucket('<your bucket name>')
bucket_keys = bucket.get_all_keys()
for object in bucket_keys:
t = (object.key,time.strptime(object.last_modified[:19], "%Y-%m-%dT%H:%M:%S"))
key_repo.append(t)
key_repo.sort(key=lambda item:item[1], reverse=1)
for key in key_repo[:10]: #top 10 items in the list
print key[0], ' ',key[1]
PS : I am beginner to Python so the code might not be optimized. Fell free to edit the answer to provide best code.

Related

Need to export the path/url of each file in Amazon S3 server

I have an Amazon S3 server filled with multiple buckets, each bucket containing multiple subfolders. There are easily 50,000 files in total. I need to generate an excel sheet that contains the path/url of each file in each bucket.
For eg, If I have a bucket called b1, and it has a file called f1.txt, I want to be able to export the path of f1 as b1/f1.txt.
This needs to be done for every one of the 50,000 files.
I have tried using S3 browsers like Expandrive and Cyberduck, however they require you to select each and every file to copy their urls.
I also tried exploring the boto3 library in python, however I did not come across any in built functions to get the file urls.
I am looking for any tool I can use, or even a script I can execute to get all the urls. Thanks.
Do you have access to the aws cli? aws s3 ls --recursive {bucket} will list all nested files in a bucket.
Eg this bash command will list all buckets, then recursively print all files in each bucket:
aws s3 ls | while read x y bucket; do aws s3 ls --recursive $bucket | while read x y z path; do echo $path; done; done
(the 'read's are just to strip off uninteresting columns).
nb I'm using v1 CLI.
What you should do is have a look again at boto3 documentation as it is what you are looking for. It is fairly simple to do what you are asking but may take you a bit of reading if you are new to it. Since there is multiple steps involved I will try to steer you in the right direction.
In boto3 for S3 the method you are looking for is list_objects_v2(). This will give you the 'Key' or object path of every object. You will notice that it will return the entire json blob for each object. Since you only are interested in the Key, you can target this just the same way you would access Key/Values in a dict. E.g. list_objects_v2()['Contents'][0]['Key'] should return only object path of the very first object.
If you've got that working the next step is to try to loop and get all values. You can either use a for loop to do this or there is an awesome python package I regularly use called jmespath - https://jmespath.org/
Here is how you can retrieve all object paths up to 1000 objects in one line.
import jmespath
bucket_name='im-a-bucket'
s3_client = boto3.client('s3')
bucket_object_paths = jmespath.search('Contents[*].Key', s3_client.list_objects_v2(Bucket=bucket_name))
Now since your buckets may have more than 1000 objects, you will need to use the paginator to do this. Have a look at this to understand it.
How to get more than 1000 objects from S3 by using list_objects_v2?
Basically the way it works is only 1000 objects can be returned. To overcome this we use a paginator which allows you to return the entire result and treats the limit of 1000 as a pagination so you just need to also use it within a for loop to get all the results you are looking for.
Once you get this working for one bucket, store the result in a variable which will be of type list and repeat for the rest of the buckets. Once you have all this data you could easily just copy paste it into an excel sheet or use python to do it. (Haven't tested the code snippets but they should work).
Amazon s3 inventory can help you with this use case.
Do evaluate that option. refer: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html

AWS Lambda create folder in S3 bucket

I have a Lambda that runs when files are uploaded to S3-A bucket and moves those files to another bucket S3-B. The challenge is that I need create a folder inside S3-B bucket with a corresponding date of uploaded files and move the files to the folder. Any help or ideas are greatly apprecited. It might sound confusing so feel free to ask questions.Thank you!
Here's a Lambda function that can be triggered by an Amazon S3 Event and move the object to another bucket:
import json
import urllib
from datetime import date
import boto3
DEST_BUCKET = 'bucket-b'
def lambda_handler(event, context):
s3_client = boto3.client('s3')
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
dest_key = str(date.today()) + '/' + key
s3_client.copy_object(
Bucket=DEST_BUCKET,
Key=dest_key,
CopySource=f'{bucket}/{key}'
)
The only thing to consider is timezones. The Lambda function runs in UTC and you might be expecting a slightly different date in your timezone, so you might need to adjust the time accordingly.
Just to clear up some confusion, in S3 there is no such thing as a folder. What you see in the interface is actually running the ListObjects using a prefix. The prefix is what you are seeing as the folder hierarchy.
To help illustrate this an object might have a key (which is a piece of metadata that defines its name) of folder/subfolder/file.txt, in the console you're actually using a prefix of folder/subfolder/*. This makes sense if you think of S3 more like a key value store, where the value is the object itself.
For this reason you can make a key on a prefix that has not existed before without creating any other hierarchical features.
In your Lambda function, you will need to download the files locally and then upload them to their new object key (remembering to delete the old object). Some SDKS will have an automated function that will perform all of these steps for you (such as Boto3 with the copy function).

When to use S3 API pagination

I'm using the boto3 client to access date stored in an Amazon S3 bucket. After reading the docs, I see that I can make a request with this code:
s3 = boto3.resource('s3')
bucket = s3.Bucket(TARGET_BUCKET)
for obj in bucket.objects.filter(Bucket=TARGET_BUCKET, Prefix=TARGET_KEYS + KEY_SEPARATOR):
print(obj)
I test against a bucket where I've stored 3000 objects and this fragment of code retrieves the references to all the objects. I've read that all the API calls to S3 return at most 1000 entries.
But reading the boto3 documentation, paginator section, I see that some S3 operations need to use pagination to retrieve all the results. I don't understand why the upper code works unless the code is using the paginator under the hood. And this is my question, can I safely assume that the upper code always will retrieve all the results.?
According to the documentation here, the pagination is handled for you.
A collection provides an iterable interface to a group of resources.
Collections behave similarly to Django QuerySets and expose a similar
API. A collection seamlessly handles pagination for you, making it
possible to easily iterate over all items from all pages of data.

Does emrfs support custom query parameters in s3 url?

Is it possible to add customer query parameters in s3 url?
We would like to add some custom meta data to S3 objects, but would like it to be transparent to EMRFS
Something like:
s3://bucket-name/object-name?x-amz-meta-tag=magic-tag
Then in our PySpark or hadoop job, we would like to write:
data.write.csv('s3://bucket-name/object-name?x-amz-meta-tag=magic-tag')
Trying this on the emrfs shows that it treats "object-name?x-amz-meta-tag=magic-tag" as the entire object name instead of ignoring the query parameters.
I can't speak for the closed source EMRFS, but for the ASF s3 connectors, the answer is "no". Interesting proposal though; maybe you should think about contributing it to the ASF. Of course, that adds a new problem: what if existing users are creating files with ? in their names —how to retain compatibility.

Get all versions of an object in an AWS S3 bucket?

I've enabled object versioning on a bucket. I want to get all versions of a key inside that bucket. But I cannot find a method go do this; how would one accomplish this using the S3 APIs?
So, I ran into this brick wall this morning. This seemingly trivial thing is incredibly difficult to do, it turns out.
The API you want is the GET Bucket Object versions API, but it is sadly non-trivial to use.
First, you have to steer clear of some non-solutions: KeyMarker, which is documented by boto3 as,
KeyMarker (string) -- Specifies the key to start with when listing objects in a bucket.
…does not start with the specified key when listing objects in a bucket; rather, it starts immediately after that key, which makes it somewhat useless here.
The best restriction this API provides is Prefix; this isn't going to be perfect, since there could be keys that are not our key of interest that nonetheless contain our key.
Also beware of MaxKeys; it is tempting to think that, lexicographically, our key should be first, and all keys which have our key as a prefix of their key name would follow, so we could trim them using MaxKeys; sadly, MaxKeys controls not how many keys are returned in the response, but rather the number of versions. (And I'm going to presume that isn't known in advance.)
So, Prefix is the best it seems that can be done. Also note that, at least in some languages, the client library will not handle pagination for you, so you'll additionally need to deal with that.
As an example in boto3:
response = client.list_object_versions(
Bucket=bucket_name, Prefix=key_name,
)
while True:
# Process `response`
...
# Check if the results got paginated:
if response['IsTruncated']:
response = client.list_object_versions(
Bucket=bucket_name, Prefix=key_name,
KeyMarker=response['NextKeyMarker'],
VersionIdMarker=response['NextVersionIdMarker'],
)
else:
break
AWS supported get all object versions by prefix, so you could just use your key as this prefix, it works fine, please try it.
You can use AWS CLI to get list of all versions from bucket
aws s3api list-object-versions --bucket bucketname
Using python,
session = boto3.Session(aws_access_key_id, aws_secret_access_key)
s3 = session.client('s3')
bucket_name = 'bucketname'
versions = s3.list_object_versions (Bucket = bucket_name)
print(versions.get('Versions'))