AS the title says, what is a key in boto?
What does it encapsulate (fields, data structures, methods etc.)?
How does one access the file contents for files in an AWS bucket using a key/boto?
I was not able to find this information on their official documentation or on any other third party website. Could anybody provide this info?
Here are some examples of the usage of the key object:
def download_file(key_name, storage):
key = bucket.get_key(key_name)
try:
storage.append(key.get_contents_as_string())
except:
print "Some error message."
and:
for key in keys_to_process:
pool.spawn_n(download_file, key.key, file_contents)
pool.waitall()
In your code example - key is the object reference to the unique identifier within a bucket.
Think of buckets as a table in a database
think of keys as the rows in the table
you reference the key (better known as an object) in the bucket.
often in boto (not boto3) works like this
from boto.s3.connection import S3Connection
connection = S3Connection() # assumes you have a .boto or boto.cfg setup
bucket = connection.get_bucket('my_bucket_name_here') # this is like the table name in SQL, select OBJECT form TABLENAME
key = bucket.get_key('my_key_name_here') this is the OBJECT in the above SQL example. key names are a string, and there is a convention that says if you put a '/' in the name, a viewer/tool should treat it like a path/folder for the user, e.g. my/object_name/is_this is really just a key inside the bucket, but most viewers will show a my folder, and an object_name folder, and then what looks like a file called is_this simply by UI convention
Since you appear to be talking about Simple Storage Service (S3), you'll find that information on Page 1 of the S3 documentation.
Each object is stored and retrieved using a unique developer-assigned key.
A key is the unique identifier for an object within a bucket. Every object in a bucket has exactly one key. Because the combination of a bucket, key, and version ID uniquely identify each object, Amazon S3 can be thought of as a basic data map between "bucket + key + version" and the object itself. Every object in Amazon S3 can be uniquely addressed through the combination of the web service endpoint, bucket name, key, and optionally, a version. For example, in the URL http://doc.s3.amazonaws.com/2006-03-01/AmazonS3.wsdl, "doc" is the name of the bucket and "2006-03-01/AmazonS3.wsdl" is the key.
http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html
The key is just a string -- the "path and filename" of the object in the bucket, without a leading /.
Related
According to the ListObjectsV2 - Amazon Simple Storage Service documentation, when I specify a Prefix and a Delimiter, I should get a contents element in the response with an ETag for the prefix.
<Contents>
<Key>photos/2006/</Key>
<LastModified>2016-04-30T23:51:29.000Z</LastModified>
<ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag>
<Size>0</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
I have tried to run this using the python sdk (boto3).
client.list_objects_v2(Bucket='bucketname', Prefix = "folder1-folder2-", Delimiter = "-")
But in the response dict, I dont find a contents key. All the others fields as per the response in the example are present.
dict_keys(['ResponseMetadata', 'IsTruncated', 'Name', 'Prefix', 'Delimiter', 'MaxKeys', 'CommonPrefixes', 'EncodingType', 'KeyCount'])
Is this something which is no longer in the response of the API call. Or is this something the SDK doenst show.
And a follow up question, if it is something on the sdk side, how do I make an api call that returns me this field.
When a Prefix and a Delimiter is provided, the directories within that Prefix are returned in CommonPrefixes.
So, if there is an object called folder1-folder2-folder3-file.txt, then your return response should contain a CommonPrefixes list that includes folder3-.
Since you are using boto3, it's easier to look at the boto3 documentation for list_objects_v2(). It shows how the fields are provided in the response.
You can access values like this:
response = s3_client.list_objects_v2(Bucket='bucketname', Prefix = "folder1-folder2-", Delimiter = "-")
# Objects
for object in response['Contents']:
print(object['Key'])
# Folders
for folder in response['CommonPrefixes']:
print(folder['Prefix'])
When a user clicks Create Folder in the Amazon S3 management console, it creates a zero-length object with the same name as the 'folder'. This is because Amazon S3 does not actually use folders, but it can simulate them via Delimiter and CommonPrefixes. By creating a zero-length object, it forces that folder name to appear as a CommonPrefix. It also causes the zero-length object itself to appear as an object in the list_objects() API call.
Is it possible to use AWS Athena to query S3 Object Tagging? For example, if I have an S3 layout such as this
bucketName/typeFoo/object1.txt
bucketName/typeFoo/object2.txt
bucketName/typeFoo/object3.txt
bucketName/typeBar/object1.txt
bucketName/typeBar/object2.txt
bucketName/typeBar/object3.txt
And each object has an S3 Object Tag such as this
#For typeFoo/object1.txt and typeBar/object1.txt
id=A
#For typeFoo/object2.txt and typeBar/object2.txt
id=B
#For typeFoo/object3.txt and typeBar/object3.txt
id=C
Then is it possible to run an AWS Athena query to get any object with the associated tag such as this
select * from myAthenaTable where tag.id = 'A'
# returns typeFoo/object1.txt and typeBar/object1.txt
This is just an example and doesn't reflect my actual S3 bucket/object-prefix layout. Feel free to use any layout you wish in your answers/comments.
Ultimately I have a plethora of objects that could be in different buckets and folder paths but they are related to each other and my goal is to tag them so that I can query for a particular id value and get all objects related to that id. The id value would be a GUID and that GUID would map to many different types of objects that are related e.g., I could have a video file, a picture file, a meta-data file, and a json file and I want to get all of those files using their common id value; please feel free to offer suggestions too because I have the ability to structure this as I see fit.
Update - Note
S3 Object Metadata and S3 Object Tagging are two different things.
Athena does not support querying based on s3 tag
one workaround is,
you can create a meta file which contains the tag and file mapping using lambda i.e whenever new file comes to s3 and lambda would update a file in s3 with tag and name details.
I have a bucket with the following key structure:
path/to/file1
path/to/file2
path/of/file3
path/of/file4
And I would like to be able to get the list of "folders" inside path. The actual use case has many "subfolders", so I need to filter the listing. Ideally, I only want to receive two entries: to and of.
Using boto3, I was expecting the two following calls being basically equal, i.e. that the listing of both yields the same result:
Using the bucket returned by the S3 resource
s3 = boto3.resouce('s3')
bucket = s3.Bucket('bucketname')
bucket.objects.filter(Prefix='path/', Delimiter='/').all()
and the underlying client
s3 = boto3.resouce('s3')
s3.meta.client.list_objects(Bucket='path', Prefix='', Delimiter='/')
However, the first returns an emtpy list, while the second returns a JSON with the CommonPrefixes key having the two entries.
Question: What do I miss?
from https://github.com/boto/boto3/issues/134#issuecomment-116766812
The reason that it is not included in the list of objects returned is
that the values that you are expecting when you use the delimiter are
prefixes (e.g. Europe/, North America) and prefixes do not map
into the object resource interface.
I have to store lots of photos (+1 000 000, one max 5MB) and I have a database, every record has 5 photos, so what is the best solution:
Create directory for each record's slug/id, and upload photos inside it
Put all photos into one directory, and in name contain id or slug of record
Put all photos into one directory, and in database to each record add field with names of photos.
I use Amazon S3 server.
i would suggest you to name your photos like this while uploading in batch:
user1/image1.jpeg
user2/image2.jpeg
Though these names would not effect the way objects are stored on s3 , these names will simply be 'keys' of 'objects', as there is no folder like hierarchical structure in s3 , but doing these will make objects appear in folders which will help to segregate images easily if you want later to do so.
for example , let us suppose you stored all images with unique names and you are using unique UUID to map records in database to images in your bucket.
But later on suppose you want all 5 photos of a particular user, then what will you have to do is
scan the database for particular username
Retrieve UUID's for the images of that user
and then using the UUID for fetching images from s3
But if you name images by prefixing username to it , you can directly fetch images from s3 without making any reference to your database.
For example, to list all photos of user1, you can use this small code snippet in python :
import boto3
s3 = boto3.resource('s3')
Bucket=s3.Bucket('bucket_name')
for obj in Bucket.objects.filter(Prefix='user1/'):
print(obj.key)
while if you don't use any user-id in key of object , then you have to refer database to do a mapping between photos and records even just to get a list of images of a particular user
A lot of this depends on your use-case, such as how the database and the photos will be used. There is not enough information here to give a definitive answer.
However, some recommendations for the storage side...
The easiest option is just to use a UUID for each photo. This is effectively a random name that has no meaning. Store that name in your database and your system will know which image relates to which record. There is no need to ever rename the images because the names are just Unique IDs and convey no further information.
When you want to provide access to a particular image, your application can generate an Amazon S3 pre-signed URL that grants time-limited access to an object. After the expiry time, the URL does not work so the object remains private. Granting access in this manner means that there is no need to group images into directories by "owner", since access is granted per-object rather than per-owner.
Also, please note that Amazon S3 doesn't actually support folders. Rather, the Key ("filename") of the object is the entire path (eg user-2/foo.jpg). This makes it more human-readable (because the objects 'appear' to be in folders), but doesn't actually impact the way data is stored behind-the-scenes.
Bottom line: It doesn't really matter how you store the images. What matters is that you store the image name in your database so you know which image matches which record. Avoid situations where you need to rename images - just give them a name and keep it.
Boto's S3 Key object contains last_modified date (available via parse_ts) but the base_field "date" (i.e., ctime) doesn't seem to be accessible, even though it's listed in key.base_fields.
Based on the table at http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html, it does seem that it is always automatically created (and I can't imagine a reason why it wouldn't be). It's probably just a simple matter of finding it somewhere in the object attributes, but I haven't been able to find it so far, although I did find the base_fields attribute which contains 'date'. (They're just a set and don't seem to have an available methods and I haven't been able to find documentation regarding ways to inspect them.)
For example, Amazon S3 maintains object creation date and size metadata and uses this information as part of object management.
Interestingly, create_time (system metadata field "Date" in link above) does not show up in the AWS S3 console, either, although last_modified is visible.
TL;DR: Because overwriting an S3 object is essentially creating a new one, the "last modified" and "creation" timestamp will always be the same.
Answering the old question, just in case others run into the same issue.
Amazon S3 maintains only the last modified date for each object.
For example, the Amazon S3 console shows the Last Modified date in the object Properties pane. When you initially create a new object, this date reflects the date the object is created. If you replace the object, the date changes accordingly. So when we use the term creation date, it is synonymous with the term last modified date.
Reference: https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html
i suggest use
key.last_modified since key.date seems to return the last time you viewed the file
so something like this :
key = bucket.get_key(key.name)
print(key.last_modified)
After additional research, it appears that S3 key objects returned from a list() may not include this metadata field!
The Key objects returned by the iterator are obtained by parsing the results of a GET on the bucket, also known as the List Objects request. The XML returned by this request contains only a subset of the information about each key. Certain metadata fields such as Content-Type and user metadata are not available in the XML. Therefore, if you want these additional metadata fields you will have to do a HEAD request on the Key in the bucket. (docs)
In other words, looping through keys:
for key in conn.get_bucket(bucket_name).list():
print (key.date)
... does not return the complete key with creation date and some other system metadata. (For example, it's also missing ACL data).
Instead, to retrieve the complete key metadata, use this method:
key = bucket.get_key(key.name)
print (key.date)
This necessitates an additional HTTP request as the docs clearly state above. (See also my original issue report.)
Additional code details:
import boto
# get connection
conn = boto.connect_s3()
# get first bucket
bucket = conn.get_all_buckets()[0]
# get first key in first bucket
key = list(bucket.list())[0]
# get create date if available
print (getattr(key, "date", False))
# (False)
# access key via bucket.get_key instead:
k = bucket.get_key(key.name)
# check again for create_date
getattr(k, "date", False)
# 'Sat, 03 Jan 2015 22:08:13 GMT'
# Wait, that's the current UTC time..?
# Also print last_modified...
print (k.last_modified)
# 'Fri, 26 Apr 2013 02:41:30 GMT'
If you have versioning enabled for your S3 bucket, you can use list_object_versions and find the smallest date for the object you're looking for which should be the date it was created