AWS S3 Object search - amazon-web-services

I have created an S3 bucket and uploaded about 50,000 objects in the bucket, which are PDF files. The PDFs are saved as the customer's first name, last name, and zip code (for example, John Smith 90005. pdf). My issue is that s3 objects can be searched by prefixes only. For example, you can search for John smith by typing John (first name) and hitting enter, which brings up all customers with the first name John. If a customer's last name is John, He will not appear in the search. You can't search for john smith by typing only smith (Last name) or Zip code since the search is prefix only. How can I search for a customer with the last name or zip code? I can't use AWS Athena since my files are in PDF. Any suggesting?
I tried using AWS Athena to query S3 but didn't work.

It is not possible to use the Amazon S3 management console to 'search' for objects by a partial name match.
The API call to Amazon S3 that lists the objects only returns a maximum of 1000 objects. Thus, listing objects in large buckets can be quite slow, and the API calls do not support 'search' (only Prefix, as you mention).
If you wish to 'search' for objects in an S3 bucket, it is better to maintain your own index/database of the objects.
You can also activate Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then use this information in a program to locate objects.

Related

How I Can Search Unknown Folders in S3 Bucket. I Have millions of object in my bucket I only want Folder List?

I Have a bucket with 3 million objects. I Even don't know how many folders are there in my S3 bucket and even don't know the names of folders in my bucket.I want to show only list of folders of AWS s3. Is there any way to get list of all folders ?
I would use AWS CLI for this. To get started - have a look here.
Then it is a matter of almost standard linux commands (ls):
aws s3 ls s3://<bucket_name>/path/to/search/folder/ --recursive | grep '/$' > folders.txt
where:
grep command just reads what aws s3 ls command has returned and searches for entries with ending /.
ending > folders.txt saves output to a file.
Note: grep (if I'm not wrong) is unix only utility command. But I believe, you can achieve this on windows as well.
Note 2: depending on the number of files there this operation might (will) take a while.
Note 3: usually in systems like AWS S3, term folder is there only for user to maintain visual similarity with standard file systems however inside it does treat it as a part of a key. You can see in your (web) console when you filter by "prefix".
Amazon S3 buckets with large quantities of objects are very difficult to use. The API calls that list bucket contents are limited to returning 1000 objects per API call. While it is possible to request 'folders' (by using Delimiter='/' and looking at CommonPrefixes), this would take repeated calls to obtain the hierarchy.
Instead, I would recommend using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then play with that CSV file from code (or possibly Excel? Might be too big?) to obtain your desired listings.
Just be aware that doing anything on that bucket will not be fast.

Query Amazon S3 objects by date

I have an Amazon S3 bucket with the following structure:
%patientId%/%sessionId%/list files that their name is a datetime.
Patient id and session id are unique.
Example bucket with two patients:
patien1/session1/2021-05-29T061445Z.xxx
patien1/session1/2021-05-30T061445Z.xxx
patien2/session2/2021-05-31T061445Z.xxx
Each session may contain thousands of files.
The file name is date, and I prefer (unless there is no other choice) not using "last modified time" of Amazon S3, because we might have a difference between the two dates.
I would like to query by patient/session and time (name of the file), e.g. all files of patient1, session 1 between 2021-05-20 and 2021-05-29.
I understand that using standard Amazon S3 list objects, it is not possible.
I checked AWS Athena, but it seems more suitable for querying Amazon S3 file content, and not by their name.
So, what is the recommended solution for it?
Thanks,
If you have a large number of objects, you might consider maintaining your own database of objects. This database should be updated when objects are added/removed. It might sound like a lot of work, but it will perform very well for your application.
You can populate the initial list by using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects in a bucket.

Use AWS Athena To Query S3 Object Tagging

Is it possible to use AWS Athena to query S3 Object Tagging? For example, if I have an S3 layout such as this
bucketName/typeFoo/object1.txt
bucketName/typeFoo/object2.txt
bucketName/typeFoo/object3.txt
bucketName/typeBar/object1.txt
bucketName/typeBar/object2.txt
bucketName/typeBar/object3.txt
And each object has an S3 Object Tag such as this
#For typeFoo/object1.txt and typeBar/object1.txt
id=A
#For typeFoo/object2.txt and typeBar/object2.txt
id=B
#For typeFoo/object3.txt and typeBar/object3.txt
id=C
Then is it possible to run an AWS Athena query to get any object with the associated tag such as this
select * from myAthenaTable where tag.id = 'A'
# returns typeFoo/object1.txt and typeBar/object1.txt
This is just an example and doesn't reflect my actual S3 bucket/object-prefix layout. Feel free to use any layout you wish in your answers/comments.
Ultimately I have a plethora of objects that could be in different buckets and folder paths but they are related to each other and my goal is to tag them so that I can query for a particular id value and get all objects related to that id. The id value would be a GUID and that GUID would map to many different types of objects that are related e.g., I could have a video file, a picture file, a meta-data file, and a json file and I want to get all of those files using their common id value; please feel free to offer suggestions too because I have the ability to structure this as I see fit.
Update - Note
S3 Object Metadata and S3 Object Tagging are two different things.
Athena does not support querying based on s3 tag
one workaround is,
you can create a meta file which contains the tag and file mapping using lambda i.e whenever new file comes to s3 and lambda would update a file in s3 with tag and name details.

Optimize photo storage nomenclature on Amazon S3

I have to store lots of photos (+1 000 000, one max 5MB) and I have a database, every record has 5 photos, so what is the best solution:
Create directory for each record's slug/id, and upload photos inside it
Put all photos into one directory, and in name contain id or slug of record
Put all photos into one directory, and in database to each record add field with names of photos.
I use Amazon S3 server.
i would suggest you to name your photos like this while uploading in batch:
user1/image1.jpeg
user2/image2.jpeg
Though these names would not effect the way objects are stored on s3 , these names will simply be 'keys' of 'objects', as there is no folder like hierarchical structure in s3 , but doing these will make objects appear in folders which will help to segregate images easily if you want later to do so.
for example , let us suppose you stored all images with unique names and you are using unique UUID to map records in database to images in your bucket.
But later on suppose you want all 5 photos of a particular user, then what will you have to do is
scan the database for particular username
Retrieve UUID's for the images of that user
and then using the UUID for fetching images from s3
But if you name images by prefixing username to it , you can directly fetch images from s3 without making any reference to your database.
For example, to list all photos of user1, you can use this small code snippet in python :
import boto3
s3 = boto3.resource('s3')
Bucket=s3.Bucket('bucket_name')
for obj in Bucket.objects.filter(Prefix='user1/'):
print(obj.key)
while if you don't use any user-id in key of object , then you have to refer database to do a mapping between photos and records even just to get a list of images of a particular user
A lot of this depends on your use-case, such as how the database and the photos will be used. There is not enough information here to give a definitive answer.
However, some recommendations for the storage side...
The easiest option is just to use a UUID for each photo. This is effectively a random name that has no meaning. Store that name in your database and your system will know which image relates to which record. There is no need to ever rename the images because the names are just Unique IDs and convey no further information.
When you want to provide access to a particular image, your application can generate an Amazon S3 pre-signed URL that grants time-limited access to an object. After the expiry time, the URL does not work so the object remains private. Granting access in this manner means that there is no need to group images into directories by "owner", since access is granted per-object rather than per-owner.
Also, please note that Amazon S3 doesn't actually support folders. Rather, the Key ("filename") of the object is the entire path (eg user-2/foo.jpg). This makes it more human-readable (because the objects 'appear' to be in folders), but doesn't actually impact the way data is stored behind-the-scenes.
Bottom line: It doesn't really matter how you store the images. What matters is that you store the image name in your database so you know which image matches which record. Avoid situations where you need to rename images - just give them a name and keep it.

Easy way to created dated subdirectories on AWS S3

I'm trying to create a web service that is able to store user-upload files in S3. The problem is that we want the files stored in "dated directories".
For example, if a user uploads a.txt on 12/1/2017 at 9:15am, the file should look like this in S3:
https://s3-eu-west-1.amazonaws.com/test-bucket/uploaded/2017/12/1/9/a.txt
Does S3 have any API to help us achieving this or do we need to hand-craft this solution?
There is no such API in S3. Think of Amazon S3 as a storage service, not an application or database.
It is the responsibility of your application to store the data in the desired naming format -- just like storing data on a disk.
By the way, your naming format could do with some improvement:
Always expand fields to the correct number of digits (use 01 for January rather than 1) so that they sort correctly.
Think about your use-case -- if you will be scanning documents by year, then the /2017/12/01/09/a.txt naming format makes sense since you can look in the 2017 directory (not that directories really exist in S3). If not, then simply store it as /2017-12-01-09-a.txt.
Make it very clear which one is month vs day -- the USA is the only country in the world that treats "12/1/2017" as December 1st. The rest of the world reads it as "12 January". Using the format of 2017-12-01 makes it clear that it is 1-December-2017.
What about naming conflicts? Can only one person upload a file with a given name on a given day? How are you going to differentiate between different users uploading a file with the same name?
The reality is, the filename is totally irrelevant -- your application should use a database to keep track of objects that users
upload and assign each of them a unique name. When a file is later
requested, lookup the filename in the database and then provide that
file. Do not use S3 filenames as a pseudo-database where the name
conveys particular meaning, otherwise you'll often have to rename
files to add more meaning!
Directories don't actually exist in S3 -- they are just part of the filename. So, you can create a file in a given directory just by storing it -- there is no need to pre-create directories.
AWS S3 does not provide you with such logic. But it should by fairly easy to use the time information of your application to create such a s3 object key ("path").
Good luck!