I have an Amazon S3 bucket with the following structure:
%patientId%/%sessionId%/list files that their name is a datetime.
Patient id and session id are unique.
Example bucket with two patients:
patien1/session1/2021-05-29T061445Z.xxx
patien1/session1/2021-05-30T061445Z.xxx
patien2/session2/2021-05-31T061445Z.xxx
Each session may contain thousands of files.
The file name is date, and I prefer (unless there is no other choice) not using "last modified time" of Amazon S3, because we might have a difference between the two dates.
I would like to query by patient/session and time (name of the file), e.g. all files of patient1, session 1 between 2021-05-20 and 2021-05-29.
I understand that using standard Amazon S3 list objects, it is not possible.
I checked AWS Athena, but it seems more suitable for querying Amazon S3 file content, and not by their name.
So, what is the recommended solution for it?
Thanks,
If you have a large number of objects, you might consider maintaining your own database of objects. This database should be updated when objects are added/removed. It might sound like a lot of work, but it will perform very well for your application.
You can populate the initial list by using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects in a bucket.
Related
I am new to Amazon AWS S3.
One of my applications processes 40000 updates an hour with a unique identifier for each update.
This identifier is basically a string.
At runtime, I want to store the ID in an S3 bucket for all updates.
But, as far as I understood, we need to store files in s3.
Is there anyway around this?
Should I store a file.. Then read that file each time..append the name and store it again?
Any direction would be very helpful.
Thanks in advance.
I want it to be stored like:
Id1
Id2
Id3
.
.
,
.
Edit: Thanks for the responses, I have added what is asked..
I want to be able to just fetch all these IDs if and when a problem occurs in our system.
I am open to using anything other than s3 as well. I was also looking into DynamoDB. With the ID as the primary key. But, these ID's might be repetitive in 1-2% cases.
In S3, you do not have concept of files and folders. All you have is a bucket and objects inside the bucket. However, the UI of AWS groups objects with common prefixes such that they appear to be in the same folder.
Also, there is nothing like appending to a file in S3. Since S3 has objects, what essentially happens is that the so called append deletes the previous object and creates a new object with the previous object's data appended with some more data.
So, one way to do what I think you're trying is :
Suppose you have all the IDs written at 10:00 in an S3 object called data_corresponding_to_10_00_00. For the next hour(and 40000 updates), if they have all new IDs, you can write them to another S3 object with the name data_corresponding_to_11_00_00.
However, if you do not want multiple entries in both the files, and you need to update the previous file itself, using S3 is not a great idea. Rather use a database indexed on ID so that the performance becomes faster.
I am trying to find possible orphans in an S3 bucket. What I mean is that we might delete something out of the DB, and for whatever reason, it doesn't get cleared from S3. This can be a bug in our system or something of that nature. I want to double check against our API that the object in S3 maps to something that exists - the naming convention let's us map things together like that.
Scraping an entire bucket every X days seems unscalable. I was thinking that for each object in the bucket, it can add itself to an SQS queue for the relevant checking to happen, every 30 days or so.
I've only found events around uploads and specific modifications over at https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html. Is there anything more generalized I can't find? Any creative solutions to this problem?
You should activate Amazon S3 Inventory, which can provide a regular CSV file (as often as daily) that contains a list of every object in the Amazon S3 bucket.
You could then trigger some code that compares the contents of the CSV file against the database to find 'orphan' objects.
Is it possible to use AWS Athena to query S3 Object Tagging? For example, if I have an S3 layout such as this
bucketName/typeFoo/object1.txt
bucketName/typeFoo/object2.txt
bucketName/typeFoo/object3.txt
bucketName/typeBar/object1.txt
bucketName/typeBar/object2.txt
bucketName/typeBar/object3.txt
And each object has an S3 Object Tag such as this
#For typeFoo/object1.txt and typeBar/object1.txt
id=A
#For typeFoo/object2.txt and typeBar/object2.txt
id=B
#For typeFoo/object3.txt and typeBar/object3.txt
id=C
Then is it possible to run an AWS Athena query to get any object with the associated tag such as this
select * from myAthenaTable where tag.id = 'A'
# returns typeFoo/object1.txt and typeBar/object1.txt
This is just an example and doesn't reflect my actual S3 bucket/object-prefix layout. Feel free to use any layout you wish in your answers/comments.
Ultimately I have a plethora of objects that could be in different buckets and folder paths but they are related to each other and my goal is to tag them so that I can query for a particular id value and get all objects related to that id. The id value would be a GUID and that GUID would map to many different types of objects that are related e.g., I could have a video file, a picture file, a meta-data file, and a json file and I want to get all of those files using their common id value; please feel free to offer suggestions too because I have the ability to structure this as I see fit.
Update - Note
S3 Object Metadata and S3 Object Tagging are two different things.
Athena does not support querying based on s3 tag
one workaround is,
you can create a meta file which contains the tag and file mapping using lambda i.e whenever new file comes to s3 and lambda would update a file in s3 with tag and name details.
I'm testing out S3 Select and as far as I understand from the examples, you can treat a single object (CSV or JSON) as a data store.
I wanted to have a single JSON document per S3 object and search the entire bucket as a 'database'. I'm saving each 'file' as <ID>.json and each file has JSON documents with the same schema.
Is it possible to search multiple objects in a single call? i.e. Find all JSON documents where customerId = 123 ?
It appears that Amazon S3 Select operates on only one object.
You can use Amazon Athena to run queries across paths, which will include all files within that path. It also supports partitioning.
Simple, just iterate over the folder key in which you have all the files and grab the key and use the same to leverage S3 Select.
In an S3 bucket, I have thousands and thousands of files stored with names having a structure that comes down to prefix and number:
A-0001
A-0002
A-0003
B-0001
B-0002
C-0001
C-0002
C-0003
C-0004
C-0005
New objects for a given prefix should come in with varying frequency, but might not. Older objects may disappear.
Is there a way to efficiently query S3 for the highest number of every prefix, i.e. without listing the entire bucket? The result I want is:
A-0003
B-0002
C-0005
The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it? So far I have only found it capable of searching within objects, but all I care about are their key names. If it can report on the contents of objects in the bucket, can't it on the bucket itself?
I would be okay with the latest modification date per prefix, but I want to avoid having to switch to a versioned bucket with just the prefixes as names to achieve that.
I think this is what you are looking for:
variable name is $path and you can regexp to get the pattern you are querying...
WHERE regexp_extract(sp."$path", '[^/]+$') like concat('%',cast(current_date - interval '1' day as varchar),'.csv')
The S3 API itself does not seem to offer anything usable for that.
However, perhaps another service, like Athena, could do it?
Yes at the moment, there is not direct way of doing it only with AWS S3. Even with Athena, it will go through the files to query their content but it will be easier using standard SQL support with Athena and would be faster since the queries runs in parallel.
So far I have only found it capable of searching within objects, but
all I care about are their key names.
Both Athena and S3 Select is to query by content not keys.
The best approach I can recommend is to use AWS DynamoDB to keep the metadata of the files, including file names for faster querying.