How can I detect orphaned objects in S3 that aren't mapped to our database? - amazon-web-services

I am trying to find possible orphans in an S3 bucket. What I mean is that we might delete something out of the DB, and for whatever reason, it doesn't get cleared from S3. This can be a bug in our system or something of that nature. I want to double check against our API that the object in S3 maps to something that exists - the naming convention let's us map things together like that.
Scraping an entire bucket every X days seems unscalable. I was thinking that for each object in the bucket, it can add itself to an SQS queue for the relevant checking to happen, every 30 days or so.
I've only found events around uploads and specific modifications over at https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html. Is there anything more generalized I can't find? Any creative solutions to this problem?

You should activate Amazon S3 Inventory, which can provide a regular CSV file (as often as daily) that contains a list of every object in the Amazon S3 bucket.
You could then trigger some code that compares the contents of the CSV file against the database to find 'orphan' objects.

Related

Storing S3 Urls vs calling listObjects

I have an app that has an attachments feature for users. They can upload documents to S3 and then revisit and preview and/or Download said attachments.
I was planning on storing the S3 urls in DB and then pre-signing them when the User needs them. I'm finding a caveat here is that this can lead to edge cases between S3 and the DB.
I.e. if a file gets removed from S3 but its url does not get removed from DB (or vice-versa). This can lead to data inconsistency and may mislead users.
I was thinking of just getting the urls via the network by using listObjects in the s3 client SDK. I don't really need to store the urls and this guarantees the user gets what's actually in S3.
Only con here is that it makes 1 API request (as opposed to DB hit)
Any insights?
Thanks!
Using a database to store an index to files is a good idea, especially once the volume of objects increases. The ListObjects() API only returns 1000 objects per call. This might be okay if every user has their own path (so you can use ListObjects(Prefix='user1/'), but that's not ideal if you want to allow document sharing between users.
Using a database will definitely be faster to obtain a listing, and it has the advantage that you can filter on attributes and metadata.
The two systems will only get "out of sync" if objects are created/deleted outside of your app, or if there is an error in the app. If this concerns you, then use Amazon S3 Inventory, to provide a regular listing of objects in the bucket and write some code to compare it against the database entries. This will highlight if anything is going wrong.
While Amazon S3 is an excellent NoSQL database (Key = filename, Value = contents), it isn't good for searching/listing a large quantity of objects.

Store list of Strings in S3

I am new to Amazon AWS S3.
One of my applications processes 40000 updates an hour with a unique identifier for each update.
This identifier is basically a string.
At runtime, I want to store the ID in an S3 bucket for all updates.
But, as far as I understood, we need to store files in s3.
Is there anyway around this?
Should I store a file.. Then read that file each time..append the name and store it again?
Any direction would be very helpful.
Thanks in advance.
I want it to be stored like:
Id1
Id2
Id3
.
.
,
.
Edit: Thanks for the responses, I have added what is asked..
I want to be able to just fetch all these IDs if and when a problem occurs in our system.
I am open to using anything other than s3 as well. I was also looking into DynamoDB. With the ID as the primary key. But, these ID's might be repetitive in 1-2% cases.
In S3, you do not have concept of files and folders. All you have is a bucket and objects inside the bucket. However, the UI of AWS groups objects with common prefixes such that they appear to be in the same folder.
Also, there is nothing like appending to a file in S3. Since S3 has objects, what essentially happens is that the so called append deletes the previous object and creates a new object with the previous object's data appended with some more data.
So, one way to do what I think you're trying is :
Suppose you have all the IDs written at 10:00 in an S3 object called data_corresponding_to_10_00_00. For the next hour(and 40000 updates), if they have all new IDs, you can write them to another S3 object with the name data_corresponding_to_11_00_00.
However, if you do not want multiple entries in both the files, and you need to update the previous file itself, using S3 is not a great idea. Rather use a database indexed on ID so that the performance becomes faster.

How to diff very large buckets in Amazon S3?

I have a use case where I have to back up a 200+TB, 18M object S3 bucket to another account that changes often (used in batch processing of critical data). I need to add a verification step, but due to the large size of both bucket, object count, and frequency of change this is tricky.
My current thoughts are to pull the eTags from the original bucket and archive bucket, and the write a streaming diff tool to compare the values. Has anyone here had to approach this problem and if so did you come up with a better answer?
Firstly, if you wish to keep two buckets in sync (once you've done the initial sync), you can use Cross-Region Replication (CRR).
To do the initial sync, you could try using the AWS Command-Line Interface (CLI), which has a aws s3 sync command. However, it might have some difficulties with a large number of files -- I suggest you give it a try. It uses keys, dates and filesize to determine which files to sync.
If you do wish to create your own sync app, then eTag is definitely a definitive way to compare files.
To make things simple, activate Amazon S3 Inventory, which can provide a daily listing of all files in a bucket, including eTag. You could then do a comparison between the Inventory files to discover which remaining files require synchronization.
For anyone looking for a way to solve this problem in an automated way (as was I),
I created a small python script that leverages S3 Inventories and Athena to do the comparison somewhat efficiently. (This is basically automation of John Rosenstein's suggestion)
You can find it here https://github.com/forter/s3-compare

Is there a way to query S3 object key names for the latest per prefix?

In an S3 bucket, I have thousands and thousands of files stored with names having a structure that comes down to prefix and number:
A-0001
A-0002
A-0003
B-0001
B-0002
C-0001
C-0002
C-0003
C-0004
C-0005
New objects for a given prefix should come in with varying frequency, but might not. Older objects may disappear.
Is there a way to efficiently query S3 for the highest number of every prefix, i.e. without listing the entire bucket? The result I want is:
A-0003
B-0002
C-0005
The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it? So far I have only found it capable of searching within objects, but all I care about are their key names. If it can report on the contents of objects in the bucket, can't it on the bucket itself?
I would be okay with the latest modification date per prefix, but I want to avoid having to switch to a versioned bucket with just the prefixes as names to achieve that.
I think this is what you are looking for:
variable name is $path and you can regexp to get the pattern you are querying...
WHERE regexp_extract(sp."$path", '[^/]+$') like concat('%',cast(current_date - interval '1' day as varchar),'.csv')
The S3 API itself does not seem to offer anything usable for that.
However, perhaps another service, like Athena, could do it?
Yes at the moment, there is not direct way of doing it only with AWS S3. Even with Athena, it will go through the files to query their content but it will be easier using standard SQL support with Athena and would be faster since the queries runs in parallel.
So far I have only found it capable of searching within objects, but
all I care about are their key names.
Both Athena and S3 Select is to query by content not keys.
The best approach I can recommend is to use AWS DynamoDB to keep the metadata of the files, including file names for faster querying.

Top level solution to rename AWS bucket item's folder names?

I've inherited a project at work. Its essentially a niche content repository, and we use S3 to store the content. The project was severely outdated, and I'm in the process of a thorough update.
For some unknown and undocumented reason, the content is stored in an AWS S3 bucket with the pattern web_cl_000000$DB_ID$CONTENT_NAME So, one particular folder can be named web_cl_0000003458zyxwv. This makes no sense, and requires a bit of transformation logic to construct a URL to serve up the content!
I can write a Python script using the boto3 library to do an item-by-item rename, but would like to know if there's a faster way to do so. There are approximately 4M items in that bucket, which will take quite a long time.
That isn't possible, because the folders are an illusion derived from the strings between / delimiters in the object keys.
Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using key name prefixes for objects. (emphasis added)
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
The console contributes to the illusion by allowing you to "create" a folder, but all that actually does is create a 0-byte object with / as its last character, which the console will display as a folder whether there are other objects with that prefix or not, making it easier to upload objects manually with some organization.
But any tool or technique that allows renaming folders in S3 will in fact be making a copy of each object with the modified name, then deleting the old object, because S3 does not actually support rename or move, either -- objects in S3, including their key and metadata, are actually immutable. Any "change" is handled at the API level with a copy/overwrite or copy-then-delete.
Worth noting, S3 should be able to easily sustain 100 such requests per second, so with asynchronous requests or multi-threaded code, or even several processes each handling a shard of the keyspace, you should be able to do the whole thing in a few hours.
Note also that the less sorted (more random) the new keys are in the requests, the harder you can push S3 during a mass-write operation like this. Sending the requests so that the new keys are in lexical order will be the most likely scenario in which you might see 503 Slow Down errors... in which case, you just back off and retry... but if the new keys are not ordered, S3 can more easily accommodate a large number of requests.