Deleteing millions of files from S3 - amazon-web-services

I need to delete 64 million objects from a bucket, leaving about the same number of objects untouched. I have created an inventory of the bucket and used that to create a filtered inventory that has only the objects that need to be deleted.
I created a Lambda function that uses NodeJS to 'async' delete the objects that are fed to it.
I have created smaller inventories (10s, 100s and 1000s of objects) from the filtered one, and used S3 Batch Operation jobs to process these, and those all seem to check out: the expected files were deleted, and all other files remained.
Now, my questions:
Am I doing this right? Is this the preferred method to delete millions of files, or did my Googling misfire?
Is it advised to just create on big batch job and let that run, or is it better to break it up in chunks of, say, a million objects?
How long will this take (approx. of course)? Will S3 Batch go through the list and do each file sequentially? Or does it automagically scale out and do a whole bunch in parallel?
What am I forgetting?
Any suggestions, thoughts or criticisms are welcome. Thanks!

You might have a look into Stepfunctions Distributed Map feature. I do not know your specific use case but it could help to get the proper scaling.
Here is a short blog entry how you can achieve it.

Related

Does amazon s3 have a limit to the number of subfolders you create?

I am thinking of using amazon s3 to implement my own backup solution. The idea is to have have a script that accepts a directory and recursively uploads all files underneath that directory into s3. However, I am not sure if it would work because of the following reasons.
s3 apparently doesn't have folders.
s3 imposes a limit on the size of the name of objects (1024 characters).
I take this to mean that if an object identified as "/foo/bar/baz.txt", then the "/foo/bar/" portion of that "filepath" is actually part of the object's name and counts towards the character limit on object names. If this is true, then I could see this becoming an issue when uploading deeply nested files with long filepaths (although 1024 characters does seem fairly generous).
Am I understanding things correctly?
Yes, this is accurate.
S3 is a key/value store, not a filesystem, though backups are certainly somethings its authors expect it to be used for (as evidenced by the documentation's choice of example keys being mostly filepaths!). If your computer has directory structures and filenames so long and so deeply nested that their entire path exceeds a thousand characters, I'd strongly recommend reorganising your hard drive!
If you can't do that and have lots of long paths, you may wish to try something other than attempting a one-to-one mapping between the two things. For example, you could store data blobs (the content of a file) with a key that is some GUID. Have a separate key/value store that maps GUIDs to filepaths. Although that doesn't help you with reverse lookup. Basically do the same thing you'd do if you were trying to structure this efficiently in code, using algorithms and data structures. Because, really, that's what you're doing here, too!
Putting backups aside and speaking more generally, if you were using subdirectories on disk only as a sort of metadata, there are other metadata properties you can use in S3 for that. But your object keys would still have to be unique across the whole dataset.
You can read more about S3 objects in the AWS documentation.

Triggering Lambda on basis of multiple files

I'm a bit confused, as I need to run an AWS glue job, when multiple specific files are available in s3. On every file put event in s3, I am triggering a lambda which writes that file metadata to dynamodb. Here in dynamodb, I am also maintaining a counter which counts the number of required files present.
But when multiple files are uploaded at once, which triggers multiple lambdas, they write at nearly the same time in dynamodb, which impacts the counter; hence the counter is not able to count accurately.
I need a better way to start a job, when specific (multiple) files are made available in s3.
Kindly suggest a better way.
Dynamo is eventually consistent by default. You need to request a strongly consistent read to guarantee you are reading the same data that was written.
See this page for more information, or for a more concrete example, see the ConsistentRead flag in the GetItem docs.
It's worth noting that these will only minimise your problem. There will also be a very small window between read/writes where network lag causes one function to read/write while another is doing so too. You should think about only allowing one function to run at a time, or some other logic to guarantee mutually exclusive access to the DB.
It sounds like you are getting the current count, incrementing it in your Lambda function, then updating DynamoDB with the new value. Instead you need to be using DynamoDB Atomic Counters, which will ensure that multiple concurrent updates will not cause the problems you are describing.
By using Atomic counters you simply send DynamoDB a request to increment your counter by 1. If your Lambda needs to check if this was the last file you were waiting on before doing other work, then you can use the return value from the update call to check what the new count is.
Not sure what you mean by "specific" (multiple) files.
If you are expecting specific file names (or "patterns"), then you could just check for all the expected files as first instruction of your lambda function. I.e. you expect files: A.txt, B.txt, C.txt, then test if your s3 bucket contains those 3 specific files (or 3 *.txt files or whatever suits your requirements). If that's the case then keep processing, if not then return from the function. This would technically work in case of concurrency calls.

Rails ActiveStorage vs AWS S3 tiers

My application stores MANY MANY images in S3 - we use Rails 5.2 ActiveStorage for that. The images are used a lot for 6 to 9 months. Then they are used VERY rarely until they are 15 months old and deleted automatically by ActiveStorage.
To save some money I'd like to move the files from 'S3-Standard' to 'S3-Infrequent Access (S3-IA)' after 9months of the file creation (This can be done automatically in AWS).
My question is: Will ActiveStorage still be able to find/display the image in 'S3-IA' in the rare case someone wants to see it? Will ActiveStorage still be able to find the file to delete it at 15months. Bottom Line: I don't want ActiveStorage to loose track of the file when it goes from 'S3-Standard' to 'S3-IA'
S3-IA just changes the pricing of an object. It doesn't change the visibility of the object, or the time needed to retrieve it (unlike GLACIER storage class).
One thing to be aware of is that IA pricing is based on a minimum object size of 128k. If you have a lot of objects that are smaller, then your costs may actually increase if you save them as IA.
docs
I haven’t tested, but Active Storage should be able to find the object as long as its name doesn’t change.

How to combine multiple S3 objects in the target S3 object w/o leaving S3

I understand that the minimum part size for uploading to an S3 bucket is 5MB
Is there any way to have this changed on a per-bucket basis?
The reason I'm asking is there is a list of raw objects in S3 which we want to combine in the single object in S3.
Using PUT part/copy we are able to "glue" objects in the single one providing that all objects except last one are >= 5MB. However sometimes our raw objects are not big enough and in this case when we try to complete multipart uploading we're getting famous error "Your proposed upload is smaller than the minimum allowed size" from AWS S3.
Any other idea how we could combine S3 objects without downloading them first?
"However sometimes our raw objects are not big enough... "
You can have a 5MB garbage object sitting on S3 and do concatenation with it where part 1 = 5MB garbage object, part 2 = your file that you want to concatenate. Keep repeating this for each fragment and finally use the range copy to strip out the 5MB garbage
There is no way to have the minimum part size changed
You may want to either;
Stream them together to AWS (which does not seem like an option, otherwise you would already be doing this)
Pad the file so it fill the minimum size of 5MB (what can or cannot be feasible to you since this will increase your bill). You will have the option to either use infrequent access (when you access these files rarely) or reduced redundancy (when you can recover lost files) if you think it can be applied to these specific files in order to reduce the impact.
Use an external service that will zip (or "glue" them together) your files and then re-upload to S3. I dont know if such service exists, but I am pretty sure you can implement it your self using a lambda function (I have even tried something like this in the past; https://github.com/gammasoft/zipper-lambda)

AWS boto3 -- Difference between `batch_writer` and `batch_write_item`

I'm currently applying boto3 with dynamodb, and I noticed that there are two types of batch write
batch_writer is used in tutorial, and it seems like you can just iterate through different JSON objects to do insert (this is just one example, of course)
batch_write_items seems to me is a dynamo-specific function. However, I'm not 100% sure about this, and I'm not sure what's the difference between these two functions (performance, methodology, what not)
Do they do the same thing? If they are, why having 2 different functions? If they're not, what's the difference? How's the performance comparison?
As far as I understand and use these APIs, with the batch_write_item(), you can even handle the data for more than one table in one query. But with batch_writer(), it means you are going to specify the actions are only applicable for a certain table. I think that should be the very basic difference I can tell you.
batch_writer creates a context manager for writing objects to Amazon
DynamoDB in batch.
The batch writer will automatically handle buffering and sending items
in batches.
In addition, the batch writer will also automatically handle any
unprocessed items and resend them as needed. All you need to do is
call put_item for any items you want to add, and delete_item for any
items you want to delete.
In addition, you can specify auto_dedup if the batch might contain
duplicated requests and you want this writer to handle de-dup for you.
source