Bulk AWS S3 files exist verification? - amazon-web-services

What is the best/fastest approach to check if multiple files exist in AWS S3 bucket?
For example I have 100k files metadata in my local DB. I would like to make sure all of them exist in S3 bucket. I can do 'aws s3 ls' for particular file, but that would mean 100k aws requests. Is there a better approach to this?

If you are just doing a general audit, you could use Amazon S3 Inventory to obtain a complete daily dump of all object keys and associated metadata.
You could then write some code to compare the contents of the Inventory file against the DB entries.

If you want to retrieve all keys in a specific bucket in one command then you can use this.
aws s3api list-objects --bucket <bucket-name> --no-paginate
Once you have that list, you can process it by a custom code.

If you would like to make sure your local files are on S3 you can try the s3 sync command.
You can also check out which files are there currently with Commandeer, which supports S3 file browsing in a nice tree view.

Related

S3 Bucket AWS CLI takes forever to get specific files

I have a log archive bucket, and that bucket has 2.5m+ objects.
I am looking to download some specific time period files. For this I have tried different methods but all of them are failing.
My observation is those queries start from oldest file, but the files I seek are the newest ones. So it takes forever to find them.
aws s3 sync s3://mybucket . --exclude "*" --include "2021.12.2*" --include "2021.12.3*" --include "2022.01.01*"
Am I doing something wrong?
Is it possible to make these query start from newest files so it might take less time to complete?
I also tried using S3 Browser and CloudBerry. Same problem. Tried with a EC2 that is inside the same AWS network. Same problem.
2.5m+ objects in an Amazon S3 bucket is indeed a large number of objects!
When listing the contents of an Amazon S3 bucket, the S3 API only returns 1000 objects per API call. Therefore, when the AWS CLI (or CloudBerry, etc) is listing the objects in the S3 bucket it requires 2500+ API calls. This is most probably the reason why the request is taking so long (and possibly failing due to lack of memory to store the results).
You can possibly reduce the time by specifying a Prefix, which reduces the number of objects returned from the API calls. This would help if the objects you want to copy are all in a sub-folder.
Failing that, you could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then extract from that CSV file a list of objects you want to copy (eg use Excel or write a program to parse the file). Then, specifically copy those objects using aws s3 cp or from a programming language. For example, a Python program could parse the script and then use download_file() to download each of the desired objects.
The simple fact is that a flat-structure Amazon S3 bucket with 2.5m+ objects will always be difficult to list. If possible, I would encourage you to use 'folders' to structure the bucket so that you would only need to list portions of the bucket at a time.

Cheapest way in aws to move/obtain a subset of files based on date

The main question I have is:
how can i move files based on a date range, without incurring client-side api calls that cost money?
Background:
I want to download a subset of files from an AWS S3 bucket onto a linux server but there are millions of them in ONE folder, with nothing differentiating them except a sequence number; and I need a subset of these based on creation date. (well actually, inside the files is an event timestamp, so I want to reduce the bulk first by creation date).
I have frankly no idea what costs I am incurring, everytime I do an ls on that dataset , e.g. for testing.
Right now I am considering:
aws s3api list-objects --bucket "${S3_BUCKET}" --prefix "${path_from}" --query "Contents[?LastModified>='${low_extract_date}'].{Key: Key}"
but that is client-side if I understand correctly. So I would like to just move the relevant files to a different folder first, based on creation date.
Then just run aws S3 ls on that set.
Is that possible?
Because in that case, I would either:
move files to another folder, while limiting to the date range that I am interested in (2-5%)
list all these files (as I understand, this is where the costs are created?) and subsequently, extract them (and move them to archive)
remove the subset folder
or:
sync bucket into a new bucket
delete all files I dont need from that bucket (older than date X)
run ls on the remaining set
or:
some other way?
And: is that cheaper than listing the files using the query?
thanks!
PS so to clarify: i wish to do a server-side operation to reduce the set initially and then list the result.
I believe a good approach to this would be the following:
If your instance is in a VPC create a VPC endpoint for S3 to allow direct private connection to Amazon S3 rather than going across the internet
Move the object keys that you want, to include a prefix (preferably Y/m/d) e.g. prefix/randomfile.txt might become 2020/07/04/randomfile.txt. If you're planning on scrapping the rest of the files then move it to a new bucket rather than in the same bucket.
Get objects based on the prefix (for all files for this month the prefix would be 2020/07
From the CLI you can move a file using the current syntax
aws s3 mv s3://bucketname/prefix/randomfile.txt s3://bucketname/2020/07/04/randomfile.txt
To copy the files for a specific prefix you could run the following on the CLI
aws s3 cp s3://bucketname/2020/07 .
To get files on a specific date you can run the below
aws s3api list-objects-v2 --bucket bucketname --query 'Contents[?contains(LastModified, `$DATE`)]'
The results of running this would need to be run via the CLI

Download millions of records from s3 bucket based on modified date

I am trying to download millions of records from s3 bucket to NAS. Because there is not particular pattern for filenames, I can rely solely on modified date to execute multiple CLI's in parallel for quicker download. I am unable to find any help to download files based on modified date. Any inputs would be highly appreciated!
someone mentioned about using s3api, but not sure how to use s3api with cp or sync command to download files.
current command:
aws --endpoint-url http://example.com s3 cp s3:/objects/EOB/ \\images\OOSS\EOB --exclude "*" --include "Jun" --recursive
I think this is wrong because include here would be referring to inclusion of 'Jun' within the file name and not as modified date.
The AWS CLI will copy files in parallel.
Simply use aws s3 sync and it will do all the work for you. (I'm not sure why you are providing an --endpoint-url)
Worst case, if something goes wrong, just run the aws s3 sync command again.
It might take a while for the sync command to gather the list of objects, but just let it run.
If you find that there is a lot of network overhead due to so many small files, then you might consider:
Launch an Amazon EC2 instance in the same region (make it fairly big to get large network bandwidth; cost isn't a factor since it won't run for more than a few days)
Do an aws s3 sync to copy the files to the instance
Zip the files (probably better in several groups rather than one large zip)
Download the zip files via scp, or copy them back to S3 and download from there
This way, you are minimizing the chatter and bandwidth going in/out of AWS.
I'm assuming you're looking to sync arbitrary date ranges, and not simply maintain a local synced copy of the entire bucket (which you could do with aws s3 sync).
You may have to drive this from an Amazon S3 Inventory. Use the inventory list, and specifically the last modified timestamps on objects, to build a list of objects that you need to process. Then partition those somehow and ship sub-lists off to some distributed/parallel process to get the objects.

AWS S3 LISTING is slow

I am trying to execute the following command using AWS CLI on an S3 bucket:
aws s3 ls s3://bucket name/folder_name --summarize --human-readable --recursive
I am trying to get the size of the folder, but given there are multiple levels and a huge number of files, it running for hours.
Is there an efficient way to quickly get the size at folder level on Amazon S3?
You can use Amazon S3 Inventory:
Amazon S3 inventory provides comma-separated values (CSV) or Apache optimized row columnar (ORC) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).
You would need to parse the file, but all information is provided.
It is only updated daily, so if you need something faster then you'd have to make the calls yourself.

How to check is S3 service is available or not in AWS via CLI?

We have options to :
1. Copy file/object to another S3 location or local path (cp)
2. List S3 objects (ls)
3. Create bucket (mb) and move objects to bucket (mv)
4. Remove a bucket (rb) and remove an object (rm)
5. Sync objects and S3 prefixes
and many more.
But before using the commands, we need to check if S3 service is available in first place. How to do it?
Is there a command like :
aws S3 -isavailable
and we get response like
0 - S3 is available, I can go ahead upload object/create bucket etc.
1 - S3 is not availble, you can't upload object etc. ?
You should assume that Amazon S3 is available. If there is a problem with S3, you will receive an error when making a call with the Amazon CLI.
If you are particularly concerned, then add a simple CLI command first, eg aws s3 ls and throw away the results. But that's really the same concept. Or, you could use the --dry-run option available on many commands that simply indicates whether you would have had sufficient permissions to make the request, but doesn't actually run the request.
It is more likely that you will have an error in your configuration (eg wrong region, credentials not valid) than S3 being down.