S3 Bucket AWS CLI takes forever to get specific files - amazon-web-services

I have a log archive bucket, and that bucket has 2.5m+ objects.
I am looking to download some specific time period files. For this I have tried different methods but all of them are failing.
My observation is those queries start from oldest file, but the files I seek are the newest ones. So it takes forever to find them.
aws s3 sync s3://mybucket . --exclude "*" --include "2021.12.2*" --include "2021.12.3*" --include "2022.01.01*"
Am I doing something wrong?
Is it possible to make these query start from newest files so it might take less time to complete?
I also tried using S3 Browser and CloudBerry. Same problem. Tried with a EC2 that is inside the same AWS network. Same problem.

2.5m+ objects in an Amazon S3 bucket is indeed a large number of objects!
When listing the contents of an Amazon S3 bucket, the S3 API only returns 1000 objects per API call. Therefore, when the AWS CLI (or CloudBerry, etc) is listing the objects in the S3 bucket it requires 2500+ API calls. This is most probably the reason why the request is taking so long (and possibly failing due to lack of memory to store the results).
You can possibly reduce the time by specifying a Prefix, which reduces the number of objects returned from the API calls. This would help if the objects you want to copy are all in a sub-folder.
Failing that, you could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then extract from that CSV file a list of objects you want to copy (eg use Excel or write a program to parse the file). Then, specifically copy those objects using aws s3 cp or from a programming language. For example, a Python program could parse the script and then use download_file() to download each of the desired objects.
The simple fact is that a flat-structure Amazon S3 bucket with 2.5m+ objects will always be difficult to list. If possible, I would encourage you to use 'folders' to structure the bucket so that you would only need to list portions of the bucket at a time.

Related

Moving File from S3 to S3 Glacier using C#

I have uploaded 365 files 1 files per day to S3 bucket all at one go. Now All the files have the same upload date. I want to Move the file which are more than 6 months to S3 Glacier. S3 lifecycle policy will take effect after 6 months as all the files upload date to s3 is same. The actual date of the files upload is stored in DynamoDb table with S3KeyUrl.
I want to know the best way to be able to move file to s3 Glacier. I came up with the following approach
Create the S3 Lifecycle policy to move file to s3 Glacier which will work after 6 month.
Create a app to Query DynamoDB Table to get the list of files which are more than 6 months and
download the file from s3 (as it allows uploading files from local directory) and use
ArchiveTransferManager (Amazon.Glacier.Transfer) to the file to s3 glacier vault.
In Prod Scenario there will be files in some 10 million so the solution should be reliable.
There are two versions of Glacier:
The 'original' Amazon Glacier, which uses Vaults and Archives
The Amazon S3 Storage Classes of Glacier and Glacier Deep Archive
Trust me... You do not want to use the 'original' Glacier. It is slow and difficult to use. So, avoid anything that mentions Vaults and Archives.
Instead, you simply want to change the Storage Class of the objects in Amazon S3.
Normally, the easiest way to do this is to "Edit storage class" in the S3 management console. However, you mention Millions of objects, so this wouldn't be feasible.
Instead, you will need to copy objects over themselves, while changing the storage class. This can be done with the AWS CLI:
aws s3 cp s3://<bucket-name>/ s3://<bucket-name>/ --recursive --storage-class <storage_class>
Note that this would change the storage class for all objects in the given bucket/path. Since you only wish to selectively change the storage class, you would either need to issue lots of the above commands (each for only one object), or you could use an AWS SDK to script the process. For example, you could write a Python program that loops through the list of objects, checks DynamoDB to determine whether the object is '6 months old' and then copies it over itself with the new Storage Class.
See: StackOverflow: How to change storage class of existing key via boto3
If you have millions of objects, it can take a long time to merely list the objects. Therefore, you could consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then use this CSV file as the 'input list' for your 'copy' operation rather than having to list the bucket itself.
Or, just be lazy (which is always more productive!) and archive everything to Glacier. Then, if somebody actually needs one of the files in the next 6 months, simply restore it from Glacier before use. So simple!

Cheapest way in aws to move/obtain a subset of files based on date

The main question I have is:
how can i move files based on a date range, without incurring client-side api calls that cost money?
Background:
I want to download a subset of files from an AWS S3 bucket onto a linux server but there are millions of them in ONE folder, with nothing differentiating them except a sequence number; and I need a subset of these based on creation date. (well actually, inside the files is an event timestamp, so I want to reduce the bulk first by creation date).
I have frankly no idea what costs I am incurring, everytime I do an ls on that dataset , e.g. for testing.
Right now I am considering:
aws s3api list-objects --bucket "${S3_BUCKET}" --prefix "${path_from}" --query "Contents[?LastModified>='${low_extract_date}'].{Key: Key}"
but that is client-side if I understand correctly. So I would like to just move the relevant files to a different folder first, based on creation date.
Then just run aws S3 ls on that set.
Is that possible?
Because in that case, I would either:
move files to another folder, while limiting to the date range that I am interested in (2-5%)
list all these files (as I understand, this is where the costs are created?) and subsequently, extract them (and move them to archive)
remove the subset folder
or:
sync bucket into a new bucket
delete all files I dont need from that bucket (older than date X)
run ls on the remaining set
or:
some other way?
And: is that cheaper than listing the files using the query?
thanks!
PS so to clarify: i wish to do a server-side operation to reduce the set initially and then list the result.
I believe a good approach to this would be the following:
If your instance is in a VPC create a VPC endpoint for S3 to allow direct private connection to Amazon S3 rather than going across the internet
Move the object keys that you want, to include a prefix (preferably Y/m/d) e.g. prefix/randomfile.txt might become 2020/07/04/randomfile.txt. If you're planning on scrapping the rest of the files then move it to a new bucket rather than in the same bucket.
Get objects based on the prefix (for all files for this month the prefix would be 2020/07
From the CLI you can move a file using the current syntax
aws s3 mv s3://bucketname/prefix/randomfile.txt s3://bucketname/2020/07/04/randomfile.txt
To copy the files for a specific prefix you could run the following on the CLI
aws s3 cp s3://bucketname/2020/07 .
To get files on a specific date you can run the below
aws s3api list-objects-v2 --bucket bucketname --query 'Contents[?contains(LastModified, `$DATE`)]'
The results of running this would need to be run via the CLI

Download millions of records from s3 bucket based on modified date

I am trying to download millions of records from s3 bucket to NAS. Because there is not particular pattern for filenames, I can rely solely on modified date to execute multiple CLI's in parallel for quicker download. I am unable to find any help to download files based on modified date. Any inputs would be highly appreciated!
someone mentioned about using s3api, but not sure how to use s3api with cp or sync command to download files.
current command:
aws --endpoint-url http://example.com s3 cp s3:/objects/EOB/ \\images\OOSS\EOB --exclude "*" --include "Jun" --recursive
I think this is wrong because include here would be referring to inclusion of 'Jun' within the file name and not as modified date.
The AWS CLI will copy files in parallel.
Simply use aws s3 sync and it will do all the work for you. (I'm not sure why you are providing an --endpoint-url)
Worst case, if something goes wrong, just run the aws s3 sync command again.
It might take a while for the sync command to gather the list of objects, but just let it run.
If you find that there is a lot of network overhead due to so many small files, then you might consider:
Launch an Amazon EC2 instance in the same region (make it fairly big to get large network bandwidth; cost isn't a factor since it won't run for more than a few days)
Do an aws s3 sync to copy the files to the instance
Zip the files (probably better in several groups rather than one large zip)
Download the zip files via scp, or copy them back to S3 and download from there
This way, you are minimizing the chatter and bandwidth going in/out of AWS.
I'm assuming you're looking to sync arbitrary date ranges, and not simply maintain a local synced copy of the entire bucket (which you could do with aws s3 sync).
You may have to drive this from an Amazon S3 Inventory. Use the inventory list, and specifically the last modified timestamps on objects, to build a list of objects that you need to process. Then partition those somehow and ship sub-lists off to some distributed/parallel process to get the objects.

Run AWS lambda function on existing S3 images

I write an AWS lambda function in Node.js for image resizing and trigger it when images upload.
I have already more than 1,000,000 images existing in bucket.
I want to run this lambda function on that images but not find anything till yet.
How can I run AWS lamdba function on existing images of S3 bucket?
Note:- I know this question already asked on Stack overflow, but issue is that no solution of them given till yet
Unfortunately, Lambda cannot be triggered automatically for objects that are already existing in a S3 bucket.
You will have to invoke your Lambda function manually for each image in your S3 bucket.
First, you will need to list existing objects in your S3 bucket using the ListObjectsV2 action.
For each object in your S3 bucket, you must then invoke your Lambda function and provide the S3 object's information as the Payload.
Yes , it's completely true that lambda cannot be triggered by objects already present there in your s3 bucket, but invoking your lambda manually for each object is a completely dumb idea.
With some clever techniques you can perform your tasks on those images easily :
The hard way is, make a program locally that exactly does the same thing as your lambda function but add two more things, firstly you have to iterate over each object in your bucket, then perform your code on it and then save it to destination path of s3 after resizing. i.e, for all images already stored in your s3 bucket , instead of using lambda, you are resizing the images locally in your computer and saving them back to s3 destination.
The easiest way is, first make sure that you have configured s3 notification's event type to be Object Created (All) as trigger for your lambda.
Then after this, move all your already stored images to a new temporary bucket, and then move those images back to the original bucket, this is how your lambda will get triggered for each image automatically. You can do the moving task easily by using sdk's provided by AWS. For example, for moving files using boto3 in python, you can refer this link to moving example in python using boto3
Instead of using moving , i.e cut and paste , you can use copy and paste commands too.
In addition to Mausam Sharma's comment you can run the copy between buckets using the aws cli:
aws s3 sync s3://SOURCE-BUCKET-NAME s3://DESTINATION-BUCKET-NAME --source-region SOURCE-REGION-NAME --region DESTINATION-REGION-NAME
from here:
https://medium.com/tensult/copy-s3-bucket-objects-across-aws-accounts-e46c15c4b9e1
You can simply copy back to the same bucket with the CLI which will replace the original file with itself and then run the lambda as a result.
aws s3 copy s3://SOURCE-BUCKET-NAME s3://SOURCE-BUCKET-NAME --recursive
You can also include/exclude glob patterns which can be used to selectively run against say a particular day, or specific extensions etc.
aws s3 copy s3://SOURCE-BUCKET-NAME s3://SOURCE-BUCKET-NAME --recursive --exclude "*" --include "2020-01-15*"
It's worth noting that like many of the other answers here, this will incur costs on s3 for read/write etc, so cautiously apply this in the event of buckets containing lots of files.

AWS S3 LISTING is slow

I am trying to execute the following command using AWS CLI on an S3 bucket:
aws s3 ls s3://bucket name/folder_name --summarize --human-readable --recursive
I am trying to get the size of the folder, but given there are multiple levels and a huge number of files, it running for hours.
Is there an efficient way to quickly get the size at folder level on Amazon S3?
You can use Amazon S3 Inventory:
Amazon S3 inventory provides comma-separated values (CSV) or Apache optimized row columnar (ORC) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).
You would need to parse the file, but all information is provided.
It is only updated daily, so if you need something faster then you'd have to make the calls yourself.