I wrote a lambda function to copy files in an s3 bucket into another s3 bucket and I need to move a very large number of these files. To try and meet the volume requirements I was looking for a way to send these requests in large batches to S3 to cut down on overhead. However I cannot find any information on how to do this in Python. There's a Batch class in the boto3 documentation but I can't make sense of how it works or even what it actually does.
There is no underlying Amazon S3 API call that can copy multiple files in one request.
The best option is to issue requests in parallel so that they will execute faster.
The boto3 Transfer Manager might be able to assist with this effort.
Side-note: There is no such thing as 'move' command for S3. Instead, you will need to copy, then delete. Just mentioning it for other readers.
Related
I have a log archive bucket, and that bucket has 2.5m+ objects.
I am looking to download some specific time period files. For this I have tried different methods but all of them are failing.
My observation is those queries start from oldest file, but the files I seek are the newest ones. So it takes forever to find them.
aws s3 sync s3://mybucket . --exclude "*" --include "2021.12.2*" --include "2021.12.3*" --include "2022.01.01*"
Am I doing something wrong?
Is it possible to make these query start from newest files so it might take less time to complete?
I also tried using S3 Browser and CloudBerry. Same problem. Tried with a EC2 that is inside the same AWS network. Same problem.
2.5m+ objects in an Amazon S3 bucket is indeed a large number of objects!
When listing the contents of an Amazon S3 bucket, the S3 API only returns 1000 objects per API call. Therefore, when the AWS CLI (or CloudBerry, etc) is listing the objects in the S3 bucket it requires 2500+ API calls. This is most probably the reason why the request is taking so long (and possibly failing due to lack of memory to store the results).
You can possibly reduce the time by specifying a Prefix, which reduces the number of objects returned from the API calls. This would help if the objects you want to copy are all in a sub-folder.
Failing that, you could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then extract from that CSV file a list of objects you want to copy (eg use Excel or write a program to parse the file). Then, specifically copy those objects using aws s3 cp or from a programming language. For example, a Python program could parse the script and then use download_file() to download each of the desired objects.
The simple fact is that a flat-structure Amazon S3 bucket with 2.5m+ objects will always be difficult to list. If possible, I would encourage you to use 'folders' to structure the bucket so that you would only need to list portions of the bucket at a time.
I need to move some large file(s) (1 terabyte to 5 terabyte) from one S3 location to a different directory in the same bucket or to a different bucket.
There are few ways that I can think of doing it more robustly.
Trigger a lambda function based on ObjectCreated:Put trigger and use boto3 to copy the file to new location and delete source file. Plain and simple. But if there is any error while copying the files, I lose the event. I have to design some sort of tracking system along with this.
Use same-region-replication and delete the source once the replication is completed. I do not think there is any event emitted once the object is replicated so I am not sure.
Trigger a Step function and have Copy and Delete as separate steps. This way if for some reason Copy or Delete steps fail, I can rerun the state machine. Here again the problem is what if the file size is too big for lambda to copy?
Trigger a lambda function based on ObjectCreated:Put trigger and create a data pipeline and move the file using aws s3 mv. This can get little expensive.
What is the right way of doing this to make this reliable?
I am looking for advise on the right approach. I am not looking for code. Please do not post aws s3 cp or aws s3 mv or aws s3api copy-object one line commands.
Your situation appears to be:
New objects are being created in Bucket A
You wish to 'move' them to Bucket B (or move them to a different location in Bucket A)
The move should happen immediately after object creation
The simplest solution, of course, would be to create the objects in the correct location without needing to move them. I will assume you have a reason for not being able to do this.
To respond to your concepts:
Using an AWS Lambda function: This is the easiest and most-responsive method. The code would need to do a multi-part copy since the objects can be large. If there is an unrecoverable error, the original object would be left in the source bucket for later retry.
Using same-region replication: This is a much easier way to copy the objects to a desired destination. S3 could push the object creation information to an Amazon SQS queue, which could be consulted for later deletion of the source object. You are right that timing would be slightly tricky. If you are fine with keeping some of the source files around for a while, the queue could be processed at regular intervals (eg every 15 minutes).
Using a Step Function: You would need something to trigger the Step Function (another Lambda function?). This is probably overkill since the first option (using Lambda) could delete the source object after a successful copy, without needing to invoke a subsequent step. However, Step Functions might be able to provide some retry functionality.
Use Data Pipeline: Don't. Enough said.
Using an AWS Lambda function to copy an object would require it to send a Copy command for each part of an object, thereby performing a multi-part copy. This can be made faster by running multiple requests in parallel through multiple threads. (I haven't tried that in Lambda, but it should work.)
Such multi-threading has already been implemented in the AWS CLI. So, another option would be to trigger an AWS Lambda function (#1 above) that calls out to run the AWS CLI aws s3 mv command. Yes, this is possible, see: How to use AWS CLI within a Lambda function (aws s3 sync from Lambda) :: Ilya Bezdelev. The benefit of this method is that the code already exists, it works, using aws s3 mv will delete the object after it is successfully copied, and it will run very fast because the AWS CLI implements multi-part copying in parallel.
I have a requirement to transfer data(one time) from on prem to AWS S3. The data size is around 1 TB. I was going through AWS Datasync, Snowball etc... But these managed services are better to migrate if the data is in petabytes. Can someone suggest me the best way to transfer the data in a secured way cost effectively
You can use the AWS Command-Line Interface (CLI). This command will copy data to Amazon S3:
aws s3 sync c:/MyDir s3://my-bucket/
If there is a network failure or timeout, simply run the command again. It only copies files that are not already present in the destination.
The time taken will depend upon the speed of your Internet connection.
You could also consider using AWS Snowball, which is a piece of hardware that is sent to your location. It can hold 50TB of data and costs $200.
If you have no specific requirements (apart from the fact that it needs to be encrypted and the file-size is 1TB) then I would suggest you stick to something plain and simple. S3 supports an object size of 5TB so you wouldn't run into trouble. I don't know if your data is made up of many smaller files or 1 big file (or zip) but in essence its all the same. Since the end-points or all encrypted you should be fine (if your worried, you can encrypt your files before and they will be encrypted while stored (if its backup of something). To get to the point, you can use API tools for transfer or just file-explorer type of tools which have also connectivity to S3 (e.g. https://www.cloudberrylab.com/explorer/amazon-s3.aspx). some other point: cost-effectiviness of storage/transfer all depends on how frequent you need the data, if just a backup or just in case. archiving to glacier is much cheaper.
1 TB is large but it's not so large that it'll take you weeks to get your data onto S3. However if you don't have a good upload speed, use Snowball.
https://aws.amazon.com/snowball/
Snowball is a device shipped to you which can hold up to 100TB. You load your data onto it and ship it back to AWS and they'll upload it to the S3 bucket you specify when loading the data.
This can be done in multiple ways.
Using AWS Cli, we can copy files from local to S3
AWS Transfer using FTP or SFTP (AWS SFTP)
Please refer
There are tools like cloudberry clients which has a UI interface
You can use AWS DataSync Tool
I need to create a python flask application that moves a file from s3 storage to s3 glacier. I cannot use the lifetime policy to do this as I need to use glacier vault lock which isn't possible with the lifetime policy method since I won't be able to use any glacier features on those files. The files will be multiple GBs in size so I need to download these files and then upload them on glacier. I was thinking of adding a script on ec2 that will be triggered by flask and will start downloading and uploading files to glacier.
This is the only solution I have come up with and it doesn't seem very efficient but I'm not sure. I am pretty new to AWS so any tips or thoughts will be appreciated.
Not posting any code as I don't really have a problem with the coding, just the approach I should take.
It appears that your requirement is to use Glacier Vault Lock on some objects to guarantee that they cannot be deleted within a certain timeframe.
Fortunately, similar capabilities have recently been added to Amazon S3, called Amazon S3 Object Lock. This works at the object or bucket level.
Therefore, you could simply use Object Lock instead of moving the objects to Glacier.
If the objects will be infrequently accessed, you might also want to change the Storage Class to something cheaper before locking it.
See: Introduction to Amazon S3 Object Lock - Amazon Simple Storage Service
Suppose I have a couple of terabytes worth of data files that have accumulated on an EC2 instance's block storage.
What would be the most efficient way of downloading them to a local machine? scp? ftp? nfs? http? rsync? Going through an intermediate s3 bucket? Torrent via multiple machines? Any special tools or scripts out there for this particular problem?
As I did not really receive a convincing answer, I decided to make a small measurement myself. Here are the results I got:
More details here.
Please follow these rules:
Move as one file, tar everything into a single archive file.
Create S3 bucket in the same region as your EC2/EBS.
Use AWS CLI S3 command to upload file to S3 bucket.
Use AWS CLI to pull the file to your local or wherever another storage is.
This will be the easiest and most efficient way for you.
Some more info about this usecase is needed. I hope below concepts are helpfull:
HTTP - fast, easy to implement, versatile and has small overhead.
Resilio (formerly BitTorrent Sync) - fast, easy to deploy, decentralized, and secure. Can handle transfer interruptions. Works if both endpoints are behind NAT.
rsync - old school and well known solution. Can resume transfer and fast in syncing big amounts of data.
Upload to S3 and get from there - Upload to S3 is fast. Next You can use HTTP(S) or BitTorrent to get data localy.