Background: We are currently trying to cutover from 1 AWS account to another. This includes getting a full copy of the S3 buckets into the new account (including all historical versions and timestamps). We first initiated replication to the new account's S3 buckets, ran a batch job to copy the historical data, and then tested against it. Afterward, we emptied the bucket to remove data added during testing, and then tried to redo the replication/batch job.
Now it seems AWS will not replicate the objects because it sees they did at one point exist in the bucket. Looking at the batch job's output, every object shows this:
{bucket} {key} {version} failed 500 DstObjectHardDeleted Currently object can't be replicated if this object previously existed in the destination but was recently deleted. Please try again at a later time
After seeing this, I deleted the destination bucket completely and recreated it, in the hope that it would flush out any previous traces of the data, and then I retried it. The same error occurs.
I cannot find any information on this error or even an acknowledgement in the AWS docs that this is expected or a potential issue.
Can anyone tell me how long we have to wait before replicating again? an hour? 24?
Is there any documentation on this error in AWS?
Is there anyway to get around this limitation?
Update: Retried periodically throughout the day, and never got an upload to replicate. Also I tried replicating instead to a third bucket, and then initiate replication from that new bucket to the original target. It throws the same error.
Update2: This post was made on a Friday. Retried the jobs today (the following Monday), and the error remains unchanged.
Update3: Probably the last update. Short version is I gave up, and made a different bucket to replicate it. If anyone has information on this, I'm still interested, I just can't waste anymore time on it.
Batch Replication does not support re-replicating objects that were hard-deleted (deleted with the version of the object) from the destination bucket.
Below are possible workaround for this limitation:
Copy the source objects in place with a Batch Copy job. Copying those
objects in place will create new versions of the objects in the
source and initiate replication automatically to the destination. You
may also use a custom script to do an in-place copy in the source
bucket.
Re-replicate these source objects to a different/new destination bucket.
Run aws s3 sync command. It will copy objects to destination bucket with new version IDs (Version IDs will be different in source and destination buckets). If you are syncing large number of objects, run it at prefix level and determine the approximate time to replicate all objects depending on your network throughput. Run command in background with "&" at the end. You may also do dryrun before actual copy. Refer for more options.
aws s3 sync s3://SOURCE-BUCKET/prefix1 s3://DESTINATION-BUCKET/prefix1 --dryrun > output.txt
aws s3 sync s3://SOURCE-BUCKET/prefix1 s3://DESTINATION-BUCKET/prefix1 > output.txt &
In summary, you can do S3 batch copy OR S3 replication to existing destination bucket only for new version ID objects. To replicate existing version ID objects of source bucket, you will have to use different/new destination bucket.
We encountered the same thing and tried the same process you outlined. We did get some of the buckets to succeeded in the second account replication batch job but the largest amount of data was just below 2 million count. We have had to use the aws cli to sync the data or use the DataSync service (this process is still ongoing and may have to run many times breaking up the records).
It appears that when deleting large buckets in the first account, the metadata about them is hanging around for a long time. We moved about 150 buckets with varying amounts of data. Only about half made it to the second account doing the two step replication. So the lesson I learned is if you can control the name of your buckets and change them during the move, do that.
Related
I have a job that needs to transfer ~150GB from one folder into another. This runs once a day.
def copy_new_data_to_official_location(bucket_name):
s3 = retrieve_aws_connection('s3')
objects_to_move = s3.list_objects(
Bucket=bucket_name, Prefix='my/prefix/here')
for item in objects_to_move['Contents']:
print(item['Key'])
copy_source = {
'Bucket': bucket_name,
'Key': item['Key']
}
original_key_name = item['Key'].split('/')[2]
s3.copy(copy_source, bucket_name, original_key_name)
I have the following. This process takes a bit of time and also, if I'm reading correctly, I'm paying transfer fees moving between objects.
Is there a better way?
Flow:
Run large scale job on Spark to feed data in from folder_1 and external source
Copy output to folder_2
Delete all contents from folder_1
Copy contents of folder_2 to folder_1
Repeat above flow on daily cadence.
Spark is a bit strange, so need to copy output to folder_2, otherwise redirecting to folder_1 causes a data wipe before the job even kicks off.
There are no Data Transfer fees if the source and destination buckets are in the same Region. Since you are simply copying within the same bucket, there would be no Data Transfer fee.
150 GB is not very much data, but it can take some time to copy if there are many objects. The overhead of initiating the copy can sometimes take more time than actually copying the data. When using the copy() command, all data is transferred within Amazon S3 -- nothing is copied down to the computer where the command is issued.
There are several ways you could make the process faster:
You could issue the copy() commands in parallel. In fact, this is how the AWS Command-Line Interface (CLI) works when using aws s3 cp --recursive and aws s3 sync.
You could use the AWS CLI to copy the objects rather writing your own program.
Instead of copying objects once per day, you could configure replication within Amazon S3 so that objects are copied as soon as they are created. (Although I haven't tried this with the same source and destination bucket.)
If you need to be more selective about the objects to copy immediately, you could configure Amazon S3 to trigger an AWS Lambda function whenever a new object is created. The Lambda function could apply some business logic to determine whether to copy the object, and then it can issue the copy() command.
I need to move some large file(s) (1 terabyte to 5 terabyte) from one S3 location to a different directory in the same bucket or to a different bucket.
There are few ways that I can think of doing it more robustly.
Trigger a lambda function based on ObjectCreated:Put trigger and use boto3 to copy the file to new location and delete source file. Plain and simple. But if there is any error while copying the files, I lose the event. I have to design some sort of tracking system along with this.
Use same-region-replication and delete the source once the replication is completed. I do not think there is any event emitted once the object is replicated so I am not sure.
Trigger a Step function and have Copy and Delete as separate steps. This way if for some reason Copy or Delete steps fail, I can rerun the state machine. Here again the problem is what if the file size is too big for lambda to copy?
Trigger a lambda function based on ObjectCreated:Put trigger and create a data pipeline and move the file using aws s3 mv. This can get little expensive.
What is the right way of doing this to make this reliable?
I am looking for advise on the right approach. I am not looking for code. Please do not post aws s3 cp or aws s3 mv or aws s3api copy-object one line commands.
Your situation appears to be:
New objects are being created in Bucket A
You wish to 'move' them to Bucket B (or move them to a different location in Bucket A)
The move should happen immediately after object creation
The simplest solution, of course, would be to create the objects in the correct location without needing to move them. I will assume you have a reason for not being able to do this.
To respond to your concepts:
Using an AWS Lambda function: This is the easiest and most-responsive method. The code would need to do a multi-part copy since the objects can be large. If there is an unrecoverable error, the original object would be left in the source bucket for later retry.
Using same-region replication: This is a much easier way to copy the objects to a desired destination. S3 could push the object creation information to an Amazon SQS queue, which could be consulted for later deletion of the source object. You are right that timing would be slightly tricky. If you are fine with keeping some of the source files around for a while, the queue could be processed at regular intervals (eg every 15 minutes).
Using a Step Function: You would need something to trigger the Step Function (another Lambda function?). This is probably overkill since the first option (using Lambda) could delete the source object after a successful copy, without needing to invoke a subsequent step. However, Step Functions might be able to provide some retry functionality.
Use Data Pipeline: Don't. Enough said.
Using an AWS Lambda function to copy an object would require it to send a Copy command for each part of an object, thereby performing a multi-part copy. This can be made faster by running multiple requests in parallel through multiple threads. (I haven't tried that in Lambda, but it should work.)
Such multi-threading has already been implemented in the AWS CLI. So, another option would be to trigger an AWS Lambda function (#1 above) that calls out to run the AWS CLI aws s3 mv command. Yes, this is possible, see: How to use AWS CLI within a Lambda function (aws s3 sync from Lambda) :: Ilya Bezdelev. The benefit of this method is that the code already exists, it works, using aws s3 mv will delete the object after it is successfully copied, and it will run very fast because the AWS CLI implements multi-part copying in parallel.
I have the following currently created in AWS us-east-1 region and per the request of our AWS architect I need to move it all to the us-east-2, completely, and continue developing in us-east-2 only. What are the easiest and least work and coding options (as this is a one-time deal) to move?
S3 bucket with a ton of folders and files.
Lambda function.
AWS Glue database with a ton of crawlers.
AWS Athena with a ton of tables.
Thank you so much for taking a look at my little challenge :)
There is no easy answer for your situation. There are no simple ways to migrate resources between regions.
Amazon S3 bucket
You can certainly create another bucket and then copy the content across, either using the AWS Command-Line Interface (CLI) aws s3 sync command or, for huge number of files, use S3DistCp running under Amazon EMR.
If there are previous Versions of objects in the bucket, it's not easy to replicate them. Hopefully you have Versioning turned off.
Also, it isn't easy to get the same bucket name in the other region. Hopefully you will be allowed to use a different bucket name. Otherwise, you'd need to move the data elsewhere, delete the bucket, wait a day, create the same-named bucket in another region, then copy the data across.
AWS Lambda function
If it's just a small number of functions, you could simply recreate them in the other region. If the code is stored in an Amazon S3 bucket, you'll need to move the code to a bucket in the new region.
AWS Glue
Not sure about this one. If you're moving the data files, you'll need to recreate the database anyway. You'll probably need to create new jobs in the new region (but I'm not that familiar with Glue).
Amazon Athena
If your data is moving, you'll need to recreate the tables anyway. You can use the Athena interface to show the DDL commands required to recreate a table. Then, run those commands in the new region, pointing to the new S3 bucket.
AWS Support
If this is an important system for your company, it would be prudent to subscribe to AWS Support. They can provide advice and guidance for these types of situations, and might even have some tools that can assist with a migration. The cost of support would be minor compared to the savings in your time and effort.
Is it possible for you to create CloudFormation stacks (from existing resources) using the console, then copying the contents of those stacks and run them in the other region (replacing values where they need to be).
See this link: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/resource-import-new-stack.html
I am trying to download millions of records from s3 bucket to NAS. Because there is not particular pattern for filenames, I can rely solely on modified date to execute multiple CLI's in parallel for quicker download. I am unable to find any help to download files based on modified date. Any inputs would be highly appreciated!
someone mentioned about using s3api, but not sure how to use s3api with cp or sync command to download files.
current command:
aws --endpoint-url http://example.com s3 cp s3:/objects/EOB/ \\images\OOSS\EOB --exclude "*" --include "Jun" --recursive
I think this is wrong because include here would be referring to inclusion of 'Jun' within the file name and not as modified date.
The AWS CLI will copy files in parallel.
Simply use aws s3 sync and it will do all the work for you. (I'm not sure why you are providing an --endpoint-url)
Worst case, if something goes wrong, just run the aws s3 sync command again.
It might take a while for the sync command to gather the list of objects, but just let it run.
If you find that there is a lot of network overhead due to so many small files, then you might consider:
Launch an Amazon EC2 instance in the same region (make it fairly big to get large network bandwidth; cost isn't a factor since it won't run for more than a few days)
Do an aws s3 sync to copy the files to the instance
Zip the files (probably better in several groups rather than one large zip)
Download the zip files via scp, or copy them back to S3 and download from there
This way, you are minimizing the chatter and bandwidth going in/out of AWS.
I'm assuming you're looking to sync arbitrary date ranges, and not simply maintain a local synced copy of the entire bucket (which you could do with aws s3 sync).
You may have to drive this from an Amazon S3 Inventory. Use the inventory list, and specifically the last modified timestamps on objects, to build a list of objects that you need to process. Then partition those somehow and ship sub-lists off to some distributed/parallel process to get the objects.
Trying to sync a large (millions of files) S3 bucket from cloud to local storage seems to be troublesome process for most S3 tools, as virtually everything I've seen so far uses GET Bucket operation, patiently getting the whole list of files in bucket, then diffing it against a list local of files, then performing the actual file transfer.
This looks extremely unoptimal. For example, if one could list files in a bucket that were created / changed since the given date, this could be done quickly, as list of files to be transferred would include just a handful, not millions.
However, given that answer to this question is still true, it's not possible to do so in S3 API.
Are there any other approaches to do periodic incremental backups of a given large S3 bucket?
On AWS S3 you can configure event notifications (Ex: s3:ObjectCreated:*). To request notification when an object is created. It supports SNS, SQS and Lambda services. So you can have an application that listens on the event and updates the statistics. You may also want to ad timestamp as part of the statistic. Then just "query" the result for a certain period of time and you will get your delta.