Amazon S3 sync millions of files to local for incremental backup

Amazon S3 sync millions of files to local for incremental backup - amazon-web-services

Trying to sync a large (millions of files) S3 bucket from cloud to local storage seems to be troublesome process for most S3 tools, as virtually everything I've seen so far uses GET Bucket operation, patiently getting the whole list of files in bucket, then diffing it against a list local of files, then performing the actual file transfer.
This looks extremely unoptimal. For example, if one could list files in a bucket that were created / changed since the given date, this could be done quickly, as list of files to be transferred would include just a handful, not millions.
However, given that answer to this question is still true, it's not possible to do so in S3 API.
Are there any other approaches to do periodic incremental backups of a given large S3 bucket?

On AWS S3 you can configure event notifications (Ex: s3:ObjectCreated:*). To request notification when an object is created. It supports SNS, SQS and Lambda services. So you can have an application that listens on the event and updates the statistics. You may also want to ad timestamp as part of the statistic. Then just "query" the result for a certain period of time and you will get your delta.

Related

aws sqs to s3 using lambda

Our upstream system is sending JSON messages to our SQS we will have 5 million messages per day.
I need to persist these messages to a S3 bucket for archiving and analytics purpose. I need to dequeue the messages and write 100K messages to a S3 file using lambda function. we will have multiple small files created in S3 buckets to Facilitate quick processing. The lambda would be triggered few times in a day . Any sample code for the lambda function that i can use or any pointers would be appreciated.

Processing millions of objects in Amazon S3 is not advisable.
Software or services that attempt to use these objects will be very slow. For example, simply listing the contents of an Amazon S3 bucket can only return 1000 objects per API call. Even services such as Amazon Athena that process multiple files in parallel will be very slow in listing and reading that many objects.
An alternative approach would be to send the messages to an Amazon Kinesis Data Firehose, which can combine multiple messages together based on size or elapsed time. It can then store files that combine multiple messages in one, thereby reducing the number of objects created in the S3 bucket.
If you are dealing with 100K+ objects in Amazon S3, also consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.

Costs Related to Individual Bucket Items in S3

My AWS S3 costs have been going up pretty quickly for usage type "DataTransfer-Out-Bytes". I have thousands of images in this one bucket and I can't seem to find a way to drill down into the bucket to see which individual bucket items might be causing the increase. Is there a way to see which individual files are attributing to the higher data transfer cost?

Use Cloudfront if you can - its cheaper(if you properly set your cache headers!) than directly hosting from S3 and Cloudfront includes a popular objects report - which would answer your question.
If your using S3 alone you need to enable logging on the bucket (more storage cost) and then crunch the data in the logs (more data transfer cost) to get your answer. You can use AWS Athena to process the s3 access logs or use unix command line tools like grep/wc/uniq/cut to operate on the log files locally/from a server to find the culprits.

AWS S3 replication DstObjectHardDeleted error during replication

Background: We are currently trying to cutover from 1 AWS account to another. This includes getting a full copy of the S3 buckets into the new account (including all historical versions and timestamps). We first initiated replication to the new account's S3 buckets, ran a batch job to copy the historical data, and then tested against it. Afterward, we emptied the bucket to remove data added during testing, and then tried to redo the replication/batch job.
Now it seems AWS will not replicate the objects because it sees they did at one point exist in the bucket. Looking at the batch job's output, every object shows this:
{bucket} {key} {version} failed 500 DstObjectHardDeleted Currently object can't be replicated if this object previously existed in the destination but was recently deleted. Please try again at a later time
After seeing this, I deleted the destination bucket completely and recreated it, in the hope that it would flush out any previous traces of the data, and then I retried it. The same error occurs.
I cannot find any information on this error or even an acknowledgement in the AWS docs that this is expected or a potential issue.
Can anyone tell me how long we have to wait before replicating again? an hour? 24?
Is there any documentation on this error in AWS?
Is there anyway to get around this limitation?
Update: Retried periodically throughout the day, and never got an upload to replicate. Also I tried replicating instead to a third bucket, and then initiate replication from that new bucket to the original target. It throws the same error.
Update2: This post was made on a Friday. Retried the jobs today (the following Monday), and the error remains unchanged.
Update3: Probably the last update. Short version is I gave up, and made a different bucket to replicate it. If anyone has information on this, I'm still interested, I just can't waste anymore time on it.

Batch Replication does not support re-replicating objects that were hard-deleted (deleted with the version of the object) from the destination bucket.
Below are possible workaround for this limitation:
Copy the source objects in place with a Batch Copy job. Copying those
objects in place will create new versions of the objects in the
source and initiate replication automatically to the destination. You
may also use a custom script to do an in-place copy in the source
bucket.
Re-replicate these source objects to a different/new destination bucket.
Run aws s3 sync command. It will copy objects to destination bucket with new version IDs (Version IDs will be different in source and destination buckets). If you are syncing large number of objects, run it at prefix level and determine the approximate time to replicate all objects depending on your network throughput. Run command in background with "&" at the end. You may also do dryrun before actual copy. Refer for more options.
aws s3 sync s3://SOURCE-BUCKET/prefix1 s3://DESTINATION-BUCKET/prefix1 --dryrun > output.txt
aws s3 sync s3://SOURCE-BUCKET/prefix1 s3://DESTINATION-BUCKET/prefix1 > output.txt &
In summary, you can do S3 batch copy OR S3 replication to existing destination bucket only for new version ID objects. To replicate existing version ID objects of source bucket, you will have to use different/new destination bucket.

We encountered the same thing and tried the same process you outlined. We did get some of the buckets to succeeded in the second account replication batch job but the largest amount of data was just below 2 million count. We have had to use the aws cli to sync the data or use the DataSync service (this process is still ongoing and may have to run many times breaking up the records).
It appears that when deleting large buckets in the first account, the metadata about them is hanging around for a long time. We moved about 150 buckets with varying amounts of data. Only about half made it to the second account doing the two step replication. So the lesson I learned is if you can control the name of your buckets and change them during the move, do that.

Periodic Read from AWS S3 and publish to SQS

I have an S3 bucket with different files. I need to read those files and publish SQS msg for each row in the file.
I cannot use S3 events as the files need to be processed with a delay - put to SQS after a month.
I can write a scheduler to do this task, read and publish. But can I was AWS for this purpose?
AWS Batch or AWS data pipeline or Lambda.?
I need to pass the date(filename) of the data to be read and published.
Edit : The data volume to be dealt is huge

I can think of two ways to do this entirely using AWS serverless offerings without even having to write a scheduler.
You could use S3 events to start a Step Function that waits for a month before reading the S3 file and sending messages through SQS.
With a little more work, you could use S3 events to trigger a Lambda function which writes the messages to DynamoDB with a TTL of one month in the future. When the TTL expires, you can have another Lambda that listens to the DynamoDB streams, and when there’s a delete event, it publishes the message to SQS. (A good introduction to this general strategy can be found here.)
While the second strategy might require more effort, you might find it less expensive than using Step Functions depending on the overall message throughput and whether or not the S3 uploads occur in bursts or in a smooth distribution.

At the core, you need to do two things:
Enumerate all of the object in a bucket in S3, and perform some action on any object uploaded more than a month ago.
Can you use Lambda or Batch to do this? Sure. A Lambda could be set to trigger once a day, enumerate the files, and post the results to SQS.
Should you? No clue. A lot depends on your scale, and what you plan to do if it takes a long time to perform this work. If your S3 bucket has hundreds of objects, it won't be a problem. If it has billions, your Lambda will need to be able to handle being interrupted, and continuing paging through files from a previous run.
Alternatively, you could use S3 events to trigger a simple Lambda that adds a row to a database. Then, again, some Lambda could run on a cron job that asks the database for old rows, and publishes that set to SQS for others to consume. That's slightly cleaner, maybe, and can handle scaling up to pretty big bucket sizes.
Or, you could do the paging through files, deciding what to do, and processing old files all on a t2.micro if you just need to do some simple work to a few dozen files every day.
It all depends on your workload and needs.

Identifying and deleting S3 Objects that are not being accessed?

I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?

There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.

No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.

There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js