Does S3 Replication/S3 Batch Ops offer Data Integrity? - amazon-web-services

I have a use case where we want to transfer data between AWS Accounts. I want to use the S3 Replication/S3 Batch Ops/DataSync provided they can ensure the data integrity so that I don't have to use additional checks after data is transferred.

I have used S3 document sync across two different AWS accounts via AWS CLI. My use case was to push data from one bucket to another bucket in different AWS accounts so I used AWS CLI command.
I was satisfied with the Data Integrity in this process. Next time when I used to run the sync, it used to transfer only newly created item in source S3 bucket.

Related

Moving objects from one GCS bucket to another Bucket using Terraform

I'd like to use Terraform to move multiple GCS bucket objects from one bucket to another bucket to a different location.
I read through Terraform documentation but I couldn't find anything substantial.
Terraform for Cloud Storage provider only handles creation of object. What you can do as a workaround is to use Terraform with Storage Transfer Service which schedules a job that transfers multiple objects to a GCS bucket which either came from AWS S3 or another GCS.
Since this is a GCS to GCS transfer, you can take note of:
Under transfer spec block, only specify the gcs_data_source to indicate that it is a GCS to GCS transfer.
The schedule block specifies the time when the transfer will start. If you intend to execute it just once, you can specify the schedule_end_date immediately.
The Storage Transfer Service feature also offers guide through the Google Cloud Console should you want to try it out:
https://cloud.google.com/storage-transfer/docs/create-manage-transfer-console#configure

DynamoDB replication across AWS accounts

I am looking for a better way to replicate data from one AWS account DynamoDB to another account.
I know this can be done using Lambda triggers and streams.
Is there something like Global tables which exist in AWS we can use for replication across accounts?
I think the best way for migrating data between accounts is using AWS Data pipeline. This process will essentially take a backup (export) of your DynamoDb table in Account A to a S3 bucket of account B via DataPipeline. Then, one more DataPipeline job in account B would import the data from S3 back to the provided DynamoDb table.
The step by step manual is given in this document
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb.html.
Also you will need Cross-account access to the S3 bucket which you will be using to store your DynamoDB table data from account A, so the bucket or (the files) you are using must be shared between your account (A) and your destination account (B), till the migration gets completed.
Refer this doc for permissions https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example2.html
Another approach you can take is using script. There is no direct API for migration. You will have to use two clients, one for each account. One client will scan the data and other client could write that data into the table in another account.
Also I think they do have an import-export tool as mentioned in their AWSLabs repo. Although I have never tried this.
https://github.com/awslabs/dynamodb-import-export-tool

Temporary s3 buckets or different storage methodologies to support AWS batch execution

I want to pass data to and from an AWS batch instance.
The current workflow requires an s3 bucket as support, it requires to take care of s3 data upload and disposal and to create an existing bucket just for this data exchange.
Is it possible to create a temporary s3 bucket or common workspace with automatic object/bucket disposal? Are there other techniques that can be used for simple data exchange between local code and an AWS Batch instance?

AWS S3 replication without versioning

I have enabled AWS S3 replication in an account and I want to replicate the same S3 data to another account and it all works fine. But I don't want to use S3 versioning because of its additional cost.
So is there any other way to accommodate this scenario?
The automated Same Region Replication(SRR) and Cross Region Replication(CRR) requires versioning to be activated due to the way that data is replicated between S3 buckets. For example, a new version of an object might be uploaded while a bucket is still being replicated, which can lead to problems without having separate versions.
If you do not wish to retain other versions, you can configure Amazon S3 Lifecycle Rules to expire (delete) older versions.
An alternative method would be to run the AWS CLI aws s3 sync command at regular intervals to copy the data between buckets. This command would need to be run on an Amazon EC2 instance or even your own computer. It could be triggered by a cron schedule (Linux) or a Schedule Task (Windows).

Identifying and deleting S3 Objects that are not being accessed?

I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs