DynamoDB replication across AWS accounts - amazon-web-services

I am looking for a better way to replicate data from one AWS account DynamoDB to another account.
I know this can be done using Lambda triggers and streams.
Is there something like Global tables which exist in AWS we can use for replication across accounts?

I think the best way for migrating data between accounts is using AWS Data pipeline. This process will essentially take a backup (export) of your DynamoDb table in Account A to a S3 bucket of account B via DataPipeline. Then, one more DataPipeline job in account B would import the data from S3 back to the provided DynamoDb table.
The step by step manual is given in this document
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb.html.
Also you will need Cross-account access to the S3 bucket which you will be using to store your DynamoDB table data from account A, so the bucket or (the files) you are using must be shared between your account (A) and your destination account (B), till the migration gets completed.
Refer this doc for permissions https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example2.html
Another approach you can take is using script. There is no direct API for migration. You will have to use two clients, one for each account. One client will scan the data and other client could write that data into the table in another account.
Also I think they do have an import-export tool as mentioned in their AWSLabs repo. Although I have never tried this.
https://github.com/awslabs/dynamodb-import-export-tool

Related

Does S3 Replication/S3 Batch Ops offer Data Integrity?

I have a use case where we want to transfer data between AWS Accounts. I want to use the S3 Replication/S3 Batch Ops/DataSync provided they can ensure the data integrity so that I don't have to use additional checks after data is transferred.
I have used S3 document sync across two different AWS accounts via AWS CLI. My use case was to push data from one bucket to another bucket in different AWS accounts so I used AWS CLI command.
I was satisfied with the Data Integrity in this process. Next time when I used to run the sync, it used to transfer only newly created item in source S3 bucket.

How do I import a DynamoDB table exported to S3 as a new table?

I don't want to use data pipeline because it is too cumbersome. I also have a relatively small table so it would be heavy handed to use data pipeline for it- I could run a script locally to do the import because it's so small.
I used the fully managed Export to S3 feature to export a table to a bucket (in a different account): https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html
What are my options now for importing that to a new table in the other account?
If there isn't a managed feature for this, does AWS provide a canned script I can point at an S3 folder and give the name of the new table I want to create from it?
Amazon DyanamoDB now supports importing data from S3 buckets to new DynamoDB tables from this blog post.
The steps for importing data from S3 buckets can be found in their developer guide.
As of 18 August 2022, this feature is now built into DynamoDB and you need no other services or code.
Another AWS-blessed option is a cross-account DynamoDB table replication that uses Glue in the target account to import the S3 extract and Dynamo Streams for ongoing replication.
You may want to create a AWS Data Pipeline which already has recommended template for importing DynamoDB data from S3:
This is the closest you can get to a "managed feature" where you select the S3 prefix and the DynamoDB table.

Copy data from S3 and post process

There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.
Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.

Copy DynamoDB table data cross account real time

What is the easiest approach (easiest implies low number of service maintenance overhead. Would prefer server less approach if possible) to copy data from a DDB table in one account to another, preferably in server less manner (so no scheduled jobs using Data pipelines).
I was exploring possibility of using DynamoDB streams, however this old answer mentions that is not possible. However, I could not find latest documentation confirming/disproving this. Is that still the case?
Another option I was considering: Update the Firehose transform lambda that manipulates and then inserts data into the DynamoDB table to publish this to a Kinesis stream with cross account delivery enabled triggering a Lambda that will further process data as required.
This should be possible
configure DynamoDB table in the source account with Stream enabled
create Lambda function in the same account (source account) and integrate it with DDB Stream
create cross-account role, i.e DynamoDBCrossAccountRole in the destination account with permissions to do necessary operations on the destination DDB table (this role and destination DDB table are in the same account)
add sts:AssumeRole permissions to your Lambda function's execution role in addition to logs permissions for CloudWatch so that it can assume the cross-account role
call sts:AssumeRole from within your lambda function and configure DynamoDB client with these permissions, example:
client = boto3.client('sts')
sts_response = client.assume_role(RoleArn='arn:aws:iam::<999999999999>:role/DynamoDBCrossAccountRole',
RoleSessionName='AssumePocRole', DurationSeconds=900)
dynamodb = boto3.resource(service_name='dynamodb', region_name=<region>,
aws_access_key_id = sts_response['Credentials']['AccessKeyId'],
aws_secret_access_key = sts_response['Credentials']['SecretAccessKey',
aws_session_token = sts_response['Credentials']['SessionToken'])
now your lambda function should be able to operate on the DynamoDB in the destination account from the source account
We kind of created replication system for cross account using DynamoDB streams and Lambda for a hackathon task.
You might see some delay in the records though, because of Lambdas coldstart issue.
There are ways to tackle this problem too depends on how busy you are going to keep Lambda, here is the link.
We actually created a cloudformation and a jar which can used by anyone internal to our orgainisation to start replication on any table. Won't be able to share due to security concerns.
Please check out this link for more details.

Identifying and deleting S3 Objects that are not being accessed?

I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs