Copy DynamoDB table data cross account real time - amazon-web-services

What is the easiest approach (easiest implies low number of service maintenance overhead. Would prefer server less approach if possible) to copy data from a DDB table in one account to another, preferably in server less manner (so no scheduled jobs using Data pipelines).
I was exploring possibility of using DynamoDB streams, however this old answer mentions that is not possible. However, I could not find latest documentation confirming/disproving this. Is that still the case?
Another option I was considering: Update the Firehose transform lambda that manipulates and then inserts data into the DynamoDB table to publish this to a Kinesis stream with cross account delivery enabled triggering a Lambda that will further process data as required.

This should be possible
configure DynamoDB table in the source account with Stream enabled
create Lambda function in the same account (source account) and integrate it with DDB Stream
create cross-account role, i.e DynamoDBCrossAccountRole in the destination account with permissions to do necessary operations on the destination DDB table (this role and destination DDB table are in the same account)
add sts:AssumeRole permissions to your Lambda function's execution role in addition to logs permissions for CloudWatch so that it can assume the cross-account role
call sts:AssumeRole from within your lambda function and configure DynamoDB client with these permissions, example:
client = boto3.client('sts')
sts_response = client.assume_role(RoleArn='arn:aws:iam::<999999999999>:role/DynamoDBCrossAccountRole',
RoleSessionName='AssumePocRole', DurationSeconds=900)
dynamodb = boto3.resource(service_name='dynamodb', region_name=<region>,
aws_access_key_id = sts_response['Credentials']['AccessKeyId'],
aws_secret_access_key = sts_response['Credentials']['SecretAccessKey',
aws_session_token = sts_response['Credentials']['SessionToken'])
now your lambda function should be able to operate on the DynamoDB in the destination account from the source account

We kind of created replication system for cross account using DynamoDB streams and Lambda for a hackathon task.
You might see some delay in the records though, because of Lambdas coldstart issue.
There are ways to tackle this problem too depends on how busy you are going to keep Lambda, here is the link.
We actually created a cloudformation and a jar which can used by anyone internal to our orgainisation to start replication on any table. Won't be able to share due to security concerns.
Please check out this link for more details.

Related

Not able to send records from kinesis firehose to private redshift

I have a use case where my redshift cluster is private and supports only VPN connection to the VPC. I need to send data from kinesis firehose which is in another VPC. I found out that we need to make redshift public or attach an internet gateway to make this happen but I can't use internet gateway. I need to connect to redshift from kinesis firehose with VPN only. I am not able to figure out any way to do this.
As you are already aware, you cannot use a private Redshift cluster in a VPC as a target for Firehose without Internet access. There is no direct solution for this as detailed here and here.
That said, I can think of at least two work arounds that might suffice.
You can have Firehose target S3. Then setup a private link access to S3 from the private VPC and setup an event to copy the data into the Redshift cluster on an acceptable cadence. I think this is probably the best option.
You MIGHT be able to setup Firehose with a lambda processor that feeds the records into Redshift. The reason I say "might" is because the lambda will also need to be within the VPC and will need to be able to keep up with the Firehose flow. This could be fraught with performance issues, and potentially expensive. And Redshift isn't really meant to have high write transactions as a data warehouse. This is the worst option.
Firehose aggregates data in S3 and then triggers a COPY command in Redshift. As you don't have a network path from Firehose to Redshift this fails. However, Firehose can just stop at placing the data in S3.
Now you just need a way to trigger Redshift to COPY the data. There are a number of ways to do this but the easiest way is to use Lambda (in your Redshift VPC) to issue the COPY commands. You will need to decide on when the Lambda should run - Firehose uses two parameters to determine when a COPY should be issued; time since last COPY and data size not yet copied. You can emulate this behavior if you like but the simplest way is to just issue COPYs on some regular time interval, like every 5 min.
To do this you set up CloudWatch to trigger your Lambda every 5 min. The
Lambda looks in the Firehose location in S3 and lists all the files
renames (moves) all these files to put them in a new uniquely named
S3 "subfolder"
issues the COPY command to Redshift to ingest from this "subfolder"
Upon successful ingestion these files can be moved again, left in
the above "subfolder" or deleted
The reason to rename/move the files in S3 is to ensure that each run of the Lambda is operating on a unique set of files and that files aren't ingested more than once.

DynamoDB replication across AWS accounts

I am looking for a better way to replicate data from one AWS account DynamoDB to another account.
I know this can be done using Lambda triggers and streams.
Is there something like Global tables which exist in AWS we can use for replication across accounts?
I think the best way for migrating data between accounts is using AWS Data pipeline. This process will essentially take a backup (export) of your DynamoDb table in Account A to a S3 bucket of account B via DataPipeline. Then, one more DataPipeline job in account B would import the data from S3 back to the provided DynamoDb table.
The step by step manual is given in this document
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb.html.
Also you will need Cross-account access to the S3 bucket which you will be using to store your DynamoDB table data from account A, so the bucket or (the files) you are using must be shared between your account (A) and your destination account (B), till the migration gets completed.
Refer this doc for permissions https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example2.html
Another approach you can take is using script. There is no direct API for migration. You will have to use two clients, one for each account. One client will scan the data and other client could write that data into the table in another account.
Also I think they do have an import-export tool as mentioned in their AWSLabs repo. Although I have never tried this.
https://github.com/awslabs/dynamodb-import-export-tool

Copy data from S3 and post process

There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.
Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.

DynamoDB triggering a Lambda function in another Account

I have a DynamoDB table in Account A and an AWS Lambda function in Account B. I need to trigger the Lambda function when there are changes in the DynamoDB table.
I came across aws lambda - It is possible to Access AWS DynamoDB streams accross accounts? - Stack Overflow which says it is not possible. But again I found amazon web services - Cross account role for an AWS Lambda function - Stack Overflow which says it is possible. I am not sure which one is correct.
Has somebody tried the same scenario as I am trying to achieve?
The first link that is being pointed to is correct. Triggers from Stream-based event to Lambda is limited to same aws account and same region.
However, there is a way you will be able to achieve your goal.
Pre-requisite: I assume you already have a Dynamo DB (DDB) table (let's call it Table_A) created in AWS account A. Also, you have a processing lambda(let's call it Processing_Lambda) in AWS account B.
Steps:
Create a new proxy lambda(let's call it Proxy_Lambda) in Account A. This lambda will broadcast the event that it processes.
Enable dynamo stream on DDB table Table_A. This stream will contain all update/insert/delete events being done on the table.
Create a lambda trigger for Proxy_Lambda to read events from dynamo db table stream of Table_A.
Crete new SNS topic (let's call it AuditEventFromTableA) in AWS account A
Add code in Proxy_Lambda to publish the event read from stream to the SNS topic AuditEventFromTableA.
Create an AWS SQS queue (can also be FIFO queue if your use-case requires sequential events). This queue is present in AWS account B. Let's call this queue AuditEventQueue-TableA-AccountA.
Create a subscription for SNS topic AuditEventFromTableA present in AWS account A to the SQS queue AuditEventQueue-TableA-AccountA present in AWS account B. This will allow all the SNS events from account A to be received in the SQS queue of Account B.
Create a trigger for Processing_Lambda present in AWS account B to consume message from SQS queue AuditEventQueue-TableA-AccountA.
Result: This way you will be able to trigger the lambda present in account B, based on the changes in dynamo table of account A.
Note: if your use-case demands strict tracking of the sequential event, you may prefer publishing update events from Proxy_Lambda directly to AWS Kinesis stream present in Account B instead of SNS-SQS path.
Simple!
Create a proxy lambda A in Account A and permit A to call target lambda Bin account B.
DDB stream trigger lambda A. Lambda A call Lambda B.

AWS Cross-Account, Cross-Region (to China) access? No Lambda in CN - alternatives?

In our "standard" AWS account, I have a system that does something like this:
CloudWatch Rule (Scheduled Event) -> Lambda function (accesses DynamoDB table, makes computations, writes metrics) -> CloudWatch Alarm (consume metrics, etc.)
However, in our separate CN account, we need to do a similar thing, but in CN, there's no Lambda...
Is there any way we can do something similar to what was done above using the systems available to CN? For example, is it possible to create a rule and have it trigger a lambda function in our "standard/nonCN" AWS account that access the other account's DynamoDB table?
I ultimately accomplished this by having the Lambda and the CloudWatch alarm live in the non-CN account, and then having the Lambda access the dynamoDB table across accounts and across regions.
This actually ended up working, though it did involve me using user credentials instead of a role like I would have been able to had it not been CN.
If anyone is interested in more details on this solution, feel free to comment and I can add more.
You can mix and match between AWS resources between regions. When you do your code, you need to make sure you have the regions correctly configured to those resources.
With respect to trigger, Have the trigger where ever you have your lambda. That will ease your process.
Hope it helps.