I want to move my DynamoDB table (that has approx. 1000 rows) to S3 and everytime that the DynamoDB table gets updated, the file in S3 should be updated automatically.
What's a good way to implement this? Is my approach correct?
Initial step:
DynamoDB -> Glue/Data Pipeline -> S3
Updating S3:
DynamoDB (or DynamoDB Stream?) -> Lambda -> S3
Is it better to use Glue or Data Pipeline for moving? And is there a helpful link on how I can write a Lambda function for this case?
As the number of rows is less, it is better to go ahead with Lambda based solution to avoid bootstrap time in Glue/Data Pipeline(uses EMR) and associated costs.
Doing CRUD operations on files is difficult, so, the lambda function can do a full table query everytime and write to a file and push to S3.
This function can happen on a schedule with 1 min frequency or more as you expect
Hope this helps!
Related
I want to periodically insert data from S3 (or other fonts) into Amazon Redshift, i.e., when data is added to my S3 bucket, I want an option to add it automatically to my Amazon Redshift cluster.
My preferred method for doing this is to establish a trigger that fire every time a file is created in a part of a bucket. This trigger creates an event that initiates a Lambda function that issues the desired SQL to Redshift. (Or if the work that is needed in Redshift is complex or long running I will use a step function but this is rare.)
Example setups for this:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html
https://64lines.medium.com/building-a-aws-lambda-function-to-run-aws-redshift-sql-scripts-in-python-7468b7c2fdea
I'd start simple if you can and work up to Redshift Data API and Step functions.
You can automate the insertion of data from S3 with a scheduled Lambda that triggers periodically. This might be a better solution than invoking a Lambda on every object upload, especially if you are receiving lots of files continuously.
I'm working on a task of copying csv files from s3 bucket to redshift. I've found multiple ways to do so but I'm not sure which one will be the best possible way to do it. Here's the scenario:
On regular intervals, multiple CSV files of size around 500 MB - 1 GB, will be added to my s3 bucket. The data can contain duplicates. The task is to copy the data to redshift table while ensuring that the duplicate data is not present in redshift.
Here are the ways I found which can be used:
Create a AWS Lambda function which will be triggered whenever a file is added to s3 bucket.
Use AWS Kinesis
Use AWS Glue
I understand Lambda should not be used for jobs that takes more than 5 minutes. So should I use it or just eliminate this option?
Kinesis can handle large amount of data but is it the best way to do it?
I'm not familiar with Glue and Kinesis. But I read that Glue can be slow.
If anyone can point me to the right direction, it will be really helpful.
You can definitely make it work with Lambda, if you leverage StepFunctions and the S3 Select option to filter subsets of data into smaller chunks. You'd have your Step Functions manage your ETL orchestration wherein you execute your lambdas that selectively pull from the large data file via the S3 select option. Your pre-process state--see links below--could be used to determine execution requirements, then execute multiple Lambdas, even in parallel, if you wish. Those lambdas would process the subsets of data to remove dups and perform any other ETL operations you might require. Then, you'd take the processed data and write to Redshift. Here are links that will help you put that architecture together:
Trigger State Machine Execution from S3 Event
Manage Lambda Processing Executions and workflow state
Use S3 Select to pull subsets from large data objects
Also, here's a link to a Python ETL pipeline example for the CDK that I built. You'll see an example of an S3 event-driven lambda along with data processing and DDB or MySQL writes. Will give you an idea as to how you can build out comprehensive Lambdas for ETL operations. You would just need to add a psycopg2 layer to your deployment for Redshift. Hope this helps.
There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.
Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.
I want to move(export) data from DynamoDB to S3
I have seen this tutorial but i'm not sure if the extracted data of dynamoDB will be deleted or coexits in DynamoDB and S3 at the same time.
What I expect is the data from dynamoDB will be deleted and stored in s3 (after X time stored in DynamoDB)
The main purpose of the project could be similar to this
There are any way to do this without have to develop a lambda function?
In resume, I have found this 2 different ways:
DynamoDB -> Pipeline -> S3 (Are the dynamoDB data deleted?)
DynamoDB -> TTL DynamoDB + DynamoDB stream -> Lambda -> firehose -> s3 (this appears to be more difficult)
Is this post currently valid for this purpouse?
What would be the simpliest and fasted way?
In your first option, as per default, data is not removed from dynamoDB. You can design a pipeline to make this work, but I think that is not the best solution.
In your second option, you must evaluate the solution based on your expected data volume:
If the data volume that will expire in TTL definition is not very
large, you can use lambda to persist removed data into S3 without
Firehose. You can design a simple lambda function to be triggered by
DynamoDB Stream and persist each stream event as a S3 object. You
can even trigger another lambda function to consolidate the objects
in a single file in the end of the day, week or month. But again,
based on your expected volume.
If you have a lot of data being expired at the same time and you
must perform transformations on this data, the best solution is to
use Firehose. Firehose can proceed with the transformation,
encryption and compact your data before sending it to S3. If the
volume of data is to big, using functions in the end of the day,
week or month may not be feasible. So it's better to perform all
this procedures before persisting it.
You can use AWS Pipeline to dump DynamoDB table to S3 and it will not be deleted.
I'm trying to set up ElasticSearch import process from DynamoDB table. I have already created AWS Lambda and enabled DynamoDB stream with trigger that invokes my lambda for every added/updated record. Now I want to perform initial seed operation (import all records that are currently in my DynamoDB table to ElasticSearch). How do I do that? Is there any way to make all records in a table be "reprocessed" and added to stream (so they can be processed by my lambda)? Or is it better to write a separate function that will manually read all data from the table and send it to ElasticSearch - so basically have 2 lambdas: one for initial data migration (executed only once and triggered manually by me), and another one for syncing new records (triggered by DynamoDB stream events)?
Thanks for all the help :)
Depending on how Large your dataset is you won't be able to seed your database in Lambda as there is a max timeout of 300 seconds (EDIT: This is now 15 minutes, thanks #matchish).
You could spin up an EC2 instance and use the SDK to perform a DynamoDB scan operation and batch write to your Elasticsearch instance.
You could also use Amazon EMR to perform a Map Reduce Job to export to S3 and from there process all your data.
I would write a script that will touch each records in dynamodb. For each items in your dynamodb, add a new property called migratedAt or whatever you wish. Adding this property will trigger dynamodb stream which in turn will trigger your lambda handler. Based on your question, your lambda handler already handles the update so there is no change there.