We have a hundreds of thousands of .csv files stored in S3 that contain at least several data records each. (each record is its own row)
I am trying to design a migration strategy to transform all the records in the .csv files and put them into DynamoDB. During the migration, I'd also like to ensure that if any new .csv gets added to the S3 bucket, we automatically trigger a lambda or something to do the transformation and write to DynamoDB as well.
Eventually we'd stop writing to S3 entirely, but initially we need to keep those writes and any writes to S3 to also trigger a write to DynamoDB. Does anyone know of any good strategies for doing this? (Is there something like DynamoDB streams except for S3?) Any strategies for getting the existing things in .csv in S3 over to DynamoDB in general?
AWS has many tools you can use to solve this problem. Here are a few.
You could use AWS Database Migration Service. It supports migrating data from S3 and into DynamoDB. This AWS product is designed specifically for your use case, and it handles pretty much everything.
Once the migration has started, DMS manages all the complexities of the migration process including automatically replicating data changes that occur in the source database during the migration process.
S3 can publish events to trigger a lambda function which can be used to continuously replicate the data to DynamoDB.
AWS Data Pipelines basically does batch ETL jobs, which could move your data all at once from S3 to DynamoDB. You might also be able to run periodic sync jobs if you can tolerate a delay in replicating data to DynamoDB.
AWS Glue can crawl your data, process it, and store it in another location. I think it would provide you with an initial load plus the ongoing replication. While it could work, it’s designed more for unstructured data, and you have CSV files which are usually structured.
I’d recommend using AWS Database Migration Service because it’s the one-stop solution, but if you can’t use it for some reason, there are other options.
I don't know if DynamoDB has "load records from CSV" feature (RedShift does).
If it does not, then you could roll your own. Write a Python function that imports the csv and boto3 modules, takes as input an S3 path (inside an event dictionary). The function would them download the file from S3 to temp dir, parse it with csv, then use boto3 to insert into DynamoDB.
To get the history loaded, write a function that uses `boto3' to read the list of objects in S3, then call the first function to upload to DynamoDB.
To get future files loaded, install the first function as a Lambda function, and add a trigger from S3 Object Creation events to run the function whenever a new object is put onto S3.
Related
I want to periodically insert data from S3 (or other fonts) into Amazon Redshift, i.e., when data is added to my S3 bucket, I want an option to add it automatically to my Amazon Redshift cluster.
My preferred method for doing this is to establish a trigger that fire every time a file is created in a part of a bucket. This trigger creates an event that initiates a Lambda function that issues the desired SQL to Redshift. (Or if the work that is needed in Redshift is complex or long running I will use a step function but this is rare.)
Example setups for this:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html
https://64lines.medium.com/building-a-aws-lambda-function-to-run-aws-redshift-sql-scripts-in-python-7468b7c2fdea
I'd start simple if you can and work up to Redshift Data API and Step functions.
You can automate the insertion of data from S3 with a scheduled Lambda that triggers periodically. This might be a better solution than invoking a Lambda on every object upload, especially if you are receiving lots of files continuously.
I have a transactional table that I load from SQL server to Amazon S3 S3 using AWS DMS. For handling updates I move the old files to archive and then process only the incremental records everytime.
This is fine when I have only insert operation in my database. But problem comes when we need to accomodate the updates. Right now for doing any updates we read the entire S3 file and make changes to the records which are updated as a part of incremental load. Then as the data keeps on increasing the process of reading the entire file in S3 bucket and updating it would take more time and in future the job might not be able to end in time (Considering that job needs to end in 1 hour).
This can be handled using Databricks where we can use Delta Table to update the records and finally overwrite the existing file. But Databricks is bit expensive.
How do we handle the same using AWS Glue?
I'm working on a task of copying csv files from s3 bucket to redshift. I've found multiple ways to do so but I'm not sure which one will be the best possible way to do it. Here's the scenario:
On regular intervals, multiple CSV files of size around 500 MB - 1 GB, will be added to my s3 bucket. The data can contain duplicates. The task is to copy the data to redshift table while ensuring that the duplicate data is not present in redshift.
Here are the ways I found which can be used:
Create a AWS Lambda function which will be triggered whenever a file is added to s3 bucket.
Use AWS Kinesis
Use AWS Glue
I understand Lambda should not be used for jobs that takes more than 5 minutes. So should I use it or just eliminate this option?
Kinesis can handle large amount of data but is it the best way to do it?
I'm not familiar with Glue and Kinesis. But I read that Glue can be slow.
If anyone can point me to the right direction, it will be really helpful.
You can definitely make it work with Lambda, if you leverage StepFunctions and the S3 Select option to filter subsets of data into smaller chunks. You'd have your Step Functions manage your ETL orchestration wherein you execute your lambdas that selectively pull from the large data file via the S3 select option. Your pre-process state--see links below--could be used to determine execution requirements, then execute multiple Lambdas, even in parallel, if you wish. Those lambdas would process the subsets of data to remove dups and perform any other ETL operations you might require. Then, you'd take the processed data and write to Redshift. Here are links that will help you put that architecture together:
Trigger State Machine Execution from S3 Event
Manage Lambda Processing Executions and workflow state
Use S3 Select to pull subsets from large data objects
Also, here's a link to a Python ETL pipeline example for the CDK that I built. You'll see an example of an S3 event-driven lambda along with data processing and DDB or MySQL writes. Will give you an idea as to how you can build out comprehensive Lambdas for ETL operations. You would just need to add a psycopg2 layer to your deployment for Redshift. Hope this helps.
I'm doing designing some ETL data pipelines with Airflow. Data transformations is done by provisioning an AWS EMR Spark cluster and sending its some jobs. The jobs read data from S3, process them and write them back to S3 using date as a partition.
For my last step, I need to load the S3 data to a datawarehouse using SQL scripts that are submitted to Redshift using Python script, however I cannot find a clean way to get retrieve which data need to be loaded, ie. which date partitions have been generated during Spark transformations (can only be known during the execution of the job and not beforehand).
Note that everything is orchestrated through a Python script using boto3 library that is run from a corporate VM that cannot be accessed from outside.
What would be the best way to fetch this information from EMR?
For now I'm thinking about different solutions:
- Write the information into a log file. Get the data from Spark master node using SSH through Python script
- Write the information to an S3 file
- Write the information to a database (RDS?)
I'm struggling to determine what are the pros and the cons of these solutions. I'm also wondering what would be the best way to inform that data transformations is over and that metadata can be fetched.
Thanks in advance
The most straightforward is to use S3 as your temporary storage. After finishing your Spark execution (Writing result to S3), you can add one more step writing data to S3 bucket which you want to get in next step.
The approach with RDS should be the similar to S3, but it requires more implementations than S3. You need to setup RDS, maintain Schema, implementation to work with RDS...
With S3 tmp file, after EMR terminated and AF running next step, using Boto to fetch that tmp file (S3 Path depends on your requirement) and that is it.
I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?
Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.
Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.