ETL architecture with AWS Glue and Data Pipeline - amazon-web-services

I'm trying to decide whether to use AWS Glue or Amazon Data Pipeline for our ETL. I need to incrementally copy several tables to Redshift. Almost all tables need to be copied with no transformation. One table requires a transformation that could be done using Spark.
Based on my understanding from these two services, the best solution is to use a combination of the two. Data Pipeline can copy everything to S3. From there, if no transformation is needed, Data Pipeline can use Redshift COPY to move the data to Redshift. Where a transformation is required, a Glue job can apply the transformation and copy the data to Redshift.
Is this a sensible strategy or am I misunderstanding the applications of these services?

I'm guessing it's long pass the project deadline but for people looking at this:
Use only AWS Glue. You can define Redshift as a both source and target connectors, meaning that you can read from it and dump into it. Before you do that, however, you'll need to use a a Crawler to create Glue-specific schema.
All of this can be also done through only Data Pipeline with SqlActivity(s) although setting up everything might take significantly longer and not that much cheaper.
rant: I'm honestly surprised how AWS focused solely on big data solutions without providing a decent tool for small/medium/large data sets. Glue is an overkill and Data Pipeline is cumbersome/terrible for usage. There should be a simple SQL-type Lambda!

Related

AWS Glue as a ETL tool?

Why AWS claims Glue as a ETL tool? We need to code everything to pull data, no inbuilt functionality provided by Glue. Any benefits of using Glue instead of Nifi or some other ingestion tools?
Glue is a good ETL tool within AWS. Especially for big data work loads. After all it is running on spark.
Glue does have the ability to produce some basic automated transformation code -> Move data from A to B and remap column names etc.
However, it's the flexibility to write custom code that really sets it apart. Using the Glue code editor, or the Pycharm IDE, you can script any transformations you need using pyspark and/or scala.
The benefits of Glue are really gained when it is used in conjunction with other AWS services. The Glue Data Catalog is shared with Athena and even AWS EMR, so you end up with a central point for your big data ecosystem.
One limitation of Glue I have found is writing large datasets to MS SQL Server (10 million rows+). Glue uses JDBC drivers, and as of 2020, there is yet to be a Microsoft JDBC connection that avails of bulk copy. So, effectively you are writing an insert statement for each row. Therefore, performance can suffer once you get into the 10s of millions of rows currently.

Amazon EMR vs Amazon Redshift

For majority of use-cases, Spark transformations can be done on streaming data or bounded data (say from Amazon S3) using Amazon EMR, and then data can be written to S3 again with the transformed data.
The transformations can also be achieved in Amazon Redshift using the different data from S3 being loaded to different Redshift tables, and then the data from the different Redshift tables loaded to final table. (Now with Redshift spectrum, we could also select and transform data directly from S3 as well.)
With that said, I see the transformations can be done in both EMR and Redshift, with Redshift loads and transformations done with less development time.
So, should EMR be used for use-cases mainly involving streaming/unbounded data? What other use-cases is EMR preferable (I am aware Spark provides other core, sql, ml libraries as well), but just for transformation(involving joins/reducers) to be achieved, I don't see a use-case other than streaming inside EMR, when transformation can be achieved also in Redshift.
Please provide use-cases when to use EMR transformations vs Redshift transformation.
In the first instance I prefer to use Redshift for transformations as:
Development is easier, SQL rather than Spark
Maintenance / monitoring is easier
Infrastructure costs are lower assuming you can run during "off-peak"
times.
Sometimes EMR is a better option, I would consider it in these circumstances:
When you want to have raw and transformed data both on S3, e.g. a
"data lake" strategy
Complex transformations are required. Some transformations are just
not possible using Redshift such as when
managing complex and large json columns
pivoting of data dynamically (variable number of attributes)
Third party libraries are required
data sizes are so large that a much bigger redshift cluster would be needed to process the transformations.
There are other additional options other than Redshift and EMR, thsese should also be considered.
for example
Standard python or other scripting language to :
create dynamic transformation sql, which can be run in redshift
processing from csv to parquet or similar
scheduling (e.g. airflow)
AWS Athena
can be used with s3 (e.g. parquet) input and output
uses SQL (so some advantages in development time) using Presto syntax which in some cases is more powerful than Redshift SQL
can have significant cost benefits as no permanent infrastructe costs are needed, pay on usage.
AWS Batch and AWS lambda should also be considered.

Copy data from PostgreSQL to S3 using AWS Data Pipeline

I am trying to copy all the tables from a schema (PostgreSQL, 50+ tables) to Amazon S3.
What is the best way to do this? I am able to create 50 different copy activities, but is there a simple way to copy all tables in a schema or write one pipeline and loop?
I think the old method is :
1. Unload your data from PostgreSQL to a CSV file first using something like psql
2. Then just copy the csv to S3
But, AWS gives u a script to do so , RDSToS3CopyActivity See this link from AWS
Since you have a large number of tables. I would recommend using AWS Glue as compared to AWS Data Pipeline. Glue is easily configurable having crawlers etc that allows you the flexibility to choose columns, define etc. Moreover,he underlying jobs in AWS Glue are pyspark jobs that scale really well giving you really good performance.

AWS data pipeline: dump data to 3 s3 nodes

I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?
Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.
Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.

Need strategy advice for migrating large tables from RDS to DynamoDB

We have a couple of mySql tables in RDS that are huge (over 700 GB), that we'd like to migrate to a DynamoDB table. Can you suggest a strategy, or a direction to do this in a clean, parallelized way? Perhaps using EMR or the AWS Data Pipeline.
You can use AWS Pipeline. There are two basic templates, one for moving RDS tables to S3 and the second for importing data from S3 to DynamoDB. You can create your own pipeline using both templates.
Regards
one thing to consider with such large data is whether Dynamo is the best option.
If this is statistical data or otherwise "big data", check out AWS RedShift which might be better suited for your situation.
We have done a similar work and there is probably a better strategy to do this. Using AWS DMS and some prep tables within your source instance.
It involved two steps:
You create new tables within your source instance which match exactly with the dynamodb schema. Like merging multiple tables to one etc.
Set up DMS task with the prep tables as source and DynamoDB as the target. Since the prep tables and the target schema matches now, it should be pretty straightforward from this point.