What is the difference between AWS Glue ETL Job and AWS EMR? - amazon-web-services

If I had to perform ETL on a huge dataset(say 1Tb) stored in S3 as csv files, Both AWS Glue ETL job and AWS EMR steps can be used. Then how is AWS Glue different from AWS EMR. And which is the better solution in this case.

Most of the differences are already listed so I'll focus more on the use case specific.
When to choose aws glue
Data size is huge but structured i.e. it is in the table structure and is of known format (CSV, parquet, orc, json).
Lineage is required, if you need the data lineage graph while developing your etl job prefer developing the etl using glue native libraries.
The developers don't need to tweak the performance parameters like setting number of executors, per executor memory and so on.
You don't want the overhead of managing large cluster and pay only for what you use.
When to use EMR
Data is huge but semi-structured or unstructured where you can't take any benefit from Glue catalog.
You believe only in the outputs and lineage is not required.
You need to define more memory per executor depending upon the type of your job and requirement.
You can manage the cluster easily or if you have so many jobs which can run concurrently on the cluster saving you money.
In case of structured data, you should use EMR when you want more Hadoop capabilities like hive, presto for further analytics.
So it depends on what your use case is. Both are great service.

Glue allows you to submit ETL scripts directly in PySpark/Python/Scala, without the need for managing an EMR cluster. All setup/tear-down of infrastructure is managed.
There are also a few other managed components like Crawlers, Glue Data Catalog, etc which make it easier to work on your data.
You could use either for your use-case, Glue would be faster however you may not have the flexibility you get with EMR.

Glue uses EMR under the hood. This is evident when you ssh into the driver of your Glue dev-endpoint.
Now since Glue is a managed spark environment or say managed EMR environment, it comes with reduced flexibility. The type of workers that you can chose is limited. The number of language libraries that you can use in your spark code is limited. Glue did not support packages like pandas, numpy until recently. Apps like presto cant be integrated with Glue although Athena is a good alternative to a separate presto installation.
The main issue however is that Glue jobs have a cold start time from anywhere between 1 minute to 15 minutes.
EMR is a good choice for exploratory data analysis but for a production environment with CI/CD, Glue seems to be the better choice.
EDIT - Glue jobs no longer have a cold start wait time

From the AWS Glue FAQ:
AWS Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. AWS Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs.
Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.
Source: https://aws.amazon.com/glue/faqs/

AWS Glue is a ETL service from AWS. AWS Glue will generate ETL code in Scala or Python to extract data from the source, transform the data to match the target schema, and load it into the target
AWS EMR is a service where you can process large amount of data , its a supporting big data platform .It Supports Hadoop,Spark,Flink,Presto, Hive etc.You can spin up EC2 with the above listed softwares and make a similar ecosystem.
In your case , you want to process 1 TB of data .Now if you want do computations on the same data , you can use EMR and if you want to run the analytics on the transformed data , use Glue .

Following is something that i compiled post working on analytics projects (though a lot of it depends on use case) - but generally speaking :
Criteria
Glue
EMR
Costs
Comparatively Costlier
Much Cheaper (Due to Spot Instance Functionality, There have been cases when there are saving of upto 50% over top-off glue costs - even more depending upon the use case)
Orchestration
Inbuilt(Glue WorkFlows & Triggers)
Through Cloud Watch Triggers & Step Functions
Infra Work Required
No Infra Setup - Select Worker Type However,Roles & Permissions are needed
Identify the Type of Node Needed & Setup Autoscaling rules etc
Cluster Resiliency & Robustness
Highly Resilient (AWS MANAGED)
If Spot Instances are used then interruption might occur with 2 min notification(Though the System Recovers Automatically - For eg - Job Times might elongate)
Skill Sets Needed
PySpark & Intermediate AWS Knowledge
DevOps to Setup EMR & Manage, Intermediate Knowledge of Orchestration via Cloud Watch & Step Function, PySpark
Applicable Use Cases
Attractive Option in event: 1. You are not worried about Costs but need highly resilient infra2. Batch Setups wherein the Job might complete in fixed time3. Short RealTime Streaming Jobs which need to run for let's say hrs during a day
1. Use Case is of Volatile Clusters - Mostly Used for Batch Processing (Day MINUS Scenarios) - Thereby making a costs effective solution for Batch Jobs2. Attractive option for 24/7 Spark Streaming Programs3. You Need a Hadoop Ecosystem & Related tools (like HDFS, HIVE, HUE, Impala etc)4. You need to run Flink Programs etc5. You need control over Infra & It's tuning parameters
Also going back to OP's use case of 1TB of data processing. If its one time processing Glue should suffice, if its a Daily Once Batch EMR & GLUE will both be good (depending on how job is tuned Glue can be an attractive option), if its a multiple time daily job - then EMR is a better option (Considering balance of performance and cost)

Related

Joining many large files on AWS

I am looking for advice which service should I use. I am new to big data and confused with differences between them on AWS.
Use case:
I receive 60-100 csv files daily (each one can be from few MB to few GB). There are six corresponding schemas, and each file can be treated as part of only one table.
I need to load those files to the six database tables and execute joins between them and generate daily output. After generation of the output, the data present in database is no longer need, so we can truncate that tables and await on the next day.
Files have predictable naming patterns:
A_<timestamp1>.csv goes to A table
A_<timestamp2>.csv goes to A table
B_<timestamp1>.csv goes to B table
etc ...
Which service could be used for that purpose?
AWS Redshift (execute here joins)
AWS Glue (load to redshift)
AWS EMR (spark)
or maybe something else? I heard that spark could be used to do the joins, but what is the proper, optimal and performant way of doing that?
Edit:
Thanks for the responses. I see two options for now:
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, ​execute SQL joins with Athena
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, trigger spark job (AWS Glue in serverless form) to do the SQL joins and setup output to the S3.
Edit 2:
But according to the: https://carbonrmp.com/knowledge-hub/tech-engineering/athena-vs-spark-lessons-from-implementing-a-fully-managed-query-system/
Presto is designed for low latency and uses a massively parallel processing (MPP) approach which is fast but requires everything to happen at once and in memory. It’s all or nothing, if you run out of memory, then “Query exhausted resources at this scale factor”. Spark is designed for scalability and follows a map-reduce design [1]. The job is split and processed in chunks, which are generally processed in batches. If you double the workload without changing the resource, it should take twice as long instead of failing [2]
So Athena (aka Presto) is not scalable as much as I want. I've seen "Query exhausted resources at this scale factor" for my case.
Any possibility of changing the file type to a columnar format like parquet? Then you can use AWS EMR and spark should be able to handle the joins easily. Obviously, you need to optimize the query depending on the data/cluster size etc.

ML Pipeline on AWS SageMaker: How to create long-running query/preprocessing tasks

I'm a software engineer transitioning toward machine learning engineering, but need some assistance.
I'm currently using AWS Lambda and Step Functions to run query and preprocessing jobs for my ML pipeline, but am restrained by Lambda's 15m runtime limitation.
We're a strictly AWS shop, so I'm kind of stuck with SageMaker and other AWS tools for the time being. Later on we'll consider experimenting with something like Kubeflow if it looks advantageous enough.
My current process
I have my data scientists write python scripts (in a git repo) for the query and preprocessing steps of a model, and deploy them (via Terraform) as Lambda functions, then use Step Functions to sequence the ML Pipeline steps as a DAG (query -> preprocess -> train -> deploy)
The Query lambda pulls data from our data warehouse (Redshift), and writes the unprocessed dataset to S3
The Preprocessing lambda loads the unprocessed dataset from S3, manipulates it as needed, and writes it as training & validation datasets to a different S3 location
The Train and Deploy tasks use the SageMaker python api to train and deploy the models as SageMaker Endpoints
Do I need to be using Glue and SageMaker Processing jobs? From what I can tell, Glue seems more targeted towards ETLs than for writing to S3, and SageMaker Processing jobs seem a bit more complex to deploy to than Lambda.
There is a solution that just came out for long running actions in Redshift - Redshift Data API. https://aws.amazon.com/about-aws/whats-new/2020/09/announcing-data-api-for-amazon-redshift/
This allows Lambdas in a Step function to issue a set of SQL to Redshift and poll to see when the SQL is done. Now the run time of your Lambda is only as long as it needed to launch the SQL.
As for the processing steps - I'd recommend doing as much of the processing inside of Redshift before unloading the data to S3 (I hope you are not pulling lots of data through a select statement). This will be much faster than processing in Lambda and can benefit from Data API as well. Now there will likely be some processing steps that you cannot do in Redshift and Lambda is a good option. One additional benefit of UNLOAD is that you can set the output file size. This way you can launch a Lambda per file of the output and then you have many, shorter running Lambdas.
You could attempt to break up the work and have many Lambdas running in series but processing large amounts of data at once is not a strength of Lambda. Being able to do this will depend on the data processing you are doing.
You could use Glue for this but this is likely complete overkill, a whole new service to learn, and since it is an EMR wrapper it can get costly. To be honest Glue is not my favorite AWS service as it only does the most basic things easily and anything even slightly complex becomes a battle. So if this is a tool you know and like go for it.

AWS Glue as a ETL tool?

Why AWS claims Glue as a ETL tool? We need to code everything to pull data, no inbuilt functionality provided by Glue. Any benefits of using Glue instead of Nifi or some other ingestion tools?
Glue is a good ETL tool within AWS. Especially for big data work loads. After all it is running on spark.
Glue does have the ability to produce some basic automated transformation code -> Move data from A to B and remap column names etc.
However, it's the flexibility to write custom code that really sets it apart. Using the Glue code editor, or the Pycharm IDE, you can script any transformations you need using pyspark and/or scala.
The benefits of Glue are really gained when it is used in conjunction with other AWS services. The Glue Data Catalog is shared with Athena and even AWS EMR, so you end up with a central point for your big data ecosystem.
One limitation of Glue I have found is writing large datasets to MS SQL Server (10 million rows+). Glue uses JDBC drivers, and as of 2020, there is yet to be a Microsoft JDBC connection that avails of bulk copy. So, effectively you are writing an insert statement for each row. Therefore, performance can suffer once you get into the 10s of millions of rows currently.

ETL approaches to bulk load data in Cloud SQL

I need to ETL data into my Cloud SQL instance. This data comes from API calls. Currently, I'm running a custom Java ETL code in Kubernetes with Cronjobs that makes request to collect this data and load it on Cloud SQL. The problem comes with managing the ETL code and monitoring the ETL jobs. The current solution may not scale well when more ETL processes are incorporated. In this context, I need to use an ETL tool.
My Cloud SQL instance contains two types of tables: common transactional tables and tables that contains data that comes from the API. The second type is mostly read-only in a "operational database perspective" and a huge part of the tables are bulk updated every hour (in batch) to discard the old data and refresh the values.
Considering this context, I noticed that Cloud Dataflow is the ETL tool provided by GCP. However, it seems that this tool is more suitable for big data applications that needs to do complex transformations and ingest data in multiple formats. Also, in Dataflow, the data is parallel processed and worker nodes are escalated as needed. Since Dataflow is a distributed system, maybe the ETL process would have an overhead when allocating resources to do a simple bulk load. In addition to that, I noticed that Dataflow doesn't have a particular sink for Cloud SQL. This probably means that Dataflow isn't the correct tool for simple bulk load operations in a Cloud SQL database.
In my current needs, I only need to do simple transformations and bulk load the data. However, in the future, we might want to handle other sources of data (pngs, json, csv files) and sinks (Cloud Storage and maybe BigQuery). Also, in the future, we might want to ingest streaming data and store it on Cloud SQL. In this sense, the underlying Apache Beam model is really interesting, since it offers an unified model for batch and streaming.
Giving all this context, I can see two approaches:
1) Use an ETL tool like Talend in the Cloud to help monitoring ETL jobs and maintenance.
2) Use Cloud Dataflow, since we may need streaming capabilities and integration with all kinds of sources and sinks.
The problem with the first approach is that I may end up using Cloud Dataflow anyway when future requeriments arrives and that would be bad for my project in terms of infrastructure costs, since I would be paying for two tools.
The problem with the second approach is that Dataflow doesn't seem to be suitable for simply bulk loading operations in a Cloud SQL Database.
Is there something I am getting wrong here? Can someone enlighten me?
You can use Cloud Dataflow just for loading operations. Here is a tutorial on how to perform ETL operations with Dataflow. It uses BigQuery but you can adapt it to connect to your Cloud SQL or other JDBC sources.
More examples can be found on the official Google Cloud Platform github page for Dataflow analysis of user generated content.
You can also have a look at this GCP ETL architecture example that automates the tasks of extracting data from operational databases.
For simpler ETL operations, Dataprep is an easy tool to use and provides flow scheduling as well.

Comparing aws [Athena,S3, Lambda ...] VS Hortonwork [HDFS, Hive, Oozie ...]

What are the advantages/disadvantages of using 'plain' Hadoop cluster Hortonworks with components HDFS, Hive, Oozie... vs some services on AWS like S3/Athena/Lambda?
my scenario data flow:
source data come from iot sensors in order to analytics and sometimes I need to query by deviceid & datetime with Hive/Athena ... (all conditions have been partitioned)
Disadvantages of installing Hadoop yourself in any cloud provider is obviously cost and a little bit of maintenance.
For example, HDFS disk gets full, add more volumes. You need to upgrade and patch software yourself. You're charged every machine hour, for every machine and turning off just the namenode of the cluster will render it unusable for a period of time; if you do not have any business use-case for running the cluster overnight, you're wasting money
Therefore the advantage of storing data in cloud is.
While slower than HDFS, object store in S3 is significantly cheaper and scalable
Triggering actions via Lambda or another scheduler, can actually happen faster than Oozie launching a YARN job. Your code isn't tied to Hadoop, either, so your functions should be able to be smaller, although you may be limited in language options. If you combine lambda or other filesystem triggers with container schedulers like Kubernetes, you can open lots of options.
Querying your data any time you want with tools like AWS Glue and Athena, decouples the maintenance of a Hive metastore and a compatible query engine, whether that's Hive, Presto, Impala, Drill, etc. Anyone with AWS access can run an Athena query without needing to know an address of your HiveServer and how to appropriately connect to it (for example, you should secure it and make it highly available)