AWS Data Pipeline vs Step Functions - amazon-web-services

I am working on a problem where we intend to perform multiple transformations on data using EMR (SparkSQL).
After going through the documentation of AWS Data Pipelines and AWS Step Functions, I am slightly confused as to what is the use-case each tries to solve. I looked around but did not find a authoritative comparison between both. There are multiple resources that show how I can use them both to schedule and trigger Spark jobs on an EMR cluster.
Which one should I use for scheduling and orchestrating my processing EMR jobs?
More generally, in what situation would one be a better choice over the other as far as ETL/data processing is concerned?

Yes, there are many ways to achieve the same thing, and the difference is in the details and in your use case. I am going to even offer yet one more alternative :)
If you are doing a sequence of transformations and all of them are on an EMR cluster, maybe all you need is either to create the cluster with steps, or submit an API job with several steps. Steps will execute in order on your cluster.
If you have different sources of data, or you want to handle more complex scenarios, then both AWS Data Pipeline and AWS Step Functions would work. AWS Step Functions is a generic way of implementing workflows, while Data Pipelines is a specialized workflow for working with Data.
That means that Data Pipeline will be better integrated when it comes to deal with data sources and outputs, and to work directly with tools like S3, EMR, DynamoDB, Redshift, or RDS. So for a pure data pipeline problem, chances are AWS Data Pipeline is a better candidate.
Having said so, AWS Data Pipeline is not very flexible. If the data source you need is not supported, or if you want to execute some activity which is not integrated, then you need to hack your way around with shell scripts.
On the other hand, AWS Step Functions are not specialized and have good integration with some AWS Services and with AWS Lambda, meaning you can easily integrate with anything via serverless apis.
So it really depends on what you need to achieve and the type of workload you have.

Related

ML Pipeline on AWS SageMaker: How to create long-running query/preprocessing tasks

I'm a software engineer transitioning toward machine learning engineering, but need some assistance.
I'm currently using AWS Lambda and Step Functions to run query and preprocessing jobs for my ML pipeline, but am restrained by Lambda's 15m runtime limitation.
We're a strictly AWS shop, so I'm kind of stuck with SageMaker and other AWS tools for the time being. Later on we'll consider experimenting with something like Kubeflow if it looks advantageous enough.
My current process
I have my data scientists write python scripts (in a git repo) for the query and preprocessing steps of a model, and deploy them (via Terraform) as Lambda functions, then use Step Functions to sequence the ML Pipeline steps as a DAG (query -> preprocess -> train -> deploy)
The Query lambda pulls data from our data warehouse (Redshift), and writes the unprocessed dataset to S3
The Preprocessing lambda loads the unprocessed dataset from S3, manipulates it as needed, and writes it as training & validation datasets to a different S3 location
The Train and Deploy tasks use the SageMaker python api to train and deploy the models as SageMaker Endpoints
Do I need to be using Glue and SageMaker Processing jobs? From what I can tell, Glue seems more targeted towards ETLs than for writing to S3, and SageMaker Processing jobs seem a bit more complex to deploy to than Lambda.
There is a solution that just came out for long running actions in Redshift - Redshift Data API. https://aws.amazon.com/about-aws/whats-new/2020/09/announcing-data-api-for-amazon-redshift/
This allows Lambdas in a Step function to issue a set of SQL to Redshift and poll to see when the SQL is done. Now the run time of your Lambda is only as long as it needed to launch the SQL.
As for the processing steps - I'd recommend doing as much of the processing inside of Redshift before unloading the data to S3 (I hope you are not pulling lots of data through a select statement). This will be much faster than processing in Lambda and can benefit from Data API as well. Now there will likely be some processing steps that you cannot do in Redshift and Lambda is a good option. One additional benefit of UNLOAD is that you can set the output file size. This way you can launch a Lambda per file of the output and then you have many, shorter running Lambdas.
You could attempt to break up the work and have many Lambdas running in series but processing large amounts of data at once is not a strength of Lambda. Being able to do this will depend on the data processing you are doing.
You could use Glue for this but this is likely complete overkill, a whole new service to learn, and since it is an EMR wrapper it can get costly. To be honest Glue is not my favorite AWS service as it only does the most basic things easily and anything even slightly complex becomes a battle. So if this is a tool you know and like go for it.

Is AWS Lambda preferred over AWS Glue Job?

In AWS Glue job, we can write some script and execute the script via job.
In AWS Lambda too, we can write the same script and execute the same logic provided in above job.
So, my query is not whats the difference between AWS Glue Job vs AWS Lambda, BUT iam trying to undestand when AWS Glue job should be preferred over AWS Lambda, especially while when both does the same job? If both does the same job, then ideally I would blindly prefer using AWS Lambda itself, right?
Please try to understand my query..
Additional points:
Per this source and Lambda FAQ and Glue FAQ
Lambda can use a number of different languages (Node.js, Python, Go, Java, etc.) vs. Glue can only execute jobs using Scala or Python code.
Lambda can execute code from triggers by other services (SQS, Kafka, DynamoDB, Kinesis, CloudWatch, etc.) vs. Glue which can be triggered by lambda events, another Glue jobs, manually or from a schedule.
Lambda runs much faster for smaller tasks vs. Glue jobs which take longer to initialize due to the fact that it's using distributed processing. That being said, Glue leverages its parallel processing to run large workloads faster than Lambda.
Lambda looks to require more complexity/code to integrate into data sources (Redshift, RDS, S3, DBs running on ECS instances, DynamoDB, etc.) while Glue can easily integrate with these. However, with the addition of Step Functions, multiple lambda functions can be written and ordered sequentially due reduce complexity and improve modularity where each function could integrate into a aws service (Redshift, RDS, S3, DBs running on ECS instances, DynamoDB, etc.)
Glue looks to have a number of additional components, such as Data Catalog which is a central metadata repository to view your data, a flexible scheduler that handles dependency resolution/job monitoring/retries, AWS Glue DataBrew for cleaning and normalizing data with a visual interface, AWS Glue Elastic Views for combining and replicating data across multiple data stores, AWS Glue Schema Registry to validate streaming data schema.
There are other examples I am missing, so feel free to comment and I can update.
Lambda has a lifetime of fifteen minutes. It can be used to trigger a glue job as an event based activity. That is, when a file lands in S3 for example, we can have an event trigger which can run a glue job. Glue is a managed services for all data processing.
If the data is very low maybe you can do it in lambda, but for some reason the process goes beyond fifteen minutes, then data processing would fail.
The answer to this can involve some foundational design decisions. What is this job doing? What kind of data are you dealing with? Is there a decision to be made whether the task should be executed in a batch or event oriented paradigm?
Batch
This may be necessary or desirable because the task:
Is being done over large monolithic data (e.g., binary).
Relies on context of multiple records in a dataset such that they must be loaded into a single job.
Order matters.
I feel like just as often I see batch handling chosen by default because "this is the way we've always done it" but breaking from this approach could be worth consideration.
Glue is built for batch operations. With a current maximum execution time of 15 minutes and maximum memory of 10gb, Lambda has become capable of processing fairly large datasets in a single execution, as well. It can be difficult to pin down a direct cost comparison without specifics of the workload. When it comes to development, I feel that Lambda has the edge as far as tooling to build, test, deploy.
Event
In the case where your data consists of a set of records, it might behoove you to parse and "stream" them into Lambda. Consider a flow like:
CSV lands in S3.
S3 event triggers Lambda.
Lambda reads and parses CSV into discrete events, submits to another Lambda or publishes to SNS for downstream processing. Concurrent instances of this Lambda can be employed to speed up ingest, where each instance is responsible for certain lines of the S3 object.
This pushes all logic and error handling, as well as resources required, to the level of individual event/record level. Often mechanisms such as dead-letter queues are employed for remediation. While context of a given container persists across invocations - assuming the container has not been idle and torn down - Lambda should generally be considered stateless such that the processing of an event/record is thought of as occurring within its own scope, outside that of others in the dataset.

What is the difference between AWS Glue ETL Job and AWS EMR?

If I had to perform ETL on a huge dataset(say 1Tb) stored in S3 as csv files, Both AWS Glue ETL job and AWS EMR steps can be used. Then how is AWS Glue different from AWS EMR. And which is the better solution in this case.
Most of the differences are already listed so I'll focus more on the use case specific.
When to choose aws glue
Data size is huge but structured i.e. it is in the table structure and is of known format (CSV, parquet, orc, json).
Lineage is required, if you need the data lineage graph while developing your etl job prefer developing the etl using glue native libraries.
The developers don't need to tweak the performance parameters like setting number of executors, per executor memory and so on.
You don't want the overhead of managing large cluster and pay only for what you use.
When to use EMR
Data is huge but semi-structured or unstructured where you can't take any benefit from Glue catalog.
You believe only in the outputs and lineage is not required.
You need to define more memory per executor depending upon the type of your job and requirement.
You can manage the cluster easily or if you have so many jobs which can run concurrently on the cluster saving you money.
In case of structured data, you should use EMR when you want more Hadoop capabilities like hive, presto for further analytics.
So it depends on what your use case is. Both are great service.
Glue allows you to submit ETL scripts directly in PySpark/Python/Scala, without the need for managing an EMR cluster. All setup/tear-down of infrastructure is managed.
There are also a few other managed components like Crawlers, Glue Data Catalog, etc which make it easier to work on your data.
You could use either for your use-case, Glue would be faster however you may not have the flexibility you get with EMR.
Glue uses EMR under the hood. This is evident when you ssh into the driver of your Glue dev-endpoint.
Now since Glue is a managed spark environment or say managed EMR environment, it comes with reduced flexibility. The type of workers that you can chose is limited. The number of language libraries that you can use in your spark code is limited. Glue did not support packages like pandas, numpy until recently. Apps like presto cant be integrated with Glue although Athena is a good alternative to a separate presto installation.
The main issue however is that Glue jobs have a cold start time from anywhere between 1 minute to 15 minutes.
EMR is a good choice for exploratory data analysis but for a production environment with CI/CD, Glue seems to be the better choice.
EDIT - Glue jobs no longer have a cold start wait time
From the AWS Glue FAQ:
AWS Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. AWS Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs.
Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.
Source: https://aws.amazon.com/glue/faqs/
AWS Glue is a ETL service from AWS. AWS Glue will generate ETL code in Scala or Python to extract data from the source, transform the data to match the target schema, and load it into the target
AWS EMR is a service where you can process large amount of data , its a supporting big data platform .It Supports Hadoop,Spark,Flink,Presto, Hive etc.You can spin up EC2 with the above listed softwares and make a similar ecosystem.
In your case , you want to process 1 TB of data .Now if you want do computations on the same data , you can use EMR and if you want to run the analytics on the transformed data , use Glue .
Following is something that i compiled post working on analytics projects (though a lot of it depends on use case) - but generally speaking :
Criteria
Glue
EMR
Costs
Comparatively Costlier
Much Cheaper (Due to Spot Instance Functionality, There have been cases when there are saving of upto 50% over top-off glue costs - even more depending upon the use case)
Orchestration
Inbuilt(Glue WorkFlows & Triggers)
Through Cloud Watch Triggers & Step Functions
Infra Work Required
No Infra Setup - Select Worker Type However,Roles & Permissions are needed
Identify the Type of Node Needed & Setup Autoscaling rules etc
Cluster Resiliency & Robustness
Highly Resilient (AWS MANAGED)
If Spot Instances are used then interruption might occur with 2 min notification(Though the System Recovers Automatically - For eg - Job Times might elongate)
Skill Sets Needed
PySpark & Intermediate AWS Knowledge
DevOps to Setup EMR & Manage, Intermediate Knowledge of Orchestration via Cloud Watch & Step Function, PySpark
Applicable Use Cases
Attractive Option in event: 1. You are not worried about Costs but need highly resilient infra2. Batch Setups wherein the Job might complete in fixed time3. Short RealTime Streaming Jobs which need to run for let's say hrs during a day
1. Use Case is of Volatile Clusters - Mostly Used for Batch Processing (Day MINUS Scenarios) - Thereby making a costs effective solution for Batch Jobs2. Attractive option for 24/7 Spark Streaming Programs3. You Need a Hadoop Ecosystem & Related tools (like HDFS, HIVE, HUE, Impala etc)4. You need to run Flink Programs etc5. You need control over Infra & It's tuning parameters
Also going back to OP's use case of 1TB of data processing. If its one time processing Glue should suffice, if its a Daily Once Batch EMR & GLUE will both be good (depending on how job is tuned Glue can be an attractive option), if its a multiple time daily job - then EMR is a better option (Considering balance of performance and cost)

Dependency based ETL flow in AWS

We want to create a dynamic flow based on input data in S3. Based on data available in S3 and along with meta data we want to create dynamic clusters and dynamic tasks/transformation jobs in the system. And Some jobs are dependency based. Here I am sharing the expected flow, want to know how efficiently we can do this using AWS services and env.
I am exploring AWS SWF, Data Pipe Line and Lambda. But now sure how to take care of dynamic tasks and dynamic dependencies. Any thoughts around this.
Data Flow is explained in the attached image (refer ETL Flow)
ETL Flow
Amazon Step Functions with S3 Triggers should get the job done in a cost effective and scalable manner.
All Steps are defined with state language.
https://states-language.net/spec.html
You can run jobs in parallel and wait for them to finish before you start your next job.
Below is one of the sample from AWS Step Functions,
If you use AWS Flow Framework that is part of official SWF client then modeling such dynamic flow is pretty straightforward. You define its object model, write code that instantiate it based on your pipeline definition and execute using the framework. See Deployment Sample for an example of such dynamic workflow implementation.

AWS : What's the difference between Simple Workflow Service and Data Pipeline?

What's the difference between Amazon Simple Workflow Service and Amazon Data Pipeline ? It seems that they are pretty much the same product. The Data Pipeline has a nice web based diagram editor though.
Cheers !
From http://aws.amazon.com/datapipeline/faqs/
Q: How is AWS Data Pipeline different from Amazon Simple Workflow
Service?
While both services provide execution tracking, retry and
exception-handling capabilities, and the ability to run arbitrary
actions, AWS Data Pipeline is specifically designed to facilitate the
specific steps that are common across a majority of data-driven
workflows – inparticular, executing activities after their input data
meets specific readiness criteria, easily copying data between
different data stores, and scheduling chained transforms. This highly
specific focus means that its workflow definitions can be created
[with] very rapidly and with no code or programming knowledge.
Data Pipeline is service used to transfer data between various services of AWS. Example you can use DataPipeline to read the log files from your EC2 and periodically move them to S3.
Simple Workflow service is very powerful service. You can write even your workflow logic using it. Example : Most of the ecommerce systems have scalability problems in their order systems. You can use write code in SWF to make this ordering workflow process itself.
AWS Big Data Blog does a wonderful job of explaining key features of SWF, Data Pipeline & Lambda.
Below diagram is copied from the blog.