I was wondering if it is possible to run large number of "jobs" (or "pipeline" or whatever is the right way) to execute some modelling tasks in parallel.
So what I planned to do is to do a ETL process and EDA done and after that when the data is ready, I would like to fire 2000 modelling jobs. We have 2000 products and each job can start with a data (SELECT * FROM DATA WHERE PROD_ID='xxxxxxxxx') and my idea is to run these training jobs in parallel (there is no dependency between them - so it makes sense to me).
First of all - 1) Can it be done in AWS SageMaker? 2) What would be the right approach? 3) Any special considerations I need to be aware of?
Thanks a lot in advance!
it's possible to run this on SageMaker, with SageMaker pipelines that will orchestrate a SageMaker Processing job, followed by a Training job. You can define the PROD_ID as a String parameter to the SageMaker Pipeline, then run multiple pipelines executions concurrently (default soft limit is 200 concurrent executions).
As you have a very high numbers of jobs (2K) which you want to run in parallel, and perhaps optimize compute usage, you might also want to look at AWS Batch, which allows you to queue up tasks, for a fleet of instances that starts containers to perform these jobs. AWS Batch also support Spot instances which could reduce your instance cost by 70%-90%. Another advantage of AWS Batch is that jobs reuse the same running instance (only container stop/start), while in SageMaker there's a ~2 minute overhead to start the instance per job. Additionally, AWS Batch also takes care of retries and allowing you to chain all 2,000 jobs together and run a "finisher" job when all jobs have completed.
Limits increase - For any service, you'll need to increase your service quota limits. It can be done from the console "Quotas" for most services, or by contacting AWS support. Some services has hard limits.
If I had to perform ETL on a huge dataset(say 1Tb) stored in S3 as csv files, Both AWS Glue ETL job and AWS EMR steps can be used. Then how is AWS Glue different from AWS EMR. And which is the better solution in this case.
Most of the differences are already listed so I'll focus more on the use case specific.
When to choose aws glue
Data size is huge but structured i.e. it is in the table structure and is of known format (CSV, parquet, orc, json).
Lineage is required, if you need the data lineage graph while developing your etl job prefer developing the etl using glue native libraries.
The developers don't need to tweak the performance parameters like setting number of executors, per executor memory and so on.
You don't want the overhead of managing large cluster and pay only for what you use.
When to use EMR
Data is huge but semi-structured or unstructured where you can't take any benefit from Glue catalog.
You believe only in the outputs and lineage is not required.
You need to define more memory per executor depending upon the type of your job and requirement.
You can manage the cluster easily or if you have so many jobs which can run concurrently on the cluster saving you money.
In case of structured data, you should use EMR when you want more Hadoop capabilities like hive, presto for further analytics.
So it depends on what your use case is. Both are great service.
Glue allows you to submit ETL scripts directly in PySpark/Python/Scala, without the need for managing an EMR cluster. All setup/tear-down of infrastructure is managed.
There are also a few other managed components like Crawlers, Glue Data Catalog, etc which make it easier to work on your data.
You could use either for your use-case, Glue would be faster however you may not have the flexibility you get with EMR.
Glue uses EMR under the hood. This is evident when you ssh into the driver of your Glue dev-endpoint.
Now since Glue is a managed spark environment or say managed EMR environment, it comes with reduced flexibility. The type of workers that you can chose is limited. The number of language libraries that you can use in your spark code is limited. Glue did not support packages like pandas, numpy until recently. Apps like presto cant be integrated with Glue although Athena is a good alternative to a separate presto installation.
The main issue however is that Glue jobs have a cold start time from anywhere between 1 minute to 15 minutes.
EMR is a good choice for exploratory data analysis but for a production environment with CI/CD, Glue seems to be the better choice.
EDIT - Glue jobs no longer have a cold start wait time
From the AWS Glue FAQ:
AWS Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. AWS Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs.
Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.
Source: https://aws.amazon.com/glue/faqs/
AWS Glue is a ETL service from AWS. AWS Glue will generate ETL code in Scala or Python to extract data from the source, transform the data to match the target schema, and load it into the target
AWS EMR is a service where you can process large amount of data , its a supporting big data platform .It Supports Hadoop,Spark,Flink,Presto, Hive etc.You can spin up EC2 with the above listed softwares and make a similar ecosystem.
In your case , you want to process 1 TB of data .Now if you want do computations on the same data , you can use EMR and if you want to run the analytics on the transformed data , use Glue .
Following is something that i compiled post working on analytics projects (though a lot of it depends on use case) - but generally speaking :
Criteria
Glue
EMR
Costs
Comparatively Costlier
Much Cheaper (Due to Spot Instance Functionality, There have been cases when there are saving of upto 50% over top-off glue costs - even more depending upon the use case)
Orchestration
Inbuilt(Glue WorkFlows & Triggers)
Through Cloud Watch Triggers & Step Functions
Infra Work Required
No Infra Setup - Select Worker Type However,Roles & Permissions are needed
Identify the Type of Node Needed & Setup Autoscaling rules etc
Cluster Resiliency & Robustness
Highly Resilient (AWS MANAGED)
If Spot Instances are used then interruption might occur with 2 min notification(Though the System Recovers Automatically - For eg - Job Times might elongate)
Skill Sets Needed
PySpark & Intermediate AWS Knowledge
DevOps to Setup EMR & Manage, Intermediate Knowledge of Orchestration via Cloud Watch & Step Function, PySpark
Applicable Use Cases
Attractive Option in event: 1. You are not worried about Costs but need highly resilient infra2. Batch Setups wherein the Job might complete in fixed time3. Short RealTime Streaming Jobs which need to run for let's say hrs during a day
1. Use Case is of Volatile Clusters - Mostly Used for Batch Processing (Day MINUS Scenarios) - Thereby making a costs effective solution for Batch Jobs2. Attractive option for 24/7 Spark Streaming Programs3. You Need a Hadoop Ecosystem & Related tools (like HDFS, HIVE, HUE, Impala etc)4. You need to run Flink Programs etc5. You need control over Infra & It's tuning parameters
Also going back to OP's use case of 1TB of data processing. If its one time processing Glue should suffice, if its a Daily Once Batch EMR & GLUE will both be good (depending on how job is tuned Glue can be an attractive option), if its a multiple time daily job - then EMR is a better option (Considering balance of performance and cost)
Here is a scenario which is as follows:
There is a historical data in Hive which is keep growing on daily basis.
Daily fresh data comes in the form of batches which is merged to the data mentioned above.
As of now above process is being run on Hadoop Cluster as scheduled Hive jobs.
Above setup is to be migrated on AWS cloud.
Now problem is how to setup the cluster:
If we setup continuously running EMR cluster then cost will be too high and also most of the time cluster would be unused (as data is coming in batches once or twice in a day)
If we use scheduled EMR cluster (instance fleet) then at the processing time we will have to copy full historical data from the storage (lets say S3) to cluster for merging with fresh data, which will take too much time as its huge in volume and also after processing,again it is to be stored in S3.
Please suggest, which kind of EMR cluster will be suitable here...
In the docs it's said that AWS allocates by default 10 DPUs per ETL job and 5 DPUs per development endpoint by default, even though both can have a minimum of 2 DPUs configured.
It's also mentioned that Crawling is also priced on second increments and with a 10 minute minimum run, but nowhere is specified how many DPUs are allocated. Jobs and Development Endpoints can be configured in the Glue console to consume less DPUs, but I haven't seen any such configuration for the crawlers.
Is there a fixed amount of DPUs per crawler? Can we control that amount?
This is my conversation with AWS Support about this subject:
Hello, I'd like to know how many DPUs a crawler uses in order to
calculate my costs with crawlers.
Their answer:
Dear AWS Customer,
Thank you for reaching out today. My name is Safari, I will assist
with your case.
I understand that while compiling the cost of your Glue crawlers,
you'd like to know the amount of DPUs a particular crawler uses.
Unfortunately, there is no direct way to find out the DPU consumption
by a given crawler. I apologize for the inconvenience. However, you
may see the total DPU consumption across all crawlers in your detailed
bill under the section AWS Service Charges > Glue > {region} > AWS
Glue CrawlerRun. Additionally, you can add tags to your crawlers and
then enable "Cost Allocation Tags" from your AWS Billing and Cost
Management console. This would allow AWS to generate a cost allocation
report grouped by the predefined tags. For more on this, please see
the documentation link below [1].
I hope this helps. Please let me know if I can provide you with any
other assistance.
References [1]:
https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html
Discussed with AWS support team as well, and currently its not possible to modify or view the DPU configuration details for Glue - crawlers. But, does crawlers use a DPU?
AWS Glue documentation regarding pricing reads:
A Glue ETL job requires a minimum of 2 DPUs. By default, AWS Glue
allocates 10 DPUs to each ETL job. You are billed $0.44 per DPU-Hour
in increments of 1 minute, rounded up to the nearest minute, with a
10-minute minimum duration for each ETL job.
I want to reduce the number of DPUs allocated to my ETL job. I searched for this option in Glue console. But I didn't find it. Can you please let me know how do I do that?
Thanks
To reduce the number of DPU, please go to AWS glue jobs console. Select the job and under action Edit the job. Under "Script libraries and job parameters", you should see "Concurrent DPUs per job run". You can provide an integer value to increase or reduce the number of DPUs.