Running multiple jobs in SageMaker

Running multiple jobs in SageMaker - amazon-web-services

I was wondering if it is possible to run large number of "jobs" (or "pipeline" or whatever is the right way) to execute some modelling tasks in parallel.
So what I planned to do is to do a ETL process and EDA done and after that when the data is ready, I would like to fire 2000 modelling jobs. We have 2000 products and each job can start with a data (SELECT * FROM DATA WHERE PROD_ID='xxxxxxxxx') and my idea is to run these training jobs in parallel (there is no dependency between them - so it makes sense to me).
First of all - 1) Can it be done in AWS SageMaker? 2) What would be the right approach? 3) Any special considerations I need to be aware of?
Thanks a lot in advance!

it's possible to run this on SageMaker, with SageMaker pipelines that will orchestrate a SageMaker Processing job, followed by a Training job. You can define the PROD_ID as a String parameter to the SageMaker Pipeline, then run multiple pipelines executions concurrently (default soft limit is 200 concurrent executions).
As you have a very high numbers of jobs (2K) which you want to run in parallel, and perhaps optimize compute usage, you might also want to look at AWS Batch, which allows you to queue up tasks, for a fleet of instances that starts containers to perform these jobs. AWS Batch also support Spot instances which could reduce your instance cost by 70%-90%. Another advantage of AWS Batch is that jobs reuse the same running instance (only container stop/start), while in SageMaker there's a ~2 minute overhead to start the instance per job. Additionally, AWS Batch also takes care of retries and allowing you to chain all 2,000 jobs together and run a "finisher" job when all jobs have completed.
Limits increase - For any service, you'll need to increase your service quota limits. It can be done from the console "Quotas" for most services, or by contacting AWS support. Some services has hard limits.

Related

Where is FlexRS type in dataflow part of batch or stream processing?

I have a question about the FlexRS type while I was looking at the dataflow of Google Cloud Platform. :)
As far as I know, dataflow supports batch and stream, but I was curious to know that there is a FlexRS type.
Simply understanding FlexRS was difficult to identify in the document except that it was cheaper than a typical workflow.
Can I ask you to explain FlexRS?
Thank you.

Dataflow is a fully managed batch and streaming analytics service that minimises latency, processing time through autoscaling.
Regarding FlexRS for Dataflow, ou can use it in batch processing pipelines which are not time-critical, such as daily or weekly jobs that can be completed within a certain time-window. Normally, Dataflow uses both and preemptible and regular workers to execute your job. It takes into account the availability of preemptible VM`s and you are charged according to the documentation.
On the other hand, FlexRS offers a discounted rate for CPU and memory pricing for batch processing. It can delay your Dataflow batch job within a 6-hour window to identify the best point in time to start the job, based on the availability of resources. When enabled, FlexRS selects preemptible VMs for 90% of workers in the worker pool by default.
Therefore, FlexRS is used only for non time-critical batch workloads.

What is the difference between AWS Glue ETL Job and AWS EMR?

If I had to perform ETL on a huge dataset(say 1Tb) stored in S3 as csv files, Both AWS Glue ETL job and AWS EMR steps can be used. Then how is AWS Glue different from AWS EMR. And which is the better solution in this case.

Most of the differences are already listed so I'll focus more on the use case specific.
When to choose aws glue
Data size is huge but structured i.e. it is in the table structure and is of known format (CSV, parquet, orc, json).
Lineage is required, if you need the data lineage graph while developing your etl job prefer developing the etl using glue native libraries.
The developers don't need to tweak the performance parameters like setting number of executors, per executor memory and so on.
You don't want the overhead of managing large cluster and pay only for what you use.
When to use EMR
Data is huge but semi-structured or unstructured where you can't take any benefit from Glue catalog.
You believe only in the outputs and lineage is not required.
You need to define more memory per executor depending upon the type of your job and requirement.
You can manage the cluster easily or if you have so many jobs which can run concurrently on the cluster saving you money.
In case of structured data, you should use EMR when you want more Hadoop capabilities like hive, presto for further analytics.
So it depends on what your use case is. Both are great service.

Glue allows you to submit ETL scripts directly in PySpark/Python/Scala, without the need for managing an EMR cluster. All setup/tear-down of infrastructure is managed.
There are also a few other managed components like Crawlers, Glue Data Catalog, etc which make it easier to work on your data.
You could use either for your use-case, Glue would be faster however you may not have the flexibility you get with EMR.

Glue uses EMR under the hood. This is evident when you ssh into the driver of your Glue dev-endpoint.
Now since Glue is a managed spark environment or say managed EMR environment, it comes with reduced flexibility. The type of workers that you can chose is limited. The number of language libraries that you can use in your spark code is limited. Glue did not support packages like pandas, numpy until recently. Apps like presto cant be integrated with Glue although Athena is a good alternative to a separate presto installation.
The main issue however is that Glue jobs have a cold start time from anywhere between 1 minute to 15 minutes.
EMR is a good choice for exploratory data analysis but for a production environment with CI/CD, Glue seems to be the better choice.
EDIT - Glue jobs no longer have a cold start wait time

From the AWS Glue FAQ:
AWS Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. AWS Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs.
Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.
Source: https://aws.amazon.com/glue/faqs/

AWS Glue is a ETL service from AWS. AWS Glue will generate ETL code in Scala or Python to extract data from the source, transform the data to match the target schema, and load it into the target
AWS EMR is a service where you can process large amount of data , its a supporting big data platform .It Supports Hadoop,Spark,Flink,Presto, Hive etc.You can spin up EC2 with the above listed softwares and make a similar ecosystem.
In your case , you want to process 1 TB of data .Now if you want do computations on the same data , you can use EMR and if you want to run the analytics on the transformed data , use Glue .

Following is something that i compiled post working on analytics projects (though a lot of it depends on use case) - but generally speaking :
Criteria
Glue
EMR
Costs
Comparatively Costlier
Much Cheaper (Due to Spot Instance Functionality, There have been cases when there are saving of upto 50% over top-off glue costs - even more depending upon the use case)
Orchestration
Inbuilt(Glue WorkFlows & Triggers)
Through Cloud Watch Triggers & Step Functions
Infra Work Required
No Infra Setup - Select Worker Type However,Roles & Permissions are needed
Identify the Type of Node Needed & Setup Autoscaling rules etc
Cluster Resiliency & Robustness
Highly Resilient (AWS MANAGED)
If Spot Instances are used then interruption might occur with 2 min notification(Though the System Recovers Automatically - For eg - Job Times might elongate)
Skill Sets Needed
PySpark & Intermediate AWS Knowledge
DevOps to Setup EMR & Manage, Intermediate Knowledge of Orchestration via Cloud Watch & Step Function, PySpark
Applicable Use Cases
Attractive Option in event: 1. You are not worried about Costs but need highly resilient infra2. Batch Setups wherein the Job might complete in fixed time3. Short RealTime Streaming Jobs which need to run for let's say hrs during a day
1. Use Case is of Volatile Clusters - Mostly Used for Batch Processing (Day MINUS Scenarios) - Thereby making a costs effective solution for Batch Jobs2. Attractive option for 24/7 Spark Streaming Programs3. You Need a Hadoop Ecosystem & Related tools (like HDFS, HIVE, HUE, Impala etc)4. You need to run Flink Programs etc5. You need control over Infra & It's tuning parameters
Also going back to OP's use case of 1TB of data processing. If its one time processing Glue should suffice, if its a Daily Once Batch EMR & GLUE will both be good (depending on how job is tuned Glue can be an attractive option), if its a multiple time daily job - then EMR is a better option (Considering balance of performance and cost)

What to use AWS Fargate or AWS Beanstalk

I have a java application that reads from a SQS queue and does some business processing and finally writes it to a datastore. As the SQS queue grows I want to be able to scale to read more messages and process them. Each SQS message will take about 15 to 20 minutes to process. I was looking at a service like AWS Fargate or AWS Beanstalk to deploy my application. Money is not a concern but usability is. What would be the best platform?

Fargate would be an ideal solution, as it has following advantages over Beanstalk:
It's serverless
More fine-grained control for custom application architectures.
No need to write EB extensions.
Build and Test image locally and Promote same to Fargate.
With application autoscaling, you can scale on the go.
Pricing is per second with a 1-minute minimum
FAQ:
https://aws.amazon.com/fargate/faqs/
Pricing:
https://aws.amazon.com/fargate/pricing/

I've had a very similar use case to this and I used Batch. (which was not available in 2014 when the question was asked)
https://aws.amazon.com/batch/
In my case I was processing audio and video files from the queue.
You can set a lambda to fire on the SQS queue and have that drop the job onto batch for processing.
If you have the minimum cluster size set to zero then you will have no servers running when there is no work to do, but you can have them autoscale up to process as much work as you require when the jobs come in.
The advantage compared to lambda is that the code that executes can be any container with as much resource as you want to throw at it.
For your use case it will be perfect, but for anything that can complete processing in a a few seconds or a minute it's worth making each job process more than one task per execution or all of the time will be spent firing up and shutting down containers.

Can Cloud Dataflow streaming job scale to zero?

I'm using Cloud Dataflow streaming pipelines to insert events received from Pub/Sub into a BigQuery dataset. I need a few ones to keep each job simple and easy to maintain.
My concern is about the global cost. Volume of data is not very high. And during a few periods of the day, there isn't any data (any message on pub/sub).
I would like that Dataflow scale to 0 worker, until a new message is received. But it seems that minimum worker is 1.
So minimum price for each job for a day would be : 24 vCPU Hour... so at least $50 a month/job. (without discount for monthly usage)
I plan to run and drain my jobs via api a few times per day to avoid 1 full time worker. But this does not seem to be the right form for a managed service like DataFlow.
Is there something I missed?

Dataflow can't scale to 0 workers, but your alternatives would be to use Cron, or Cloud Functions to create a Dataflow streaming job whenever an event triggers it, and for stopping the Dataflow job by itself, you can read the answers to this question.
You can find an example here for both cases (Cron and Cloud Functions), note that Cloud Functions is not in Alpha release anymore and since July it's in General Availability release.

A streaming Dataflow job must always have a single worker. If the volume of data is very low, perhaps batch jobs fit the use case better. Using a scheduler or cron you can periodically start a batch job to drain the topic and this will save on cost.

Scheduling long-running tasks using AWS services

My application heavily relies on AWS services, and I am looking for an optimal solution based on them. Web Application triggers a scheduled job (assume repeated infinitely) which requires certain amount of resources to be performed. Single run of the task normally will take maximum 1 min.
Current idea is to pass jobs via SQS and spawn workers on EC2 instances depending on the queue size. (this part is more or less clear)
But I struggle to find a proper solution for actually triggering the jobs at certain intervals. Assume we are dealing with 10000 jobs. So for a scheduler to run 10k cronjobs (the job itself is quite simple, just passing job description via SQS) at the same time seems like a crazy idea. So the actual question would be, how to autoscale the scheduler itself (given the scenarios when scheduler is restarted, new instance is created etc. )?
Or the scheduler is redundant as an app and it is wiser to rely on AWS Lambda functions (or other services providing scheduling)? The problem with using Lambda functions is the certain limitation and the memory provided 128mb provided by single function is actually too much (20mb seems like more than enough)
Alternatively, the worker itself can wait for a certain amount of time and notify the scheduler that it should trigger the job one more time. Let's say if the frequency is 1 hour:
1. Scheduler sends job to worker 1
2. Worker 1 performs the job and after one hour sends it back to Scheduler
3. Scheduler sends the job again
The issue here however is the possibility of that worker will be get scaled in.
Bottom Line I am trying to achieve a lightweight scheduler which would not require autoscaling and serve as a hub with sole purpose of transmitting job descriptions. And certainly should not get throttled on service restart.

Lambda is perfect for this. You have a lot of short running processes (~1 minute) and Lambda is for short processes (up until five minutes nowadays). It is very important to know that CPU speed is coupled to RAM linearly. A 1GB Lambda function is equivalent to a t2.micro instance if I recall correctly, and 1.5GB RAM means 1.5x more CPU speed. The cost of these functions is so low that you can just execute this. The 128MB RAM has 1/8 CPU speed of a micro instance so I do not recommend using those actually.
As a queueing mechanism you can use S3 (yes you read that right). Create a bucket and let the Lambda worker trigger when an object is created. When you want to schedule a job, put a file inside the bucket. Lambda starts and processes it immediately.
Now you have to respect some limits. This way you can only have 100 workers at the same time (the total amount of active Lambda instances), but you can ask AWS to increase this.
The costs are as follows:
0.005 per 1000 PUT requests, so $5 per million job requests (this is more expensive than SQS).
The Lambda runtime. Assuming normal t2.micro CPU speed (1GB RAM), this costs $0.0001 per job (60 seconds, first 300.000 seconds are free = 5000 jobs)
The Lambda requests. $0.20 per million triggers (first million is free)
This setup does not require any servers on your part. This cannot go down (only if AWS itself does).
(don't forget to delete the job out of S3 when you're done)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js