Training job runtime exceeded MaxRuntimeInSeconds provided - amazon-web-services

I would like to run my model 30 days in using aws sagemaker training job, but its max time is 5 days, how to resume the earlier to proceed further

Follow these steps:
Open a support ticket to increase Longest run time for a training job
to 2419200 seconds (28 days). (this can't be adjusted using the service quotas in AWS Web console).
Using the SageMaker Python SDK, when creating an Estimator, set max_run=2419200.
Implement Resume from checkpoints in your training script.
Also, the questions in #rok's answer are very relevant to consider.

According to the documentation here the maximum allowed runtime is 28 days, not 5. Check your configuration please . You are right, according to the documentation here the maximum runtime for a training job is 5 days. There are multiple things you can do: more powerful (multiple) GPU to reduce training time, or save checkpoint and restart training from there. Anyway 30 days looks like a very big training time (with associated cost), are you sure you need that ?
Actually you could ask for service quotas increase from here but as you can see Longest run time for a training job is not adjustable. So I don't you have any other choice of either using checkpoints or greater GPUs.

Related

Reaching batch prediction quota limit when not submitting that many batch predictions

I'm using Vertex AI batch predictions using a custom XGBoost model with Explainable AI using Shapley values.
The explanation part is quite computationally intensive so I've tried to split up the input dataset into chunks and submit 5 batch prediction jobs in parallel. When I do this I receive a "Quota exhausted. Please reach to ai-platform-unified-feedback#google.com for batch prediction quota increase".
I don't understand why I'm hitting the quota. According to the docs there is a limit on the number of concurrent jobs for AutoML models but it doesn't mention custom models.
Is the quota perhaps on the number of instances the batch predictions are running on? I'm using a n1-standard-8 instance for my predictions.
I've tried changing the instance type and launching fewer jobs in parallel but still getting the same error.
According to the Google documentation of Vertex AI, for Custom models the quota is on the number of concurrent machines that are running in the specified region.
You can request for a quota increase following the information mentioned in the error message.
For more information on custom-trained model quotas refer to this documentation.
After reaching out to Google support regarding this issue, it was explained to me that the quota is based on the number of vCPUs used in the batch prediction job. The formula to calculate this is:
the number of vCPUs in a machine X number of machines ( X 3 if explanations are enabled because a separate node is spun up in this case which requires additional resources)
For example if using 50 e2-standard-4 machines to a run batch prediction with explanations results in 50 * 4 * 3 = 600 vCPUs in total being used.
The default quota for a Google project is 2,200 vCPUs for the europe-west2 region. Moreover, this limit is not visible in the user's Google project, but instead in a hidden project only visible to Google engineers. Thus, it is required to raise a support ticket if you need the quota to be increased.

How can I calculate costs for Apache beam pipelines running on Google Dataflow?

I wanted to get an estimate on my Apache Beam Pipeline costs running on Google Cloud DataFlow, I am currently running an Apache Beam code that scales pipeline automatically using a for loop and am storing the data in Beam itself for 12 hours around before aggregating and processing the data in pipeline. Any Ideas on how can I estimate the cost would be appreciated and optimization ways to minimize this costing as well.
Thanks!
For cost calculation of your Dataflow job, you can get the resource metrics of in the detail page of your job showing the DAG, steps, at the right side :
Resource metrics
Current vCPUs 2
Current memory 8 GB
Current HDD PD 25 GB
Current SSD PD 0 B
Total DCU usage 0.14
https://cloud.google.com/dataflow/docs/guides/using-monitoring-intf
Then there is a link allowing to calculate the cost of your job based on your resource metrics (workers, vCPUs, memory, Disk usage...) :
https://cloud.google.com/products/calculator
The price calculator is proposed for classic Dataflow job and Dataflow prime.
Dataflow prime is a new optimized execution engine allowing vertical autoscaling in a worker and also other features : https://cloud.google.com/dataflow/docs/guides/enable-dataflow-prime
You can also check this link : https://cloud.google.com/dataflow/pricing
Example with Dataflow prime, in this example the job average duration per month is 7.5 hours :
In this case, the result is :
Pricing for Dataflow are based on resource usage, which are billed per second usage(Varies from location). For pricing details please visit this link. If you want to know how to check your resources you can check at Dataflow resource monitoring interface.
Also for sample use cases and quotations you can contact Google's sales

Serverless python requests with long timeouts?

I have a several python scripts that follow a similar format: you pass in a date, and it either: - checks my S3 bucket for the file with that date in the filename, and parses it or - Runs a python script doing some analysis on the file of that date (which take over 1 hour to run)
I am looking for a serverless solution that would let me call these functions on a range of dates, and run them all in parallel. Because of the long duration of my python script, services like AWS and Google Cloud Functions don't work because of their timeouts (15 minutes and 9 minutes respectively). I have looked at Google Cloud Dataflow, but am not sure whether this is overkill for my relatively simple use case.
Something with the lowest possible outages is important, so I am leaning towards something from AWS, Google Cloud, etc.
I also would like to be able to see a dashboard of the progress of each job with logs, so I can see which dates have completed and which dates had a bug (plus what the bug is)
As you said, with Google Cloud Functions you can configure the timeout for up to 9 minutes during the deployment.
Solutions different to Dataflow that allow higher timeouts:
App engine Flex
Other GCP product that allows higher timeouts (up to 60 minutes) is the App Engine Flex environment link.
Cloud Tasks
Cloud tasks is also similar, but asynchronous. With timeouts up to 30 min. It is a task queue, you put the task in the queue and returns quickly. Then, the worker (or workers) of the queue will evaluate the tasks one by one.
The usual output of Cloud Tasks is to send emails or to save the results into a Storage link.
With this solution, you can add a task for each file/filename to process and each of this tasks has the timeout of 30 min.
Long running duration is planned in the Cloud Run roadmap but we don't have date for now.
Today, the best recommended way is to use AppEngine in addition of Task Queue. With push queue, you can run process up to 24H long when you deploy in manual scaling mode. But Be careful, manual scaling doesn't scale to 0!
If you prefer container, I know 2 "strange" workaround on GCP:
Use Cloud Build. Cloud Build allows you to build custom builder in a container. Do whatever you want in this container, even if it's not for building something. Think to set up the correct timeout for your processing step. You have 120 minutes per day FREE with Cloud Build (shared across the entire organisation, it's not a free tier per project!). You can run up to 10 build jobs in parallel.
Use AI Platform training. Similarly to Cloud Build, AI Platform training allows you to run a custom container for performing processing, initially think for training. But, it's a container, you can run whatever you want in it. No free tier here. You are limited to 20 CPU in parallel but you can ask for increasing the limit up to 450 concurrent vCPU.
Sadly, it's not as easy as a Function or a Cloud Run to use. You don't have an HTTP endpoint and you simply call this with the date that you want and enjoy. But you can wrap this into a function which perform the API calls to the Cloud Build or the AI Platform training.

Optimizing apache beam / cloud dataflow startup

I have done a few tests with apache-beam using both auto-scale workers and 1 worker, and each time I see a startup time of around 2 minutes. Is it possible to reduce that time, and if so, what are the suggested best practices for reducing the startup time?
IMHO: Two minutes is very fast for a product like Cloud Dataflow. Remember, Google is launching a powerful Big Data service for you that autoscales.
Compare that time to the other cloud vendors. I have seen some clusters (Hadoop) take 15 minutes to come live. In any event, you do not control the initialization process for Dataflow so there is nothing for you to improve.

ML Units exceed the allowed maximum

When I'm trying to submit a job for training a model in Google Cloud-ML, I'm getting the below error.
RESOURCE_EXHAUSTED: Quota failure for project my_project.
The requested 16.536900000000003 ML Units exceed the allowed maximum of 15.To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure
violations:
- description: The requested 16.536900000000003 ML Units exceed the allowed maximum
of 15.
subject: my_project
Now my question is, will this quota reset after few hours or days? Or do I need to ask for an increase in ML Units? If so, how to do that?
lleo#'s answer is correct; this section states the following:
The typical project, when first using Cloud ML Engine is limited in
the number of concurrent processing resources:
Concurrent number of ML training units: 15.
But to directly answer your question: that is a per-job quota (not something that "refills" after a few hours or days). So the immediate solution is to simply submit a job with fewer workers.
Having access to limited resources with new projects is pretty common on Google Cloud Platform, but you can email cloudml-feedback#google.com to see if you're eligible for an increase.
As stated by Guoqing Xu in comments, there was an issue. Which is resolved from their side and I can submit my job successfully now.
For me, there was only one job and that too it's light. Moreover there are no parallel computations going on. Hence, I was puzzled why the quota has reached.
It's resolved now, working fine. Thanks Google team for resolving it. :)
You can read about default limits here here
You may ask for increasing the quota (on the API Manager page in the console). It looks like your issue is about concurrent ML units used, so you might either refactor your training pipeline or ask for a increased quota.