I'm using Vertex AI batch predictions using a custom XGBoost model with Explainable AI using Shapley values.
The explanation part is quite computationally intensive so I've tried to split up the input dataset into chunks and submit 5 batch prediction jobs in parallel. When I do this I receive a "Quota exhausted. Please reach to ai-platform-unified-feedback#google.com for batch prediction quota increase".
I don't understand why I'm hitting the quota. According to the docs there is a limit on the number of concurrent jobs for AutoML models but it doesn't mention custom models.
Is the quota perhaps on the number of instances the batch predictions are running on? I'm using a n1-standard-8 instance for my predictions.
I've tried changing the instance type and launching fewer jobs in parallel but still getting the same error.
According to the Google documentation of Vertex AI, for Custom models the quota is on the number of concurrent machines that are running in the specified region.
You can request for a quota increase following the information mentioned in the error message.
For more information on custom-trained model quotas refer to this documentation.
After reaching out to Google support regarding this issue, it was explained to me that the quota is based on the number of vCPUs used in the batch prediction job. The formula to calculate this is:
the number of vCPUs in a machine X number of machines ( X 3 if explanations are enabled because a separate node is spun up in this case which requires additional resources)
For example if using 50 e2-standard-4 machines to a run batch prediction with explanations results in 50 * 4 * 3 = 600 vCPUs in total being used.
The default quota for a Google project is 2,200 vCPUs for the europe-west2 region. Moreover, this limit is not visible in the user's Google project, but instead in a hidden project only visible to Google engineers. Thus, it is required to raise a support ticket if you need the quota to be increased.
Related
I wanted to get an estimate on my Apache Beam Pipeline costs running on Google Cloud DataFlow, I am currently running an Apache Beam code that scales pipeline automatically using a for loop and am storing the data in Beam itself for 12 hours around before aggregating and processing the data in pipeline. Any Ideas on how can I estimate the cost would be appreciated and optimization ways to minimize this costing as well.
Thanks!
For cost calculation of your Dataflow job, you can get the resource metrics of in the detail page of your job showing the DAG, steps, at the right side :
Resource metrics
Current vCPUs 2
Current memory 8 GB
Current HDD PD 25 GB
Current SSD PD 0 B
Total DCU usage 0.14
https://cloud.google.com/dataflow/docs/guides/using-monitoring-intf
Then there is a link allowing to calculate the cost of your job based on your resource metrics (workers, vCPUs, memory, Disk usage...) :
https://cloud.google.com/products/calculator
The price calculator is proposed for classic Dataflow job and Dataflow prime.
Dataflow prime is a new optimized execution engine allowing vertical autoscaling in a worker and also other features : https://cloud.google.com/dataflow/docs/guides/enable-dataflow-prime
You can also check this link : https://cloud.google.com/dataflow/pricing
Example with Dataflow prime, in this example the job average duration per month is 7.5 hours :
In this case, the result is :
Pricing for Dataflow are based on resource usage, which are billed per second usage(Varies from location). For pricing details please visit this link. If you want to know how to check your resources you can check at Dataflow resource monitoring interface.
Also for sample use cases and quotations you can contact Google's sales
I would like to run my model 30 days in using aws sagemaker training job, but its max time is 5 days, how to resume the earlier to proceed further
Follow these steps:
Open a support ticket to increase Longest run time for a training job
to 2419200 seconds (28 days). (this can't be adjusted using the service quotas in AWS Web console).
Using the SageMaker Python SDK, when creating an Estimator, set max_run=2419200.
Implement Resume from checkpoints in your training script.
Also, the questions in #rok's answer are very relevant to consider.
According to the documentation here the maximum allowed runtime is 28 days, not 5. Check your configuration please . You are right, according to the documentation here the maximum runtime for a training job is 5 days. There are multiple things you can do: more powerful (multiple) GPU to reduce training time, or save checkpoint and restart training from there. Anyway 30 days looks like a very big training time (with associated cost), are you sure you need that ?
Actually you could ask for service quotas increase from here but as you can see Longest run time for a training job is not adjustable. So I don't you have any other choice of either using checkpoints or greater GPUs.
I'm using AWS SageMaker studio and I need to launch a ml.p2.xlarge instance for a Training Job to run the fit() function of a model. I need to run it multiple times, and I want to know if AWS charges me for every time I launch an instance or just for the minutes I use them.
For example, if I need to run it three times, would it be cheaper to launch a ml.p2.xlarge instance once and run the training job three times in the span of an hour, or launch the instance three times in that span for 6 minutes each?
The answer is generally to run a training jobs 3 time. This way you only pay for what you use, and there's no ideal time wasted. One thing to note is that, per job, you also pay for the overhead of loading the training container, loading data to the training container, and the duration to stop the instance. As long as this overhead is relatively small, it's worth it.
Example: (6min net training + 4min overhead) = 10min x 3 = 30min vs 60min.
Another benefit to have a job per training is separate metadata and results per job (metrics, logs, hyperparameters), comparing jobs, ability to quickly clone a job, job status. etc.
Empirically: you can run one training job, multiple results by 3.
In SageMaker Training you pay by the second ("billable seconds"). You can see this figure in the training job details in the web console (or via describe-training-job API call).
GCP AutoML Natural Language , training items limit is 1,000,000 items. I will need to increase that much more than 1 million, How do I do that? Is that done through GCP support is there any other option? I do not see that listed in edit quota.
The difference between a quota and a limit is that a limit cannot be adjusted, whereas you can request a quota increase. If I understand correctly the 1,000,000 training items is a limit for AutoML and thus cannot be changed.
When I'm trying to submit a job for training a model in Google Cloud-ML, I'm getting the below error.
RESOURCE_EXHAUSTED: Quota failure for project my_project.
The requested 16.536900000000003 ML Units exceed the allowed maximum of 15.To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure
violations:
- description: The requested 16.536900000000003 ML Units exceed the allowed maximum
of 15.
subject: my_project
Now my question is, will this quota reset after few hours or days? Or do I need to ask for an increase in ML Units? If so, how to do that?
lleo#'s answer is correct; this section states the following:
The typical project, when first using Cloud ML Engine is limited in
the number of concurrent processing resources:
Concurrent number of ML training units: 15.
But to directly answer your question: that is a per-job quota (not something that "refills" after a few hours or days). So the immediate solution is to simply submit a job with fewer workers.
Having access to limited resources with new projects is pretty common on Google Cloud Platform, but you can email cloudml-feedback#google.com to see if you're eligible for an increase.
As stated by Guoqing Xu in comments, there was an issue. Which is resolved from their side and I can submit my job successfully now.
For me, there was only one job and that too it's light. Moreover there are no parallel computations going on. Hence, I was puzzled why the quota has reached.
It's resolved now, working fine. Thanks Google team for resolving it. :)
You can read about default limits here here
You may ask for increasing the quota (on the API Manager page in the console). It looks like your issue is about concurrent ML units used, so you might either refactor your training pipeline or ask for a increased quota.