AutoML Training items limit - google-cloud-platform

GCP AutoML Natural Language , training items limit is 1,000,000 items. I will need to increase that much more than 1 million, How do I do that? Is that done through GCP support is there any other option? I do not see that listed in edit quota.

The difference between a quota and a limit is that a limit cannot be adjusted, whereas you can request a quota increase. If I understand correctly the 1,000,000 training items is a limit for AutoML and thus cannot be changed.

Related

Reaching batch prediction quota limit when not submitting that many batch predictions

I'm using Vertex AI batch predictions using a custom XGBoost model with Explainable AI using Shapley values.
The explanation part is quite computationally intensive so I've tried to split up the input dataset into chunks and submit 5 batch prediction jobs in parallel. When I do this I receive a "Quota exhausted. Please reach to ai-platform-unified-feedback#google.com for batch prediction quota increase".
I don't understand why I'm hitting the quota. According to the docs there is a limit on the number of concurrent jobs for AutoML models but it doesn't mention custom models.
Is the quota perhaps on the number of instances the batch predictions are running on? I'm using a n1-standard-8 instance for my predictions.
I've tried changing the instance type and launching fewer jobs in parallel but still getting the same error.
According to the Google documentation of Vertex AI, for Custom models the quota is on the number of concurrent machines that are running in the specified region.
You can request for a quota increase following the information mentioned in the error message.
For more information on custom-trained model quotas refer to this documentation.
After reaching out to Google support regarding this issue, it was explained to me that the quota is based on the number of vCPUs used in the batch prediction job. The formula to calculate this is:
the number of vCPUs in a machine X number of machines ( X 3 if explanations are enabled because a separate node is spun up in this case which requires additional resources)
For example if using 50 e2-standard-4 machines to a run batch prediction with explanations results in 50 * 4 * 3 = 600 vCPUs in total being used.
The default quota for a Google project is 2,200 vCPUs for the europe-west2 region. Moreover, this limit is not visible in the user's Google project, but instead in a hidden project only visible to Google engineers. Thus, it is required to raise a support ticket if you need the quota to be increased.

Training job runtime exceeded MaxRuntimeInSeconds provided

I would like to run my model 30 days in using aws sagemaker training job, but its max time is 5 days, how to resume the earlier to proceed further
Follow these steps:
Open a support ticket to increase Longest run time for a training job
to 2419200 seconds (28 days). (this can't be adjusted using the service quotas in AWS Web console).
Using the SageMaker Python SDK, when creating an Estimator, set max_run=2419200.
Implement Resume from checkpoints in your training script.
Also, the questions in #rok's answer are very relevant to consider.
According to the documentation here the maximum allowed runtime is 28 days, not 5. Check your configuration please . You are right, according to the documentation here the maximum runtime for a training job is 5 days. There are multiple things you can do: more powerful (multiple) GPU to reduce training time, or save checkpoint and restart training from there. Anyway 30 days looks like a very big training time (with associated cost), are you sure you need that ?
Actually you could ask for service quotas increase from here but as you can see Longest run time for a training job is not adjustable. So I don't you have any other choice of either using checkpoints or greater GPUs.

What is the way increasing CPU quota of VM instance in Google Compute Engine while requests were rejecting all the time?

I have a school project about parallel programming. For getting results and seeing the results are good enough, i need to increase CPU quota. I applied for increasing CPU quota in several amount -16, 24, 96- and several regions but it always rejected.
The ways that i use;
All Quotas > Select the CPUs > Select the region > Edit Quotas > Write New Quota > Send.
Edit VM instance > Select Higher CPU option > Error: Quota CPUs exceeded. Limit is 8 in region east-west-6.
Write to Sales Support Team (They haven't answered, yet.)
Write to Google Cloud Platform Support, they say; "However after careful evaluation, we have determined that we are unable to grant your quota increase due to insufficient service usage history within your preferred project. We suggest for you to make use of your current quotas and other resources readily available to serve your purposes for the meantime. To discuss further options on higher quota eligibility and to answer your questions, please reach out to your Sales team [1]" (Actually, i used this project for a while, at least for 5-6 months, not day by day, but frequently)
I have 8 CPU right now in my VM instance in Compute Engine - i was using Free Credit by the way but my credit card is added too. - I need higher quota amount. So, what is the way that i need to follow for increasing CPU quota amount after all?
Try to upgrade your free trial account to paid account. However, I checked on my account and indeed, I have quotas limitation per region (24 cpus for N1 type, 8 for other types (N2, N2D, C2))
Maybe that with a paid account, your quota increase request will be accepted! In my case, it has been accepted in 2 minutes (N2 type, request of 16 CPUs, all regions)

ML Units exceed the allowed maximum

When I'm trying to submit a job for training a model in Google Cloud-ML, I'm getting the below error.
RESOURCE_EXHAUSTED: Quota failure for project my_project.
The requested 16.536900000000003 ML Units exceed the allowed maximum of 15.To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure
violations:
- description: The requested 16.536900000000003 ML Units exceed the allowed maximum
of 15.
subject: my_project
Now my question is, will this quota reset after few hours or days? Or do I need to ask for an increase in ML Units? If so, how to do that?
lleo#'s answer is correct; this section states the following:
The typical project, when first using Cloud ML Engine is limited in
the number of concurrent processing resources:
Concurrent number of ML training units: 15.
But to directly answer your question: that is a per-job quota (not something that "refills" after a few hours or days). So the immediate solution is to simply submit a job with fewer workers.
Having access to limited resources with new projects is pretty common on Google Cloud Platform, but you can email cloudml-feedback#google.com to see if you're eligible for an increase.
As stated by Guoqing Xu in comments, there was an issue. Which is resolved from their side and I can submit my job successfully now.
For me, there was only one job and that too it's light. Moreover there are no parallel computations going on. Hence, I was puzzled why the quota has reached.
It's resolved now, working fine. Thanks Google team for resolving it. :)
You can read about default limits here here
You may ask for increasing the quota (on the API Manager page in the console). It looks like your issue is about concurrent ML units used, so you might either refactor your training pipeline or ask for a increased quota.

Is there a way to realisticlly model or estimate AWS usage?

This question is specifically for aws and s3 but it could be for other cloud services as well
Amazon charges for s3 by storage (which is easily estimated with the amount of data stored times the price)
But also charges for requests which is really hard to estimate.. a page that has one image stored in s3 technically gets 1 request per user per visit, but using cache it reduces it. Further more, how can I understand the costs with 1000 users?
Are there tools that will extrapolate data of the current usage to give me estimates?
As you mention, its depending on a lot of different factors. Calculating the cost per GB is not that hard, but estimating the amount of requests is a lot more difficult.
There are no tools that I know of that will calculate the AWS S3 costs based on historic access logs or the like. These calculations would also not be that accurate.
What you can best do is calculate the costs based on the worst case scenario. In this calculation, you assume that nothing will be cached and will assume you will get peak requests all the time. In 99% of the cases, the outcome of that calculation will be lower than what will happen in reality.
If the outcome of that calculation is acceptable pricing wise, you're good to go. If it is way more than your budget allows, then you should think about various ways you could lower these costs (caching being one of them).
Cost calculation beforehand is purely to indicate if the project or environment could realistically stay below budget. Its not meant to provide a 100% accurate estimate beforehand. Most important thing to do is to keep track of the costs after everything has been deployed. Setup billing/budget alerts and check for possible savings.
The AWS pricing calculator should help you get started: https://calculator.aws/
Besides using the calculator, I tend to prefer the actual pricing pages of each individual service and calculate it within a spreadsheet. This gives me a more in-depth overview of the actual costs.