ZONE_RESOURCE_POOL_EXHAUSTED for DataFlow & DataPrep - google-cloud-platform

Alright team...Dataprep running into BigQuery. I cannot for the life of me find out why I have the ZONE_RESOURCE_POOL_EXHAUSTED issue for the past 5 hours. The night before, everything was going great, but today, I am having some serious issues.
Can anyone give any insight into how to change the resource pool for Dataflow jobs with regard to Dataprep? I can't even get a basic column transform to push through.
Looking forward to anyone helping me with this because honestly, this issue one of those "just change this and maybe that will fix it and if not, maybe a few weeks and it'll work".
Here is the issue in screenshot: https://i.stack.imgur.com/Qi4Dg.png
UPDATE:
I believe some of my issue may deal with GCP Compute incident 18012 espcially since it's a us-central based issue for creation of instances.

The incident you mentioned was actually resolved on November 5th and was only affecting the us-central1-a zone. Seeing that your question was posted on November 10th and other users in the comments got the error in the us-central1-b zone, the error is not related to the incident you linked.
As the error message suggests, this is a resource availability issue. These scenarios are rare and are usually resolved quickly. If this ever happens in the future, using Compute Engine instances in other regions/zones will solve the issue. To do so using Dataprep, as mentioned in the comment, after the job is launched from Dataprep, you can re-run the job from Dataflow while specifying the region/zone you would like to run the job in.

Related

AlphaFold on VertexAI - Stuck in setting up notebook for 2 hours

I am trying to run AlphaFold on VertexAI as explained here. However, my instance creation is stuck in this state for roughly two hours now. There is no error message either. I am wondering if something has gone wrong or this is just the expected time it will take to setup a new instance?
I actually tried with two different notebooks. One is the default one linked in the above article and the other is https://raw.githubusercontent.com/deepmind/alphafold/main/notebooks/AlphaFold.ipynb
Both are in the same state for roughly the same time.
I finally gave up and Canceled the notebook creation. When I went back to the Workbench screen, THEN it displayed me this error message:
So, turns out that the new Google Cloud account I created has no quota for GPUs. In order to increase the quota, I first had to upgrade to a full GCP account. And now I need to update for a couple of days before I can actually request the quota increase because I got this automated response when I submitted the quota increase request.
I have also contacted Sales on the link given at the end of this email to see if they can escalate the process in any way.

AWS Sagemaker training job stuck in progress state

I have created a training job yesterday, same as usual, just adding few more training data. I didn't have any problem with this in the last 2 years (the same exact procedure and code). This time after 14 hours more or less simply stalled.
Training job is still "in processing", but cloudwatch is not logging anything since then. Right now 8 more hours passed and no new entry is in the logs, no errors no crash.
Can someone explain this ? Unfortunately I don't have any AWS support plan.
As you can see from the picture below after 11am there is nothing..
The training job is supposed to complete in the next couple of hours, but now I'm not sure if is actually running (in this case would be a cloudwatch problem) or not..
UPDATE
Suddenly the training job failed, without any further log. The reason is
ClientError: Artifact upload failed:Error 7: The credentials received
have been expired
But there is still nothing in the logs after 11am. Very weird.
For future readers I can confirm that is something that can happen very rarely (I' haven't experienced it anymore since then), but it's AWS fault. Same data, same algorithm.

istio-operator image is no longer available in the public istio-release registry on GCR

We have recently come across an issue on one of our cluster pods, which caused an outage on our application and impacted our customers.
Here is the thing: We were able to pull the gke.gcr.io/istio/operator:1.6.3 image from GCR, though, it started failing overnight.
Finally, we noticed that this image is no longer available in the public istio-release registry, on gcr.io, causing a ImagePullBackoff failure. However, we are still able to find it on docker.io.
Having said that, we're sticking with the solution approach of pulling the image from docker.io/istio/operator:1.6.3, which is a pretty straightforward one for now. Nevertheless, we're still skeptical and wondering why this image has suddenly vanished from gcr.io.
Has anyone been facing something similar?
Best regards.
I did some reasearch but I can't find anything related.
As I mentioned in comments, I strongly suggest you keep all critical images in a private container registry. Using this approach you can avoid incidents like that, and earn some extra control upon the images, such as: versioning, the security etc.
There are many guides on the internet to setup your own managed private container registry like Nexus, if you want to use as a service, you can try Gooogle Container Registry.
Keep in mind that when you are working in a critical environment, you need to try minize the variables to keep your service as resilient as possible.
I noticed a small downtime with one of our services deployed to the GKE and noticed istio-operator was listed with a red warning.
The log was:
Back-off pulling image "gke.gcr.io/istio/operator:1.6.4": ImagePullBackOff
Since istio-operator is a workload GKE manages I was hesitant but the downtime repeated couple of times for couple of minutes so I also edited the service yaml and update the image with docker.

not have enough resources available to fulfil the request try a different zone

not have enough resources available to fulfill the request try a different zone
All of my machines in the different zone
have the same issue and can not run.
"Starting VM instance "home-1" failed.
Error:
The zone 'projects/extreme-pixel-208800/zones/us-west1-b' does not have enough resources available to fulfill the request. Try a different zone, or try again later."
I am having the same issue. I emailed google and figured out this has nothing to do with quota. However, you can try to decrease the need of your instance (eg. decrease RAM, CPUs, GPUs). It might work if you are lucky.
Secondly, if you want to email google again, you will get the message sent from the following template.
Good day! This is XX from Google Cloud Platform Support and I'll be
glad to help you from here. First, my apologies that you’re
experiencing this issue. Rest assured that the team is working hard to
resolve it.
Our goal is to make sure that there are available resources in all
zones. This type of issue is rare, when a situation like this occurs
or is about to occur, our team is notified immediately and the issue
is investigated.
We recommend deploying and balancing your workload across multiple
zones or regions to reduce the likelihood of an outage. Please review
our documentation [1] which outlines how to build resilient and
scalable architectures on Google Cloud Platform.
Again, we want to offer our sincerest apologies. We are working hard
to resolve this and make this an exceptionally rare event. I'll be
keeping this case open for one (1) business day in case you have
additional question related to this matter, otherwise you may
disregard this email for this ticket to automatically close.
All the best,
XXXX Google Cloud Platform Support
[1] https://cloud.google.com/solutions/scalable-and-resilient-apps
So, if you ask me how long you are expected to wait and when this issue is likely to happen:
I waited for an average of 1.5-3 days.
During the weekend (like from Friday to Sunday) daytime EST, GCP has a high probability of unavailable resources.
Usually when you have one instance that has this issue, others too. For me, keep trying in different region waste my time. (But, maybe it just that I don't have any luck)
The error message "The zone 'projects/[...]' does not have enough resources available to fulfill the request. Try a different zone, or try again later." is always in reference to a shortage of resources in a zone.
Google recommends spreading your workload across different zones to reduce the impact of these issues on your workload. Otherwise, there isn't much else to do other than wait or try another zone/region
Faced this Issue yesterday [01/Aug/2020] when GCP free credit was over and below steps helped to workaround this.
I was on asia-south-c zone and moved to us zone
Going to my Google Cloud Platform >>> Compute Engine
Went to Snapshots >>> created a snapshot >>> Select your Compute Engine instance
Once snapshot was completed I clicked on my snapshot.
Ended up under "snapshot details". There, on the top, just click create instance. Here you are basically creating an instance with a copy of your disk.
Select your new zone, don't forget to attach GPUs, all previous setting, create new name.
Click create, that's it, your image should now be running in your new zone
No worry of losting configuration as well.

Why do my #AWS EC2 spot instance requests get stuck in pending-evaluation status?

When I try to launch an EC2 spot instance, the instance almost immediately goes into status = pending-evaluation and stays there indefinitely.
My bid price is far above the current spot price, and I have no trouble launching dedicated instances.
Why is this happening? Has anyone had a similar problem?
Can't answer the "why" but regarding "has anyone had a similar problem" - yes, many people have had this same issue over many years.
Search the AWS support forums for "pending-evaluation" and you'll have a lot of threads to read up on: https://forums.aws.amazon.com/search.jspa?mbtc=70747e76394e723be38d774e355fd542ab723cc7d767e298a5626f55fd475590&threadID=&q=%22pending-evaluation%22&objID=&userID=&dateRange=all&numResults=30&rankBy=9
Noticeably quite a few of them has had responses from AWS support stating something in the line of "we found an issue with your account, it should be fixed now". However many posts have remained unanswered so unless you're paying for support it seems you might need luck on your side to get your issues resolved.