GCP VM can't start or move TERMINATED instance - google-cloud-platform

I'm running into a problem starting my Google Cloud VM instance. I wanted to restart the instance so I hit the stop button but this was just the beginning of a big problem.
start failed with an error that the zone did not have enough capacity. Message:
The zone 'XXX' does not have enough resources available to fulfill the request. Try a different zone, or try again later.
I tried and retried till I decided to move it to another zone and ran:
gcloud compute instances move VM_NAME --destination-zone NEW_ZONE
I then get the error:
Instance cannot be moved while in state: TERMINATED
What am I supposed to do???
I'm assuming that this is a basic enough issue that there's a common way to solve for this.
Thanks
Edit: I have since managed to start the instance but would like to know what to do next time

The correct solution depends on your criteria.
I assume you're using Preemptible instances for their cost economies but -- as you've seen, there's a price -- sometimes non-preemptible resources are given the priority and sometimes (more frequently than for regular cores) there are insufficient preemptible cores available.
While it's reasonable to want to, you cannot move stopped instances between zones in a region.
I think there are a few options:
Don't use Preemptible. You'll pay more but you'll get more flexibility.
Use Managed Instance Groups (MIGs) to maintain ~1 instance (in the region|zone)
(for completeness) consider using containers and perhaps Cloud Run or Kubernetes
You describe wanting to restart your instance. Perhaps this was because you made some changes to it. If this is the case, you may wish to consider treating your instances as more being more disposable.
When you wish to make changes to the workload:
IMPORTANT ensure you're preserving any important state outside of the instance
create a new instance (at this time, you will be able to find a zone with capacity for it)
once the new instance is running correctly, delete the prior version
NB Both options 2 (MIGs) and 3 (Cloud Run|Kubernetes) above implement this practice.

Related

Google Cloud VM Instance Stuck on resizing suggested by Console

I had a vm instance running on Google Cloud, it's suggested me that "you should resize instance to 2CPU and 16GB RAM from 4CPU and 16GB RAM".
I pressed to Apply to set new config. Instance has stopped and stucked in resize process since an hour, neigher shows resized in gcloud instance list nor starting up.
Even try for taking snapshot of that vm's disk shows error that "it's being used in some operations"
Tried to force stop via gcloud, but no luck. In notification pop-up shows, resizing vm only.
Pls help me here.
The main reason for this issue is GCP resource availability which depends on users requests and therefore is dynamic. As result, issues like this could happen when you use cloud resources on-demand without reservation.
Let's have a look at the cause of this issue:
when you stop an instance it releases some resources like vCPU and memory;
when you start an instance it requests resources like vCPU and memory back;
when you resize your VM it's the same.
In case if there's not enough resources available in the zone you'll get an error message:
The zone 'projects/xyz-project-272905/zones/asia-south1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later..
more details you can find in the documentation:
If you receive a resource error (such as ZONE_RESOURCE_POOL_EXHAUSTED
or ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS) when requesting new
resources, it means that the zone cannot currently accommodate your
request. This error is due to Compute Engine resource obtainability,
and is not due to your Compute Engine quota.
There are a few ways to solve your issue:
Move your instance to another zone by following instructions.
Wait for a while and try to resize your VM instance again.
Reserve resources for your VM by following documentation to avoid such issue in future (extra payment will be required):
Create reservations for Virtual Machine (VM) instances in a specific
zone, using custom or predefined machine types, with or without
additional GPUs or local SSDs, to ensure resources are available for
your workloads when you need them. After you create a reservation, you
begin paying for the reserved resources immediately, and they remain
available for your project to use indefinitely, until the reservation
is deleted.

GCP Compute Engine Resource Not Available

I have 4 vm instances in South-Asia region in two zones (south-asia-a & south-asia-c) out of three. I can't boot up all four instances due to below error regarding resources not available in current zone or region.
I can't move instances to other region, because public IP will change, in South-Asia region all zones shows same error.
Even tried to create new instance in different zone like south-asia-b, but same error.
The zone 'projects/some-projectname/zones/asia-south1-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later.
EDIT: Tried all three zones a,b,c of asia-south.
What can I do?
Yes I have face the same issue. I ended up switching to asia-south1-c. This is a recent error. I first encountered 1.5 weeks ago. If you must use asia-south1-a then you can raise a support ticket. Anyway, the issue should be fixed soon since many users are facing the same problem.
In case you have snapshots of the instances disks you can use the snapshots as an existing disk to create new instances in any region/zone. If you do not have a snapshot you cannot create now since the instance has to be running to do so.
Edit:
We are continuously adding more and more resources to avoid situations like this. If your workload is predictable long term, you may want to purchase commitment and reserve the resources you will use at a discount price.

Problems with Memory and CPU limits in AWS ECS cluster running on reserved EC2 instance

I am running the ECS cluster that currently has 3 services running on T3 medium instance. Each of those services is running only one task which has a soft memory limit of 1GB, the hard limit is different for each (but that should not be the problem). I will always have enough memory to run one, new deployed task (new one will also take 1GB, and T3 medium will be able to handle it since it has 4GB total). After the new task is up and running, the old one will be stopped and I will have again 1GB free for the new deployment. I did similar to the CPU (2048 CPU, each task has 512, and 512 free for new deployments).
So everything runs fine now, but I am not completely satisfied with this setup for the future. What will happen if I need to add another service with another task? I need to deploy all existing tasks and to modify their task definitions to use less CPU and memory in order to run this new task (and new deployments). I am planning to get a reserved EC2 instance, so it will not be easy to swap the current EC2 instance with the larger one.
Is there a way to spin up another EC2 instance for the same ECS cluster to handle bursts in my tasks? Also deployments, it's not a perfect scenario to have the ability to deploy only one task, and then wait for old to be killed in order to deploy the next one, without downtimes.
And biggest concern, what if I need new service and task, I need again to adjust all others in order to run a new one and deploy others, which is not very maintainable and what if I cannot lower CPU and memory more because I already reached the lowest point in order to run the task smoothly.
I was thinking about having another EC2 instance for the same cluster, that will handle bursts, deployments, and new services/tasks. But not sure if that's possible and if that's the best way of doing this. I was also thinking about Fargate, but this is much more expensive and I cannot afford it for now. What do you think? Any ideas, suggestions, and hints will be helpful since I am desperate to find the best way to avoid the problems mentioned above.
Thanks in advance!
So unfortunately, there is no out of the box solution to ensure that all your tasks run on min possible (i.e. one) instance. You can use our new feature called Capacity Providers (CP), which will allow you to ensure the minimum number of ec2 instances required to run all your tasks. The major difference between CP vs ASG is that CP gives more weight to task placement (where as ASG will scale in/out based on resource utilization which isn't ideal in your case).
However, it's not an ideal solution. Just as you said in your comment, when the service needs to scale out during a deployment, CP will spin up another instance, the new task will be placed on it and once it gets to Running state, the old task will be stopped.
But now you have an "extra" EC2 instance because there is no way to replace a running task. The only way I can think of would be to use a lambda function that drains the new instance, which will move all the service tasks to the other instance. CP will, after about 15 minutes, terminate this instance as there are no tasks are running on it.
A couple caveats:
CP are new, a little rough around the edges, and you can't
delete/modify them. You can only create or deactivate them.
CP needs an underlying ASG and they must have a 1-1 relationship
Make sure to enable managed scaling when creating CP
Choose 100% capacity target
Don't forget to add a default capacity strategy for the cluster
Minimizing EC2 instances used:
If you're using a capacity provider, the 'binpack' placement strategy minimises the number of EC2 hosts that are used.
However, there are some scale-in scenarios where you can end up with a single task running on its own EC2 instance. As Ali mentions in their answer; ECS will not replace this running task, but depending on your setup, it may be fairly easy for you to replace it yourself by configuring your task to voluntarily 'quit'.
In my case; I always have at least 2 tasks running per service. So I just added some logic to my tasks' healthchecks, so they report as unhealthy after ~6 hours. ECS will spot the 'unhealthy' task, remove it from the load balancer, and spin up a replacement (according to the binpack strategy).
Note: If you take this approach; add some variation to your timeout so you're less likely to have all of your tasks expire at the same time. Something like: expiry = now + timedelta(hours=random.uniform(5.5,6.5))
Sharing memory 'headspace' with soft-limits:
If you set both soft and hard memory limits; ECS will place your tasks based on the soft limit. If your tasks' memory usage varies with usage, it's fairly easy to get your EC2 instance to start swapping.
For example: Say you have a task defined with a soft limit of 900mb, and a hard limit of 1800mb. You spin up a service with 4 running instances. ECS provisions all 4 of these instances on a single t3.medium. Notice here that each instance thinks it can safely use up to 1800mb, when in fact there's very little free memory on the host server. When you hit your service with some traffic; each task tries to use some more memory, and your t3.medium is incapacitated as it starts swapping memory to disk. ECS does not recover from this type of failure very well. It notices that the task instances are no longer available, and will attempt to provision replacements, but the capacity provider is very slow to replace the swapping t3.medium.
My suggestion:
Configure your service to auto-scale based on memory usage (this will be a percentage of your soft-limit), for example: a target memory usage of 70%
Configure your tasks' healthchecks so that they report as unhealthy when they are nearing their soft-limit. This way, your tasks still have some headroom for quick spikes of memory usage, while giving your load balancer a chance to drain and gracefully replace tasks that are getting greedy. This is fairly easy to do by reading the value within /sys/fs/cgroup/memory/memory.usage_in_bytes.

The zone X does not have enough resources

Instance group attempting to create an instance, but can't, with this error:
The zone 'projects/myproject/zones/us-central1-f' does not have enough resources available to fulfill the request
How do I fix this, and is it something I should be expecting from GCE on regular basis?
Thanks
In some rare occasions, some zones might not have enough resources available to fullfill a request. This is done to ensure that there are enough resources in each zone, to ensure that already installed users have enough of them to keep running their applications.
This type of issues are immediately noticed, and currently this is being investigated. For the moment, you can try to do one the following points:
Keep trying to do the deployment until the zone has enough resources.
Relax the requirements of the instance you are creating ( i.e. less CPU/Disk/Memory...)
Try to deploy to another zone within the same region, for example deploy to us-central1-a. You can see the full list of available zones/regions in this documentation
I would recommend you to go for the third option, as you will be able to create the instances immediately, with the resources you need, and you probably won't be affected by the zone change.

Is it possible to do a temporary upgrade of an AWS micro instance to test what would be ok?

I do have a free micro instance on AWS and quite often my CPU is throttled making it very hard to use.
I do want to know if there is any way to test a bigger instance so I see which one would be ok.
Side questions:
Can I go back to the free micro if I want?
Can I limit the cost of the testing, or do an estimate on it? I don't want to endup with a surprise bill as the result of the testing.
You can of course launch a new instance of a larger size, run your tests, then terminate the instance. It will not effect your running micro instance in any way at all.
AWS publishes their pricing data, so you can either calculate the cost manually or use the cost calculator: http://calculator.s3.amazonaws.com/calc5.html
There is no way to "cap" your AWS spend.
Mike Ryan's answer is correct as such, but there might be a better way to achieve your goal, because it is possible to upgrade your Amazon EC2 t1.micro instance in place. This process (and the few constraints) are summarized in Eric Hammond's article Moving an EC2 Instance to a Larger (or Smaller) Instance Type:
When you discover that the entry level t1.micro instance size is
simply not cutting it for your growing application needs, you may want
to try upgrading it to a larger instance type, perhaps an m1.small or
even a c1.medium.
Instead of starting a new instance and having to configure it from
scratch, you may be able to simply resize the existing instance by
asking Amazon move it to better hardware for you. Of course, since
this is AWS, you don’t have to actually talk to anybody—just type a
few commands and the job is done automatically.
Eric describes how to achieve this via the command line, but the same can be done via the AWS Management Console as well if your prefer, the instance menu features a respective command Change Instance Type (only enabled when the instance is stopped).
Alternatively you might also want to get acquainted with the ease of duplicating an EBS-Backed EC2 instance by means of an Amazon Machine Image (AMI), which allows you to start any number of exact duplicates of your current instance - this process is outlined in Creating Amazon EBS-Backed AMIs Using the Console for example.