Why am I getting inconsistent results when attempting to update my instance group using `gcloud`? - google-cloud-platform

I have an instance group in GCP, and I am working on automating the deployment process. The instances in this group are based on a tagged GCR image. When a new image is pushed to the container registry, we have been manually triggering an upgrade by navigating to the instance group from console.cloud.google.com, clicking "restart/replace vms", and setting these options:
Operation: replace
Maximum surge: 3
Maximum unavailable: 0
Here is my gcloud command for doing the same thing (link to Google's documentation about this command):
gcloud beta compute instance-groups managed rolling-action start-update my-instance-group \
--version=template=my-template-with-image \
--replacement-method=substitute \
--max-surge=3 \
--max-unavailable=0 \
--region=us-central1
Manually, the process always works. But the gcloud command is flaky. It always appears to succeed from the command line, but the instance groups are not always restarted. I have even tried adding these two flags, and the restart attempt was still unreliable:
--minimal-action=replace \
--most-disruptive-allowed-action=replace \
There is quite a lot of output from the gcloud command (which I can provide, if necessary), but here are the only parts of the output that differ between a successful and unsuccessful attempt:
Good:
currentActions:
creating: 1
status:
isStable: false
versionTarget:
isReached: false
Bad:
currentActions:
creating: 0
status:
isStable: true
versionTarget:
isReached: true
That is pretty much the extent of my knowledge at this point. I am not sure how to move forward in automating the build process, and I have been unable to find answers from the documentation so far.
I hope I was not too verbose, and thank you in advance to anyone who spends time on this :)

Related

GKE cluster creator in GCP

How can we get the cluster owner details in GKE. Logging part only contains the entry with service account operations and there is no entry with principal email of userId anywhere.
It seems very difficult to get the name of the user who created the GKE cluster.
we have exported complete json file of logs but did not the user entry who actually click on create cluster button. I think this is very common use case to know GKE cluster creator, not sure if we are missing something.
Query:
resource.type="k8s_cluster"
resource.labels.cluster_name="clusterName"
resource.labels.location="us-central1"
-protoPayload.methodName="io.k8s.core.v1.configmaps.update"
-protoPayload.methodName="io.k8s.coordination.v1.leases.update"
-protoPayload.methodName="io.k8s.core.v1.endpoints.update"
severity=DEFAULT
-protoPayload.authenticationInfo.principalEmail="system:addon-manager"
-protoPayload.methodName="io.k8s.apiserver.flowcontrol.v1beta1.flowschemas.status.patch"
-protoPayload.methodName="io.k8s.certificates.v1.certificatesigningrequests.create"
-protoPayload.methodName="io.k8s.core.v1.resourcequotas.delete"
-protoPayload.methodName="io.k8s.core.v1.pods.create"
-protoPayload.methodName="io.k8s.apiregistration.v1.apiservices.create"
I have referred the link below, but it did not help either.
https://cloud.google.com/blog/products/management-tools/finding-your-gke-logs
Audit Logs and specifically Admin Activity Logs
And, there's a "trick": The activity audit log entries include the API method. You can find the API method that interests you. This isn't super straightforward but it's relatively easy. You can start by scoping to the service. For GKE, the service is container.googleapis.com.
NOTE APIs Explorer and Kubenetes Engine API (but really container.googleapis.com) and projects.locations.clusters.create. The mechanism breaks down a little here as the protoPayload.methodName is a variant of the underlying REST method name.
And so you can use logs explorer with the following very broad query:
logName="projects/{PROJECT}/logs/cloudaudit.googleapis.com%2Factivity"
container.googleapis.com
NOTE replace {PROJECT} with the value.
And then refine this based on what's returned:
logName="projects/{PROJECT}/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.serviceName="container.googleapis.com"
protoPayload.methodName="google.container.v1beta1.ClusterManager.CreateCluster"
NOTE I mentioned that it isn't super straightforward because, as you can see in the above, I'd used gcloud beta container clusters create and so I need the google.container.v1beta1.ClusterManager.CreateCluster method but, it was easy to determine this from the logs.
And, who dunnit?
protoPayload: {
authenticationInfo: {
principalEmail: "{me}"
}
}
So:
PROJECT="[YOUR-PROJECT]"
FILTER="
logName=\"projects/${PROJECT}/logs/cloudaudit.googleapis.com%2Factivity\"
protoPayload.serviceName=\"container.googleapis.com\"
protoPayload.methodName=\"google.container.v1beta1.ClusterManager.CreateCluster\"
"
gcloud logging read "${FILTER}" \
--project=${PROJECT} \
--format="value(protoPayload.authenticationInfo.principalEmail)"
For those who are looking for a quick answer.
Use the log filter in Logs Explorer & use below to check the creator of the cluster.
resource.type="gke_cluster"
protoPayload.authorizationInfo.permission="container.clusters.create"
resource.labels.cluster_name="your-cluster-name"
From gcloud command, you can get the creation date of the cluster.
gcloud container clusters describe YOUR_CLUSTER_NAME --zone ZONE

Question when I inport .ova file into google cloud to create an instance of VM

When I try to import .ova file, there is following error.
The problem is:
[import-ovf] 2020/08/19 13:02:35 step "import-boot-disk" run error: step "wait-for-signal" run error: WaitForInstancesSignal FailureMatch found for "inst-importer-import-ovf-import-boot-disk-3qlqg": "ImportFailed: Failed to resize disk. The Compute Engine default service account needs the role: roles/compute.storageAdmin'"
ERROR
ERROR: build step 0 "gcr.io/compute-image-tools/gce_ovf_import:release" failed: step exited with non-zero status: 1
ERROR: (gcloud.compute.instances.import) build 0683583e-5157-4d13-972b-b5e3f5f75f2b completed with status "FAILURE"
"
The command I am using to implement is: gcloud compute instances import eve --os=ubuntu-1604 --source-uri=gs://evenamespace/EVE.ova --zone=northamerica-northeast1-a --custom-memory=25GB --custom-cpu=4.
I already add an role in Compute Engine default service account, that is Compute Storage Admin like the picture
As mentioned in this Answer to a very similar question, you should make sure that your Compute Engine Service Account has the roles roles/compute.storageAdmin & roles/storage.objectViewer.This is also mentioned in the documentation here.
Also, it is important to make sure that you have enough Disk quota in the region where you're importing the disk into.
The import process uses SSD disks for performance during the import process. So, in case you don't have quita for SSD Disks available you may also face issues when importing an appliance. I suggest to also check your current SSH quota.
Could you please try the recommendations and let me know if this works to complete your import?
If not, maybe you could retrieve all of the logs by inspecting the console output of gcloud compute instances import. There will be a URL for the scratch bucket.You can load that URL, navigate to the logs directory, and download all of the logs. More hints to the cause could be found in these logs.
Using this command without the / at the end of the ova file name I got it to work:
$ gcloud beta compute machine-images import <my-machine-image> --source-uri=gs://my-virtual-appliances-bucket/my-va-file.ova --os=ubuntu-1804
Let me know if this helps anyone trying to attempt this, check out this link as well.

Error with gcloud beta command for streaming assets to bigquery

This might be a bit bleeding edge but hopefully someone can help. The problem is a catch 22.
So what we're trying to do is create a continuous stream of inventory changes in each GCP project to BigQuery dataset tables that we can create reports from and get a better idea of what we're paying for, what's turned on what's in use what isn't, etc.
Error: Error running command 'gcloud beta asset feeds create asset_change_feed --project=project_id --pubsub-topic=asset_change_feed': exit status 2. Output: ERROR: (gcloud.beta.asset.feeds.create) argument (--asset-names --asset-types): Must be specified.
Usage: gcloud beta asset feeds create FEED_ID --pubsub-topic=PUBSUB_TOPIC (--asset-names=[ASSET_NAMES,...] --asset-types=[ASSET_TYPES,...]) (--folder=FOLDER_ID | --organization=ORGANIZATION_ID | --project=PROJECT_ID) [optional flags]
optional flags may be --asset-names | --asset-types | --content-type |
--folder | --help | --organization | --project
For detailed information on this command and its flags, run:
gcloud beta asset feeds create --help
Using terraform we tried creating a dataflow job and a pubsub topic called asset_change_feed.
We get an error trying to create the pubsub topic because the gcloud beta asset feeds create command wants a parameter that includes all the asset names monitor...
Well... this kind of defeats the purpose. The whole point is to monitor all the asset names that change, appear and disappear. It's like creating a feed that monitors all the new baby names that appear over the next year but the feed command requires that we know them in advance somehow. WTF? What's the point then? Are we re-inventing the wheel here?
We were going by this documentation here:
https://cloud.google.com/asset-inventory/docs/monitoring-asset-changes#creating_a_feed
As per the gcloud beta asset feeds create documentation it is required to specify at least one of --asset-names and --asset-types:
At least one of these must be specified:
--asset-names=[ASSET_NAMES,…] A comma-separated list of the full names of the assets to receive updates. For example:
//compute.googleapis.com/projects/my_project_123/zones/zone1/instances/instance1.
See
https://cloud.google.com/apis/design/resource_names#full_resource_name
for more information.
--asset-types=[ASSET_TYPES,…] A comma-separated list of types of the assets types to receive updates. For example:
compute.googleapis.com/Disk,compute.googleapis.com/Network See
https://cloud.google.com/resource-manager/docs/cloud-asset-inventory/overview
for all supported asset types.
Therefore, when we don't know the names a priori we can monitor all resources of the desired types by only passing --asset-types. You can see the list of supported asset types here or use the exportAssets API method (gcloud asset export) to retrieve the types used at an organization, folder or project level.

AWS Waiter TasksStopped failed: taskId length should be one of

For some reason I am getting the following error:
Waiter TasksStopped failed: taskId length should be one of [32,36]
I really don't know what taskId is supposed to mean and aws documentation isn't helping. Does anyone know what is going wrong in this pipeline script?
- step:
name: Run DB migrations
script:
- >
export BackendTaskArn=$(aws cloudformation list-stack-resources \
--stack-name=${DEXB_PRODUCTION_STACK} \
--output=text \
--query="StackResourceSummaries[?LogicalResourceId=='BackendECSTask'].PhysicalResourceId")
- >
SequelizeTask=$(aws ecs run-task --cluster=${DEXB_PRODUCTION_ECS_CLUSTER} --task-definition=${BackendTaskArn} \
--overrides='{"containerOverrides":[{"name":"NodeBackend","command":["./node_modules/.bin/sequelize","db:migrate"]}]}' \
--launch-type=EC2 --output=text --query='tasks[0].taskArn')
- aws ecs wait tasks-stopped --cluster=${DEXB_PRODUCTION_ECS_CLUSTER} --tasks ${SequelizeTask}
AWS introduced a new ARN format for tasks, container instances, and services. This format now contains the cluster name, which might break scripts and applications that were counting on the ARN only containing the task resource ID.
# Previous format (taskId contains hyphens)
arn:aws:ecs:$region:$accountID:task/$taskId
# New format (taskI does not contain hyphens)
arn:aws:ecs:$region:$accountId:task/$clusterName/$taskId
Until March 31, 2021, it will be possible to opt-out of this change per-region, using https://console.aws.amazon.com/ecs/home?#/settings. In order to change the behavior for the whole account, you will need to use the Root IAM user.
It turns out I had a duplicate task running in the background. I went to the ECS clusters page and stopped the duplicate task. However this may be dangerous to do if you have used cloudformation to set up your tasks and services. Proceed cautiously if you're in the same boat.
We were bit with this cryptic error message, and what it actually means is that the task_id you're sending to cloudformation script is invalid. Task ids must have a length of 32 or 36 chars.
In our case, an undocumented change in the way AWS sent back taskArn key value was causing us to grab the incorrect value, and sending an unrelated string as the task_id. AWS detected this and blew up. So double check the task_id string and you should be good.

Global GPU quota needed but can't request increase

I'm trying to use GCloud's deep learning VM image. My request for 8 Tesla K80s was approved. But when I try to create an instance with even a single GPU, I get an error saying the Global GPU limit of 0 is exceeded.
The error statement in specific:
ERROR: (gcloud.compute.instances.create) Could not fetch resource: - Quota 'GPUS_ALL_REGIONS' exceeded. Limit: 0.0 globally.
The code I wrote to create the VM is this:
export IMAGE_FAMILY="tf-latest-cu92"
export ZONE="us-west1-b"
export INSTANCE_NAME="my-new-instance"
export INSTANCE_TYPE="n1-standard-8"
gcloud compute instances create $INSTANCE_NAME \
--zone=$ZONE \
--image-family=$IMAGE_FAMILY \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--accelerator="type=nvidia-tesla-k80,count=1" \
--machine-type=$INSTANCE_TYPE \
--boot-disk-size=120GB \
--metadata="install-nvidia-driver=True"
This code snippet is drawn from:
https://cloud.google.com/deep-learning-vm/docs/quickstart-cli
Thank you for your time and effort.
I had this same thing happen a while ago. You have to increase the Tesla K80 quota as well as a global quota called GPUS_ALL_REGIONS. I'm not sure how to do this from the command line, but you can do it through the web console by going into your IAM settings, selecting "Quotas" from the side bar. In the dropdown labeled "Metric", deselect everything except for "GPUs (all regions)". You will now need to increase this quota to 8 as well. Once it is approved, you will be able to use all of your GPUs.
UPDATE 2022:
Here is how to do it in the 2022 Gcloud UI:
Simply write: GPUS_ALL_REGION in the filter input and then edit the selected quota.
Although this was already answered by #Alex Krantz, here is a screenshot to the corresponding UI mask in the Gcloud UI.
You can nativagate to this page through "IAM & Admin", then "Quotas".