GCP Compute Engine won't show memory metrics - google-cloud-platform

I want my compute engine VM to show memory usage metrics in the console, I went to this page and install Ops-Agents, restart the service and went to the VM observability section, but still saw a message that the agent is not installed (in the memory usage metric):
I thought maybe by default the memory usage is not installed (it's not mentioned anywhere, just a guess) and I need to modify the config. I went to this docs and added this code to /etc/google-cloud-ops-agent/config.yaml:
metrics:
receivers:
agent.googleapis.com/memory/bytes_used:
type: hostmetrics
collection_interval: 1m
According to the docs, this config will be merged with the built-in configuration when the agent restarts.
I restarted the agent service, went back to the dashboard but still it shows the message "Requires Ops Agent".
I don't know what I'm doing wrong, the documentations are really poor for that topic IMO, I couldn't find any example on how to turn on memory usage metrics.
EDIT
Running sudo systemctl status google-cloud-ops-agent"*"
I can see this error message:
otelopscol[2763]:
2022-05-02T14:07:02.780Z#011error#011collector#v0.26.1-0.20220307211504-dc45061a44f9/metrics.go:235#011could
not export time series to GCM#011{"error": "rpc error: code =
InvalidArgument desc = Name must begin with
'{resource_container_type}/{resource_container_id}', got: projects/"}
EDIT2
If I click INSTALL via the console, I see this installation instructions:
:> agents_to_install.csv && \
echo '"projects/<project>/zones/europe-west1-b/instances/<instance>","[{""type"":""ops-agent""}]"' >> agents_to_install.csv && \
curl -sSO https://dl.google.com/cloudagents/mass-provision-google-cloud-ops-agents.py && \
python3 mass-provision-google-cloud-ops-agents.py --file agents_to_install.csv
It's differente from the one here: https://cloud.google.com/monitoring/agent/monitoring/installation#joint-install
curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
sudo bash add-monitoring-agent-repo.sh --also-install
Not sure what installed what, tried both.

Regarding your questions “I couldn't find any example on how to turn on memory usage metrics” and “Is it installed but the configurations need to be modified for the memory usage metrics?” the answer is yes, you need to customize which group or groups of metrics to enable as specified here. The metric type strings must be prefixed with agent.googleapis.com/agent/. For memory metrics, the examples are:
agent.googleapis.com/agent/memory_usage
agent.googleapis.com/agent/memory_utilization
That prefix has been omitted from the entries in the table that I’m sharing here.
Now, you need to select the setting based on the target VM that you need to get metrics from, for example, Linux only:
agent.googleapis.com/memory/usage
Also, you can play with other options, changing the final criteria, for example:
agent.googleapis.com/memory/bytes_used
Ensure that you didn’t miss anything regarding the agent’s installation, follow these instructions to install it from the CLI. Then go to:
Resources -> Instances: You should see your VM instance.
Click on your instance -> click on Agent -> Scroll down and you see your memory and your swap usage.
Finally, you can follow this troubleshooting guide for Ops Agent issues, and these threads for more empirical cases and solutions Memory Usage Monitoring in GCP Compute Engine and No metric found.

Related

Dataproc custom image: Cannot complete creation

For a project, I have to create a Dataproc cluster that has one of the outdated versions (for example, 1.3.94-debian10) that contain the vulnerabilities in Apache Log4j 2 utility. The goal is to get the alert related (DATAPROC_IMAGE_OUTDATED), in order to check how SCC works (it is just for a test environment).
I tried to run the command gcloud dataproc clusters create dataproc-cluster --region=us-east1 --image-version=1.3.94-debian10 but got the following message ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Selected software image version 1.3.94-debian10 is vulnerable to remote code execution due to a log4j vulnerability (CVE-2021-44228) and cannot be used to create new clusters. Please upgrade to image versions >=1.3.95, >=1.4.77, >=1.5.53, or >=2.0.27. For more information, see https://cloud.google.com/dataproc/docs/guides/recreate-cluster, which makes sense, in order to protect the cluster.
I did some research and discovered that I will have to create a custom image with said version and generate the cluster from that. The thing is, I have tried to read the documentation or find some tutorial, but I still can't understand how to start or to run the file generate_custom_image.py, for example, since I am not confortable with cloud shell (I prefer the console).
Can someone help? Thank you

Application information missing in Spinnaker after re-adding GKE accounts - using spinnaker-for-gke

I am using a Spinnaker implementation set up on GCP using the spinnaker-for-gcp tools. My initial setup worked fine. However, we recently had to re-configure our GKE clusters (independently of Spinnaker). Consequently I deleted and re-added our gke-accounts. After doing that the Spinnaker UI appears to show the existing GKE-based applications but if I click on any of them there are no clusters or load balancers listed anymore! Here are the spinnaker-for-gcp commands that I executed:
$ hal config provider kubernetes account delete company-prod-acct
$ hal config provider kubernetes account delete company-dev-acct
$ ./add_gke_account.sh # for gke_company_us-central1_company-prod
$ ./add_gke_account.sh # for gke_company_us-west1-a_company-dev
$ ./push_and_apply.sh
When the above didn't work I did an experiment where I deleted the two account and added an account with a different name (but the same GKE cluster) and ran push_and_apply. As before, the output messages seem to indicate that everything worked, but the Spinnaker UI continued to show all the old account names, despite the fact that I deleted them and added new ones (which did not show up). And, as before, not details could be seen for any of the applications. Also note that hal config provider kubernetes account list did show the new account name and did not show the old ones.
Any ideas for what I can do, other than complete recreating our Spinnaker installation? Is there anything in particular that I should look for in the Spinnaker logs in GCP to provide more information?
Thanks in advance.
-Mark
The problem turned out to be that the data that was in my .kube/config file in Cloud Shell was obsolete. Removing that file, recreating it (via the appropriate kubectl commands) and then running the commands mentioned in my original description fixed the problem.
Note, though, that it took a lot of shell script and GCP log reading by our team to figure out the problem. Ultimately, what would have been nice would have been if the add_gke_account.sh or push_and_apply.sh scripts could have detected the issue, presumably by verifying that the expected changes did, in fact, correctly occur in the running spinnaker.

What trace-token option for gcloud is used for?

Help definition is not clear for me:
Token used to route traces of service requests for investigation of
issues.
Could you provide simple example how to use it?
I tried:
gcloud compute instances create vm3 --trace-token xyz123
I can find "vm3" string in logs, but not my token xyz123.
The only use of it seems to be in grep:
history| grep xyz123
The flag --trace-token is intended to be used by the support agents when there is some error which is difficult to track from the logs. The Google Cloud Platform Support agent provides a time bound token which will expire after a specified time and asks the user to run the command for the specific product in which the user is facing the issue. Then it gets easier for the support agent to trace the error by using that --trace-token.
For example :
A user faced some error while creating a Compute Engine instance and contacted the Google Cloud Platform Support team. The support agent then inspected the logs and other resources but could not find the root cause of the issue. Then the support agent provides a --trace-token and asks the user to run the below command with the provided --trace-token.
--trace-token = abcdefgh
Command : gcloud compute instances create my-vm --trace-token abcdefgh
After the user runs the above command the support agent could find the error by analysing in depth with the help of the --trace-token
Please note that when a --trace-token flag is used the content of the trace may include sensitive information like auth tokens, the contents of any accessed files. Hence they should only be used for manual testing and should not be used in production environments.

How to apply rolling updates in VM instances instead of using Managed Instance group in GCP?

Problem: I want to apply patch updates in a VM instance which is not a part of a Managed Instance Group. The patch update could be-
A change in the version of the current OS of a VM instance, that is, change from Ubuntu-16-v1 to Ubuntu-16-v2.
An upgrade of the OS boot, that is, changing from Ubuntu-16 OS to Ubuntu-18 OS.
Installation of a new package in the existing machine.
Exploration:
For Problem 1 & 2 stated above
I have explored and tried the rolling update feature present in Managed Instance Group in the Google Cloud Platform and this seems to be a good approach for the problem stated, but what should be the best approach with best practices if someone is not using a Managed Instance Group? You may find the details here.
For Problem 3 stated above
I have tried the Os-patch Management service of GCP but is there any other method that I could use?
Create an "image" from the boot disks of your existing Compute Engine instances.
For updating with newer configurations and software, group images in "image family" which always points to the latest image.
See https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images#setting_families
For your use case, I think you should use IAC script like terraform to recreate similar VMs with the same name, disk, internal address, etc..and call the script from the repo directly on a scheduled date automatically or provide self patch instructions.
Here is the likely process:
Send Email Notification to all the VM owners that Auto-Patch is
scheduled on XYZ.
Email content should include an Instance list going to be
patched/update, list of action, patch team contact details.
An email should also include a link for skipping this auto-update and perform "Self Patching instruction"
documents
Self patching documents should have a command to call autopatch
wrapper script like: "curl -u "encrypted-auth:x-oauth-basic" -k -H 'Accept:
application/vnd.github.VERSION.raw'
'https://github.com/api/v3/repos/xyz/images/contents/gcp/patch_OS_update.sh?ref=master'
|bash -s -- -q"
The above script can also have other options like to query patchset available for particular VM or scan the VM for pending updates

Command to create google cloud backend service fails - what am I doing wrong?

I am currently working through the Google Cloud "load balancing" code lab:
https://codelabs.developers.google.com/codelabs/cpo200-load-balancing
On page 4 of the lab, it requires me to run the following command in the Cloud Shell, to create a backend-service (for load balancing of a group of web server, i.e. HTTP, instances):
gcloud compute backend-services create \
guestbook-backend-service \
--http-health-checks guestbook-health-check
However, running this command results in the following error:
ERROR: (gcloud.compute.backend-services.create) Some requests did not succeed:
- Invalid value for field 'resource.loadBalancingScheme': 'EXTERNAL'.
Backend Service based Network Load Balancing is not yet supported.
Assuming that all the preceding steps in the code lab are correct (which I have no reason to suspect is not the case), this appears to be a bug in the code lab.
I have submitted a bug report for this, however, since I am not expecting any response to the bug report any time soon but I do want to continue on with this lab, what command should I be running instead?
I presume there has been some sort of API change but the code lab has not caught up and the documentation does not appear to indicate any relevant changes.
I realize I could probably work out how to do this with the Cloud Console, but I would really like to learn the command line actions.
Does anyone have any ideas?
Thanks in advance!
And, as is the nature of these things, shortly after I post this I discover the answer for myself...
The command should be:
gcloud compute backend-services create \
guestbook-backend-service \
--http-health-checks guestbook-health-check \
--global
It appears that what the error message is actually complaining about is that regional backend-services are not supported; they must be global.
Leaving aside the fact that the lab directions are inadequate, it would be nice if this was detailed in the documentation, but I guess we can't have everything...