Run Ops Agent on Linux in Azure - google-cloud-platform

Run Ops Agent on Linux in Azure - google-cloud-platform

I would like to be able to monitor (logs, performance metrics) VM's in Azure (and other clouds) using Google Cloud Logging and Monitoring.
As a proof of concept,
I'm using an Ubuntu 20.04 instance in Azure
I have installed Ops Agent
I have put a key file in place for a service account with the required roles (Logging, Monitoring)
When I check the status of the Ops Agent, I see the following (mildly redacted)
● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2022-02-16 22:39:22 UTC; 1min 5s ago
Process: 2730195 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
Process: 2730208 ExecStart=/opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=${RUNTIME_DIRECTORY}/otel.yaml (code=exited, status=1/FAILURE)
Main PID: 2730208 (code=exited, status=1/FAILURE)
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Scheduled restart job, restart counter is at 5.
Feb 16 22:39:22 HOSTNAME systemd[1]: Stopped Google Cloud Ops Agent - Metrics Agent.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Start request repeated too quickly.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Failed with result 'exit-code'.
Feb 16 22:39:22 HOSTNAME systemd[1]: Failed to start Google Cloud Ops Agent - Metrics Agent.
● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2022-02-16 22:39:22 UTC; 1min 5s ago
Process: 2730194 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIRECTORY} (code=exited, status=0/SUCCESS)
Process: 2730207 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf --log_file ${LOGS_DIRECTORY}/logging-module.log --storage_path ${STATE_DIRECTORY}/buffers (co>
Main PID: 2730207 (code=exited, status=255/EXCEPTION)
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 5.
Feb 16 22:39:22 HOSTNAME systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
Feb 16 22:39:22 HOSTNAME systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.
● google-cloud-ops-agent.service - Google Cloud Ops Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2022-02-16 22:39:21 UTC; 1min 7s ago
Process: 2730090 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS)
Process: 2730102 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
Main PID: 2730102 (code=exited, status=0/SUCCESS)
Feb 16 22:39:21 HOSTNAME systemd[1]: Starting Google Cloud Ops Agent...
Feb 16 22:39:21 HOSTNAME systemd[1]: Finished Google Cloud Ops Agent.
The Ops Agent logs show
[2022/02/16 22:39:22] [ info] [engine] started (pid=2730207)
[2022/02/16 22:39:22] [ info] [storage] version=1.1.5, initializing...
[2022/02/16 22:39:22] [ info] [storage] root path '/var/lib/google-cloud-ops-agent/fluent-bit/buffers'
[2022/02/16 22:39:22] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2022/02/16 22:39:22] [ info] [storage] backlog input plugin: storage_backlog.2
[2022/02/16 22:39:22] [ info] [cmetrics] version=0.2.2
[2022/02/16 22:39:22] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 47.7M
[2022/02/16 22:39:22] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2022/02/16 22:39:22] [ warn] [net] getaddrinfo(host='metadata.google.internal', err=-2): Name or service not known
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] failed to create metadata connection
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] can't fetch token from the metadata server
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] token retrieval failed
[2022/02/16 22:39:22] [ warn] [net] getaddrinfo(host='metadata.google.internal', err=-2): Name or service not known
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] failed to create metadata connection
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] can't fetch project id from the metadata server
[2022/02/16 22:39:22] [error] [output] failed to initialize 'stackdriver' plugin
[2022/02/16 22:39:22] [ info] [input] pausing fluentbit_metrics.0
[2022/02/16 22:39:22] [ info] [input] pausing tail.1
[2022/02/16 22:39:22] [ info] [input] pausing storage_backlog.2
I notice private_key is not defined, fetching it from metadata server, which suggests that the key file is not being picked up.
The documentation says The Ops Agent is the primary agent for collecting telemetry from your Compute Engine instances. See here.
Can the Ops Agent only be run on Compute Engine instances or is it reasonable to expect that it could be run anywhere if properly configured?

When google-cloud-ops-agent.service is started, it starts google-cloud-ops-agent-fluent-bit.service and google-cloud-ops-agent-opentelemetry-collector.service and then exits. Environment variables added as overrides to google-cloud-ops-agent.service do not persist to the others.
I found that I had to add GOOGLE_APPLICATION_CREDENTIALS to google-cloud-ops-agent-opentelemetry-collector.service and GOOGLE_SERVICE_CREDENTIALS to google-cloud-ops-agent-fluent-bit.service. You can override the systemd units non-interactively:
SYSTEMD_EDITOR=tee systemctl edit google-cloud-ops-agent-fluent-bit.service <<'EOF'
[Service]
Environment='GOOGLE_SERVICE_CREDENTIALS=/etc/google/auth/application_default_credentials.json'
EOF
SYSTEMD_EDITOR=tee systemctl edit google-cloud-ops-agent-opentelemetry-collector.service <<'EOF'
[Service]
Environment='GOOGLE_APPLICATION_CREDENTIALS=/etc/google/auth/application_default_credentials.json'
EOF

Ops Agent is looking for credentials and not finding them.
This means you either did not copy the service account to the correct location with the correct file access permissions OR you did not set up the environment variable GOOGLE_APPLICATION_CREDENTIALS correctly with the correct file access permissions.
The agent then checks the metadata service which does not support Google OAuth access tokens (Azure provides MSI credentials if setup)

Related

Unable to start NFS server on GCE instance mounted on GCS

I want to setup a NFS server on GCP, so used a VM, mounted the GCS bucket on /vol using the gcsfuse. Then installed nfs-kernel-server packages on the VM, created a dir nfs_share under /vol, added the entry in /etc/exports and while restarting the nfs-kernel-server service, ran into the below error:
sudo systemctl status nfs-kernel-server
● nfs-server.service - NFS server and services
Loaded: loaded (/lib/systemd/system/nfs-server.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sat 2022-06-04 15:28:54 UTC; 10s ago
Process: 2139 ExecStopPost=/usr/sbin/exportfs -f (code=exited, status=0/SUCCESS)
Process: 2138 ExecStopPost=/usr/sbin/exportfs -au (code=exited, status=0/SUCCESS)
Process: 2137 ExecStartPre=/usr/sbin/exportfs -r (code=exited, status=1/FAILURE)
Jun 04 15:28:54 g-non-prod-nfs-test-vm systemd[1]: Starting NFS server and services...
Jun 04 15:28:54 g-non-prod-nfs-test-vm exportfs[2137]: exportfs: /vol/nfs_share requires fsid= for NFS export
Jun 04 15:28:54 g-non-prod-nfs-test-vm systemd[1]: nfs-server.service: Control process exited, code=exited status=1
Jun 04 15:28:54 g-non-prod-nfs-test-vm systemd[1]: nfs-server.service: Failed with result 'exit-code'.
Jun 04 15:28:54 g-non-prod-nfs-test-vm systemd[1]: Stopped NFS server and services.
Filestore a service of GCP (NFS instance) needs to be deployed with 1TB storage as a minimum, thus looking for alternatives. Above approach looks feasible, but unable to get the nfs service up and running.

Alert manager in prometheus give exit code error and ignoring assignment for alert manager in prometheus

I am new to prometheus, while trying to install alert manager export tool in prometheus, I got the following error after checking with systemctl status alertmanager
alertmanager.service - AlertManager Service
Loaded: loaded (/etc/systemd/system/alertmanager.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sat 2021-05-01 11:23:07 UTC; 21s ago
Process: 51547 ExecStart=/usr/local/bin/alertmanager --config.file /etc/alertmanager/alertmanager.yml -web.external-url=http://0.0.0.0:9093 (code=exited, status=1/>
Main PID: 51547 (code=exited, status=1/FAILURE)
May 01 11:23:07 STEP-Test systemd[1]: Started AlertManager Service.
May 01 11:23:07 STEP-Test alertmanager[51547]: alertmanager: error: unknown short flag '-w', try --help
May 01 11:23:07 STEP-Test systemd[1]: alertmanager.service: Main process exited, code=exited, status=1/FAILURE
May 01 11:23:07 STEP-Test systemd[1]: alertmanager.service: Failed with result 'exit-code'.
I have tried removing and reinstalling but it was same. I checked my configuration to see, but I can't figure out the problem.
Configuration file is
Thank you all for your prompt response.

You've an error in your flags for alertmanager.
It appears, you should use --web.external-url rather than -web.external-url

Failed to start Redis In-Memory Data Store. Ubuntu 18.04

I am trying to install redis on my AWS server. I have Ubuntu 18.04 installed on it. I am following steps to install redis from digitalocean article.
When i run sudo systemctl status redis command i am getting below error.
screenshot
I tried to edit /etc/systemd/system/redis.service file and added Type=forking under [Service] section but still getting the same error.
Can anyone suggest me how i can get it fixed?
Thanks in advance.

Based on same digitalocean tutorial, actually it's running fine.
Run this command sudo systemctl restart redis.service, we get (showing "failed" on last line):
● redis.service - Redis In-Memory Data Store
Loaded: loaded (/etc/systemd/system/redis.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2021-06-28 12:03:11 +03; 1min 0s ago
Process: 20428 ExecStart=/usr/local/bin/redis-server /etc/redis/redis.conf (code=exited, status=
Main PID: 20428 (code=exited, status=203/EXEC)
Jun 28 12:03:11 XYZ systemd[1]: redis.service: Service hold-off time over, scheduling restar
Jun 28 12:03:11 XYZ systemd[1]: redis.service: Scheduled restart job, restart counter is at
Jun 28 12:03:11 XYZ systemd[1]: Stopped Redis In-Memory Data Store.
Jun 28 12:03:11 XYZ systemd[1]: redis.service: Start request repeated too quickly.
Jun 28 12:03:11 XYZ systemd[1]: redis.service: Failed with result 'exit-code'.
Jun 28 12:03:11 XYZ systemd[1]: Failed to start Redis In-Memory Data Store.
But if you run sudo service redis-server status, we get (showing "running" on 3rd line):
● redis-server.service - Advanced key-value store
Loaded: loaded (/lib/systemd/system/redis-server.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2021-06-28 11:50:13 +03; 19min ago
Docs: http://redis.io/documentation,
man:redis-server(1)
Process: 19278 ExecStop=/bin/kill -s TERM $MAINPID (code=exited, status=0/SUCCESS)
Process: 19371 ExecStart=/usr/bin/redis-server /etc/redis/redis.conf (code=exited, status=0/SUCC
Main PID: 19382 (redis-server)
Tasks: 4 (limit: 4915)
CGroup: /system.slice/redis-server.service
└─19382 /usr/bin/redis-server 127.0.0.1:6379
Jun 28 11:50:13 XYZ systemd[1]: Starting Advanced key-value store...
Jun 28 11:50:13 XYZ systemd[1]: redis-server.service: Can't open PID file /var/run/redis/red
Jun 28 11:50:13 XYZ systemd[1]: Started Advanced key-value store.
After searching for hours, it seems like it's some difference between systemctl & service and nothing more, but the actual redis server is running fine. Corrects me if that's not the case. Here's the link: https://askubuntu.com/questions/903354/difference-between-systemctl-and-service-commands
You can even check if redis is working fine, by redis-cli ping, should print PONG

I also encountered this problem, then I tried to check it again.
Finally, I found that when I authorized /var/lib/redis, I entered the wrong command, causing the redis account to have no access to /var/lib/redis.
sudo chown redis:redis /var/lib/redis
sudo systemctl restart redis
succeeded.

Issue with awslogs service and CloudWatch Logs Agent on Ubuntu 16.04

On one of my AWS ec2 instances running Ubuntu 16.04, I'm getting the following errors filled up in my /var/syslog.
Jul 17 18:11:21 Mysql-Slave systemd[1]: Stopped The CloudWatch Logs agent.
Jul 17 18:11:21 Mysql-Slave systemd[1]: Started The CloudWatch Logs agent.
Jul 17 18:11:26 Mysql-Slave systemd[1]: awslogs.service: Main process exited, code=exited, status=255/n/a
Jul 17 18:11:26 Mysql-Slave systemd[1]: awslogs.service: Unit entered failed state.
Jul 17 18:11:26 Mysql-Slave systemd[1]: awslogs.service: Failed with result 'exit-code'.
Jul 17 18:11:26 Mysql-Slave systemd[1]: awslogs.service: Service hold-off time over, scheduling restart.
Jul 17 18:11:26 Mysql-Slave systemd[1]: Stopped The CloudWatch Logs agent.
Jul 17 18:11:26 Mysql-Slave systemd[1]: Started The CloudWatch Logs agent.
Jul 17 18:11:32 Mysql-Slave systemd[1]: awslogs.service: Main process exited, code=exited, status=255/n/a
Jul 17 18:11:32 Mysql-Slave systemd[1]: awslogs.service: Unit entered failed state.
Jul 17 18:11:32 Mysql-Slave systemd[1]: awslogs.service: Failed with result 'exit-code'.
Jul 17 18:11:32 Mysql-Slave systemd[1]: awslogs.service: Service hold-off time over, scheduling restart.
Jul 17 18:11:32 Mysql-Slave systemd[1]: Stopped The CloudWatch Logs agent.
Jul 17 18:11:32 Mysql-Slave systemd[1]: Started The CloudWatch Logs agent.
The /var/log/awslogs.log contains these messages:
database is locked
2018-07-17 20:59:01,055 - cwlogs.push - INFO - 27074 - MainThread - Missing or invalid value for use_gzip_http_content_encoding config. Defaulting to using gzip encoding.
2018-07-17 20:59:01,055 - cwlogs.push - INFO - 27074 - MainThread - Using default logging configuration.
database is locked
2018-07-17 20:59:06,549 - cwlogs.push - INFO - 27104 - MainThread - Missing or invalid value for use_gzip_http_content_encoding config. Defaulting to using gzip encoding.
2018-07-17 20:59:06,549 - cwlogs.push - INFO - 27104 - MainThread - Using default logging configuration.
database is locked
2018-07-17 20:59:12,054 - cwlogs.push - INFO - 27110 - MainThread - Missing or invalid value for use_gzip_http_content_encoding config. Defaulting to using gzip encoding.
2018-07-17 20:59:12,054 - cwlogs.push - INFO - 27110 - MainThread - Using default logging configuration.
Any pointers in troubleshooting this will be of great help.

A similar issue was posted in the following link - https://forums.aws.amazon.com/thread.jspa?threadID=165134
I did the following:
a) Stopped the awslogs service
$ service awslogs stop ## Amazon Linux
OR
$ service awslogsd stop ## Amazon Linux 2
b) Deleted the agent-state file in /var/awslogs/state/ (I renamed it in my case)
$ mv agent-state agent-state.old ## Amazon Linux
OR
$ cd /var/lib/awslogs; mv agent-stat agent-stat.old ## Amazon Linux 2
c) Restarted the awslogs service
$ service awslogs start ## Amazon Linux
OR
$ sudo systemctl start awslogsd ## Amazon Linux 2
A new agent-state file was created as a result and the errors mentioned my post disappeared after this.

Please try the following commands based on your Linux version
sudo service awslogs start
If you are running Amazon Linux 2, try the below command
sudo systemctl start awslogsd
took me 2 hours to figure this out

In my case, I found duplicate entries for some properties in /etc/awslogs/awslogs.conf file.
(Not all were duplicates, as some of the properties were commented, and I uncommented them to set values.)
It didn't work. Then I scrolled till the bottom of the file.
I found following entries. Set the values to these properties and it worked.
[/var/log/messages]
datetime_format = %b %d %H:%M:%S
file = /home/ec2-user/application.log
buffer_duration = 5000
log_stream_name = {instance_id}
initial_position = start_of_file
log_group_name = MyProject

docker-machine create with digitalocean driver: ssh command error

I´m using docker tools on windows.
create command was working perfectly last week and I managed to create a number of machines on Digital Ocean. Then I tried today with no success. I repeated the same command with different regions and I always get the same result:
λ docker-machine create -d digitalocean --digitalocean-access-token=MYTOKEN --digitalocean-region=ams2 vmname
Running pre-create checks...
Creating machine...
(fernu) Creating SSH key...
(fernu) Creating Digital Ocean droplet...
(fernu) Waiting for IP address to be assigned to the Droplet...
Waiting for machine to be running, this may take a few minutes...
Detecting operating system of created instance...
Waiting for SSH to be available...
Detecting the provisioner...
Provisioning with ubuntu(systemd)...
Installing Docker...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Error creating machine: Error running provisioning: ssh command error:
command : sudo systemctl -f start docker
err : exit status 1
output : Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
If I execute the suggested command:
root#fernu:~# systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/docker.service.d
└─10-machine.conf
Active: inactive (dead) (Result: exit-code) since Fri 2017-06-30 20:56:13 UTC; 8min ago
Docs: https://docs.docker.com
Process: 4943 ExecStart=/usr/bin/docker daemon -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label provider=digitalocean (code=exited, status=1/FAILURE)
Main PID: 4943 (code=exited, status=1/FAILURE)
Jun 30 20:56:13 fernu systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Jun 30 20:56:13 fernu systemd[1]: Failed to start Docker Application Container Engine.
Jun 30 20:56:13 fernu systemd[1]: docker.service: Unit entered failed state.
Jun 30 20:56:13 fernu systemd[1]: docker.service: Failed with result 'exit-code'.
Jun 30 20:56:13 fernu systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Jun 30 20:56:13 fernu systemd[1]: Stopped Docker Application Container Engine.
Jun 30 20:56:13 fernu systemd[1]: docker.service: Start request repeated too quickly.
Jun 30 20:56:13 fernu systemd[1]: Failed to start Docker Application Container Engine.
Any help would be appreciated
Update
It´s working with ubuntu 14:
--digitalocean-image=ubuntu-14-04-x64 so it seams like a problem with the default image (ubuntu-16-04-x64)

This seems to be hitting a lot of people. TL;DR: There is a bug in docker-machine v0.12.0 and this issue can be resolved by upgrading.
Logging in to the DigitalOcean instance and running journalctl -xe provides more information:
-- Unit docker.service has begun starting up.
Jul 07 20:03:52 docker-sandbox docker[4930]: `docker daemon` is not supported on Linux. Please run `do
Jul 07 20:03:52 docker-sandbox systemd[1]: docker.service: Main process exited, code=exited, status=1/
Jul 07 20:03:52 docker-sandbox systemd[1]: Failed to start Docker Application Container Engine.
-- Subject: Unit docker.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
The key here is docker daemon is not supported on Linux. A bug in docker-machine's version comparison code caused an incorrect systemd unit file to be produced (located at /etc/systemd/system/docker.service.d/10-machine.conf) on certain versions of Ubuntu.
A fix has been committed and a new release (v0.12.1) was made.
You can grab the latest release at: https://github.com/docker/machine/releases/tag/v0.12.1

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Run Ops Agent on Linux in Azure - google-cloud-platform

Related

Unable to start NFS server on GCE instance mounted on GCS

Alert manager in prometheus give exit code error and ignoring assignment for alert manager in prometheus

Failed to start Redis In-Memory Data Store. Ubuntu 18.04

Issue with awslogs service and CloudWatch Logs Agent on Ubuntu 16.04

docker-machine create with digitalocean driver: ssh command error

Categories

Resources