How to notify/alert when the SSH/RDP service is not running in GCE VM - google-cloud-platform

GCP Operation's suite doesnot have a pre built metric to monitor the SSH (Linux) / RDP (Windows) service running in the GCE VM. Also I have tried to find it in GCP Logging but couldn't find any logs about the service running in the OS.
Is there any other way to monitor the health of SSH / RDP Service in GCE VM's ? And to forward an alert if it is not in running state.

I updated my answer:
You can use Cloud Logging to monitor status of SSH daemon and connections to it on your VM instances:
install logging agent that will send logs to Cloud Logging (formerly Stackdriver):
curl -sSO https://dl.google.com/cloudagents/add-logging-agent-repo.sh
sudo bash add-logging-agent-repo.sh
sudo apt-get update
sudo apt-get install google-fluentd
sudo apt-get install -y google-fluentd-catch-all-config
edit logging agent configuration located at /etc/google-fluentd/config.d/syslog.conf to collect entries from auth.log by adding:
<source>
type tail
# Parse the timestamp, but still collect the entire line as 'message'
format /^(?<message>(?<time>[^ ]*\s*[^ ]* [^ ]*) .*)$/
path /var/log/auth.log
pos_file /var/lib/google-fluentd/pos/auth.log.pos
read_from_head false
tag auth
</source>
restart google-fluentd to apply updated configuration:
sudo service google-fluentd restart
go to Cloud Console -> Logging -> Logs Explorer and look for events related to SSH daemon:
configure alerts; in addition, have a look at the 3rd party examples here and
here.
You can use same approach for checking and monitoring RDP on Windows:
install logging agent
edit logging agent configuration located at C:\Program Files (x86)\Stackdriver\LoggingAgent\fluent.conf
restart logging agent
go to Cloud Console -> Logging -> Logs Explorer and look for events related to RDP service.
configure alerts
EDIT
In order for the incidents to not close automatically, you can create two metrics:
1st logs termination signal and 2nd that logs SSH startups.
Then you should create an alert that fires when the first one is above zero AND second one is below one.
In that way, incident will last until sshd is started again.

Related

Dataproc YARN container logs location

i'm aware of the existence of this thread:
where are the individual dataproc spark logs?
However if i ssh connect to a worker node vm and navigate to the /tmp folder this is all i see:
Is anyone able to pinpoint me to the exact location?
also for some reason i can't navigate directly from UI to stdout/stderr of the single task as it says that i'm unable to reach the site whenever i try to access the logs from the link in the UI
The previous answer looks to be outdated.
If you are talking about the container logs, then:
On clusters with a 1.5 or newer image, Yarn log aggregation is enabled by default and the remote log directory is set to be the temp bucket for the cluster. You can look the location up under /etc/hadoop/conf/yarn-site.xml, and the configuration is yarn.nodemanager.remote-app-log-dir.
On clusters with a 1.4 or older image, log aggregation is not enabled by default, so the container logs will be under /var/log/hadoop-yarn/userlogs on the worker nodes where the containers were run.
In 1.5 or newer versions, dataproc:yarn.log-aggregation.enabled is set to true by default. Under the hood, the yarn.log-aggregation-enable property in /etc/hadoop/conf/yarn-site.xml is set to true, and the container logs are controlled by the yarn.nodemanager.remote-app-log-dir property which is set to gs://<cluster-tmp-bucket>/<cluster-uuid>/yarn-logs by default. Check this doc for more details on Dataproc tmp bucket.
In addition to dumping the logs at the location, there are several other ways to view the logs:
YARN CLI: If the cluster has not been deleted, SSH into the master node, then run yarn logs -applicationId <app-id>. If you are not sure about the app ID, run yarn application -list -appStates ALL to list all apps. This method works only when log aggregation is enabled.
YARN Application Timeline server: If you enabled Component Gateway and the cluster has not been deleted, open the cluster's "YARN Application Timeline" link in the "WEB INTERFACES" tab of the cluster's web UI, find the application attemp and its containers, click the "Logs" link. This method works only when log aggregation is enabled.
Cloud Logging: YARN container logs are available in Cloud Logging even after the cluster is deleted.
3.1) When dataproc:dataproc.logging.stackdriver.job.yarn.container.enable if false (which is the default) or the job is submitted through CLI e.g., spark-submit instead of Dataproc jobs API , it is under the projects/<project-id>/logs/yarn-userlogs log name of the cluster resource:
resource.type="cloud_dataproc_cluster"
resource.labels.cluster_name=<cluster-name>
resource.labels.cluster_uuid=<cluster-uuid>
log_name="projects/<project-id>/logs/yarn-userlogs"
3.2) When dataproc:dataproc.logging.stackdriver.job.yarn.container.enable if true, it is under the projects/<project-id>/logs/dataproc.job.yarn.container log name of the job resource:
resource.type="cloud_dataproc_job"
resource.labels.job_id=<job_id>
resource.labels.job_uuid=<job_uuid>
log_name="projects/<project-id>/logs/dataproc.job.yarn.container"
In Dataproc 1.4 (deprecated) or older versions, the yarn.log-aggregation-enable property in /etc/hadoop/conf/yarn-site.xml is set to fasle by default, and the container logs are controlled by the yarn.nodemanager.log-dirs property which is set to /var/log/hadoop-yarn/userlogs by default.

Amazon Cloudwatch Agent won't start

After installing Cloudwatch Agent on Amazon Linux 2 EC2, I ran cloudwatch-agent-ctl status
This command shows the status as stopped
I tried running 'cloudwatch-agent-ctl status` and got the following message:
cwagent-otel-collector will not be started "as it has not been configured yet"
Am not sure if the above message is causing CWAgent to not start. Any pointers?
Any pointers on how to find why my CWAgent won't start?
Before you can start your CW agent, you must configure it. From docs:
Before running the CloudWatch agent on any servers, you must create a CloudWatch agent configuration file.
You can follow the docs how to setup the config files, before running the agent.

configured logging driver does not support reading : Docker

I am running my docker container in AWS ECS. When i try to execute the below command to read the logs from container, i am facing the below error.
command: docker logs -f "Container ID"
Error response from daemon: configured logging driver does not support reading.
Any feasible solutions are welcome.
According to information commented by David Maze, you must have your container run with a awslogs log driver.
Here is the setting introduction.
After changing log driver to json-file, you could get log by executing docker logs container-id/name.
But still note this:
If using the Fargate launch type, the only supported value is awslogs.
You are using awslogs log-driver, and docker logs command is not available for that driver.
From the docs:
The docker logs command is not available for drivers other than
json-file and journald
limitations of logging drivers
I have had this same issue before.
In my specific case the Task Definition for that service in ECS had the Log Configuration's logDriver set to fluentd see the circled image below.
To resolve the issue, I created another Task Definition with all the same parameters except I left the Log Configuration section's Log Driver to default. note if you click on that you'll see other log drivers including fluentd.
After that I pointed the service to this task definition in my ECS cluster, and now I can see the logs doing the commands docker logs <container id>
I got this issue because docker is out of storage.
I fixed it by removing old docker images with docker rmi <immage_id_1> <immage_id_2>

Installing Amazon Inspector Service

I'm about to install and use Amazon Inspector. We have many EC2 instances behind ELB. Plus some EC2 instances are opened via Auto-Scale.
My question: Is the Amazon Inspector doing its work locally or globally, meaning is the monitoring being made on the instance that it is installed on or it can be configured to include all the instances of the infrastructure?
If Inspector should be applied on every EC2 instance, can the Auto-Scale be configured to open the new instances with Inspector already installed on them and if yes, how can i do that?
I asked a similar question on the Amazon forum but got no response.
In the end I used the following feature to customise the EC2 instances that my application gets deployed to:
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html
Basically off the root of your .war file you need a folder named '.ebextensions' and in there a .config file containing some commands to install the Inspector client.
So my file 'inspector-agent.config' looks like this:
# Errors get logged to /var/log/cfn-init.log. See Also /var/log/eb-tools.log
commands:
# Download the agent installation script
"01-agent-repository":
command: sudo wget https://inspector-agent.amazonaws.com/linux/latest/install
# Run the installation script
"02-run-installation-script":
command: sudo bash install
I've found the answer and the solution, You have to install Amazon Inspector on each EC2 in order to inspect them all using Amazon Inspector.
About the Auto-Scale, I've applied Amazon Inspector on the main EC2 servers and took an image from them (after inspecting all the EC2s and fix all the issues). Then I've configured the Auto-Scale to lunch to lunch from the new AMIs (The Inspected AMIs).

'No hosts succeeded' error on AWS CodeDeploy service

I am trying to set up AWS CodeDeploy for my PHP web app. I have created a CodeDeploy app and a deployment group on the AWS console. I have created the necessary revision bundle with the appspec yaml file. The revision bundle is stored on Amazon S3.
When I click 'Deploy this revision' button on the AWS console it gives me 'no hosts succeeded' error. I went through the Technical FAQ and could not find any answers. How can I counter this error?
UPDATE: I now understand that this error has something to do with Minimum Healthy Hosts count. But still I am not able to understand how does AWS calculate the healthiness of a host.
Basically what its saying is "The codedeploy service on your ec2 instance is not running"...
For why a deployment failed host health is fairly simple. A host is healthy if that host succeeded in deploying the last deployment to it. A host is unhealthy if it failed. A host is unknown if it was Skipped and had no previous deployment.
There are other aspects of host health that affect what order they are deployed to in the next deployment, but that's not going to affect you deployment failing with "No hosts succeeded".
A host can fail it's individual deployment if any of it's lifecycle events failed. A lifecycle event can fail due to service side timeout waiting for the agent to respond or because the host agent reports an error executing the command. You can check the host agent log for more details in exactly why the host agent reported a failure.
If you are hitting the server side timeouts, you should check that the host agent is running and is able to poll for commands correctly. You might have accidentally restricted access in your VPC configuration or didn't grant appropriate permissions to the instance to poll for commands in the instance profile.
This error message means you are not running CodeDeploy service at the EC2 instance targeted by your deployment group.
1) Download latest version of codedeploy from S3 (choose your region)
PS> Read-S3Object -BucketName aws-codedeploy-eu-west-1 -Key latest/codedeploy-agent.msi -File c:\temp\codedeploy-agent.msi
2) Install codedeploy
cmd> c:\temp\codedeploy-agent.msi /quiet /l c:\temp\host-agent-install-log.txt
3) Start codeploy
PS> Start-Service -Name codedeployagent
AWS CodeDeploy guide: http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-run-agent.html#how-to-run-agent-install-windows
I just ran into this issue myself. My solution was to run:
ntpdate-debian
If you are running centos it's something like
ntpdate pool.ntp.org
For me the time was off and was causing issues with the codedeploy agent.
Now, if this doesn't solve your problem. First make sure your problem is that your CodeDeploy agent is not registering. I have had this issue before and it was because one of my instances was in a failed state from a botched deployment so be sure to double check. (ELB status, tests, etc)
Then you should enable logging for your CodeDeploy agent by setting log_aws_wire and verbose to true in /etc/codedeploy-agent/conf/codedeployagent.yml and then restart the CodeDeploy. Tail the logs and you should see the reason for your problems.