I have project on GCP with a VM instance in it (CentOS 7). I want to monitor the status of some services running on the VM. Is there a way to monitor them through the OPS agent?
The objective would then be to have alerts based on the status of the service (using Grafana). Using agent.googleapis.com/processes/cpu_time in the GCP process metrics does show the processes currently running on the VM, but having an alert based on the CPU time of a process is not clear cut as having an alert based on the status of a service.
Also I have a hard-time finding an answer to what the difference between a service and process is in UNIX. Based on this answer https://superuser.com/questions/1235647/what-is-the-difference-between-a-service-and-a-process it seems that a service differs in that it runs continuously(?)
Does this mean that monitoring the processes associated to the service is not the same as monitoring the service itself, since the process may be killed but the service continue running?
You can setup custom alert on any running process on GCP.
In Alert policy you need to add Process name ( .exe process path).
Please go through below video. Explained in Details.
https://youtu.be/aaa_kwM7zkA
Related
I am trying to create a simple C++ program that sends something like a real-time status message to AWS CloudWatch to inform that it is up and running, and the status goes offline when it's closed (real-time online/offline status). The C++ program will be installed at multiple users' computers, so there will be like a dashboard on CloudWatch. Is this even possible? I'm lost on AWS between Alarms/Logs/Metrics/Events..etc.
I also want to send some stats from each PC where the program is installed, like CPU usage for example, is it possible to make a dashboard on CloudWatch to monitor this as well? Am I free to create dashboard with whatever data I want? All the tutorials I found talk about integrating CloudWatch with other AWS services (Like Lambda and EC2) which isn't my case.
Thank you in advance.
The best way to monitor a process will be using AWS CloudWatch procstat plugin. First, create a CloudWatch configuration file with PID file location from EC2 and monitor the memory_rss parameter of the process. You can read here more.
For stats you can install CloudWatch Agent on each machine and collect necessary metrics. You can read here more.
We have setup a fluentd agent on a GCP VM to push logs from syslog server (the VM) to GCP's Google Cloud Logging. The current setup is working fine and is pushing more than 300k log entries to Stackdriver (Google Cloud Logging) per hour.
Due to increased traffic, we are planning to increase the number of VMs employed behind the load balancer. However, the new VM with fluentd agent is not being able to push logs to Stackdriver. After the first time activation of VM, it does send a few entries to Stackdriver and after that, it does not work.
I tried below options to setup the fluentd agent and to resolve the issue:
Create a new VM from scratch and install fluentd logging agent using this Google Cloud documentation.
Duplicate the already working VM (with logging agent) by creating Images
Restart the VM
Reinstall the logging agent
Debugging I did:
All the configurations for google fluentd agent. Everything is correct and is also exactly similar to the currently working VM instance.
I checked the "/var/log/google-fluentd/google-fluentd.log" for any logging errors. But there are none.
Checked if the logging API is enabled. As there are already a few million logs per day, I assume we are fine on that front.
Checked the CPU and memory consumption. It is close to 0.
All the solutions I could find on Google (there are not many)
It would be great if someone can help me identify where exactly I am going wrong. I have checked configurations/setup files multiple times and they look fine.
Troubleshooting steps to resolve the issue:
Check whether you are using the latest version of the fluentd agent or not. If not, try upgrading the fluentd agent. Refer to upgrade the agent for information.
If you are running very old Compute Engine instances or Compute Engine instances created without the default credentials you must complete the Authorizing the agent procedures.
Another point to focus is, how you are Configuring an HTTP Proxy. If you are using an HTTP proxy for proxying requests to the Logging and Monitoring APIs, check whether the metadata server is reachable. The metadata server has to be reachable (and do it directly; no proxy) when Configuring an HTTP Proxy.
Check if you have any log exclusions configured which is preventing the logs from arriving. Refer Exclusion filters for information.
Try uninstalling the Fluentd agent and try to use Ops agent instead (note that syslog logs are collected by it with no setup) and check whether you were able to see the logs. Combining logging and metrics into a single agent, the Ops Agent uses Fluent Bit for logs, which supports high-throughput logging, and the OpenTelemetry Collector for metrics. Refer Ops agent for more information.
Looking into adding autoscaling of a portion of our application using AWS simple message queuing which would launch EC2 on-demand or spot instances based on queue backlog.
One question I had, is how do you deal with collecting logs from autoscaled instances? New instances are spun up based on an image, but then they are shut down when complete. Currently, if there is an issue with one of our services, which causes it to crash, we have a system to automatically restart the service, but the logs and core dump files are there to review. If we switch to an autoscaling system, where new instance are spun up, how do you get logs and core dump files when there is a failure? Particularly if the instance is spun down.
Good practice is to ship these logs and aggregate them somewhere else, and there are many services such as DataDog and Rapid7 which will do this for you at a cost.
AWS however provides CloudWatch logs, which gives you a central place to store and view logs. It also allows you then to give users access to logs on the AWS console without them having to ssh onto a server.
Shipping your logs to CloudWatch logs requires the installation of the CloudWatch agent on your server and specifying in the config which logs to ship.
You could install the CloudWatch agent once and create an AMI of that server to use in your autoscaling group, or install and configure the CloudWatch agent in userdata for every time a server is spun up.
All the information you need to get started can be found here:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html
Can we use existing ec2 instance details while configuring data pipeline? If it is possible then what are the ec2 details that we need to provide while creating a pipe line?
Yes, it is possible. According to AWS support.
"You can install Task Runner on computational resources that you manage, such as an Amazon EC2 instance, or a physical server or workstation. Task Runner can be installed anywhere, on any compatible hardware or operating system, provided that it can communicate with the AWS Data Pipeline web service.
This approach can be useful when, for example, you want to use AWS Data Pipeline to process data that is stored inside your organization’s firewall. By installing Task Runner on a server in the local network, you can access the local database securely and then poll AWS Data Pipeline for the next task to run. When AWS Data Pipeline ends processing or deletes the pipeline, the Task Runner instance remains running on your computational resource until you manually shut it down. The Task Runner logs persist after pipeline execution is complete."
I did this myself as it takes a while to get the pipeline to start up, this start up time could be 10-15 minutes depending on unknown factors.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html
We are using Bamboo to spin up (through Powershell on the buildserver) an AWS Windows 2008 R2 instance as a target for a web deployment.
What is the best method for determining that the target instance is ready for a deployment (all services up and running, etc).
There's no real easy way to do this with Windows instances. Your best bet is to write a infrastructure test that looks to see if a service is running on the target environment and to keep retrying until it either verifies that the service is available or a timeout has occurred. At that point you could start your deployment.
I generally do this with a cucumber script that will check a service's status continually until it gets an answer
You could also set a timeout for an appropriate amount of time, although this option wouldn't be my recommendation