Monitoring the Airflow web server when using Google Cloud Composer - google-cloud-platform

How can I monitor the Airflow web server when using Google Cloud Composer? If the web server goes down or crashes due to an error, I would like to receive an alert.

You can use Stackdriver Monitoring: https://cloud.google.com/composer/docs/how-to/managing/monitoring-environments. Alerts can also be set in Stackdriver.

At this time, fine-grained metrics for the Airflow web server are not exported to Stackdriver, so it cannot be monitored like other resources in a Cloud Composer environment (such as the GKE cluster, GCE instances, etc). This is because the web server runs in a tenant project alongside your main project that most of your environment's resources live in.
However, web server logs for Airflow in Composer are now visible in Stackdriver as of March 11, 2019. That means for the time being, you can configure logs-based metrics for the web server log (matching on lines that contain Traceback, etc).

Related

Google Cloud Composer 2 shows Unhealthy

Anybody knows why is this happening? Which api composer calls to calculate health? What is this events?
GCP cloud composer 2 shows unhealthy but everything works. Seems like the service account cannot access an API to calculate health metrics.
Composer version composer-2.0.25-airflow-2.2.5
"Error from server (Forbidden): events is forbidden: User "xxxxx.iam.gserviceaccount.com'
cannot list resource "events" in API group "" in the namespace airflow-2-2-5-11111111":
GKEAutopilot authz: the request was sent before policy enforcement is enabled" timestamp : '2022-10-07T15:48:08.834865950Z"
composer
The environment health metric depends on a Composer-managed DAG named airflow_monitoring which is triggered periodically by the airflow-monitoring pod.
The airflow_monitoring DAG is a per-environment liveness prober/healthcheck that is used to populate the Cloud Composer monitoring metric environment/healthy. It is an indicator for the general overall health of your environment, or more specifically, its ability to schedule DAGs and run tasks. This allows you to use Google Cloud Monitoring features such as metric graphs, or setting alerts when your environment becomes unhealthy.
If this DAG isn't deleted, you can check the airflow-monitoring logs to see if there are any problems related to reading the DAG's run statuses. Consequently, you can also try troubleshooting the error in Cloud Logging using the filter:
resource.type="cloud_composer_environment"
severity=ERROR
You can find more information about the metric on the GCP Metrics List, and can explore the metric in Cloud Monitoring.
Also, you can refer to this documentation for information on Cloud Composer’s environment health metric.

Fluentd agent setup on GCP VM is not pushing logs to Logs Explorer

We have setup a fluentd agent on a GCP VM to push logs from syslog server (the VM) to GCP's Google Cloud Logging. The current setup is working fine and is pushing more than 300k log entries to Stackdriver (Google Cloud Logging) per hour.
Due to increased traffic, we are planning to increase the number of VMs employed behind the load balancer. However, the new VM with fluentd agent is not being able to push logs to Stackdriver. After the first time activation of VM, it does send a few entries to Stackdriver and after that, it does not work.
I tried below options to setup the fluentd agent and to resolve the issue:
Create a new VM from scratch and install fluentd logging agent using this Google Cloud documentation.
Duplicate the already working VM (with logging agent) by creating Images
Restart the VM
Reinstall the logging agent
Debugging I did:
All the configurations for google fluentd agent. Everything is correct and is also exactly similar to the currently working VM instance.
I checked the "/var/log/google-fluentd/google-fluentd.log" for any logging errors. But there are none.
Checked if the logging API is enabled. As there are already a few million logs per day, I assume we are fine on that front.
Checked the CPU and memory consumption. It is close to 0.
All the solutions I could find on Google (there are not many)
It would be great if someone can help me identify where exactly I am going wrong. I have checked configurations/setup files multiple times and they look fine.
Troubleshooting steps to resolve the issue:
Check whether you are using the latest version of the fluentd agent or not. If not, try upgrading the fluentd agent. Refer to upgrade the agent for information.
If you are running very old Compute Engine instances or Compute Engine instances created without the default credentials you must complete the Authorizing the agent procedures.
Another point to focus is, how you are Configuring an HTTP Proxy. If you are using an HTTP proxy for proxying requests to the Logging and Monitoring APIs, check whether the metadata server is reachable. The metadata server has to be reachable (and do it directly; no proxy) when Configuring an HTTP Proxy.
Check if you have any log exclusions configured which is preventing the logs from arriving. Refer Exclusion filters for information.
Try uninstalling the Fluentd agent and try to use Ops agent instead (note that syslog logs are collected by it with no setup) and check whether you were able to see the logs. Combining logging and metrics into a single agent, the Ops Agent uses Fluent Bit for logs, which supports high-throughput logging, and the OpenTelemetry Collector for metrics. Refer Ops agent for more information.

Stackdriver stopped logging from GKE Containers

Logs from Spring Boot applications deployed to GKE stopped showing up in Stackdriver Logging after February 2nd 2020. What happened around that time is that Stackdriver moved to a new UI, more integrated with the GCP console - could that have anything to do with it?
I do have other projects in GKE, such as a Node.js based backend, where logging to Stackdriver has continued without interruption, but there is just silence from the Spring Boot apps:
If I select "Kubernetes Container" instead of "GKE Container" in the GCP console at "Stackdriver Logging -> Logs Viewer" I do see some log statements, specifically errors like:
WARNING: You do not appear to have access to project [my-project] or it does not exist.
and
Error while fetching metric descriptors for kube-proxy: Get https://monitoring.googleapis.com/v3/projects/my-project/metricDescriptors?...
and
Error while sending request to Stackdriver Post https://monitoring.googleapis.com/v3/projects/my-project/timeSeries?...
OK, so that seems to start explaining the problem but I haven't been changing any IAM permissions, and when comparing those to the ones in the project hosting the Node.js GKE deployments which continue logging fine, they seem to be the same.
Should I be changing some permissions in the project hosting the Spring Boot GKE deployments, to get rid of those Stackdriver errors? What IAM member affects those? What roles would be required?
Turns out that the GKE cluster had Legacy Stackdriver Logging and Legacy Stackdriver Monitoring enabled:
and the problem was solved by setting those attributes to disabled and configuring the Stackdriver Kubernetes Engine Monitoring attribute:
But why the Stackdriver Logging continues uninterrupted for the Node.js applications, with the legacy options enabled, is still a mystery to me.

How to visualize AWS Elastic Beanstalk application logs

We are using AWS Elastic Beanstalk for deploying application. Currently we have two Elastic Beanstalk applications and two worker processes (that pick message from AWS SQS Queue and process it).
What can be the best tools to view the combine logs from the Elastic Beanstalk application and worker and a few more on-premise applications in future?
Throw the logs in AWS ElasticSearch and the use Kibana, which comes with ElasticSearch, to visualize them.
I used the suggestion and configured Cloud watch logs, Elastic Search, and Kibana; but i am not getting all logs and all insights. I can see httpd access & error logs, ebs access & error logs. It also seems lot of AWS services and configuration. Since I am very new to AWS; therefore, so I am facing trouble in setting things up
Alternatively as suggested by my boss, I tried "New relic" - It was very simple to configure and I can see lot of insights of my EBS application in "New Relic" console. I can also configure my Browser, iOS app, Android app, AWS infrastructure (AWS Services) in one New Relic console. Some details are missing in New Relic console such error stack trace, request params in POST request, and so on; But I also don't want to share such details with New Relic, so, that is ok.
I will use "New Relic" and Cloudwatch logs (for real time investigation into failing HTTP REST services) right now; but I will explore more options inside AWS: Elastic Search and Kibana
Many Thanks

How to integrate on premise logs with GCP stackdriver

I am evaluating stackdriver from GCP for logging across multiple micro services.
Some of these services are deployed on premise and some of them are on AWS/GCP.
Our services are either .NET or nodejs based apps and we are invested in winston for nodejs and nlog in .net.
I was looking # integrating our on-premise nodejs application with stackdriver logging. Looking # https://cloud.google.com/logging/docs/setup/nodejs the documentation it seems that there we need to install the agent for any machine other than the google compute instances. Is this correct?
if we need to install the agent then is there any way where I can test the logging during my development? The development environment is either a windows 10/mac.
There's a new option for ingesting logs (and metrics) with Stackdriver as most of the non-google environment agents look like they are being deprecated. https://cloud.google.com/stackdriver/docs/deprecations/third-party-apps
A Google post on logging on-prem resources with stackdriver and Blue Medora
https://cloud.google.com/solutions/logging-on-premises-resources-with-stackdriver-and-blue-medora
for logs you still need to install an agent on each box to collect the logs, it's a BindPlane agent not a Google agent.
For node.js, you can use the #google-cloud/logging-winston and #google-cloud/logging-bunyan modules from anywhere (on-prem, AWS, GCP, etc.). You will need to provide projectId and auth credentials manually if not running on GCP. Instructions on how to set these up is available in the linked pages.
When running on GCP we figure out the exact environment (App Engine, Compute Engine, etc.) automatically and the logs should up under those resources in the Logging UI. If you are going to use the modules from your development machines, we will report the logs against the 'global' resource by default. You can customize this by passing a specific resource descriptor yourself.
Let us know if you run into any trouble.
I tried setting this up on my local k8s cluster. By following this: https://kubernetes.io/docs/tasks/debug-application-cluster/logging-stackdriver/
But i couldnt get it to work, the fluentd-gcp-v2.0-qhqzt keeps crashing.
Also, the page mentions that there are multiple issues with stackdriver logging if you DONT use it on google GKE. See the screenshot.
I think google is trying to lock you in into GKE.