We have 3 PostgreSQL databases in GCP's CloudSQL, all three of them are backed-up daily. I need to use Grafana to monitor those back-ups and alert when they've failed.
Unfortunately I'm not finding many resources to help me with this task. Is there a way to create this kind of alerting rule?
Related
I had a failover on my PostgreSQL instance on gcp, i have logs and metrics about the instance and metrics all looks good, but i don't have a log with the reason for the failover (something as well as a network failure or zone), Is there any way to know the reason for the failover?
This is info is only available if you have support service (only with pay), this steps are:
1.- paid 4 support
2.- open a ticker for ask to info
I am working on creating monitoring based on SLO. So far I have been using Google Cloud Monitoring solutions like Dashboards, Alerting and Uptime Checks.
I have noticed GCP has now a Managed Service for Prometheus.
My question is what would be the advantage of using Prometheus(not only Google managed one)for monitoring. Is there anything that could be achieved with Prometheus that I could not achive with Google Cloud Monitoring?
Managed service for prometheus is a managed and automatically scalable prometheus endpoint. You can request the metrics with PromQL language instead of MQL (Monitoring Query Language).
What's the advantage? If you deploy an application instrumented with Open Telemetry (for example), you don't have to change anything. On Kubernetes (GKE), the managed collector do the job for you. Else you have to configure the collector to use Managed Service for Prometheus.
If you build an app from scratch, and you want it portable, Open Telemetry and Prometheus are standard tools to instrument your app.
If not, use Cloud Monitoring!
Important note
That feature is very new and, for now, only the metrics sinks with Managed Service for Prometheus can be query with PromQL. The other metrics must be requested by MQL. It could change in the future.
So, for now, if you can use built in Cloud Monitoring metrics, it's a better solution.
I am trying to autoscale gcp instances based on memory metrics but I am unable to find the way how this can be done. I have tried to setup this through "stackdriver monitoring metrics" but no luck. Can someone help here how this can be done.
This is similar problem like posted on google forum but no proper answer here as well.
https://groups.google.com/forum/#!topic/gce-discussion/X6LA0-8mFak
It's required to install the Stackdriver Monitoring Agent by following this documentation.
Once installed, you will get more options to configure your autoscaler from your instance group page
I have setup a Managed Instance Group with initial 3 instances (I installed Lumen inside, and the web server is auto started) to be used with the GCP load balancer. The LB works great.
However, whenever I need to trace lumen logs, I need to SSH every single instance to view the logs. Is there any best practices of one centralized storage I can refer to for the logs?
Can I mount the lumen logs into a centralized disk e.g. GCP filestore volume, or Google storage bucket or using FluendD to dump my logs into GCP Logging?
Please, I need to know the best industrial practice. THanks
STACK DRIVER is the right option for your case
https://cloud.google.com/logging/docs/agent/installation#joint-install
Install the stack driver logging agent on compute engine instances. You can track your logs lively also you can create visualizations and useful analysis out of it. Stackdriver is the best industry standard for the people who is using GCP. remember the pricing. Please check the pricing details
I am trying out AWS Glue service to ETL some data from redshift to S3. Crawler runs successfully and creates the meta table in data catalog, however when I run the ETL job ( generated by AWS ) it fails after around 20 minutes saying "Resource unavailable".
I cannot see AWS glue logs or error logs created in Cloudwatch. When I try to view them it says "Log stream not found. The log stream jr_xxxxxxxxxx could not be found. Check if it was correctly created and retry."
I would appreciate it if you could provide any guidance to resolve this issue.
So basically, the job you add to Glue will either run if there's not too much traffic in the region your Glue is. If there are no resources available, you need to either manually re-add the job again or you can also bind yourself to events from CloudWatch via SNS.
Also, there are parameters you can pass to the job like maximunRetry and timeout.
If you have a Ressource not available, it won't trigger a retry because the job did not fail, it just didn't even started. But if you set the timeout to let's say 60 minutes, it will trigger an error after that time, decrement your retry pool and re-launch the job.
The closest thing I see to Glue documentation on this is here:
If you encounter errors in AWS Glue, use the following solutions to
help you find the source of the problems and fix them. Note The AWS
Glue GitHub repository contains additional troubleshooting guidance in
AWS Glue Frequently Asked Questions. Error: Resource Unavailable If
AWS Glue returns a resource unavailable message, you can view error
messages or logs to help you learn more about the issue. The following
tasks describe general methods for troubleshooting. • A custom DNS
configuration without reverse lookup can cause AWS Glue to fail. Check
your DNS configuration. If you are using Amazon Route 53 or Microsoft
Active Directory, make sure that there are forward and reverse
lookups. For more information, see Setting Up DNS in Your VPC (p. 23).
• For any connections and development endpoints that you use, check
that your cluster has not run out of elastic network interfaces.
I have recently struggled with Resource Unavailable thrown by Glue Job
Also i was not able to make a direct connection in Glue using RDS -it said "no suitable security group found"
I faced this issue while trying to connect with AWS RDS and Redshift
The problem was with the Security Group that the Redshift was using. There is a need to place a self referencing inbound rule in the Security Group.
For those who dont know what is self referencing inbound rule, follow the steps
1) Go to the Security Group you are using (VPC -> Security Group)
2) In the Inbound Rules select Edit Inbound Rules
3) Add a Rule
a) Type - All Traffic b) Protocol - All c) Port Range - ALL d) Source - custom and in space available write the initial of your security group and select it. e) Save it.
Its done !
if you were missing this condition in your Security Group Inbound Rules
Try creating the connection you will be able to create the connection.
Also job should work this time.