Stackdriver Logging Client Libraries - What happens during Google Downtime? - google-cloud-platform

If you embed the Stackdrvier client library in your application and the Google stack driver API has downtime (Google documentation indicates 99.95% or 21.92 minutes of downtime/month)
My question is: What will happen to my application during the downtime? Will logging info build up in memory? Will it cause application errors or will it discard the log data and continue on?

Logging API downtimes can have different root causes and consequences. Google System Engineers have mechanisms in place to track and take mitigation actions so the downtime and its consequences are minimal but Google cannot guarantee data loss prevention in all outages all the time related to logging API.
Hopefully your application and pipeline can withstand up to (21.56 minutes) expected downtime a month (SLA 99.95%) as per the internal SLOs and SLAs of GCP.
The three scenarios you listed are plausible. In this period, your application sending the logs may have 500 responses from the network so it has to be able to deal with this kind of issue.
If the logging data manages to reach Google's platform but an outage prevents the data to be accessible, then Google's team will try their best to release backlogs, repopulate data, etc. They will post general notice on https://status.cloud.google.com/
If the issue is caused by the logging agent not sending data to our platform, then logging data may not be retrievable (but it could still be an infrastructural outage with one of the GCP products) or linked to something other than an outage like your application or its underlying host running out of resources or the logging agent being corrupted which is not covered by GCP Stackdriver SLA [1].
If the pipeline that ingests data from Logging API is backlogged, it could cause an outage but GCP team will try their best to make the data accessible after the outage ends.
If you suspect issues with Logging API malfunctioning, please contact support or file issue tracker or inspect open issues where Google's product team will provide updates live. Links below:
[1] https://cloud.google.com/stackdriver/sla#sla_exclusions
[2]
create new incident:
https://issuetracker.google.com/issues/new?component=187203&template=0
[3]
open issues:
https://issuetracker.google.com/savedsearches/559764

Related

Need help building an uptime dashboard for a distributed system

I have a product for which I would like to create a dashboard to show
its availability/uptime over time and display any outages.
Specifically I am looking for
ability to report historical information on service uptime
provide details on any service outages
The product is running on a fleet of linux servers and connects to a DB running
on a separate instance, also we have some dedicated instances that run nightly
batch jobs. My system also relies on some external services to provide
additional functionality for select customers. There is redis cache also for
caching data for multiple customers.
We replicate all the above setup (application servers, DB, jobs servers, redis
cache etc) into dedicated clusters for large customers. Small customers are put
on one of the shared clusters to keep costs low.
Currently we are running health checks on application servers only and providing
that information in a simple HTML page. This is a go to page for end-users/customers
and support teams.
Since the product is constructed using multiple systems/services our current HTML
page often times says that the system is up and running fine while can be experiencing
issues with some of its components or external services.
Current health check is using a simple HTTP request and looks for a 200
status code, this check runs every minute and we plot this data into a simple
chart to show last 30 days. We also show a list of outages with timestamp and
additional static information that is manually added.
We would like to build a more robust solution that monitors much more than the HTTP port
and where we have more details like what part
of the system is having issues and how those issues are impacting the system and
which customers are impacted.
Appreciate any guidance or help. We prefer to build the solution using
open source tools since we dont have much budget. Goal is to improve things for
my team members who are already overloaded.
I'm not sure if this will be overkill or not for your setup, given that I don't know your product, but have a look at the ELK Stack and see if you can use some components or at least some ideas from there:
What is the ELK Stack?
The Complete Guide to the ELK Stack

Google Cloud Platform - Stack Driver Enabled - 100% Compute Errors

I developed and support a client's mobile app that uses Firebase services.
Google Cloud Platform logged this event yesterday at 4:17 am:
'<my account email> has executed
google.api.serviceusage.v1.ServiceUsage.EnableService
on stackdriver.googleapis.com'
I was sleeping at the time and a review of Google Admin Console Login Audit Log does not show a login event around that same time.
Immediately, 100% errors were reporting for 'compute':
A look at the Stackdriver API overview page does not give any indication of activity:
My question, my concern, how/why did this service get activated and what is the activity driving the compute errors at 100%?
During my efforts to understand, I clicked on Compute Engine API in the API library, which enabled the API (but no VMs, Disk, etc. were created):
A short time later, Google Cloud Platform has several log entries:
google.devtools.cloudbuild.v1.CloudBuild.ListBuilds
was executed on builds
Number of returned items 1000
The 'compute' errors stopped.
When I disabled the Compute Engine API, the ListBuilds logs stopped, but the Computer Errors returned to 100%.
I have not found a definitive answer to my question.
It's clear that Stackdriver API was enabled, but I don't know why.
When enabled, 100% Compute errors were being reported (orange line on graph) without any details.
While customizing the Google Cloud Platform Dashboard for this account, I toggled/enabled the Compute Engine card/graph hoping that might reveal some clues regarding the 100% errors. That action initialized the Compute Engine API. Almost immediately the Compute errors ended but there was a surge of activity that has continued. Reviewing many resources I found information that suggest this is normal behavior.
I would still like to fully understand how Stackdriver was enabled, why it was enabled, what value it provides, if I can simply disable it and Compute API as this project will never require VM compute services.

AWS RDS Performance Insights not showing SQL Queries

I enabled Performance Insights on an existing SQL Server database (MySql 5.6.46) in AWS RDS.
But still, it shows 0 sessions and “No active sessions in the selected time range” no matter what duration I've select from the top list.
Is there some condition I need to meet in order to have my query get recorded in Performance Insights? What're the criteria? How can I troubleshoot this?
I created AWS Support case where AWS Engineer explained to me:
Unfortunately, this is a known issue from our end where Performance Insights does not get enabled when it is issued in the same API call as engine version upgrade as RDS follows a priority in executing multiple requests that have been submitted as part of the same API call - for example in this case, request to enable Performance Insights and request to upgrade the instance to 11.1 version. Performance Insights call is evaluated first followed by the engine upgrade. This means that when Performance Insights request was being considered, the instance was still on the previous incompatible version, hence the request did not go through successfully.
The workaround to resolve this issue is to disable Performance insights, wait a few minutes and then re-enable Performance Insights.
Enabling/disabling Performance Insights does not cause an outage/downtime. The Performance Insights agent is designed to stay out of your database workloads' way. When Performance Insights detects heavy load or depleted resources, it backs off, still collecting data, but only when it is safe to do so.

WSO2 Throttling API

I have gone through read the various questions involving throttling on stack overflow. However, I didn't find anyone with a similar issue to what I'm seeing. I have gone through the tutorials and setup process on the WSO2 site regarding throttling.
This is what I have done:
Setup an additional tier to allow 5 calls per minute on the
following levels (Advanced Throttling, Application Throttling,
Subscription Throttling).
Edit the API and set the subscription tier level to the new custom
tier
Set the Application to the new tier level
Set the Advanced Throttling Policy to apply to the API, then I saved & published
Ran 1100 HTTP requests from an application that calls the API on an
interval every second. Every request made was successfully processed
without any throttling.
I installed version 1.9 of API manager and setup the very same rules
The requests were throttled correctly.
Any help would be greatly appreciated, I'm not really sure if it is a bug or a configuration issue on my end.
Regards
So after much digging in the WSO2 documentation. I have found that in order to use the advanced throttling techniques (which are enabled by default) you must use Traffic Manager (which is disabled by default).
There are instructions on how to use Traffic Manager in the WSO2 documentation. If advanced throttling is disabled the basic throttling works as expected.
It took some time to discover this as the documentation doesn't clearly make the distinction very clear in the documentation.
I hope this helps someone having a similar issue.

Google Cloud Pub/Sub on Google App Engine hits QPS limit too soon

Around 90 or 100 calls per second to
pubsub_client.projects().topics().publish(topic='projects/xxxx',body=body).execute(num_retries=0)
per second from Google App Engine App to Google Cloud Pub/Sub, results in
HttpError: <HttpError 429 when requesting https://pubsub.googleapis.com/v1/projects/xxxx:publish?alt=json returned "Request throttled due to user QPS limit being reached.">
I know there is a limit on administrative operations at 100 QPS, but certainly publishing to a topic is not an administrative operation? I know pub/sub should support millions of operations per second so I know there's something wrong.
Any help or insight would be appreciated. I need to get up to at least 300 publishes per second, trying to streamline an existing implementation using pubsub. I think this may be a bug with the implementation.
I am running this code on Google App Engine python 2.7 -- using the appengine runtime, not the flexible one as that's not approved for production code yet.
Note that publisher quota is not in terms of QPS, but in terms of throughput. The default limit is 100MB/s. See the Quotas documentation for more details. Depending on the message size you are sending, you may be running into these limits.
The "user QPS limit being reached" message on a publish usually means one of three things:
You are publishing at a throughput that is higher than the default 100MB/s quota. If that is the case, then you can apply for more quota by clicking on the "Apply for higher quota" on the Pub/Sub Quota page.
You are not authenticated against the correct Cloud project. If you are authenticated in or running your Google App Engine instances in a Cloud project that differs from the one your topic is defined in, the quota you run into may not be defined in the project you expect. More information can be found in the Google Application Defaults Credentials page.
You have manually set quota in the Quota page and that is the limit you are running into.