setting up custom HTTP Health Checks on GCP - google-cloud-platform

Apparently I cannot figure out how to do custom HTTP endpoint for Health Checks. Maybe I missed something or GCP doesn't offer it yet.
The ElasticSearch health check page describes various ways to check the EL cluster.
I was looking at the GCP health checks interface and it doesn't let us to add a URL endpoint, neither let us to define a parser for the health check to match against the "green" cluster.
What I was able to do is to wire in port 9200 and use a config like:
port: 9200, timeout: 5s, check interval: 60s, unhealthy threshold: 2 attempts
But this is way not the way to go for EL cluster, as the cluster may respond but having a yellow/red state.
There would be an easier way without parsing the output just adding a timeout check like:
GET /_cluster/health?wait_for_status=yellow&timeout=50s
Note: Will wait for 50 seconds for the cluster to reach the yellow level (if it reaches the green or yellow status before 50 seconds elapse, it will return at that point).
Any suggestions?

GCP health checks are simple and use the HTTP status code to determine if the check passes (200) - https://cloud.google.com/compute/docs/load-balancing/health-checks
what you can do is implement a simple HTTP service that will query ES's health check endpoint parse the output and decide if status code 200 should be returned or something else.

Related

Google Cloud Run not scaling up despite large backlog and available instances

I am seeing something similar to this post. It looked like additional detail was needed to answer that question, so I'm re-asking with my details since those details weren't provided.
I am running a modified version of the Google Cloud Run image processing tutorial example.
I am inserting tasks into a task queue using this create tasks snippet. The tasks from the queue get pushed to my cloud run instance.
The problem is it isn't scaling up and making it through my tasks in a timely manner.
My cloud run service configuration:
I have tried setting a minimum of both 0 and 50 instances
I have tried a maximum of 100 and 1000 instances
I have tried --concurrency=1 and 2, and 8
I have tried with --async and without --async
With 50 instances pre-allocated even with concurrency set to 1, I am typically seeing ~10 active container instances and ~40 idle container instances. I have ~30,000 tasks in the queue and it is getting through ~5 jobs/minute.
My tasks queue has the default settings. My containers aren't using a lot of cpu, but they are using a lot of memory.
A process takes about a minute to complete. I'm only running one process per container instance. What additional parameters should be set to get higher throughput?
Edit - adding additional logs
I enabled the logs for the queue, I'm seeing some errors for some of the jobs. The errors look like this:
{
insertId: "<my_id>"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.tasks.logging.v1.TaskActivityLog"
attemptResponseLog: {
attemptDuration: "19.453155s"
dispatchCount: "1"
maxAttempts: 0
responseCount: "0"
retryTime: "2021-10-20T22:45:51.559121Z"
scheduleTime: "2021-10-20T16:42:20.848145Z"
status: "UNAVAILABLE"
targetAddress: "POST <my_url>"
targetType: "HTTP"
}
task: "<my_task>"
}
logName: "<my_log_name>"
receiveTimestamp: "2021-10-20T22:45:52.418715942Z"
resource: {
labels: {
location: "us-central1"
project_id: "<my_project>"
queue_id: "<my-queue>"
target_type: "HTTP"
}
type: "cloud_tasks_queue"
}
severity: "ERROR"
timestamp: "2021-10-20T22:45:51.459232147Z"
}
I don't see errors in the cloud run logs.
Edit - Additional Debug Information
I tried to take the queue out of the equation to determine if it is cloud run or the queue. Instead I directly used curl to post to the url. Some of the tasks ran successfully, for others I received an error. In the below logs empty lines are successful:
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
This makes me think cloud run isn't handling all of the incoming requests.
Edit - task completion time test
I wanted to test if the time it takes to complete a task causes any issues with CloudRun and the Queue scaling up and keeping up with the tasks.
In place of the task I actually want completed I put a dummy task that just sleeps for n seconds and prints the task details to stdout (which I can read in the cloud run logs).
With n set to 0, 5, 10 seconds I see the number of instances scale up and it keeps up with the tasks being added to the queue. With n set to 20 seconds or more I see that less CloudRun instances are instantiated and items accumulate in the task queue. I see more errors with the Unavailable status in my logs.
According to this post:
Cloud Run offers a longer request timeout duration of up to 60 minutes
So it seems that long running tasks are expected. Is this a Google bug or am I missing setting some parameter?
I do not think this is a Cloud Run Service problem. I think this is an issue with how you have Tasks setup.
The dates in the log entry look odd. Take a look at the receiveTimestamp and the scheduleTime. The task is scheduled for six hours before the receive time. Do you have a timezone problem?
According to the documentation, if the response_time is not set then the task was not attempted. It looks like you are scheduling tasks incorrectly and the tasks never run.
Search for the text The status of a task attempt. in this link:
Types for Google Cloud Tasks

Sagemaker Batch Transform Error "Model container failed to respond to ping; Ensure /ping endpoint is implemented and responds with an HTTP 200 status"

My task is to do large scale inference via Sagemaker Batch Transform.
I have been following the tutorial: bring your own container, https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb
I have encountered many problems and solved them by searching stack overflow. However there is one problem that still causes the trouble.
When I run the same code and same dataset using 20 EC2 instances simultaneously, sometimes I get the error "Model container failed to respond to ping; Please ensure /ping endpoint is implemented and responds with an HTTP 200 status", and sometimes I don't.
What I find most frustrating is that, I have already do nothing for /ping (see code below)
#app.route("/ping", methods=["GET"])
def ping():
"""Determine if the container is working and healthy. In this sample container, we declare it healthy if we can load the model successfully."""
# health = ScoringService.get_model() is not None # You can insert a health check here
# status = 200 if health else 404
status = 200
return flask.Response(response="\n", status=status, mimetype="text/csv")
How could the error still happen?
I read from some posts (e.g., How can I add a health check to a Sagemaker Endpoint?
) saying that "ping response should return within 2 seconds timeout".
How can I increase the ping response timeout? And in general, what can I do to prevent the error from happening?
Quick clarification, SageMaker Batch Transform and creating a Real Time Endpoint are two different facets. For [Batch Transform] you do not use a persistent endpoint rather create a transformer object that can perform inference on a large set of data. An example of the Bring Your Own approach with Batch can be seen here.
Regardless of Batch or Real-Time the /ping must be passed. Make sure that you are loading your model in this route. Generally if your model is not loaded properly this leads to that health check error being emitted. Here's another BYOC example, in the predictor.py you can see me loading the model in the /ping route.
Lastly not sure what you mean by 20 instances simultaneously? Are you backing the endpoint with a 20 instance count?

Google Cloud Tasks enforcing rate limit on forwarding to Cloud Functions?

Cloud Tasks is saying:
App Engine is enforcing a processing rate lower than the maximum rate for this queue either because your application is returning HTTP 503 codes or because currently there is no instance available to execute a request.
However, I am forwarding the tasks to a cloud function using an HTTP POST request, similar to the one outlined in this tutorial. I don't see any 503s in my logs for the cloud function that it forwards to.
My queue.yaml is:
queue:
- name: task-queue-1
rate: 3/s
bucket_size: 500
max_concurrent_requests: 100
retry_parameters:
task_retry_limit: 1
min_backoff_seconds: 120
task_age_limit: 7d
The problem seems to come from any exception, even though only 503 is listed. If the cloud function responds with any error the task queue slows down the rate, and you have no control over that.
My solution was to swallow any errors to prevent that propagating up to Google's automatic check.

Stackdriver Alert based on service status

Is it possible to setup an alert based on the status of a custom service. For example, stackdriver-agent service crashed at one point. When running 'service stackdriver-agent status" I receive an 'Active: inactive (dead)' response.
Is it possible to setup an alert based on the condition above? The stackdriver-agent service is just an example. In theory, I would like to setup this alert condition on any service.
The answer is yes. In Stackdriver you can set up an alarm for any process in your machine. Selecting the option Add Process Health Condition you can configure alarms to receive notifications if your process starts or stops. Bear in mind that you first have to set up the Stackdriver Agent in your machine and that this option is only available in Stackdriver premium.
Thrahir's answer is a good one, though the UI has changed since then (click the right arrow next to "Metric" and "Uptime Check" to see other condition types; "Process Health" is the very last one).
If your service is a server, you might rather use an uptime check (https://cloud.google.com/monitoring/uptime-checks/) to monitor its state; that gives you a better analog to what the service's users will see than directly monitoring your processes does.
Aaron Sher, Stackdriver engineer

What the reason for AWS Health status becoming RED?

I've deployed an application to AWS elastic beanstalk.
after start the application, it runs well. But after 5 minutes(I set health check every 5 min), it runs failed. I access the url but HTTP 503 error back.
From the event info, I only get the info that the health status from YELLOW TO GREEN.
But how can I get detailed info and what can I do about this error?
BTW: I don't understand that is this health status RED leads to application can't start up OR something else failed leads to application failed, then the health status becomes to RED?
Elastic Load Balancing has a health check daemon that checks the path you've provided for a 200-range HTTP status.
If there is a problem with the application, or its not returning a 2xx status code, or if you've misconfigured the health check URL, the status will go RED.
Two things you can do to see what's going on:
Hit the hostname of an individual instance in your web browser — particularly the health check path. Are you seeing what you expected?
SSH into the instance and check the logs in /var/log and /opt/elasticbeanstalk/var/log. Are there any errors that you can find?
Without knowing more about your application, stack or container type, that's the best I can do.
I hope this helps! :)