I started using Kamon instrumentation recently and facing issues with the rate of the kamon/prometheus http endpoint refresh
Preface:
using "io.kamon" %% "kamon-bundle" % "2.1.4" && "io.kamon" %% "kamon-prometheus" % "2.1.4"
exposing metrics as http endpoint so that prometheus scrapes them and evaluates every 1 sec
created custom Counter, Gauge and Histogram metrics and they are updated 2-3K times per sec inside the Akka actor processing incoming messages
The reason to use Kamon instead of standard prometheus client is to get thread safety
There is configuration kamon.metric.tick-interval 1 second & kamon.prometheus.refresh-interval 1 second related to the rate of refresh
Problem:
Custom metrics that are exposed at the endpoint (localhost:9095) are not refreshed every second. Approximately, they are refreshed every 60 seconds.
It's not prometheus configuration problem, I'm checking the values on the http endpoint exposed by kamon, manually refreshing the page
This was misconfiguration issue. If you are getting same problem, please make sure that the kamon configuration is on the top level of the application.conf, not inside akka {..} as I had it
Related
I have deployed an application (frontend and backend) in App Engine. First of all, I am using the free tier and I chose the default F1 for the frontend and B2 for the backend. I don't exactly understand the difference between B and F instances but based on their names, I chose them for backend and frontend respectively.
My backend is a Flask application that reads some data from Firestore on #app.before_first_request and "pre-caches" it for all future requests. This takes about 20-30 seconds before the first request is served so I really don't want the backend instance to become undeployed all the time.
Right now, my backend successfully serves one request (that I am making from the browser) and then immediately gets undeployed (basically I see no active instances in App Engine dashboard after the request is served). This means that every request once again has the same long delay upon server start that I don't want. I am not sure why this is happening because I've set idle timeout to 5 minutes. I know it is not a problem with my Flask application because it does not crash after a request on a local machine and I've done its memory profiling which is in bounds of B2 limits. This is my app.yaml for the backend:
runtime: python38
service: api
env_variables:
PORT: 8080
instance_class: B2
basic_scaling:
max_instances: 1
idle_timeout: 5m
Any insight would be appreciated!
Based on the information and behavior that you are exposing, please allow me to explain to you that both Scaling models are behaving as they are designed to do so.
“Automatic Scaling: It creates instances based on request rate, response latencies, and other application metrics. You can specify thresholds for each of these metrics, and a minimum number instances to keep running always.
Basic Scaling: Basic scaling creates instances only when your application receives requests. Each instance will be shut down when the application becomes idle. Basic scaling is ideal for work that is intermittent or driven by user activity.”
Use the following URL’s documentation as reference for those models and more of them How Instances are Managed.
Information added on 10/12/2021:
Hi,
I think the correct term is “shutdown” instead of “undeployed” Disabling your application. Looking at Instance States "an instance of a manual or basic scaled service can be either running or stopped. All instances of the same service and version share the same state." then looking at Scaling types "Basic scaling creates instances when your application receives requests. Each instance will be shut down when the application becomes idle. Basic scaling is ideal for work that is intermittent or driven by user activity." and the table's Startup and shutdown row for basic scaling "Instances are created on demand to handle requests and automatically shut down when idle, based on the idle_timeout configuration parameter. An instance that is manually stopped has 30 seconds to finish handling requests before it is forcibly terminated." and Scaling down "You can specify a minimum number of idle instances. Setting an appropriate number of idle instances for your application based on request volume allows your application to serve every request with little latency".
Could you please verify:
that the instance was not manually halted?
that instance is becoming idle?
that there are no background threads?
if functionality is the same when setting the max_instances to 2
that there are no logs showcasing an instance shutdown
that they are reaching the version with the updated the idle_timeout set
Here is my situation:
I have a rather slow tensorflow model that runs on GPU (2 to 3 seconds per prediction)
A prediction for a single 'entity' vs a prediction for 8 'entities' takes about the same time
This means I could be 8 times as efficient by simply combining multiple predictions in the same request
I have a service on AI platform serving requests to that model
The service works for slow request rates but has trouble scaling up (anything over 4 QPS is too much to handle)
My question then is:
Is there a standard way / best practice for batching live client requests:
When receiving a request, wait a little bit for other requests
After a while, or when the number of requests reaches a set number, forward the requests in a single "batch" to another service.
If traffic is low, the delay will expire before the batch is full, but since traffic is low, that's not an issue
If traffic is high, the batch will be full before the delay, and the client will have to wait less
I have an almost-working solution with app-engine + firebase (for hosting the shared 'queue') but implementing the delay is giving me trouble (app engine doesn't seem to like python's threading.Timer
I'd appreciate something that could work with app engine, but at this point I'm open to any suggestions (as long as it is applicable on google cloud).
Thanks!
The perfect (but not the cheapest) is to use Dataflow.
When a prediction request comes in, publish it in PubSub
Deploy a dataflow in streaming mode, with fixed windows of X minutes, and another trigger, not accumulated, after Y event in the window.
When a window trigger is performed (either on the number of messages or on the timer) do the batch processing
You can imagine other designs, simpler/cheaper.
Still publish the prediction requests in PubSub
You can schedule a Cloud Functions, or a Cloud Run every X minutes to pull the pubsub subscription and then to trigger the batch job. But, it's a fixed time.
When you publish the message in PubSub, you can also store, in firestore for example, and increase a counter and the date of the 1st message published in PubSub.
If the number of message is above your threshold, perform a request to your other process that pull the PubSub subscription and run the batch processing (as before #1). Reset the counter value and the message date value
Set up a cloud scheduler which check, every minute, the value of the 1st message date in Firestore. If it's above your time limit, perform a request to your other process that pull the PubSub subscription and run the batch processing (as before #1). Reset the counter value and the message date value
The #2 will generate a lot of Firestore read/write, but will be cheaper than dataflow.
Cloud Tasks is saying:
App Engine is enforcing a processing rate lower than the maximum rate for this queue either because your application is returning HTTP 503 codes or because currently there is no instance available to execute a request.
However, I am forwarding the tasks to a cloud function using an HTTP POST request, similar to the one outlined in this tutorial. I don't see any 503s in my logs for the cloud function that it forwards to.
My queue.yaml is:
queue:
- name: task-queue-1
rate: 3/s
bucket_size: 500
max_concurrent_requests: 100
retry_parameters:
task_retry_limit: 1
min_backoff_seconds: 120
task_age_limit: 7d
The problem seems to come from any exception, even though only 503 is listed. If the cloud function responds with any error the task queue slows down the rate, and you have no control over that.
My solution was to swallow any errors to prevent that propagating up to Google's automatic check.
I am new to JMeter so getting confused in conducting a test. My test scenario
1) Hit a REST URL in API Gateway
2) Request should be 100 requests per seconds
3) Conduct the test for 2 hrs
4) Evaluate the error / success percentage
What parameters should I put to achieve this combination ? Any help will be appreciated
Thanks in advance
Add Concurrency Thread Group to your Test Plan and configure it like:
Put ${__tstFeedback(jp#gc - Throughput Shaping Timer,500,1000,10)} into "Target Concurrency" input.
Put 120 into "Hold Target Rate Time (min)" input
Add HTTP Request Sampler to your Test Plan and configure it to send request to the REST URL
You might also need to add HTTP Header Manager to send Content-Type header with the value of application/json
Add Throughput Shaping Timer as a child of your HTTP Request sampler and configure it like:
Start RPS: 100
End RPS: 100
Duration: 7200
Run your test in command-line non-GUI mode like:
jmeter -n -t test.jmx -l result.csv
Open JMeter GUI, add i.e. Aggregate Report listener to your test plan and see the metrics. You can also generate a HTML Reporting Dashboard to see extended results and charts .
We are working on setting up an API Management portal for one of our Web API. We are using eventhubs for logging the events and we are transferring the event messages to Azure Blob storage using Azure functions.
We would like to know how can we find the Time taken by API Management portal for providing the response for a message (we are capturing the time taken at the back end api layer but not from the API Management layer).
Regards,
John
The simpler solution is to enable Azure Monitor Diagnostic Logs for the Apimanagement service. You will get raw logs for each request including
durationMs - interval between receiving request line and headers from a client and writing last chunk of response body to a client. All writes and reads include network latency.
BackendTime - time spent waiting on backend response
ClientTime - time spent with client for request and response
CacheTime - time spent on fetching from cache
You can also refer this video.
Not the correct way of doing this, but still get an idea of how much time each request is taking. We can actually use the context variable to set the start time in the inbound policy node and then calculate the end time in the outbound node.