Prometheus exceeded maximum resolution of 11,000 points or empty result - postman

I'm currently struggling with getting empty results from requests to my Prometheus endpoint.
{
"status": "success",
"data": {
"resultType": "matrix",
"result": []
}
}
If I check out the data on grafana for this stack/instance over a 7 day period, I can see there's metrics for this stack, but the metrics were only coming through for ~15 minutes (until the stack was destroyed). That's all fine with me, I'd like to get metrics during that time frame it was alive..
Grafana 7day interval
If we take a look at here in grafana, values start coming through when I specify a short enough interval of time.
Grafana smaller time interval
Great, now all I want to do is get that value back from prometheus with a postman request.
Since I had troubles with using Prometheus' "query" request, I started trying to use "query_range" so I could specify a time to start/stop. I was assuming that if I didn't specify a time, I would be in the same situation as with Grafana before showing the graph on a large time interval but no values would appear.
After sending the request on a small step count, I get back
exceeded maximum resolution of 11,000 points per timeseries. Try decreasing the query resolution (?step=XX)
postman request small step interval
This helps me to believe that i am getting metrics back, but just too many for prometheus to send over.
So, I increase the step interval (I've tried different combinations of seconds/minutes) and it always gives back the following -
{
"status": "success",
"data": {
"resultType": "matrix",
"result": []
}
}
Postman request valid step
I can't seem to get around this issue of getting back an empty result set. Any ideas on why this resultset might be coming back as empty? I've tried getting an average by wrapping the query in "avg()" but it also gives back an empty result.
Tried sending a postman request to get prometheus metrics but I got back an empty value or that it exceeded maximum resolution of 11,000 points per timeseries.

Just found out the solution. This is really dumb but it seems like the Prometheus endpoint doesn't like it when you pass in too long of a UNIX timestamp.
I realized this by troubleshooting the following:
1675106064891 - did not work
1675106103 - did work
My Grafana metrics start getting published around Mon Jan 30 2023 19:11:45 GMT+0000 which is before both of those unix timestamps. So, what was the difference? Well, I tried adding more values onto the UNIX timestamp that did work and voila, I stopped getting metric values back.
Hope this response saves you some time if you encounter the same error. I wish Prometheus would give some context on why that value is empty or something.

Related

Receiving a 429 error when iterating through data using Newman with Postman collection

New to Postman and Newman. I am iterating through a CSV data file, at times the values are few (20 or less) and at times they are great (50 or more). When iterating through large sets of values, 50 or more I receive a 429 error. Is there a way to write a retry function on a single request when the status is not 200?
I am working with the Amazon SP-API and reading through the documentation it appears as if there is nothing I can do about the x-amzn-RateLimit-Limit. My current limit is set at 5.0 I believe this is 5 requests per second.
Any advice would be helpful: a retry function, telling the requests to sleep/wait every X-amount, another method I am not aware of, etc.
This is the error I receive
{
"errors": [
{
"code": "QuotaExceeded",
"message": "You exceeded your quota for the requested resource.",
"details": ""
}
]
}
#Danny Dainton pointed me to the right place. By reading through the documentation I found out that by using options.delayRequest I am able to delay the time between requests. My final code looks like the sample below—and it works now:
newman.run({
delayRequest: 3000,
...
})

How to get the current timestamp from the cursor position for Azure EventHub

I am using EventProcessorHost for reading messages from EventHub. It maintains the checkpoints in a blob storage in the following format
{
"PartitionId": "0",
"Owner": "xxxxxxxxxxxxxxxxxx",
"Token": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx",
"Epoch": 7370,
"Offset": "12116110271960",
"SequenceNumber": 106597952
}
I want to know if there is a way to find out the timestamp of the events being read using the above information.
I am planning on using the same for creating a simple application that will show the status of read per partition and alert in case the backlog on the partition is growing.
You can compare the Message Sequence of the current message being processed, against the last known sequence number generated for a partition. The difference between these numbers is 'how far behind' the latest message your processing has fallen, and therefore how many messages need to be processed to catch up.
I wrote an article that shows how to achieve this using Azure Functions - but the concept would be the same - calculate the number of messages in the backlog using the Message Sequence\Partition Sequence technique & turn that into a metric and record it somewhere. That lets you visualise on a dashboard like Grafana and alert when it breaches a threshold - I use Azure Monitor and a dynamic metric alert to do this. I also use the metric to scale out my processing logic, so it's useful to capture this.

AWS Elasticsearch publishing wrong total request metric

We have an AWS Elasticsearch cluster setup. However, our Error rate alarm goes off at regular intervals. The way we are trying to calculate our error rate is:
((sum(4xx) + sum(5xx))/sum(ElasticsearchRequests)) * 100
However, if you look at the screenshot below, at 7:15 4xx was 4, however ElasticsearchRequests value is only 2. Based on the metrics info on AWS Elasticsearch documentation page, ElasticsearchRequests should be total number of requests, so it should clearly be greater than or equal to 4xx.
Can someone please help me understand in what I am doing wrong here?
AWS definitions of these metrics are:
OpenSearchRequests (previously ElasticsearchRequests): The number of requests made to the OpenSearch cluster. Relevant statistics: Sum
2xx, 3xx, 4xx, 5xx: The number of requests to the domain that resulted in the given HTTP response code (2xx, 3xx, 4xx, 5xx). Relevant statistics: Sum
Please note the different terms used for the subjects of the metrics: cluster vs domain
To my understanding, OpenSearchRequests only considers requests that actually reach the underlying OpenSearch/ElasticSearch cluster, so some the 4xx requests might not (e.g. 403 errors), hence the difference in metrics.
Also, AWS only recommends comparing 5xx to OpenSearchRequests:
5xx alarms >= 10% of OpenSearchRequests: One or more data nodes might be overloaded, or requests are failing to complete within the idle timeout period. Consider switching to larger instance types or adding more nodes to the cluster. Confirm that you're following best practices for shard and cluster architecture.
I know this was posted a while back but I've additionally struggled with this issue and maybe I can add a few pointers.
First off, make sure your metrics are properly configured. For instance, some responses (4xx for example) take up to 5 minutes to register, while OpensearchRequests are refershed every minute. This makes for a very wonky graph that will definitely throw off your error rate.
In the picture above, I send a request that returns 400 every 5 seconds, and send a response that returns 200 every 0.5 seconds. The period in this case is 1 minute. This makes it so on average it should be around a 10% error rate. As you can see by the green line, the requests sent are summed up every minute, whereas the the 4xx are summed up every 5 minute, and in between every minute they are 0, which makes for an error rate spike every 5 minutes (since the opensearch requests are not multiplied by 5).
In the next image, the period is set to 5 minutes. Notice how this time the error rate is around 10 percent.
When I look at your graph, I see metrics that look like they are based off of a different period.
The second pointer I may add is to make sure to account for when no data is coming in. The behavior the alarm has may vary based on your how you define the "treat missing data" parameter. In some cases, if no data comes in, your expression might make it so it stays in alarm when in fact there is only no new data coming in. Some metrics might return no value when no requests are made, while some may return 0. In the former case, you can use the FILL(metric, value) function to specify what to return when no value is returned. Experiment with what happens to your error rate if you divide by zero.
Hope this message helps clarify a bit.

Catching timeout errors in AWS Api Gateway

Since Api Gateway time limit is 10 seconds to execute any request I'm trying to deal with timeout errors, but a haven't found a way to catch and respond a custom message.
Context of the problem: I have a function that takes less than 2 seconds to execute, but when the function performs a cold start sometimes it takes more than 10 seconds creating a connection with DynamoDB in Java. I've already optimize my function using threads but I still cannot keep between the 10-seconds limit for the initial call.
I need to find a way to deliver a response model like this:
{
"error": "timeout"
}
To find a solution I created a function in Lambda that intentionally responds something after 10 seconds of execution. Doing the integration with Api Gateway I'm getting this response:
Request: /example/lazy
Status:
Latency: ms
Response Body
{
"logref": "********-****-****-****-1d49e75b73de",
"message": "Timeout waiting for endpoint response"
}
In documentation I found that you can catch this errors using HTTP status regex in Integration Response. But I haven't find a way to do so, and it seems that nobody on the Internet is having my same problem, as I haven't find this specific message in any forum.
I have tried with these regex:
.*"message".*
Timeout.*
.*"status":400.*
.*"status":404.*
.*"status":504.*
.*"status":500.*
Anybody knows witch regex I should use to capture this "message": "Timeout... ?
You are using Test Invoke feature from console which has a timeout limit of 10 seconds. But, the deployed API's timeout is 30 seconds as mentioned here. So, that should be good enough to handle Lambda cold start case. Please deploy and then test using the api link. If that times out because your endpoint takes more than 30 seconds, the response would be:
{"message": "Endpoint request timed out"}
To clarify, you can configure your method response based on the HTTP status code of integration response. But in case of timeout, there is no integration response. So, you cannot use that feature to configure the method response during timeout.
You can improve the cold start time by allocating more memory to your Lambda function. With the default 512MB, I am seeing cold start times of 8-9 seconds for functions written in Java. This improves to 2-3 seconds with 1536MB of memory.
Amazon says that it is the CPU allocation that is really important, but there is not way to directly increase it. CPU allocation increases proportionately to memory.
And if you want close to zero cold start times, keeping the function warm is the way to go, as described here.

Client Error Youtube API python

I have a python program which query youtube to get the video details. I use the version-3 api. I have multiple processes m and a python pool of 10 processes in each python process.
songs_pool = Pool()
songs_pool =Pool(processes=10)
return_pool = songs_pool.map(getVideo,songs_list)
I get some client errors when the value of m is increased to more than 2 and the pool is increased to >5. I get forbidden errors. When I check the number of requests in the google analytics,it shows that the number of requests are 250 per sec. But according to the documentation the limit is 3000 requests per sec. I dont understand why am I getting the client errors. Can you tell me if there is a way to not get this errors and run the program quicker.
if m = 2 and process = 10 , i get no errors but it takes so much time to complete.
But if I increase them , then I get client errors which are ~ 5% of the total requests.
The per-user-limit is 3000 requests per second from a single IP address, and as soon as you go above that in a given second you'll start getting the forbidden errors. The analytics you see in the developers console will only report your average number of requests over a 5 minute period; therefore, if you had zero requests for 4 minutes, then started running your routine, the console may show only 250 requests per second (as an average) but your app likely is overrunning the limit in a given period of time or two.
It seems that you're handling it in the best way possible if speed is your concern; you'll want to run it fast enough to get a very small number of errors (so you know you're staying up there at your limit). Another option, though, might be to look into using etags; if you find yourself requesting info on the same videos a lot, you can let etags tell you whether or not any info has changed (and if the API responds that nothing has changed, it doesn't count against either your quota or your reqests/sec.)