Is it possible to have some metrics about how many search requests were processed over a certain time on ElasticSearch at AWS? Something like the cloudwatch monitoring for Cloudsearch that you can check the number of successful requests per minute (RPM):
Just find out the endpoint _stats that allow user to retrieve interesting metrics, so basically you will have to divide indices.search.query_total for indices.search.query_time_in_millis to have an average time for each query.
Still don't know a good way to have a real time data to plot a monitoring graph
Source #1
Source #2
Related
I'm trying to combine certain number of similar metrics into a single alarm in aws cloud watch. For example lets say for data quality monitoring in sagemaker, one among the metrics that are emitted from data quality monitoring job is feature baseline drift distance for each column so let say I've 600 columns so each column will have this metric. Is there a possible way to compress these metrics into a single cloud watch alarm ?
If not, Is there anyway to send the violation report as message via AWS SNS?
While I am not sure exactly on what out come you want when you refer to "compress the metrics into a single alarm." You can look at using metric math
I keep reading that I can write a log query to sample a percentage of logs but I have found zero examples.
https://cloud.google.com/blog/products/gcp/preventing-log-waste-with-stackdriver-logging\
You can also choose to sample certain messages so that only a percentage of the messages appear in Stackdriver Logs Viewer
How do I get 10% of all GCE load balancer logs with a log query? I know I can configure this on the backend, but I don't want that. I want to get 100% of logs in stackdriver and create a pub/sub log sink with a log query that only captures 10% of them and sends those sampled logs somewhere else.
I suspect you'll want to create a Pub/Sub sink for Log Router. See Configure and manage sinks
Using Google's Log querying, you can use sample to filter (inclusion) logs.
We are trying to find out why there is a high latency from Google CDN.
Our site is behind Google'a http_load_balancer with CDN turned on.
For example, by inspecting sampe GET request for a jpg file (43Kb), we can see from http_load_balancer logs that around 30% of such requests have httpRequest.latency > 1 second, and a lot are taking much longer like several or hundreds of seconds....
This is just by looking at 24h log sample (around 6K the same requests).
The httpRequest.cacheLookup and httpRequest.cacheHit for all of those requests are true.
Also jsonpayload_type_loadbalancerlogentry.statusdetails is response_from_cache and jsonpayload_type_loadbalancerlogentry.cacheid values shows correct region.
When doing the same GET request manually in the browser we are getting expected results with TTFB around 15-20ms.
Any idea where to look for a clue?
The httpRequest.latency field measures the entire download duration, and is directly impacted by slow clients - e.g. a mobile device on a congested network or throttled data plan.
You can check this by looking at the frontend_tcp_rtt metric (which is the RTT between the client and Cloud CDN) in Cloud Monitoring, as well as the average, median and 90th percentile total_latencies, where the slow clients will show up as outliers: https://cloud.google.com/load-balancing/docs/https/https-logging-monitoring#monitoring_metrics_fors
You may find that slow clients are from a specific group of client_country values.
Latency can be introduced:
Between the original client and the load balancer.
You can see the latency of that segment with the metric https/frontend_tcp_rtt.
Between the load balancer and the backend instance.
Which can be reviewed with the metric https/backend_latencies (this metric also includes the app processing time in your backend).
By the software running on the instance itself.
To investigate this I would check the access/error logs on the backend instance software and resource utilization of the VM instance.
Further information about metrics description on the GCP load balancer metrics doc.
httpRequest.latency log field description:
"The request processing latency on the server, from the time the request was received until the response was sent."
I'd like to know if possible to discover which resource is behind this cost in my Cost Explorer, grouping by usage type I can see it is Data Processing bytes, but I don't know which resource would be consuming this amount of data.
Have some any idea how to discover it on CloudWatch?
This is almost certainly because something is writing more data to CloudWatch than previous months.
As stated this AWS Support page about unexpected CloudWatch logs bill increases:
Sudden increases in CloudWatch Logs bills are often caused by an
increase in ingested or storage data in a particular log group. Check
data usage using CloudWatch Logs Metrics and review your Amazon Web
Services (AWS) bill to identify the log group responsible for bill
increases.
Your screenshot identifies the large usage type as APS2-DataProcessing-Bytes. I believe that the APS2 part is telling you it's about the ap-southeast-2 region, so start by looking in that region when following the instructions below.
Here's a brief summary of the steps you need to take to find out which log groups are ingesting the most data:
How to check how much data you're ingesting
The IncomingBytes metric shows you how much data is being ingested in your CloudWatch log groups in near-real time. This metric can help you to determine:
Which log group is the highest contributor towards your bill
Whether there's been a spike in the incoming data to your log groups or a gradual increase due to new applications
How much data was pushed in a particular period
To query a small set of log groups:
Open the Amazon CloudWatch console.
In the navigation pane, choose Metrics.
For each of your log groups, select the IncomingBytes metric, and then choose the Graphed metrics tab.
For Statistic, choose Sum.
For Period, choose 30 Days.
Choose the Graph options tab and choose Number.
At the top right of the graph, choose custom, and then choose Absolute. Select a start and end date that corresponds with the last 30 days.
For more details, and for instructions on how to query hundreds of log groups, read the full AWS support article linked above.
Apart from the steps which Gabe mentioned what helped me identify the resource which was creating large number of logs was by:
heading over to Cloudwatch
selecting the region which showed in Cost explorer
Selecting Log Groups
From settings under Log Groups, Enabling column Stored bytes to be visible
This showed me which service was causing a lot of logs to be written to Cloudwatch.
I'm trying to estimate how much GuardDuty is going to cost me per month and according to https://aws.amazon.com/guardduty/pricing/ I should look at how many cloudtrail logs I produce a month as well as how much VPC logs in GB I produce a month.
Using boto3 S3 I can count how many logs are in my bucket, which tells me how much I am going to spend having GuardDuty read my logs. Now I wish to find how many GB's of data my VPC logs are producing, but I can't seem to figure out where I can pull that kind of information from. I want to programmatically see how many GB's of VPC flow logs I produce a month to best estimate how much I would spend on GD.
This code snippet is to show how to get the size of VPC flow flogs associated with each network interface in the VPC. You have to modify the script to get the logs for the entire month and sum it.
import boto3
logs = boto3.client('logs')
# List the log groups and identify the VPC flow log group
for log in logs.describe_log_groups()['logGroups']:
print log['logGroupName']
# Get the logstreams in 'vpc-flow-logs'
for log in logs.describe_log_streams(logGroupName='vpc-flow-logs')['logStreams']:
print log['logStreamName'], log['storedBytes']
describe_log_streams
Lists the log streams for the specified log group. You can list all
the log streams or filter the results by prefix. You can also control
how the results are ordered.
This operation has a limit of five transactions per second, after which transactions are throttled.
Request Syntax
response = client.describe_log_streams(
logGroupName='string',
logStreamNamePrefix='string',
orderBy='LogStreamName'|'LastEventTime',
descending=True|False,
nextToken='string',
limit=123
)