Monitor AWS Service Status using Splunk - amazon-web-services

Problem
Dependency on AWS Services status
If you depend on Amazon AWS service to operate, you need to keep a close eye on the status of their services. Amazon uses the website http://status.aws.amazon.com/, which provides links to RSS feeds to specific services in specific regions.
Potential Errors
Our service uses S3, CloudFront, and other services to operate. We'd like to be informed on any service that might go down during hours of operations, and automate what we should do in case something goes wrong.
Splunk Logging
We use Splunk for Logging all of our services.
Requirement
For instance, if errors occurs in the application while writing to S3, we'd like to know if that was caused by a potential outage in AWS.
How to monitor the Status RSS feed in Splunk?
Is there an HTTP client for that? A background service?

Solution
You can use the Syndication Input app to collect the RSS feed data from the AWS Status
Create a query that fetches the RSS Items that have errors and stores in Splunk indexes under the syndication sourcetype.
Create an alert based on the query, a since field so that we can adjust the alerts over time.
How
Ask your Splunk team to install the app "Syndication Input" on the environments you need.
After that, just collect each of the RSS feeds needed and add them to the Settings -> Data Input -> Syndication Feed. Take all the URLs from the Amazon Status RSS feeds and use them as Splunk Data Input, filling out the form with certain interval:
http://status.aws.amazon.com/rss/cloudfront.rss
http://status.aws.amazon.com/rss/s3-us-standard.rss
http://status.aws.amazon.com/rss/s3-us-west-1.rss
http://status.aws.amazon.com/rss/s3-us-west-2.rss
When you are finished, the Syndication App has the following:
Use the search for the errors when the occur, adjusting the “since” date so that you can create an alert for the results. I added a day in the past just for display purpose.
since should be some start day you will start monitoring AWS. This helps the query to result in any new event when Amazon publishes new errors captured from the text Informational message:.
The query should not return anything new because the since will not return any date.
Since the token RESOLVED is appended to a new RSS feed item, we exclude them from the alerts.
.
sourcetype=syndication "Informational message:" NOT "RESOLVED"
| eval since=strptime("2010-08-01", "%Y-%m-%d")
| eval date=strptime(published_parsed, "%Y-%m-%dT%H:%M:%SZ")
| rex field=summary_detail_base "rss\/(?<aws_object>.*).rss$"
| where date > since
| table aws_object, published_parsed, id, title, summary
| sort -published_parsed
Create an Alert with the Query. For instance, to send an email:

Related

Is there a way to publish custom metrics from AWS Glue jobs?

I'm using an AWS Glue job to move and transform data across S3 buckets, and I'd like to build custom accumulators to monitor the number of rows that I'm receiving and sending, along with other custom metrics. What is the best way to monitor these metrics? According to this document: https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html I can keep track of general metrics on my glue job but there doesn't seem to be a good way to send custom metrics through cloudwatch.
I have done lots of similar project like this, each micro batch can be:
a file or a bunch of file
a time interval of data from API
a partition of records from database
etc ...
Your use case is can be break down into three question:
given a bunch of input, how could you define a task_id
how you want to define the metrics for your task, you need to define a simple dictionary structure for this metrics data
find a backend data store to store the metrics data
find a way to query the metrics data
In some business use case, you also need to store status information to track each of the input, are they succeeded? failed? in-progress? stuck? and you may want to control retry, and concurrency control (avoid multiple worker working on the same input)
DynamoDB is the perfect backend for this type of use case. It is a super fast, no ops, pay as you go, automatically scaling key-value store.
There's a Python library that implemented this pattern https://github.com/MacHu-GWU/pynamodb_mate-project/blob/master/examples/patterns/status-tracker.ipynb
Here's an example:
put your glue ETL job main logic in a function:
def glue_job() -> dict:
...
return your_metrics
given an input, calculate the task id identifier, then you just need
tracker = Tracker.new(task_id)
# start the job, it will succeed
with tracker.start_job():
# do some work
your_metrics = glue_job()
# save your metrics in dynamodb
tracker.set_data(your_metrics)
Consider enabling continuous logging on your AWS Glue Job. This will allow you to do custom logging via. CloudWatch. Custom logging can include information such as row count.
More specifically
Enable continuous logging for you Glue Job
Add logger = glueContext.get_logger() at the beginning of you Glue Job
Add logger.info("Custom logging message that will be sent to CloudWatch") where you want to log information to CloudWatch. For example if I have a data frame named df I could log the number of rows to CloudWatch by adding logger.info("Row count of df " + str(df.count()))
Your log messages will be located under the CloudWatch log groups /aws-glue/jobs/logs-v2 under the log stream named glue_run_id -driver.
You can also reference the "Logging Application-Specific Messages Using the Custom Script Logger" section of the AWS documentation Enabling Continuous Logging for AWS Glue Jobs for more information on application specific logging.

Hasura on Google Cloud Run - Monitoring

I would like to have a monitoring on my Hasura API on Google Cloud Run. Actually I'm using the monitoring of Google Cloud but It is not really perfect. I have the count of 200 code request. But I want for example, the number of each query / mutation endpoint request.
I want :
count 123 : /graphql/user
count 234 :/graphql/profil
I have :
count 357 : /graphql
If you have an idea.
Thanks
You can't do this with GraphQL unfortunately. All queries are sent to the /v1/graphql endpoint on Hasura, and the only way to distinguish the operations is by parsing the query parameter of the HTTP request and grabbing the operation name.
If Google Cloud allows you to query properties in logs of HTTP requests, you can set up filters on the body, something like:
"Where [request params].query includes 'MyQueryName'"
Otherwise your two options are:
Use Hasura Cloud (https://hasura.io/cloud), which gives you a count of all operations and detailed metrics (response time, variables, etc) on your console dashboard
Write and deploy a custom middleware server or a script for a reverse proxy that handles this

MTurk HITs created Through Java API are not showing on Manage Tab on UI

I am creating HITs on MTurk Sandbox using JAVA API. I am able to create HITs and also work on it by searching through worker UI. But those hits are not showing up in the "Manage" tab of the requester UI.
So as to interact with my own MTurk tasks I have developed a rudimentary management console to monitor, review, manage, and download data from API-launched tasks.
The management console is entirely JavaScript based and runs locally in your web browser. It replicates most of the basic functionality removed with the Manage HITs individually module in the December 2017 changes. You will need your API keys to use the tool but these are not stored nor transmitted to myself or any third-party.
You can download a copy from GitHub: https://github.com/jtjacques/mturk-manage/archive/master.zip
Please see the included README for comprehensive information about the tool, available on the GitHub project page https://github.com/jtjacques/mturk-manage
The "Manage" tab in the MTurk Requester Website is for managing Batches created with the MTurk Requester Website (using the Create) tab. If you need/want to view HITs that you create with the API, you can use the ListHITs API method either with the API directly (using your Java code) or using the AWS Command Line Interface (CLI).
Here's a blog explaining how to do this with the AWS CLI:
https://blog.mturk.com/tutorial-managing-mturk-hits-with-the-aws-command-line-interface-56eaabb7fd4c
The blog shows how to use aws-shell, which is a more interactive shell that sits atop the AWS CLI. It has autocomplete and shows you inline "man" pages on each command. I personally prefer this.
The CLI and aws-shell will also let you write filters and formatters for results. So you can do things like this:
aws mturk list-hits --output table --query 'HITs[].{"1. HITId": HITId, "2. Title": Title, "3. Status":HITStatus}' --endpoint-url https://mturk-requester-sandbox.us-east-1.amazonaws.com --max-results 5
This calls ListHITs, on the Sandbox (--endpoint), getting only 5 results (--max-results), formats the output as a table instead of the default JSON (--output) and filters that JSON for the HITs object (HITs[]) pulling down only the fields HITId, Title, and Status while also setting titles for those fields as "1. HITId", "2. Title", and "3. Status".
There used to be a link in the MTurk Requester Website for a GUI to manage HITs individually which would show HITs from the API, but it was deprecated this month. There's a brief thread on it here: https://forums.aws.amazon.com/thread.jspa?threadID=267769&tstart=0

Is there an api to send notifications based on job outputs?

I know there are api to configure the notification when a job is failed or finished.
But what if, say, I run a hive query that count the number of rows in a table. If the returned result is zero I want to send out emails to the concerned parties. How can I do that?
Thanks.
You may want to look at Airflow and Qubole's operator for airflow. We use airflow to orchestrate all jobs being run using Qubole and in some cases non Qubole environments. We DataDog API to report success / failures of each task (Qubole / Non Qubole). DataDog in this case can be replaced by Airflow's email operator. Airflow also has some chat operator (like Slack)
There is no direct api for triggering notification based on results of a query.
However there is a way to do this using Qubole:
-Create a work flow in qubole with following steps:
1. Your query (any query) that writes output to a particular location on s3.
2. A shell script - This script reads result from your s3 and fails the job based on any criteria. For instance in your case, fail the job if result returns 0 rows.
-Schedule this work flow using "Scheduler" API to notify on failure.
You can also use "Sendmail" shell command to send mail based on results in step 2 above.

Filter AWS Cloudwatch Lambda's Log

I have a Lambda function and its logs in Cloudwatch (Log group and Log Stream). Is it possible to filter (in Cloudwatch Management Console) all logs that contain "error"? For example logs containing "Process exited before completing request".
In Log Groups there is a button "Search Events". You must click on it first.
Then it "changes" to "Filter Streams":
Now you should just type your filter and select the beginning date-time.
So this is kind of a side issue, but it was relevant for us. (I posted this to another answer on StackOverflow but thought it would be relevant to this conversation too)
We've noticed that tailing and searching logs gets really slow after a log group has a lot of Log Streams in it, like when an AWS Lambda Function has had a lot of invocations. This is because "tail" type utilities and searching need to connect to each log stream to run. Log Events get expired and deleted due to the policy you set on the Log Group itself, but the Log Streams never get cleaned up. I made a few little utility scripts to help with that:
https://github.com/four43/aws-cloudwatch-log-clean
Hopefully that save you some agony over waiting for those logs to get searched.
You can also use CloudWatch Insights (https://aws.amazon.com/about-aws/whats-new/2018/11/announcing-amazon-cloudwatch-logs-insights-fast-interactive-log-analytics/) which is an AWS extension to CloudWatch logs that gives a pretty powerful query and analytics tool. However it can be slow. Some of my queries take up to a minute. Okay, if you really need that data.
You could also use a tool I created called SenseLogs. It downloads CloudWatch data to your browser where you can do queries like you ask about. You can use either full text and search for "error" or if your log data is structured (JSON), you can use a Javascript like expression language to filter by field, eg:
error == 'critical'
Posting an update as CloudWatch has changed since 2016:
In the Log Groups there is a Search all button for a full-text search
Then just type your search: