I have a process that uses AWS EMR to run a pyspark cluster.
I have a S3 location where all the process logs gets stored.
I want to understand that is there a way I can filter out ERROR logs and get them mailed to my inbox. I do not want to save any log file on my system.
Is there any python library which can help me monitor real time logs. I have seen the boto3 and EMR library, but I could not find a answer to my problem from there.
The EMR logs will likely be buffered up into chunks of a few minutes or some size before being written to S3 ( but full disclosure, that's based on experience with other AWS S3 logging systems, not EMR itself).
If I were attempting to solve this problem, I'd use an AWS Lambda function to execute python that would read the S3 logs line by line and filter for the lines matching ERROR, and then use SNS to send the logs to your email address. You can use S3 events to automatically trigger the Lambda when objects are written to the S3 logging location for EMR, so this is as close to realtime as you're gonna get.
The architecture I am suggesting looks something like this
EMR -> S3 -> Lambda -> SNS -> email inbox
The write of each EMR log to s3 triggers a lambda which uses boto3
to filter the log for error messages, sending alerts to an SNS topic for distribution to users.
It may seem like a lot of moving parts but it won't require much to maintain it and should cost you only a few cents a month more than the S3 storage is already costing you. And the effort for the whole thing is actually pretty small.
Furthermore, you won't need:
a place to execute your code, servers to manage, etc
nontrivial deployment model for your project
any parts not shown above, for that matter
And you'll get for free:
Monitoring in the form of
cloudwatch metrics for lambda,
s3 logs (should you enable them)
cloudwatch logs that store your function's execution windows and stdout.
Easy integration into alerting through cloudwatch Alarms ( these typically integrate well with Pager Duty and the like )
dead-simple exensibility, such as
SNS can send SMS messages to your phone
add more parsing options in the lambda and redeploy
expose cloudwatch metrics and add alarms for thresholds
write the summary to S3 for pre signed email or sms links, or further processing now or later
You could send the email yourself through SES or just manually with python, but I would rather use SNS so that the subscriptions to the topic can vary independently from the python code.
Lambdas are a little intimidating to start with, but they'll include the boto3 sdk by default (which should obviate the need for a zipfile with pip dependencies all together ), which will simplify creation.
For that matter, you can set all this stuff up in the AWS console if you like doing things by dragging mouse pointers around, or intend to do it only a few times, or you can express all if it in cloudformation if you need something repeatable.
http://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
http://docs.aws.amazon.com/lambda/latest/dg/python-programming-model-handler-types.html
http://docs.aws.amazon.com/sns/latest/dg/welcome.html
Related
We have a service that outputs application logs to cloudwatch. We structure the logs into json format, and output them through stdout, which is forwarded by fluentbit to cloudwatch. We then have a stream set up to forward the logs from cloudwatch to s3, followed by glue crawlers, Athena, and quick sight for dashboards.
We go all this working, and I just saw today that there is a 256kb limit in cloudwatch which we went over for some of our application logs. How else can we get our logs out of our service to s3 (or maybe a different data store?) for analysis? Is cloudwatch not the right approach for this? Other option I thought of us to break up the application logs into multiple events, but then we need to plumb through a joinable ID, as well as write etl logic that does more complex joins. Was hoping to avoid it unless it’s considered a better practice than what we are doing.
Thanks!
I'd like to peform the following tasks on a regular basis (e.g. every day at 6AM) using AWS:
get new set of data using API. This dataset is updated on a daily basis.
run a python script that would process the obtained dataset by the means of several python libraries like matplotlib, pandas, plotly
automatically send the output of the script, which would be a single pdf file or a html dashboard, via email to a group of specified recipients
I know how to perform all of the above items locally - my goal is to automate this routine. I'm new to AWS and would appreciate some advice on how to perform these tasks in a straightforward way. Based on the reading I did so far, it looks like the serverless approach may be able to do the job and also reduce the complexity, but I'm not sure which functionalities exactly I should use.
For scheduling you can use aws event bridge.
You can schedule AWS lambda or AWS Step Functions both of these are serverless :).
You can have 3 lambdas
To get the data and save it in S3/dynamo (if you want to persist the data)
Processor lambda and save the report to S3.
Another lambda to send email using AWS SES which will read the report from S3 and send it.
If you don't want to use step function you can start your lambda from S3 put event or you can trigger one lambda from another lambda using aws-sdk.
So there are different approaches you can take.
First off, I would create a Lambda. You can schedule the function to run on a cron job.
If the Message you want to send is small:
I would create a SNS Topic with a email fan out.
Inside your lambda you can then transform the data and send out via SNS.
Otherwise:
I would use SES and send a mail via the SES SDK.
I want to build an end to end automated system which consists of the following steps:
Getting data from source to landing bucket AWS S3 using AWS Lambda
Running some transformation job using AWS Lambda and storing in processed bucket of AWS S3
Running Redshift copy command using AWS Lambda to push the transformed/processed data from AWS S3 to AWS Redshift
From the above points, I've completed pulling data, transforming data and running manual copy command from a Redshift using a SQL query tool.
Doubts:
I've heard AWS CloudWatch can be used to schedule/automate things but never worked on it. So, if I want to achieve the steps above in a streamlined fashion, how to go about it?
Should I use Lambda to trigger copy and insert statements? Or are there better AWS services to do the same?
Any other suggestion on other AWS Services and of the likes are most welcome.
Constraint: Want as many tasks as possible to be serverless (except for semantic layer, Redshift).
CloudWatch:
Your options here are either to use CloudWatch Alarms or Events.
With alarms, you can respond to any metric of your system (eg CPU utilization, Disk IOPS, count of Lambda invocations etc) when it crosses some threshold, and when this alarm is triggered, invoke a lambda function (or send SNS notification etc) to perform a task.
With events you can use either a cron expression or some AWS service event (eg EC2 instance state change, SNS notification etc) to then trigger another service (eg Lambda), so you could for example run some kind of clean-up operation via lambda on a regular schedule, or create a snapshot of an EBS volume when its instance is shut down.
Lambda itself is a very powerful tool, and should allow you to program a decent copy/insert function in a language you are familiar with. AWS has several GitHub repos with lots of examples too, see for example the serverless examples and many samples. There may be other services which could work for you in your specific case, but part of Lambda's power is its flexibility.
We are hosting our services in AWS beanstalk managed instances. That is forcing us to move away from files based logging to use database based logging.
Is DynamoDB a good choice for replacing file based logging. If so, what should be the primary key. I thought of using timestamp but multiple messages may be logged by the same service within the same timeStamp so that might not be reliable.
Any advice would be appreciated.
Don't use DynamoDB to store logs. You'll be paying for throughput and space needlessly.
Amazon CloudWatch has built-in logging capabilities.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatchLogs.html
Another alternative is a dedicated logging service such as Loggly which is cloud-based and can receive logs in many common formats, plus they have an API to send custom logs. In the web-based console, you can search and filter through the logs.
As an alternative, why don't you use cloudwatch? I ended up writing a whole app to consolidate logs across ec2 instances in a beanstalk app, then last year AWS opened up cloudwatch as a service, so I junked my stuff. You tell cloudwatch where your logs are on the instance, give it a log group and stream name, and all your logs are consolidated in one spot, in cloudwatch. You can also run alarms off them using the standard AWS setup. It's pretty slick, and easy - don't have to write a front end to do lookups, it's already there.
Don't know what you're using for logging - we are a node.js shop, used winston for logging, and there is a nice NPM module that works with Winston to log automatically, called winston-cloudwatch.
Amazon Cloudwatch provides some very useful metrics for monitoring my EC2s, load balancers, elasticache and RDS databases, etc and allows me to set alarms for a whole range of criteria; but is there any way to configure it to monitor my S3s as well? Or are there any other monitoring tools (besides simply enabling logging) that will help me monitor the numbers of POST/GET requests and data volumes for my S3 resources? And to provide alarms for thresholds of activity or increased datastorage?
AWS S3 is a managed storage service. The only metrics available in AWS CloudWatch for S3 are NumberOfObjects and BucketSizeBytes. In order to understand your S3 usage better you need to do some extra work.
I have recently written an AWS Lambda function to do exactly what you ask for and it's available here:
https://github.com/maginetv/s3logs-cloudwatch
It works by parsing S3 Server side log files and aggregates/exports metrics to AWS Cloudwatch (CloudWatch allows you to publish custom metrics).
Example graphs that you will get in AWS CloudWatch after deploying this function on your AWS account are:
RestGetObject_RequestCount
RestPutObject_RequestCount
RestHeadObject_RequestCount
BatchDeleteObject_RequestCount
RestPostMultiObjectDelete_RequestCount
RestGetObject_HTTP_2XX_RequestCount
RestGetObject_HTTP_4XX_RequestCount
RestGetObject_HTTP_5XX_RequestCount
+ many others
Since metrics are exported to CloudWatch, you can easily set up alarms for them as well.
CloudFormation template is included in GitHub repo and you can deploy this function very quickly to gain visibility into your S3 bucket usage.
EDIT 2016-12-10:
In November 2016 AWS has added extra S3 request metrics in CloudWatch that can be enabled when needed. This includes metrics like AllRequests, GetRequests, PutRequests, DeleteRequests, HeadRequests etc. See Monitoring Metrics with Amazon CloudWatch documentation for more details about this feature.
I was also unable to find any way to do this with CloudWatch. This question from April 2012 was answered by Derek#AWS as not having S3 support in CloudWatch. https://forums.aws.amazon.com/message.jspa?messageID=338089
The only thing I could think of would be to import the S3 access logs to a log service (like Splunk). Then create a custom cloud watch metric where you post the data that you parse from the logs. But then you have to filter out the polling of the access logs and…
And while you were at it, you could just create the alarms in Splunk instead of in S3.
If your use case is to simply alert when you are using it too much, you could set up an account billing alert for your S3 usage.
I think this might depend on where you are looking to track the access from. I.e. if you are trying to measure/watch usage of S3 objects from outside http/https requests then Anthony's suggestion if enabling S3 logging and then importing into splunk (or redshift) for analysis might work. You can also watch billing status on requests every day.
If trying to guage usage from within your own applications, there are some AWS SDK cloudwatch metrics:
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/metrics/package-summary.html
and
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/metrics/S3ServiceMetric.html
S3 is a managed service, meaning that you don't need to take action based on system events in order to keep it up and running (as long as you can afford to pay for the service's usage). The spirit of CloudWatch is to help with monitoring services that require you to take action in order to keep them running.
For example, EC2 instances (which you manage yourself) typically need monitoring to alert when they're overloaded or when they're underused or else when they crash; at some point action needs to be taken in order to spin up new instances to scale out, spin down unused instances to scale back in, or reboot instances that have crashed. CloudWatch is meant to help you do the job of managing these resources more effectively.
To enable Request and Data transfer metrics in your bucket you can run the below command. Be aware that these are paid metrics.
aws s3api put-bucket-metrics-configuration \
--bucket YOUR-BUCKET-NAME \
--metrics-configuration Id=EntireBucket
--id EntireBucket
This tutorial describes how to do it in AWS Console with point and click interface.