How do you route AWS Web Application Firewall (WAF) logs to an S3 bucket? Is this something I can quickly do through the AWS Console? Or, would I have to use a lambda function (invoked by a CloudWatch timer event) to query the WAF logs every n minutes?
UPDATE:
I'm interested in the ACL logs (Source IP, URI, Matches rule, Request Headers, Action, Time, etc).
UPDATE (05/15/2017)
AWS doesn't provide an easy way to view/parse these logs. You can get a "random sample" via the get-sampled-requests command. Which isn't acceptable...
Gets detailed information about a specified number of requests--a
sample--that AWS WAF randomly selects from among the first 5,000
requests that your AWS resource received during a time range that you
choose. You can specify a sample size of up to 500 requests, and you
can specify any time range in the previous three hours.
http://docs.aws.amazon.com/cli/latest/reference/waf/get-sampled-requests.html
Also, I'm not the only one experiencing this issue either:
https://forums.aws.amazon.com/thread.jspa?threadID=220202
I was looking for this functionality today and stumbled across the referenced thread. It was, coincidentally, updated today:
Hello,
Thanks for your input. I have submitted a feature request on your
behalf to export WAF events to S3 for long term analysis.
Best Regards, albertpataws
The lack of this feature strikes me as being almost as odd as the fact that I can't change timezones for graphs.
Related
I want the Requests Per Unit Time at route/path level metrics from AWS Lambda.
For example, path = /admin/options, Requests Per Second = 200.
My application supports many routes like this
I have looked into documentation but the Requests per unit time is not at route level.
As already pointed out, you have to do it by analyzing ALB access logs. This process is much simplified as Amazon Athena can query the ALB's logs directly in S3 as explained in Querying Application Load Balancer Logs.
This means that you don't have to download the logs before processing, or write any custom processing application. Instead you can run Amazon Athena queries against the logs.
I am trying to configure a cloudfront distribution with a lambda#Edge function linked to the origin request event. The lambda edge returns a very basic html page (the code is based on this example: Serving Static Content (Generated Response)). Once deployed, the distribution works as expected in locations close to North Virginia region, but fails in other locations returning the following error:
503: The Lambda function associated with the CloudFront distribution
was throttled. We can't connect to the server for this app or website
at this time. There might be too much traffic or a configuration
error. Try again later, or contact the app or website owner. If you
provide content to customers through CloudFront, you can find steps to
troubleshoot and help prevent this error by reviewing the CloudFront
documentation.
I already tried looking at the logs, but nothing is logged in cloudwatch when the 503 error is thrown and the logs from the CF distribution shows the lambdalimitExceeded error.
I have been jumping around between different locations using a VPN and I find it strange that it only works for places close to us-east-1 region. I am creating all the resources using a federated account, I don't know if it could be related to IAM permissions.
Another thing to point out is that everything works as expected if I reproduce the same scenario using another aws account and a regular user.
If you're seeing the lambdalimitExceeded then you need to review the following for your Lambda#Edge function:
The number of function executions exceeded one of the quotas (formerly known as limits) that Lambda sets to throttle executions in an AWS Region (concurrent executions or invocation frequency).
The function exceeded the Lambda function timeout quota.
Remember that Lambda#Edge is executed closer to the user, if you try to retrieve external resources (to the region) then you may timeout due to geographical latency, can you increase the timeout more to account for this?
Do you have other Lambdas running in the regions where it is running? If you view the CloudWatch logs for one of the regions closer to the users edge location you will see these Lambda logs and hopefully be able to identify the root cause. If not then add more debugging in.
I have an AWS Lambda that was triggered by SNS message. Many time, it has reached the max duration allowed by AWS, and AWS killed it immediately.
I have to either dig into the Lambda logs or the lambda duration chart to find out about the error.
Are there a better way to report this kind of errors?
Yes, there are some 3rd party tools that help you monitor your environment and provide exactly that - filter on specific errors and drill down to what happened there (the input event, the outgoing HTTP requests etc.).
Moreover, you can also configure alerts on specific errors that you will get via slack/mail.
Disclosure: I work for Lumigo, a company that does exactly that.
We are hosting our services in AWS beanstalk managed instances. That is forcing us to move away from files based logging to use database based logging.
Is DynamoDB a good choice for replacing file based logging. If so, what should be the primary key. I thought of using timestamp but multiple messages may be logged by the same service within the same timeStamp so that might not be reliable.
Any advice would be appreciated.
Don't use DynamoDB to store logs. You'll be paying for throughput and space needlessly.
Amazon CloudWatch has built-in logging capabilities.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatchLogs.html
Another alternative is a dedicated logging service such as Loggly which is cloud-based and can receive logs in many common formats, plus they have an API to send custom logs. In the web-based console, you can search and filter through the logs.
As an alternative, why don't you use cloudwatch? I ended up writing a whole app to consolidate logs across ec2 instances in a beanstalk app, then last year AWS opened up cloudwatch as a service, so I junked my stuff. You tell cloudwatch where your logs are on the instance, give it a log group and stream name, and all your logs are consolidated in one spot, in cloudwatch. You can also run alarms off them using the standard AWS setup. It's pretty slick, and easy - don't have to write a front end to do lookups, it's already there.
Don't know what you're using for logging - we are a node.js shop, used winston for logging, and there is a nice NPM module that works with Winston to log automatically, called winston-cloudwatch.
If I create a function to get webpages. Will it execute it on different IP per execution so that my scraping requests dont get blocked?
I would use this AWS pipeline:
Where at source on the left you will have an EC2 instance with JAUNT which then feeds the URLS or HTML pages into a Kinesis Stream. The Lambda will do your HTML parsing and via Firehose stuff everything into S3 or Redshift.
The JAUNT can run via a standard WebProxy service with a rotating IP.
Yes, lambda by default executes with random IPs. You can trigger it using things like event bridge so you can have a schedule to execute the script every hour or similar. Others can possibly recommend using API Gateway, however, it is highly insecure to expose API endpoints available for anyone to trigger. So you have to write additional logic to protect it either by hard coded headers or say oauth.
AWS Lambda doesn't have a fixed IP source as mentioned here
however, I guess this will happen when it gets cooled down, not during the same invocation.
Your internal private IP address isn't what is seen by the web server, it's the public ip address from AWS. As long as you are setting the headers and signaling to the webserver that your application is a bot. See this link for some best practices on creating a web scrapper:
https://www.blog.datahut.co/post/web-scraping-best-practices-tips
Just be respectful of how much data you pull and how often is the main thing in my opinion.
Lambda is triggered when a file is placed in S3, or data is being added to Kinesis or DynamoDB. That is often backwards of what a web scraper needs, though certainly things like S3 could perform as a queue/job runner.
Scraping on different IPs? Certainly lambda is deployed on many machines, though that doesn't actually help you, since you can't control the machines or their IP.