I'd like to use AWS AccessLogs for processing website impressions using an existing batch oriented ETL pipeline that grabs last finished hour of impressions and do a lot of further transformations with them.
The problem with AccessLog though is that :
Note, however, that some or all log file entries for a time period can
sometimes be delayed by up to 24 hours
So I would never know when all the logs for a particular hour are complete.
I unfortunately cannot use any streaming solution, I need to use existing pipeline that grabs hourly batches of data.
So my question is, is there any way to be notified that all logs has been delivered to s3 for a particular hour?
You have asked about S3, but your pull-quote is from the documentation for CloudFront.
Either way, though, it doesn't matter. This is just a caveat, saying that log delivery might sometimes be delayed, and that if it's delayed, this is not a bug -- it's a side effect of a massive, distributed system.
Both services operate an an incomprehensibly large scale, so periodically, things go wrong with small parts of the system, and eventually some stranded logs or backlogged logs may be found and delivered. Rarely, they can even arrive days or weeks later.
There is no event that signifies that all of the logs are finished, because there's no single point within such a system that is aware of this.
But here is the takeaway concept: the majority of logs will arrive within minutes, but this isn't guaranteed. Once you start running traffic and observing how the logging works, you'll see what I am referring to. Delayed logs are the exception, and you should be able to develop a sense, fairly rapidly, of how long you need to wait before processing the logs for a given wall clock hour. As long as you track what you processed, you can audit this against the bucket, later, to ensure that yout process is capturing a sufficient proportion of the logs.
Since the days before CloudFront had SNI support, I have been routing traffic to some of my S3 buckets using HAProxy in EC2 in the same region as the bucket. This gave me the ability to use custom hostnames, and SNI, but also gave me real-time logging of all the bucket traffic using HAProxy, which can stream copies of its logs to a log collector for real-time analysis over UDP, as well as writing it to syslog. There is no measurable difference in performance with this solution, and HAProxy runs extremely well on t2-class servers, so it is cost-effective. You do, of course, introduce more costs and more to maintain, but you can even deploy HAProxy between CloudFront and S3 as long as you are not using an origin access identity. One of my larger services does exactly this, a holdover from the days before Lambda#Edge.
Related
While I have worked with AWS for a bit, I'm stuck on how to correctly approach the following use case.
We want to design an uptime monitor for up to 10K websites.
The monitor should run from multiple AWS regions and ping websites if they are available and measure the response time. With a lambda function, I can ping the site, pass the result to a sqs queue and process it. So far, so good.
However, I want to run this function every minute. I also want to have the ability to add and delete monitors. So if I don't want to monitor website "A" from region "us-west-1" I would like to do that. Or the other way round, add a website to a region.
Ideally, all this would run serverless and deployable to custom regions with cloud formation.
What services should I go with?
I have been thinking about Eventbridge, where I wanted to make custom events for every website in every region and then send the result over SNS to a central processing Lambda. But I'm not sure this is the way to go.
Alternatively, I wanted to build a scheduler lambda that fetches the websites it has to schedule from a DB and then invokes the fetcher lambda. But I was not sure about the delay since I want to have the functions triggered every minute. The architecture should monitor 10K websites and even more if possible.
Feel free to give me any advise you have :)
Kind regards.
In my opinion Lambda is not the correct solution for this problem. Your costs will be very high and it may not scale to what you want to ultimately do.
A c5.9xlarge EC2 costs about USD $1.53/hour and has a 10gbit network. With 36 CPU's a threaded program could take care of a large percentage - maybe all 10k - of your load. It could still be run in multiple regions on demand and push to an SQS queue. That's around $1100/month/region without pre-purchasing EC2 time.
A Lambda, running 10000 times / minute and running 5 seconds every time and taking only 128MB would be around USD $4600/month/region.
Coupled with the management interface you're alluding to the EC2 could handle pretty much everything you're wanting to do. Of course, you'd want to scale and likely have at least two EC2's for failover but with 2 of them you're still less than half the cost of the Lambda. As you scale now to 100,000 web sites it's a matter of adding machines.
There are a ton of other choices but understand that serverless does not mean cost efficient in all use cases.
In Traditional Performance Automation Testing:
There is an application server where all the requests hits are received. So in this case; we have server configuration (CPU, RAM etc) with us to perform load testing (of lets say 5k concurrent users) using Jmeter or any load test tool and check server performance.
In case of AWS Serverless; there is no server - so to speak - all servers are managed by AWS. So code only resides in lambdas and it is decided by AWS on run time to perform load balancing in case there are high volumes on servers.
So now; we have a web app hosted on AWS using serverless framework and we want to measure performance of the same for 5K concurrent users. With no server backend information; only option here is to rely on the frontend or browser based response times - should this suffice?
Is there a better way to check performance of serverless applications?
I didn't work with AWS, but in my opinion performance testing in case serverless applications should perform pretty the same way as in traditional way with own physical servers.
Despite the name serverless, physical servers are still used (though are managed by aws).
So I will approach to this task with next steps:
send backend metrics (response time, count requests and so on) to some metrics system (graphite, prometheus, etc)
build dashboard in this metric system (ideally you should see requests count and response time per every instance and count of instances)
take a load testing tool (jmeter, gatling or whatever) and start your load test scenario
During the test and after the test you will see how many requests your app processing, it response times and how change count of instances depending of concurrent requests.
So in such case you will agnostic from aws management tools (but probably aws have some management dashboard and afterwards it will good to compare their results).
"Loadtesting" a serverless application is not the same as that of a traditional application. The reason for this is that when you write code that will run on a machine with a fixed amount CPU and RAM, many HTTP requests will be processed on that same machine at the same time. This means you can suffer from the noisy-neighbour effect where one request is consuming so much CPU and RAM that it is negatively affecting other requests. This could be for many reasons including sub-optimal code that is consuming a lot of resources. An attempted solution to this issue is to enable auto-scaling (automatically spin up additional servers if the load on the current ones reaches some threshold) and load balancing to spread requests across multiple servers.
This is why you need to load test a traditional application; you need to ensure that the code you wrote is performant enough to handle the influx of X number of visitors and that the underlying scaling systems can absorb the load as needed. It's also why, when you are expecting a sudden burst of traffic, you will pre-emptively spin up additional servers to help manage all that load ahead of time. The problem is you cannot always predict that; a famous person mentions your service on Facebook and suddenly your systems need to respond in seconds and usually can't.
In serverless applications, a lot of the issues around noisy neighbours in compute are removed for a number of reasons:
A lot of what you usually did in code is now done in a managed service; most web frameworks will route HTTP requests in code however API Gateway in AWS takes that over.
Lambda functions are isolated and each instance of a Lambda function has a certain quantity of memory and CPU allocated to it. It has little to no effect on other instances of Lambda functions executing at the same time (this also means if a developer makes a mistake and writes sub-optimal code, it won't bring down a server; serverless compute is far more forgiving to mistakes).
All of this is not to say its not impossible to do your homework to make sure your serverless application can handle the load. You just do it differently. Instead of trying to push fake users at your application to see if it can handle it, consult the documentation for the various services you use. AWS for example publishes the limits to these services and guarantees those numbers as a part of the service. For example, API Gateway has a limit of 10 000 requests per second. Do you expect traffic greater than 10 000 per second? If not, your good! If you do, contact AWS and they may be able to increase that limit for you. Similar limits apply to AWS Lambda, DynamoDB, S3 and all other services.
As you have mentioned, the serverless architecture (FAAS) don't have a physical or virtual server we cannot monitor the traditional metrics. Instead we can capture the below:
Auto Scalability:
Since the main advantage of this platform is Scalability, we need to check the auto scalability by increasing the load.
More requests, less response time:
When hitting huge amount of requests, traditional servers will increase the response time where as this approach will make it lesser. We need to monitor the response time.
Lambda insights in Cloudwatch:
There is an option to monitor the performance of multiple Lambda functions - Throttles, Invocations & Errors, Memory usage, CPU usage and network usage. We can configure the Lambdas we need and monitor in the 'Performance monitoring' column.
Container CPU and Memory usage:
In cloudwatch, we can create a dashboard with widgets to capture the CPU and memory usage of the containers, tasks count and LB response time (if any).
I'm hosting a static website in Amazon S3 with CloudFront. Is there a way to set a limit for how many reads (for example per month) will be allowed for my Amazon S3 bucket in order to make sure I don't go above my allocated budget?
If you are concerned about going over a budget, I would recommend Creating a Billing Alarm to Monitor Your Estimated AWS Charges.
AWS is designed for large-scale organizations that care more about providing a reliable service to customers than staying within a particular budget. For example, if their allocated budget was fully consumed, they would not want to stop providing services to their customers. They might, however, want to tweak their infrastructure to reduce costs in future, such as changing the Price Class for a CloudFront Distribution or using AWS WAF to prevent bots from consuming too much traffic.
Your static website will be rather low-cost. The biggest factor will likely be Data Transfer rather than charges for Requests. Changing the Price Class should assist with this. However, the only true way to stop accumulating Data Transfer charges is to stop serving content.
You could activate CloudTrail data read events for the bucket, create a CloudWatch Event Rule to trigger an AWS Lambda Function that increments the number of reads per object in an Amazon DynamoDB table and restrict access to the objects once a certain number of reads has been reached.
What you're asking for is a very typical question in AWS. Unfortunately with near infinite scale, comes near infinite spend.
While you can put a WAF, that is actually meant for security rather than scale restrictions. From a cost-perspective, I'd be more worried about the bandwidth charges than I would be able S3 requests cost.
Plus once you put things like Cloudfront or Lambda, it gets hard to limit all this down.
The best way to limit, is to put Billing Alerts on your account -- and you can tier them, so you get a $10, $20, $100 alerts, up until the point you're uncomfortable with. And then either manually disable the website -- or setup a lambda function to disable it for you.
I'm creating a simple web app that needs to be deployed to multiple regions in AWS. The application requires some dynamic configuration which is managed by a separate service. When the configuration is changed through this service, I need those changes to propagate to all web app instances across all regions.
I considered using cross-region replication with DynamoDB to do this, but I do not want to incur the added cost of running DynamoDB in every region, and the replication console. Then the thought occurred to me of using S3 which is inherently cross-region.
Basically, the configuration service would write all configurations to S3 as static JSON files. Each web app instance will periodically check S3 to see if the any of the config files have changed since the last check, and download the new config if necessary. The configuration changes are not time-sensitive, so polling for changes every 5/10 mins should suffice.
Have any of you used a similar approach to manage app configurations before? Do you think this is a smart solution, or do you have any better recommendations?
The right tool for this configuration depends on the size of the configuration and the granularity you need it.
You can use both DynamoDB and S3 from a single region to serve your application in all regions. You can read a configuration file in S3 from all the regions, and you can read the configuration records from a single DynamoDB table from all the regions. There is some latency due to the distance around the globe, but for reading configuration it shouldn't be much of an issue.
If you need the whole set of configuration every time that you are loading the configuration, it might make more sense to use S3. But if you need to read small parts of a large configuration, by different parts of your application and in different times and schedule, it makes more sense to store it in DynamoDB.
In both options, the cost of the configuration is tiny, as the cost of a text file in S3 and a few gets to that file, should be almost free. The same low cost is expected in DynamoDB as you have probably only a few KB of data and the number of reads per second is very low (5 Read capacity per second is more than enough). Even if you decide to replicate the data to all regions it will still be almost free.
I have an application I wrote that works in exactly the manner you suggest, and it works terrific. As it was pointed out, S3 is not 'inherently cross-region', but it is inherently durable across multiple availability zones, and that combined with cross region replication should be more than sufficient.
In my case, my application is also not time-sensitive to config changes, but none-the-less besides having the app poll on a regular basis (in my case 1 once per hour or after every long-running job), I also have each application subscribed to SNS endpoints so that when the config file changes on S3, an SNS event is raised and the applications are notified that a change occurred - so in some cases the applications get the config changes right away, but if for whatever reason they are unable to process the SNS event immediately, they will 'catch up' at the top of every hour, when the server reboots and/or in the worst case by polling S3 for changes every 60 minutes.
In the DynamoDB documentation and in many places around the internet I've seen that single digit ms response times are typical, but I cannot seem to achieve that even with the simplest setup. I have configured a t2.micro ec2 instance and a DynamoDB table, both in us-west-2, and when running the command below from the aws cli on the ec2 instance I get responses averaging about 250 ms. The same command run from my local machine (Denver) averages about 700 ms.
aws dynamodb get-item --table-name my-table --key file://key.json
When looking at the CloudWatch metrics in the AWS console it says the average get latency is 12 ms though. If anyone could tell me what I'm doing wrong or point me in the direction of information where I can solve this on my own I would really appreciate it. Thanks in advance.
The response times you are seeing are largely do to the cold start times of the aws cli. When running your get-item command the cli has to get loaded into memory, fetch temporary credentials (if using an ec2 iam role when running on your t2.micro instance), and establish a secure connection to the DynamoDB service. After all that is completed then it executes the get-item request and finally prints the results to stdout. Your command is also introducing a need to read the key.json file off the filesystem, which adds additional overhead.
My experience running on a t2.micro instance is the aws cli has around 200ms of overhead when it starts, which seems inline with what you are seeing.
This will not be an issue with long running programs, as they only pay a similar overhead price at start time. I run a number of web services on t2.micro instances which work with DynamoDB and the DynamoDB response times are consistently sub 20ms.
There are a lot of factors that go into the latency you will see when making a REST API call. DynamoDB can provide latencies in the single digit milliseconds but there are some caveats and things you can do to minimize the latency.
The first thing to consider is distance and speed of light. Expect to get the best latency when accessing DynamoDB when you are using an EC2 instance located in the same region. It is normal to see higher latencies when accessing DynamoDB from your laptop or another data center. Note that each region also has multiple data centers.
There are also performance costs from the client side based on the hardware, network connection, and programming language that you are using. When you are talking millisecond latencies the processing time on your machine can make a difference.
Another likely source of the latency will be the TLS handshake. Establishing an encrypted connection requires multiple round trips and computation on both sides to get the encrypted channel established. However, as long as you are using a Keep Alive for the connection you will only pay this overheard for the first query. Successive queries will be substantially faster since they do not incur this initial penalty. Unfortunately the AWS CLI isn't going to keep the connection alive between requests, but the AWS SDKs for most languages will manage this for you automatically.
Another important consideration is that the latency that DynamoDB reports in the web console is the average. While DynamoDB does provide reliable average low double digit latency, the maximum latency will regularly be in the hundreds of milliseconds or even higher. This is visible by viewing the maximum latency in CloudWatch.
They recently announced DAX (Preview).
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement – from milliseconds to microseconds – even at millions of requests per second. For more information, see In-Memory Acceleration with DAX (Preview).