We are trying to find out why there is a high latency from Google CDN.
Our site is behind Google'a http_load_balancer with CDN turned on.
For example, by inspecting sampe GET request for a jpg file (43Kb), we can see from http_load_balancer logs that around 30% of such requests have httpRequest.latency > 1 second, and a lot are taking much longer like several or hundreds of seconds....
This is just by looking at 24h log sample (around 6K the same requests).
The httpRequest.cacheLookup and httpRequest.cacheHit for all of those requests are true.
Also jsonpayload_type_loadbalancerlogentry.statusdetails is response_from_cache and jsonpayload_type_loadbalancerlogentry.cacheid values shows correct region.
When doing the same GET request manually in the browser we are getting expected results with TTFB around 15-20ms.
Any idea where to look for a clue?
The httpRequest.latency field measures the entire download duration, and is directly impacted by slow clients - e.g. a mobile device on a congested network or throttled data plan.
You can check this by looking at the frontend_tcp_rtt metric (which is the RTT between the client and Cloud CDN) in Cloud Monitoring, as well as the average, median and 90th percentile total_latencies, where the slow clients will show up as outliers: https://cloud.google.com/load-balancing/docs/https/https-logging-monitoring#monitoring_metrics_fors
You may find that slow clients are from a specific group of client_country values.
Latency can be introduced:
Between the original client and the load balancer.
You can see the latency of that segment with the metric https/frontend_tcp_rtt.
Between the load balancer and the backend instance.
Which can be reviewed with the metric https/backend_latencies (this metric also includes the app processing time in your backend).
By the software running on the instance itself.
To investigate this I would check the access/error logs on the backend instance software and resource utilization of the VM instance.
Further information about metrics description on the GCP load balancer metrics doc.
httpRequest.latency log field description:
"The request processing latency on the server, from the time the request was received until the response was sent."
Related
In Traditional Performance Automation Testing:
There is an application server where all the requests hits are received. So in this case; we have server configuration (CPU, RAM etc) with us to perform load testing (of lets say 5k concurrent users) using Jmeter or any load test tool and check server performance.
In case of AWS Serverless; there is no server - so to speak - all servers are managed by AWS. So code only resides in lambdas and it is decided by AWS on run time to perform load balancing in case there are high volumes on servers.
So now; we have a web app hosted on AWS using serverless framework and we want to measure performance of the same for 5K concurrent users. With no server backend information; only option here is to rely on the frontend or browser based response times - should this suffice?
Is there a better way to check performance of serverless applications?
I didn't work with AWS, but in my opinion performance testing in case serverless applications should perform pretty the same way as in traditional way with own physical servers.
Despite the name serverless, physical servers are still used (though are managed by aws).
So I will approach to this task with next steps:
send backend metrics (response time, count requests and so on) to some metrics system (graphite, prometheus, etc)
build dashboard in this metric system (ideally you should see requests count and response time per every instance and count of instances)
take a load testing tool (jmeter, gatling or whatever) and start your load test scenario
During the test and after the test you will see how many requests your app processing, it response times and how change count of instances depending of concurrent requests.
So in such case you will agnostic from aws management tools (but probably aws have some management dashboard and afterwards it will good to compare their results).
"Loadtesting" a serverless application is not the same as that of a traditional application. The reason for this is that when you write code that will run on a machine with a fixed amount CPU and RAM, many HTTP requests will be processed on that same machine at the same time. This means you can suffer from the noisy-neighbour effect where one request is consuming so much CPU and RAM that it is negatively affecting other requests. This could be for many reasons including sub-optimal code that is consuming a lot of resources. An attempted solution to this issue is to enable auto-scaling (automatically spin up additional servers if the load on the current ones reaches some threshold) and load balancing to spread requests across multiple servers.
This is why you need to load test a traditional application; you need to ensure that the code you wrote is performant enough to handle the influx of X number of visitors and that the underlying scaling systems can absorb the load as needed. It's also why, when you are expecting a sudden burst of traffic, you will pre-emptively spin up additional servers to help manage all that load ahead of time. The problem is you cannot always predict that; a famous person mentions your service on Facebook and suddenly your systems need to respond in seconds and usually can't.
In serverless applications, a lot of the issues around noisy neighbours in compute are removed for a number of reasons:
A lot of what you usually did in code is now done in a managed service; most web frameworks will route HTTP requests in code however API Gateway in AWS takes that over.
Lambda functions are isolated and each instance of a Lambda function has a certain quantity of memory and CPU allocated to it. It has little to no effect on other instances of Lambda functions executing at the same time (this also means if a developer makes a mistake and writes sub-optimal code, it won't bring down a server; serverless compute is far more forgiving to mistakes).
All of this is not to say its not impossible to do your homework to make sure your serverless application can handle the load. You just do it differently. Instead of trying to push fake users at your application to see if it can handle it, consult the documentation for the various services you use. AWS for example publishes the limits to these services and guarantees those numbers as a part of the service. For example, API Gateway has a limit of 10 000 requests per second. Do you expect traffic greater than 10 000 per second? If not, your good! If you do, contact AWS and they may be able to increase that limit for you. Similar limits apply to AWS Lambda, DynamoDB, S3 and all other services.
As you have mentioned, the serverless architecture (FAAS) don't have a physical or virtual server we cannot monitor the traditional metrics. Instead we can capture the below:
Auto Scalability:
Since the main advantage of this platform is Scalability, we need to check the auto scalability by increasing the load.
More requests, less response time:
When hitting huge amount of requests, traditional servers will increase the response time where as this approach will make it lesser. We need to monitor the response time.
Lambda insights in Cloudwatch:
There is an option to monitor the performance of multiple Lambda functions - Throttles, Invocations & Errors, Memory usage, CPU usage and network usage. We can configure the Lambdas we need and monitor in the 'Performance monitoring' column.
Container CPU and Memory usage:
In cloudwatch, we can create a dashboard with widgets to capture the CPU and memory usage of the containers, tasks count and LB response time (if any).
I am developing a rest service using Spring boot. The rest service takes an input file and do some operation on it and return back the processed file.
I know that in spring boot we have configuration "server.tomcat.max-threads" which can be a maximum of 400.
My rest application will be deployed on a cluster.
I want to understand how I should be handling if the request is more than 400 for a case wherein my cluster has only one node.
Basically I wanted to understand what is the standard way for serving requests more than the "max-thread-per-node X N-nodes" in a cloud solution.
Welcome to AWS and Cloud Computing in general. What you have described is the system elasticity which is made very easy and accessible in this ecosystem.
Have a look at AWS Auto Scaling. It is a service which will monitor your application and automatically scale out to meet the increasing demand and scale in to save costs when the demand is low.
You can set triggers for the same. For eg. If you know that your application load is a function of Memory usage, whenever memory usage hits 80% you can add nodes to the custer. read more about various scaling Policies here.
One such scaling metric is ALBRequestCountPerTarget. It will scale the number of nodes int he cluster to maintain the average request count per node(target) in the cluster. With some buffer, you can set this to 300 and achieve what you are looking for. Read more about this in the docs.
We were trying to implement an elastic scaling application on AWS. But currently, due to the complexity of the application process, I have an issue with the current routing algorithm.
In the application when we send a request (a request to a complex calculation). We immediately send a token to the user and start calculating. So the user can return with the token any day and access those calculated results. When there are more calculation requests they will be in a queue and get executed 2 by 2 as one calculation takes a considerable amount of CPU. As you can see, in this specific scenario.
The application active connection count is very low as we respond to the user with the token as soon as we get the request.
CPU usage will look normal as we do calculations 2 by 2
Considering these facts, with the load balancer routing we are facing a problem of elastic instances terminating before the full queue is finished calculating and the queue grows really long as the load balancer does not have any idea about the queued requests.
To solve it, either we need to do routing manually, or we need to find a way to let the load balancer know the queued request count (maybe with an API call). If you have an idea of how to do this please help me. (I'm new to AWS)
Any idea is welcome.
Based on the comments.
An issue observed with the original approach was premature termination of instances since they their scale-in/out is based on CPU utilization only.
A proposed solution to rectify the issue based the scaling activities on the length of the job queue. En example of such a solution is shown in the following AWS link:
Using Target Tracking with the Right Metric
In the example, the scaling is based on the following metric:
The solution is to use a backlog per instance metric with the target value being the acceptable backlog per instance to maintain.
I've had a Cloud Function that makes Vision API requests, specifically Document Text Detection requests. My peak requests rate is usually around ~120-150 requests per minute on an average day.
I've suddenly been getting resource quota exceeded errors for Vision API requests with a request rate at 2500 requests per minute. Some things to note:
I've had no code changes in 3 months
I deleted and redeployed the Cloud Function making these requests to stop any problematic image that was causing a runaway loop
My code calling the API nor the cloud functions themselves were getting retried so there really wasn't a way that I could exponentially increase my request rate overnight with no changes introduced.
The service account making the Vision calls is making the normal amount of requests and is only in use by the cloud function i.e. not being used by someone's local script
I've since turned on retries to mitigate this issue since it'll "work" with exponential back off but this is expensive to do, especially with the vision API. Is there anything I can do to find out the root cause of this issue?
To identify the specific quota being exceeded, Stackdriver API helps by using Monitoring quota metrics as explained here .
GCP lets you specify quota being exceeded in greater depth using the Stackdriver API and UI, with quota metrics appearing in the Metrics Explorer.
I'd like to use AWS AccessLogs for processing website impressions using an existing batch oriented ETL pipeline that grabs last finished hour of impressions and do a lot of further transformations with them.
The problem with AccessLog though is that :
Note, however, that some or all log file entries for a time period can
sometimes be delayed by up to 24 hours
So I would never know when all the logs for a particular hour are complete.
I unfortunately cannot use any streaming solution, I need to use existing pipeline that grabs hourly batches of data.
So my question is, is there any way to be notified that all logs has been delivered to s3 for a particular hour?
You have asked about S3, but your pull-quote is from the documentation for CloudFront.
Either way, though, it doesn't matter. This is just a caveat, saying that log delivery might sometimes be delayed, and that if it's delayed, this is not a bug -- it's a side effect of a massive, distributed system.
Both services operate an an incomprehensibly large scale, so periodically, things go wrong with small parts of the system, and eventually some stranded logs or backlogged logs may be found and delivered. Rarely, they can even arrive days or weeks later.
There is no event that signifies that all of the logs are finished, because there's no single point within such a system that is aware of this.
But here is the takeaway concept: the majority of logs will arrive within minutes, but this isn't guaranteed. Once you start running traffic and observing how the logging works, you'll see what I am referring to. Delayed logs are the exception, and you should be able to develop a sense, fairly rapidly, of how long you need to wait before processing the logs for a given wall clock hour. As long as you track what you processed, you can audit this against the bucket, later, to ensure that yout process is capturing a sufficient proportion of the logs.
Since the days before CloudFront had SNI support, I have been routing traffic to some of my S3 buckets using HAProxy in EC2 in the same region as the bucket. This gave me the ability to use custom hostnames, and SNI, but also gave me real-time logging of all the bucket traffic using HAProxy, which can stream copies of its logs to a log collector for real-time analysis over UDP, as well as writing it to syslog. There is no measurable difference in performance with this solution, and HAProxy runs extremely well on t2-class servers, so it is cost-effective. You do, of course, introduce more costs and more to maintain, but you can even deploy HAProxy between CloudFront and S3 as long as you are not using an origin access identity. One of my larger services does exactly this, a holdover from the days before Lambda#Edge.