Why does the "hatch rate" matter when performance testing? - web-services

I'm using Locust for performance testing. It has two parameters: the number of users and the rate at which the users are generated. But why are the users not simply generated all at once? Why does it make a difference?

Looking at Locust Configuration Options I think correct option is spawn-rate
Coming back to your question, in Performance Testing world the more common term is ramp-up
The idea is to increase the load gradually, as this way you will be able to correlate other performance metrics like response time, throughput, etc. with the increasing load.
If you release 1000 users at once you will get a limited view and will be able to answer only to question whether your system supports 1000 users or not. However you won't be able to tell what is the maximum number, what is the saturation point, etc.
When you increase the load gradually you can state that i.e.
Up to 250 users the system behaves normally, i.e. response time is the same, throughput increases as the load increases
After 250 users response time starts growing
After 400 users response time starts exceeding acceptable thresholds
After 600 users errors start occurring
etc.
Also if you decrease the load gradually you can tell whether the system gets back to normal when the load decreases.

Related

AWS Elasticsearch indexing memory usage issue

The problem: very frequent "403 Request throttled due to too many requests" errors during data indexing which should be a memory usage issue.
The infrastructure:
Elasticsearch version: 7.8
t3.small.elasticsearch instance (2 vCPU, 2 GB memory)
Default settings
Single domain, 1 node, 1 shard per index, no replicas
There's 3 indices with searchable data. 2 of them have roughly 1 million documents (500-600 MB) each and one with 25k (~20 MB). Indexing is not very simple (has history tracking) so I've been testing refresh with true, wait_for values or calling it separately when needed. The process is using search and bulk queries (been trying sizes of 500, 1000). There should be a limit of 10MB from AWS side so these are safely below that. I've also tested adding 0,5/1 second delays between requests, but none of this fiddling really has any noticeable benefit.
The project is currently in development so there is basically no traffic besides the indexing process itself. The smallest index generally needs an update once every 24 hours, larger ones once a week. Upscaling the infrastructure is not something we want to do just because indexing is so brittle. Even only updating the 25k data index twice in a row tends to fail with the above mentioned error. Any ideas how to reasonably solve this issue?
Update 2020-11-10
Did some digging in past logs and found that we used to have 429 circuit_breaking_exception-s (instead of the current 403) with a reason among the lines of [parent] Data too large, data for [<http_request>] would be [1017018726/969.9mb], which is larger than the limit of [1011774259/964.9mb], real usage: [1016820856/969.7mb], new bytes reserved: [197870/193.2kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=197870/193.2kb, accounting=4309694/4.1mb]. Used cluster stats API to track memory usage during indexing, but didn't find anything that I could identify as a direct cause for the issue.
Ended up creating a solution based on the information that I could find. After some searching and reading it seemed like just trying again when running into errors is a valid approach with Elasticsearch. For example:
Make sure to watch for TOO_MANY_REQUESTS (429) response codes
(EsRejectedExecutionException with the Java client), which is the way
that Elasticsearch tells you that it cannot keep up with the current
indexing rate. When it happens, you should pause indexing a bit before
trying again, ideally with randomized exponential backoff.
The same guide has also useful information about refreshes:
The operation that consists of making changes visible to search -
called a refresh - is costly, and calling it often while there is
ongoing indexing activity can hurt indexing speed.
By default, Elasticsearch periodically refreshes indices every second,
but only on indices that have received one search request or more in
the last 30 seconds.
In my use case indexing is a single linear process that does not occur frequently so this is what I did:
Disabled automatic refreshes (index.refresh_interval set to -1)
Using refresh API and refresh parameter (with true value) when and where needed
When running into a "403 Request throttled due to too many requests" error the program will keep trying every 15 seconds until it succeeds or the time limit (currently 60 seconds) is hit. Will adjust the numbers/functionality if needed, but results have been good so far.
This way the indexing is still fast, but will slow down when needed to provide better stability.

DynamoDb - How exactly does the throughput limit works?

Let's say I have:
A table with 100 RCUs
This table has 200 items
Each item has 4kb
As far as I understand, RCU are calculated per second and you spend 1 full RCU per 4kb (with a strongly consistent read).
1) Because of this, if I spend more than 100 RCU in one second I should get an throttling error, right?
2) How can I predict that a certain request will require more than my provisioned througput? It feels scary that at any time I can compromise the whole database by making a expensive request.
3) Let's say I want to do a scan on the whole table (get all items), so that should require 200 RCUS, But that will depend on how fast dynamodb does it right? If its too fast it will give me an error, but if it takes 2 seconds or more it should be fine, how do I account for this? How to take in account DynamoDB speed to know how much RCUs I will need? What it DynamoDB "speed"?
4) What's the difference between throttling and throughput limit exceeded?
Most of your questions are theoretical at this point , because you now (as of Nov 2018) have the option of simply telling dynamodbv to use 'on demand' mode where you no longer need to calculate or worry about RCU's. Simply enable this option, and forget about it. I had similar problems in the past because of very uneven workloads - periods of no activity and then periods where I needed to do full table scans to generate a report - and struggled to get it all working seemlessly.
I turned on 'on demand' mode, cost went down by about 70% in my case, and no more throttling errors. Your cost profile may be different, but I would defintely check out this new option.
https://aws.amazon.com/blogs/aws/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing/

High rate of InternalServerError from DynamoDB

So according to Amazon's DynamoDB error handling docs, it's expected behavior that you might receive 500 errors sometimes (they do not specify why this might occur or how often). In such a case, you're suppose to implement retries with exponential backoff, starting about 50ms.
In my case, I'm processing and batch writing a very large amount of data in parallel (30 nodes, each running about 5 concurrent threads). I expect this to take a very long time.
My hash key is fairly well balanced (user_id) and my throughput is set to 20000 write capacity units.
When I spin things up, all starts off pretty well. I hit the throughput and start backing off and for a while I oscillate nicely around the max capacity. However, pretty soon I start getting tons of 500 responses with InternalServerError as the exception. No other information is provided, of course. I back off and I back off, until I'm waiting about 1 minute between retries and everything seizes up and I'm no longer getting 200s at all.
I feel like there must be something wrong with my queries, or perhaps specific queries, but I have no way to investigate. There's simply no information about my error coming back from the server beyond the "Internal" and "Error" part.
Halp?

Amount of Test Data needed for load testing of a web service

I am currently working on a project that requires load testing of web services.
One of the services is being called 60,000 times in the production during Busy-Day/Busy-HR.
{PerfTest Env=PROD}
Input Account Number
Output AccountDetails
Do I really need 60,000 unique account numbers(TEST DATA) for this loadrunner script to simulate the production scenario?
If unique data is required, for endurance test I will have to prepare lot of test data for each web service.
If I don't get that much test data, what is the chance of Load Test being affected due to Application Server Cache mechanism??
Can somebody help me?
Thanks
Ram
Are you simulating a day or the highest volume hour in the last year? This can help you to shape the amount of data that you need. Rarely would you start with a 24 hour test. Instead you would be looking at your high water test of an hour with a ramp up and ramp down, so you would need approximately 1.333* your high water hour's worth of data.
So this can drop your 60K to (potentially) 20K(?) I am making an assumption that your worst hour over the last year is somewhere around 1/3 of your traditional day. I have observed this pattern over and over again in different environments over the past two decades. You will want to objectively verify this with log data or query data to support the number in your environment.
Next up, how many of these inquiries are actually unique? You are really going to need a log of the queries across a day (or your high water hour) to determine this. Log processing tools such as Microsoft Logparser or Splunk/Splunk Storm can help you to pull the observed distribution of unique account references within your data, including counts of those which are multiple. Once you know this you can simply use a data file with a fixed block size for each user for unique data and once the data is exhausted the user exits.

High CPU usage by Django App

I've created a pretty simple Django app which somewhat produces a high CPU load: rendering a simple generic view with a list of simple models (20 of them) and 5-6 SQL queries per page produce an apache process which loads CPU by 30% - 50%. While memory usage is pretty ok (30MB), CPU load is not ok to my understanding and this is not because of apache/wsgi settings or something, the same CPU load happens when I run the app via runserver.
Since, I'm new to Django I wanted to ask:
1) Are these 30-50% figures an usual thing for a Django app? (Django 1.4, ubuntu 12.04, python 2.7.3)
2) How do I profile CPU load? I used a profile middleware from here: http://djangosnippets.org/snippets/186/ but it shows only ms numbers not CPU load numbers and there was nothing special, so how do I identify what eats up so much CPU power?
CPU usage itself doesn't tell how efficient your app is. More important metric to measure the performance is how many requests/second your app can process. The kind of processor your machine has naturally also has a huge effect on the results.
I suggest you to run ab with multiple concurrent requests and compare the requests/second number to some benchmarks (there should be many around the net). ab will try to test the maximum throughput, so it's natural that one of the resources will be fully utilized (bottleneck), usually this is disk-io. As an example if you happen to get CPU usage close to 100% it may mean you are wasting CPU somewhere (reqs/second is low) or you that have optimized disk-io well (reqs/s high).
Looking at the %CPU column is not very accurate. I certainly see spikes of 50%-100% CPU all of the time.. it does not indicate how long a cpu is being used, just that we hit that value at that specific moment. These would fall into min / max figures, not your average cpu usage.
Another important piece: say you have 4 cores as I do which means the 30-50% figure on top is out of a maximum of 400%. 50% on top means 50% of one core, 12.5% on all four, etc.
You can press 1 in top to see individual core cpu figures.