High rate of InternalServerError from DynamoDB - amazon-web-services

So according to Amazon's DynamoDB error handling docs, it's expected behavior that you might receive 500 errors sometimes (they do not specify why this might occur or how often). In such a case, you're suppose to implement retries with exponential backoff, starting about 50ms.
In my case, I'm processing and batch writing a very large amount of data in parallel (30 nodes, each running about 5 concurrent threads). I expect this to take a very long time.
My hash key is fairly well balanced (user_id) and my throughput is set to 20000 write capacity units.
When I spin things up, all starts off pretty well. I hit the throughput and start backing off and for a while I oscillate nicely around the max capacity. However, pretty soon I start getting tons of 500 responses with InternalServerError as the exception. No other information is provided, of course. I back off and I back off, until I'm waiting about 1 minute between retries and everything seizes up and I'm no longer getting 200s at all.
I feel like there must be something wrong with my queries, or perhaps specific queries, but I have no way to investigate. There's simply no information about my error coming back from the server beyond the "Internal" and "Error" part.
Halp?

Related

DynamoDB query subsegments show some of them take way too long - How to debug

I tried using https://github.com/shelfio/dynamodb-query-optimized to speed up the dynamodb response time when querying time-series data. Nothing fancy, uses a GSI for the query (id and timestamp) and uses them in the key expression. Not much to filter out in the filter expression so the index is optimized.
For some reason, if I use the queryOptimized method (that queries 2-way i.e scan forward and backward), some of the dynamodb query subsegments start taking way too long. In the range of 16seconds-25seconds which is killing the performance and resulting in API gateway timeouts.
The first 20-25 subsegments seem to start at approximately the same time and have decent performance i.e under 500ms, However, after 20-25 subsegments, It starts about 300ms later and the duration of most of these subsegments is really long.
How do I debug what is going on here? Doesn't seem to give any further info in traces. Retries are 0 for all of those subsegments, No failures or errors either. The table is configured to have on-demand capacity so do not imagine capacity to be the issue. Lambda function also has the max (10GB memory allocated) to it. Any ideas on how to figure out what's going on?

Redshift: experiencing slow query performance between 2 segments

We’re experiencing slow query performance on AWS Redshift. Frequently we see that queries can take ±12 seconds to run, but only very little time (<500ms) is spent actually executing the query (according to the AWS Redshift console for an individual query).
Querying from svl_compile we can confirm that the query compilation plan is already compiled.
In svl_query_report we see a long time delay between the start times of 2 segments accounting for the majority of the run time, although the segments themselves all execute very quickly (milliseconds)
There are a number of things that could be going on but I suspect network distribution is involved. Check STL_DIST.
Another possibility is that Redshift broke the query up and a subquery is running during that window. This can happen with very complex queries. Review the plan and see if there are any references to computer generated table names (I think they begin with't' but this is just from memory).
Spilling to disk could be happening but this seems unlikely given what you have said so far. Also queuing delays doesn't seem like a match. Both are possible but not likely.
If you post more info about how the query is running things will narrow down. Actual execution report, explain plan, and/or logging table info would help hone in on what is happening during this time window.

Why does the "hatch rate" matter when performance testing?

I'm using Locust for performance testing. It has two parameters: the number of users and the rate at which the users are generated. But why are the users not simply generated all at once? Why does it make a difference?
Looking at Locust Configuration Options I think correct option is spawn-rate
Coming back to your question, in Performance Testing world the more common term is ramp-up
The idea is to increase the load gradually, as this way you will be able to correlate other performance metrics like response time, throughput, etc. with the increasing load.
If you release 1000 users at once you will get a limited view and will be able to answer only to question whether your system supports 1000 users or not. However you won't be able to tell what is the maximum number, what is the saturation point, etc.
When you increase the load gradually you can state that i.e.
Up to 250 users the system behaves normally, i.e. response time is the same, throughput increases as the load increases
After 250 users response time starts growing
After 400 users response time starts exceeding acceptable thresholds
After 600 users errors start occurring
etc.
Also if you decrease the load gradually you can tell whether the system gets back to normal when the load decreases.

AWS Elasticsearch indexing memory usage issue

The problem: very frequent "403 Request throttled due to too many requests" errors during data indexing which should be a memory usage issue.
The infrastructure:
Elasticsearch version: 7.8
t3.small.elasticsearch instance (2 vCPU, 2 GB memory)
Default settings
Single domain, 1 node, 1 shard per index, no replicas
There's 3 indices with searchable data. 2 of them have roughly 1 million documents (500-600 MB) each and one with 25k (~20 MB). Indexing is not very simple (has history tracking) so I've been testing refresh with true, wait_for values or calling it separately when needed. The process is using search and bulk queries (been trying sizes of 500, 1000). There should be a limit of 10MB from AWS side so these are safely below that. I've also tested adding 0,5/1 second delays between requests, but none of this fiddling really has any noticeable benefit.
The project is currently in development so there is basically no traffic besides the indexing process itself. The smallest index generally needs an update once every 24 hours, larger ones once a week. Upscaling the infrastructure is not something we want to do just because indexing is so brittle. Even only updating the 25k data index twice in a row tends to fail with the above mentioned error. Any ideas how to reasonably solve this issue?
Update 2020-11-10
Did some digging in past logs and found that we used to have 429 circuit_breaking_exception-s (instead of the current 403) with a reason among the lines of [parent] Data too large, data for [<http_request>] would be [1017018726/969.9mb], which is larger than the limit of [1011774259/964.9mb], real usage: [1016820856/969.7mb], new bytes reserved: [197870/193.2kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=197870/193.2kb, accounting=4309694/4.1mb]. Used cluster stats API to track memory usage during indexing, but didn't find anything that I could identify as a direct cause for the issue.
Ended up creating a solution based on the information that I could find. After some searching and reading it seemed like just trying again when running into errors is a valid approach with Elasticsearch. For example:
Make sure to watch for TOO_MANY_REQUESTS (429) response codes
(EsRejectedExecutionException with the Java client), which is the way
that Elasticsearch tells you that it cannot keep up with the current
indexing rate. When it happens, you should pause indexing a bit before
trying again, ideally with randomized exponential backoff.
The same guide has also useful information about refreshes:
The operation that consists of making changes visible to search -
called a refresh - is costly, and calling it often while there is
ongoing indexing activity can hurt indexing speed.
By default, Elasticsearch periodically refreshes indices every second,
but only on indices that have received one search request or more in
the last 30 seconds.
In my use case indexing is a single linear process that does not occur frequently so this is what I did:
Disabled automatic refreshes (index.refresh_interval set to -1)
Using refresh API and refresh parameter (with true value) when and where needed
When running into a "403 Request throttled due to too many requests" error the program will keep trying every 15 seconds until it succeeds or the time limit (currently 60 seconds) is hit. Will adjust the numbers/functionality if needed, but results have been good so far.
This way the indexing is still fast, but will slow down when needed to provide better stability.

Is there any way to specify a max number of retries when using s3cmd?

I've looked through the usage guide as well as the config docs and I'm just not seeing it. This is the output for my bash script that uses s3cmd sync when S3 appeared to be down:
WARNING: Retrying failed request: /some/bucket/path/
WARNING: 503 (Service Unavailable):
WARNING: Waiting 3 sec...
WARNING: Retrying failed request: /some/bucket/path/
WARNING: 503 (Service Unavailable):
WARNING: Waiting 6 sec...
ERROR: The read operation timed out
It looks like it is retrying twice using exponential backoffs, then failing. Surely there must be some way to explicitly state how many times s3cmd should retry a failed network call?
I don't think you can the set the maximum retry count. I had a look at its source code on GitHub (https://github.com/s3tools/s3cmd/blob/master/S3/S3.py).
Looks like that value is 5 and hard-coded:
Line 240:
## Maximum attempts of re-issuing failed requests
_max_retries = 5
And the retry interval is calculated as:
Line 1004:
def _fail_wait(self, retries):
# Wait a few seconds. The more it fails the more we wait.
return (self._max_retries - retries + 1) * 3
and the actual code that carries out the retries:
if response["status"] >= 500:
e = S3Error(response)
if response["status"] == 501:
## NotImplemented server error - no need to retry
retries = 0
if retries:
warning(u"Retrying failed request: %s" % resource['uri'])
warning(unicode(e))
warning("Waiting %d sec..." % self._fail_wait(retries))
time.sleep(self._fail_wait(retries))
return self.send_request(request, retries - 1)
else:
raise e
So I think after the second try some other error occurred and it caused it to get out of the retry loop.
Its very unlikely the 503 is because S3 is down, its almost never, ever 'down'. More likely you account has been throttled because you are making too many requests in too short a period.
You should either slow down your requests, if you control the speed, or I would recommend picking better keys, i.e. keys that don't all start with the same prefix - a nice wide range of keys will allow s3 to spread the workload better.
From Jeff Barr's blog post:
Further, keys in S3 are partitioned by prefix.
As we said, S3 has automation that continually looks for areas of the
keyspace that need splitting. Partitions are split either due to
sustained high request rates, or because they contain a large number
of keys (which would slow down lookups within the partition). There is
overhead in moving keys into newly created partitions, but with
request rates low and no special tricks, we can keep performance
reasonably high even during partition split operations. This split
operation happens dozens of times a day all over S3 and simply goes
unnoticed from a user performance perspective. However, when request
rates significantly increase on a single partition, partition splits
become detrimental to request performance. How, then, do these heavier
workloads work over time? Smart naming of the keys themselves!
We frequently see new workloads introduced to S3 where content is
organized by user ID, or game ID, or other similar semi-meaningless
identifier. Often these identifiers are incrementally increasing
numbers, or date-time constructs of various types. The unfortunate
part of this naming choice where S3 scaling is concerned is two-fold:
First, all new content will necessarily end up being owned by a single
partition (remember the request rates from above…). Second, all the
partitions holding slightly older (and generally less ‘hot’) content
get cold much faster than other naming conventions, effectively
wasting the available operations per second that each partition can
support by making all the old ones cold over time.
The simplest trick that makes these schemes work well in S3 at nearly
any request rate is to simply reverse the order of the digits in this
identifier (use seconds of precision for date or time-based
identifiers). These identifiers then effectively start with a random
number – and a few of them at that – which then fans out the
transactions across many potential child partitions. Each of those
child partitions scales close enough to linearly (even with some
content being hotter or colder) that no meaningful operations per
second budget is wasted either. In fact, S3 even has an algorithm to
detect this parallel type of write pattern and will automatically
create multiple child partitions from the same parent simultaneously –
increasing the system’s operations per second budget as request heat
is detected.
https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/