Elasticsearch - Keep hitting "429 Too Many Requests" error - amazon-web-services

Been running an AWS m4.large.elasticsearch Elasticsearch (service) instance with 2 data nodes for more than a year now without any severe issues. Because of the increased demand we have set up 2 additional r6g.large Elasticsearch instances (which have the same amount of vCPU and memory as the m4.large, but should even offer better performance according to the docs).
Ever since using these we have been getting "429 Too Many Requests" errors inside our application. After some digging on https://aws.amazon.com/es/premiumsupport/knowledge-center/resolve-429-error-es/ the following things have been tried without success:
Increasing the circuit breaker limit to 90% => Does not solve the issue
Switching to c6g.xlarge (Compute optimized instances with double the capacity) => Does not solve the issue
Enabled slow search logs + error logs in the hope of getting more info => Nothing is being logged
If anyone has an idea on how we could go around solving this that would be much appreciated!
PS: "Old" version is running Elasticsearch 7.7 while the new one is running 7.10, but would be astonished that this would be the cause.

A 429 error message as a write rejection indicates a bulk queue error.A bulk queue on each node can hold between 50 and 200 requests, depending on which Elasticsearch version you are using. There have been multiple reports on this and older version of Elasticache is the probable root cause.

Related

Elastic search 403 Request throttled due to too many requests /_bulk

I am trying to sync 1 million record to ES, and I am doing it using bulk API in batch of 2k.
But after inserting around 25k-32k, elastic search is giving following exception.
Unable to parse response body: org.elasticsearch.ElasticsearchStatusException
ElasticsearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [**********], URI [/_bulk?timeout=1m], status line [HTTP/1.1 403 Request throttled due to too many requests]
403 Request throttled due to too many requests /_bulk]; nested: ResponseException[method [POST], host [************], URI [/_bulk?timeout=1m], status line [HTTP/1.1 403 Request throttled due to too many requests]
403 Request throttled due to too many requests /_bulk];
I am using aws elastic search.
I think, I need to implement wait strategy to handle it, something like keep checking es status and call bulk insert if status all of ES okay.
But not sure how to implement it? Does ES offers anything pre-build for it?
Or Anything better way to handle this?
Thanks in advance.
Update:
I am using AWS elastic search version 6.8
Thanks #dravit for including my previous SO answer in the comment, after following the comments it seems OP wants to improve the performance of bulk indexing and want exponential backoff, which i don't think Elasticsearch provides out of the box.
I see that you are putting a pause of 1 second after every second which will not work in all the cases, and if you have large number of batches and documents to be indexed, for sure it will take a lot of time. There are few more suggestions from my side to improve the performance.
Follow my tips to improve the reindex speed in Elasticsearch and see what all things listed here is applicable and doing them improves speed by what factor.
Find a batching strategy which best suits to your environment, I am not sure but this article from #spinscale who is the developer of java high level rest client might help or you can ask a question on https://discuss.elastic.co/, I remembered he shared a very good batching strategy in one of his webinar but couldn't find the link of it.
Notice various ES metrics apart from bulk threadpool and queue size, and see if your ES still has capacity can you increase the queue size and increase the rate by which you can send requests to ES.
Check the error handling guide here
If you receive persistent 403 Request throttled due to too many requests or 429 Too Many Requests errors, consider scaling vertically. Amazon Elasticsearch Service throttles requests if the payload would cause memory usage to exceed the maximum size of the Java heap.
Scale your application vertically or increase delay between requests.

How to fix CloudRun error 'The request was aborted because there was no available instance'

I'm using managed CloudRun to deploy a container with concurrency=1. Once deployed, I'm firing four long-running requests in parallel.
Most of the time, all works fine -- But occasionally, I'm facing 500's from one of the nodes within a few seconds; logs only provide the error message provided in the subject.
Using retry with exponential back-off did not improve the situation; the retries also end up with 500s. StackDriver logs also do not provide further information.
Potentially relevant gcloud beta run deploy arguments:
--memory 2Gi --concurrency 1 --timeout 8m --platform managed
What does the error message mean exactly -- and how can I solve the issue?
This error message can appear when the infrastructure didn't scale fast enough to catch up with the traffic spike. Infrastructure only keeps a request in the queue for a certain amount of time (about 10s) then aborts it.
This usually happens when:
traffic suddenly largely increase
cold start time is long
request time is long
We also faced this issue when traffic suddenly increased during business hours. The issue is usually caused by a sudden increase in traffic and a longer instance start time to accommodate incoming requests. One way to handle this is by keeping warm-up instances always running i.e. configuring --min-instances parameters in the cloud run deploy command. Another and recommended way is to reduce the service cold start time (which is difficult to achieve in some languages like Java and Python)
I also experiment the problem. Easy to reproduce. I have a fibonacci container that process in 6s fibo(45). I use Hey to perform 200 requests. And I set my Cloud Run concurrency to 1.
Over 200 requests I have 8 similar errors. In my case: sudden traffic spike and long processing time. (Short cold start for me, it's in Go)
I was able to resolve this on my service by raising the max autoscaling container count from 2 to 10. There really should be no reason that 2 would be even close to too low for the traffic, but I suspect something about the Cloud Run internals were tying up to 2 containers somehow.
Setting the Max Retry Attempts to anything but zero will remedy this, as it did for me.

A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool'

I have a series of Azure SQL Data Warehouse databases (for our development/evaluation purposes). Due to a recent unplanned extended outage (due to an issue with the Tenant Ring associated with some of these databases), I decided to resume the canary queries I had been running before but had quiesced for a couple of months due to frequent exceptions.
The canary queries are not running particularly frequently on any specific database, say every 15 minutes. On one database, I've received two indications of issues completing the canary query in 24 hours. The error is:
Msg 110802, Level 16, State 1, Server adwscdev1, Line 1110802;An internal DMS error occurred that caused this operation to fail. Details: A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool' (2000000007). Rerun the query.
This database is under essentially no load, running at more than 100 DWU.
Other databases on the same logical server may be running under a load, but I have not seen the error on them.
What is the explanation for this error?
Please open a support ticket for this issue, support will have full access to the DMS logs and be able to see exactly what is going on. this behavior is not expected.
While I agree a support case would be reasonable I think you should also try scaling up to say DWU400 and retrying. I would also consider trying largerc or xlargerc on DWU100 and DWU400 as described here. Note it gets more memory and resources per query.
Run the following then retry your query:
EXEC sp_addrolemember 'largerc', 'yourLoginName'

Elasticsearch percolation dead slow on AWS EC2

Recently we switched our cluster to EC2 and everything is working great... except percolation :(
We use Elasticsearch 2.2.0.
To reindex (and percolate) our data we use a separate EC2 c3.8xlarge instance (32 cores, 60GB, 2 x 160 GB SSD) and tell our index to include only this node in allocation.
Because we'll distribute it amongst the rest of the nodes later, we use 10 shards, no replicas (just for indexing and percolation).
There are about 22 million documents in the index and 15.000 percolators. The index is a tad smaller than 11GB (and so easily fits into memory).
About 16 php processes talk to the REST API doing multi percolate requests with 200 requests in each (we made it smaller because of the performance, it was 1000 per request before).
One percolation request (a real one, tapped off of the php processes running) is taking around 2m20s under load (of the 16 php processes). That would've been ok if one of the resources on the EC2 was maxed out but that's the strange thing (see stats output here but also seen on htop, iotop and iostat): load, cpu, memory, heap, io; everything is well (very well) within limits. There doesn't seem to be a shortage of resources but still, percolation performance is bad.
When we back off the php processes and try the percolate request again, it comes out at around 15s. Just to be clear: I don't have a problem with a 2min+ multi percolate request. As long as I know that one of the resources is fully utilized (and I can act upon it by giving it more of what it wants).
So, ok, it's not the usual suspects, let's try different stuff:
To rule out network, coordination, etc issues we also did the same request from the node itself (enabling the client) with the same pressure from the php processes: no change
We upped the processors configuration in elasticsearch.yml and restarted the node to fake our way to a higher usage of resources: no change.
We tried tweaking the percolate and get pool size and queue size: no change.
When we looked at the hot threads, we DiscovereUsageTrackingQueryCachingPolicy was coming up a lot so we did as suggested in this issue: no change.
Maybe it's the amount of replicas, seeing Elasticsearch uses those to do searches as well? We upped it to 3 and used more EC2 to spread them out: no change.
To determine if we could actually use all resources on EC2, we did stress tests and everything seemed fine, getting it to loads of over 40. Also IO, memory, etc showed no issues under high strain.
It could still be the batch size. Under load we tried a batch of just one percolator in a multi percolate request, directly on the data & client node (dedicated to this index) and found that it used 1m50s. When we tried a batch of 200 percolators (still in one multi percolate request) it used 2m02s (which fits roughly with the 15s result of earlier, without pressure).
This last point might be interesting! It seems that it's stuck somewhere for a loooong time and then goes through the percolate phase quite smoothly.
Can anyone make anything out of this? Anything we have missed? We can provide more data if needed.
Have a look at the thread on the Elastic Discuss forum to see the solution.
TLDR;
Use multiple nodes on one big server to get better resource utilization.

Amazon Elasticache Failover

We have been using AWS Elasticache for about 6 months now without any issues. Every night we have a Java app that runs which will flush DB 0 of our redis cache and then repopulate it with updated data. However we had 3 instances between July 31 and August 5 where our DB was successfully flushed and then we were not able to write the new data to the database.
We were getting the following exception in our application:
redis.clients.jedis.exceptions.JedisDataException:
redis.clients.jedis.exceptions.JedisDataException: READONLY You can't
write against a read only slave.
When we look at the cache events in Elasticache we can see
Failover from master node prod-redis-001 to replica node
prod-redis-002 completed
We have not been able to diagnose the issue and since the app was running fine for the past 6 months I am wondering if it is something related to a recent Elasticache release that was done on the 30th of June.
https://aws.amazon.com/releasenotes/Amazon-ElastiCache
We have always been writing to our master node and we only have 1 replica node.
If someone could offer any insight it would be much appreciated.
EDIT: This seems to be an intermittent problem. Some days it will fail other days it runs fine.
We have been in contact with AWS support for the past few weeks and this is what we have found.
Most Redis requests are synchronous including the flush so it will block all other requests. In our case we are actually flushing 19m keys and it takes more then 30 seconds.
Elasticache performs a health check periodically and since the flush is running the health check will be blocked, thus causing a failover.
We have been asking the support team how often the health check is performed so we can get an idea of why our flush is only causing a failover 3-4 times a week. The best answer we can get is "We think its every 30 seconds". However our flush consistently takes more then 30 seconds and doesn't consistently fail.
They said that they may implement the ability to configure the timing of the health check however they said this would not be done anytime soon.
The best advice they could give us is:
1) Create a completely new cluster for loading the new data on, and
instead of flushing the previous cluster, re-point your application(s)
to the new cluster, and remove the old one.
2) If the data that you are flushing is an update version of the data,
consider not flushing, but updating and overwriting new keys?
3) Instead of flushing the data, set the expiry of the items to be
when you would normally flush, and let the keys be reclaimed (possibly
with a random time to avoid thundering herd issues), and then reload
the data.
Hope this helps :)
Currently for Redis versions from 6.2 AWS ElastiCache has a new feature of thread monitoring. So the health check doesn't happen in the same thread as all other actions of Redis. Redis can continue to proceed a long command / lua script, but will still considered healthy. Because of this new feature failovers should happen less.