I'm trying to understand what's causing spikes of slow searches on AWS Opensearch (ElasticSearch). Search times are typically 40-150 ms, but I see spikes of searches taking 5-15 seconds.
For now, I'm trying to understand how to read the monitors.
In the AWS dashboard, I'm looking at the Search latency monitor. It's described as "The average time that it takes a shard to complete a search operation." I see the spikes there. There are various drop-down options for Statistic. When I choose Maximum, the maximum shown during that bad spiking time is about 3000 ms. (Frankly, not sure how I can have an "average" of a "maximum".)
I also monitor the "took" value that is returned for every query search. Elasticsearch describes took as "Milliseconds it took Elasticsearch to execute the request." Here I'm seeing the 10+ second times. This also fits with what I'm actually measuring time wise, comparing times before and after I make the request.
So, why doesn't the dashboard monitor show the actual times? Is there a way for me to see it there?
I'm thinking of setting ES_SEARCH_TIMEOUT to a couple of seconds or so. No one is going to wait longer than that for a result, anyway. I think perhaps the slow searches build up, causing more slow searches. Better to have some early errors, hopefully preventing more slow queries. (Although not certain that this actually stops the query from continuing on the server.) (Considering options before I pay $600 a month more for the next larger instance size). But if AWS somehow sees 10 second searches as taking 3 seconds, then I'm not sure how to set it.
Related
I would like to set a monthly threshold on the number of traces collected by AWS X-Ray (mainly to avoid unexpected expenses).
It seems that sampling rules enable us to limit the trace ingestion but they use one second window.
https://docs.aws.amazon.com/xray/latest/devguide/xray-console-sampling.html
But setting a limit on the number of traces per seconds might cause me to loose some important traces. Basically the one second window seems unreasonably narrow and I would rather set the limits for a whole month.
Is there any way to achieve that?
If not, does anyone know the reason why AWS does not enable that?
(Update)
Answer by Lei Wang confirms that it is not possible and speculates about the possible reasons (see the post for details).
Interestingly log analytics workspaces in azure have this functionality so it should likely not be impossible to add something similar to AWS X-Ray.
XRay right now supports 2 basic sampling behaviors:
ratio
limit the sampled per second
These 2 can be used together in or relationship to become the 3rd behavior: ratio + reservoir. For example, 1/s reservoir + 5% ration. Means sample at least 1 trace / second, then if the throughput is over 1/second, sample additional 5%.
The reason XRay does not support more sampling behavior like you mentioned limit per month I guess because technically it is not easy to implement and not sure whether it is a common user requirement. Because XRay is not able to guarantee customer would not reboot application within 1 month. Even user say his application would never reboot. XRay SDK still need communication mechanism to calculate the total traces across fleet. So, the only possible workaround is user application keeps tracking how many traces have been in XRay backend in total by periodically query.
There is an Elastic Container Service cluster running an application internally referred to as Deltaload. It checks the data in Oracle production database and in dev database in Amazon RDS and loads whatever is missing into RDS. A CloudWatch rule is set up to trigger this process every hour.
Now, for some reason, every 20-30 hours there is one interval of a different length. Normally, it is a ~25 min gap, but on other occasions it can be 80-90 min instead of 60. I could understand a difference of 1-2 minutes, but being off by 30 min from an hourly schedule sounds really problematic, especially given the full run takes ~45 min. Does anyone have any ideas on what could be the reason for that? Or at least how can I figure out why it is so?
The interesting part is that this glitch in schedule either breaks or fixes the Deltaload app. What I mean is, if it is running successfully every hour for a whole day and then the 20 min interval happens, it will then be crashing every hour for the next day until the next glitch arrives, after which it will work again (the very same process, same container, same everything). It crashes, because the connection to RDS times out. This 'day of crashes, day of runs' thing has been going on since early February. I am not too proficient with AWS. This Deltaload app is written in C#, which I don't know. The only thing I managed to do is to increase the RDS connection timeout to 10 min, which did not fix the problem. The guy that wrote the app has left the company a time ago and is unavailable. There are no other developers on this project, as everyone got fired, because of corona. So far, the best alternative I see, is to just rewrite the whole thing in Python (which I know). If anyone has any other thoughts on how understand/fix it, I'd greatly appreciate any input.
To restate my actual question: why is it that CloudWatch rule drops in irregular intervals in a regular schedule? How to prevent this from happening?
I've started using Athena recently and it appears useful. However, one thing that bugs me is that Query Queuing times can sometimes be very long (around a minute). At other times, queries are executed almost immediately.
I have not been able to identify reasons why queries sometimes queue for so long, and not at other times. The only thing I noticed is that Table Creation and other DDL statements don't queue for long.
What are the factors that affect queuing time? Server load? Query length? Query complexity?
How can I reduce queuing time? There's no information on this available in the documents as far as I'm aware.
Around a minute it's not that long. We had a few weeks during which we had a random queuing time up to 10m. After a lot of back-and-forth with the support they finally tweaked something and queuing time was reduced to up to 1m, with average 10s.
The queuing is unrelated to your specific query or even "max concurrent queries" settings on your account, it's related to global region Athena load and many other hidden settings that AWS engineers can tweak.
I'm currently using ~500 concurrent executions and this tends to reach up to 5000 easily, is this a long term problem or is it relatively easy to make a quota increase request to AWS?
Getting quota increases is not difficult, but it’s also not instantaneous. In some cases the support person will ask for more information on why you need the increase (often to be sure you aren’t going too far afoul of best practices), which can slow things down. Different support levels have different response times too. So if you are concerned about it you should get ahead of it and get the increase before you think you’ll need it.
To request an increase:
In the AWS management console, select Service Quotes
Click AWS Lambda
Select Concurrent executions
Click Request quota increase
We've been experiencing problems where batch update requests take in excess of 60 seconds. We're updating a few kbs of data, quite some way short of the 5MB limit.
What is surprising me is not so much the time taken to index the data, but the time taken for the update request itself. Just uploading ~65kb of data can take over a minute.
We're making frequent updates with small quantities of data. Could it be that we're being throttled?
Not sure if this applies to your problem, but Amazon's documentation recommends sending large batches if possible.
Important
Whenever possible, you should group add and delete operations in batches that are close to the maximum batch size. Submitting a large volume of single-document batches to the document service can increase the time it takes for your changes to become visible in search results.
It talks about search result availability, but I don't think it's a stretch to say that this also would affect SDF processing performance.