We're having an issue with ElasticSearch on AWS.
The node is in Red Status for couple of hours now. I have no idea how to recover this.
I have tried a few suggestions:
curl -XGET -u 'username:password' 'host:443/_cluster/allocation/explain'
But all of the requests are coming back with:
{
"error" : {
"root_cause" : [
{
"type" : "master_not_discovered_exception",
"reason" : null
}
],
"type" : "master_not_discovered_exception",
"reason" : null
},
"status" : 503
}
The health dashboard is showing this:
Any ideas on how I can recover the instance?
UPDATE:
It looks like one of the nodes has disappeared:
24 hours ago
Now:
UPDATE:
Maybe there was too much RAM use? How do I fix it? The node is not even listed in the list of nodes. Can I curl a specific node?
UPDATE:
Ended up just re-creating the instance from beginning. Apparently Master nodes is a no-go. You are supposed to have 3 at least as when you have 2 master nodes and one of them crashes, the other one does nothing to restore it.
if you are getting the no masters discovered error then there's something pretty wrong with your deployment
you'd need to contact aws support for this, as that is what is managing the node deployment at the end of the day
Related
Im trying to add a process instance using this post request to /message
{
"messageName" : "DocumentReceived",
"businessKey" : "3",
"processVariables" : {
"document" : {"value" : "This is a document...", "type": "String"
}
}
}
But instead of getting 1 instance im getting 2 instances of the same id and same everything, I tried creating a process directly from the webapp (TaskList) but it still creates 2 duplicates, and i noticed one the instances gets stuck on user task while the other can just pass it without doing anything, ill attach a screenshot after running the post request above
Check your process model carefully. I believe you accidentally have two outgoing sequence flows on the start event. One connects to the user task, the other connects directly to the gateway. Because the two flows overlap, it is hard to spot. However, when you look closely at the "Send the new document" user task, you can see a faint line passing "behind" the task. Move the user task model element 3 cm up and you will see what is wrong.
So I had a working configuration with fluent-bit on eks and elasticsearch on AWS that was pointing on the AWS elasticsearch service but for cost saving purpose, we deleted that elasticsearch and created an instance with a solo elasticsearch, enough for dev purpose. And the aws service doesn't manage well with only one instance.
The issue is that during this migration the fluent-bit seems to have broken, and I get lots of "[warn] failed to flush chunk" and some "[error] [upstream] connection #55 to ES-SERVER:9200 timed out after 10 seconds".
My current configuration:
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Ignore_Older 1m
I think the issue is in one of those configuration, if I comment the kubernetes filter I don't have the errors anymore but I'm loosing the fields in the indices...
I tried tweeking some parameters in fluent-bit to no avail, if anyone has a suggestion?
So, the previous logs did not indicate anything, but I finaly found something when activating trace_error in the elasticsearch output:
{"index":{"_index":"fluent-bit-2021.04.16","_type":"_doc","_id":"Xkxy 23gBidvuDr8mzw8W","status":400,"error":{"type":"mapper_parsing_exception","reas on":"object mapping for [kubernetes.labels.app] tried to parse field [app] as o bject, but found a concrete value"}}
Did someone get that error before and knows how to solve it?
So, after looking into the logs and finding the mapping issue I ssem to have resolved the issue. The logs are now corretly parsed and send to the elasticsearch.
To resolve it I had to augment the limit of output retry and add the Replace_Dots option.
[OUTPUT]
Name es
Match *
Host ELASTICSERVER
Port 9200
Index <fluent-bit-{now/d}>
Retry_Limit 20
Replace_Dots On
It seems that at the beginning I had issues with the content being sent, because of that the error seemed to have continued after the changed until a new index was created making me think that the error was still not resolved.
I uses AWS Elasticsearch service version 7.1 and its built-it Kibana to manage application logs. New indexes are created daily by Logstash. My Logstash gets error about maximum shards limit reach from time to time and I have to delete old indexes for it to become working again.
I found from this document (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-handling-errors.html) that I have an option to increase _cluster/settings/cluster.max_shards_per_node.
So I have tried that by put following command in Kibana Dev Tools
PUT /_cluster/settings
{
"defaults" : {
"cluster.max_shards_per_node": "2000"
}
}
But I got this error
{
"Message": "Your request: '/_cluster/settings' payload is not allowed."
}
Someone suggests that this error occurs when I try to update some settings that are not allowed by AWS, but this document (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-supported-es-operations.html#es_version_7_1) tells me that cluster.max_shards_per_node is one in the allowed list.
Please suggest how to update this settings.
You're almost there, you need to rename defaults to persistent
PUT /_cluster/settings
{
"persistent" : {
"cluster.max_shards_per_node": "2000"
}
}
Beware though, that the more shards you allow per node, the more resources each node will need and the worse the performance can get.
I am using AWS ElasticSearch service, and am attempting to use a policy to transition indices to UltraWarm storage. However, each time the migration to UltraWarm begins, Kibana displays the error, "Failed to start warm migration" for the managed index. The complete error message is below. The "cause" is not very helpful. I am looking for help on how to identify / resole the root cause of this issue. Thanks!
{
"cause": "[753f6f14e4f92c962243aec39d5a7c31][10.212.32.199:9300][indices:admin/ultrawarm/migration/warm]",
"message": "Failed to start warm migration"
}
I have an index on AWS Elasticsearch which were unassighed due to NODE_LEFT. Here's an output of _cat/shards
rawindex-2017.07.04 1 p STARTED
rawindex-2017.07.04 3 p UNASSIGNED NODE_LEFT
rawindex-2017.07.04 2 p STARTED
rawindex-2017.07.04 4 p STARTED
rawindex-2017.07.04 0 p STARTED
under normal circumstances, it would be easy to reassign these shards by using the _cluster or _settings. However, these are the exact APIs that are not allowed by AWS. I get the following message:
{
Message: "Your request: '/_settings' is not allowed."
}
According to an answer to a very similar question, I can change the setting of an index using _index API, which is allowed by AWS. However, it seems like index.routing.allocation.disable_allocation is not valid for Elasticsearch 5.x, which I am running. I get the following error:
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[enweggf][x.x.x.x:9300][indices:admin/settings/update]"
}
],
"type": "illegal_argument_exception",
"reason": "unknown setting [index.routing.allocation.disable_allocation] please check that any required plugins are installed, or check the breaking changes documentation for removed settings"
},
"status": 400
}
I tried prioritizing index recovery with high index.priority as well as setting index.unassigned.node_left.delayed_timeout to 1 minute, but I am just not being able to reassign them.
Is there any way (dirty or elegant) to achieve this on AWS managed ES?
Thanks!
I had a similar issue with AWS Elasticsearch version 6.3, namely 2 shards failed to be assigned, and the cluster had status RED. Running GET _cluster/allocation/explain showed that the reason was that they had exceeded the default maximum allocation retries of 5.
Running the query GET <my-index-name>/_settings revealed the few settings that can be changed per index. Note that all queries are in Kibana format which you have out of the box if you are using AWS Elasticsearch service. The following solved my problem:
PUT <my-index-name>/_settings
{
"index.allocation.max_retries": 6
}
Running GET _cluster/allocation/explain immediately afterwards returned an error with the following: "reason": "unable to find any unassigned shards to explain...", and after some time the problem was resolved.
There might be an alternative solution when the other solutions fail. If you have a managed Elasticsearch Instance on AWS the chances are high that you can "just" restore a snapshot.
Check for failed indexes.
You can use for e.g.:
curl -X GET "https://<es-endpoint>/_cat/shards"
or
curl -X GET "https://<es-endpoint>/_cluster/allocation/explain"
Check for snapshots.
To find snapshot repositories execute the following query:
curl -X GET "https://<es-endpoint>/_snapshot?pretty"
Next let's have a look at all the snapshots in the cs-automated repository:
curl -X GET "https://<es-endpoint>/_snapshot/cs-automated/_all?pretty"
Find a snapshot where failures: [ ] is empty or the index you want to restore is NOT in a failed state. Then delete the index you want to restore:
curl -XDELETE 'https://<es-endpoint>/<index-name>'
... and restore the deleted index like this:
curl -XPOST 'https://<es-endpoint>/_snapshot/cs-automated/<snapshot-name>/_restore' -d '{"indices": "<index-name>"}' -H 'Content-Type: application/json'
There is also some good documentation here:
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains-snapshots.html#es-managedomains-snapshot-restore
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-handling-errors.html#aes-handling-errors-red-cluster-status
I also faced a similar problem. The solution is pretty simple. You can solve it in 2 different ways.
First solution is to edit all indexes collectively:
PUT _all/_settings
{
"index.allocation.max_retries": 3
}
Second solution is to edit specific indexes:
PUT <myIndex>/_settings
{
"index.allocation.max_retries": 3
}