I have created an ES domain to search for the vpcglow logs and cloud trail log with daily indexing.
Right now, the status is RED :
{
"cluster_name": "678628912247:test",
"status": "red",
"timed_out": false,
"number_of_nodes": 17,
"number_of_data_nodes": 17,
"active_primary_shards": 687,
"active_shards": 1374,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 8,
"number_of_pending_tasks": 0
}
Further investigating I found, one index is RED:
red open cwl-2016.02.19 5 1 381700 102899 335.8mb 167.9mb
Looking into the shards:
cwl-2016.02.19 2 p UNASSIGNED
cwl-2016.02.19 2 r UNASSIGNED
cwl-2016.02.19 0 p UNASSIGNED
cwl-2016.02.19 0 r UNASSIGNED
cwl-2016.02.19 3 p STARTED 381700 167.9mb x.x.x.x Elektra Natchios
cwl-2016.02.19 3 r STARTED 381700 167.9mb x.x.x.x Chronos
cwl-2016.02.19 1 p UNASSIGNED
cwl-2016.02.19 1 r UNASSIGNED
cwl-2016.02.19 4 p UNASSIGNED
cwl-2016.02.19 4 r UNASSIGNED
I tried to reroute the shards to less used nodes, but it gives me:
{"Message":"Your request: '/_cluster/reroute' is not allowed."}
Any advice please what I should do now.
Thanks & Regards.
A red cluster status means that at least one primary shard and its replicas are not allocated to a node.
Since you have already found the red index best option is to delete it.
If deletion is not possible then you restore the snapshot (Please note AWS automatically takes snapshots)
As last resort you can contact AWS support and they can restore it for you.
Its important to fix red cluster since once cluster is red AWS stop taking automatic snapshots.
RED cluster means one or more primary shards are not available and its means a data-loss and its very serious issue which requires an immediate fix.
If you have snapshot then try to recover the index from it.
In future try to increase replica so that you don't loose the primary shards and it can be easily recover from replica shards.
Check the ES cluster logs and try to find out the reason of missing primary shards.
See if reroute API can be useful, it will if you have shard available on disk but there is no data-node where ES can allocate it, see if you can data-node or create a configuration which can recover the primary shards.
Regarding the Error when trying to run reroute API, it seems its a permission issue which you can solve by having proper access.
Elasticsearch Allocation API
Allocation API will help you understand the cluster allocation issues.
curl -XGET "location:9200/_cluster/allocation/explain"
Resolve the issues or reasons explained by allocation API and reinitiate the allocation with the following
curl -X POST http://127.0.0.1:9200/_cluster/reroute?retry_failed=true
Related
I am trying to cross join two data frames and apply few transformations and finally trying to write the result into temp S3 location. But I am always ending up with below No space left on device error. Looks like it is due to calling spill(). Could you please help me how to overcome this error with the correct configurations?
Configuration details:
Cluster: AWS EMR cluster
CORE nodes: 2 initially and it scaling up to 15 nodes.
TASK nodes: 0 initially and it scaling up to 15 ON-DEMAND basis.
instance type: r4.2xlarge (8 core, 61GB RAM, 128 EBS)
Dataframe1 & Dataframes2 partitions size: 26 partitions.
Dataframe1 record count = 115580
Dataframe2 record count = 94191
Dataframe1 columns count: 53 ( 1 column holding JSON data)
Dataframe2 columns count: 36
spark.sql.shuffle.partitions: 500
"spark.executor.memoryOverhead": "4852"
"spark.driver.memoryOverhead": "4852"
Error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 63 in stage 68.0 failed 4 times, most recent failure: Lost task 63.3 in stage 68.0 (TID 1640) (ip-10-66-199-71.ec2.internal executor 44):
org.apache.spark.memory.SparkOutOfMemoryError: error while calling spill() on org.apache.spark.shuffle.sort.ShuffleExternalSorter#7ea8a25 : No space left on device
org.apache.spark.memory.SparkOutOfMemoryError: error while calling spill() on org.apache.spark.shuffle.sort.ShuffleExternalSorter#7ea8a25 : No space left on device
Thanks in Advance..!!
Sekhar
Its a common issue, and AWS provides official documentation on how to solve it:
How do I resolve "no space left on device" stage failures in Spark on Amazon EMR?
I deployed 2 instances of Eureka server and a total of 12 instances microservices. .
Renews (last min) is as expected 24. But Renew Threshold is always 0. Is this how it supposed to be when self preservation is turned on? Also seeing this error - THE SELF PRESERVATION MODE IS TURNED OFF. THIS MAY NOT PROTECT INSTANCE EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS. What's the expected behavior in this case and how to resolve this if this is a problem?
As mentioned above, I deployed 2 instances of Eureka Server but after running for a while like around 19-20 hours, one instance of Eureka Server always goes down. Why that could be possibly happening? I checked the processes running using top command and found that Eureka Server is taking a lot of memory. What needs to be configured on Eureka Server so that it don't take a lot of memory?
Below is the configuration in the application.properties file of Eureka Server:
spring.application.name=eureka-server
eureka.instance.appname=eureka-server
eureka.instance.instance-id=${spring.application.name}:${spring.application.instance_id:${random.int[1,999999]}}
eureka.server.enable-self-preservation=false
eureka.datacenter=AWS
eureka.environment=STAGE
eureka.client.registerWithEureka=false
eureka.client.fetchRegistry=false
Below is the command that I am using to start the Eureka Server instances.
#!/bin/bash
java -Xms128m -Xmx256m -Xss256k -XX:+HeapDumpOnOutOfMemoryError -Dspring.profiles.active=stage -Dserver.port=9011 -Deureka.instance.prefer-ip-address=true -Deureka.instance.hostname=example.co.za -Deureka.client.serviceUrl.defaultZone=http://example.co.za:9012/eureka/ -jar eureka-server-1.0.jar &
java -Xms128m -Xmx256m -Xss256k -XX:+HeapDumpOnOutOfMemoryError -Dspring.profiles.active=stage -Dserver.port=9012 -Deureka.instance.prefer-ip-address=true -Deureka.instance.hostname=example.co.za -Deureka.client.serviceUrl.defaultZone=http://example.co.za:9011/eureka/ -jar eureka-server-1.0.jar &
Is this approach to create multiple instances of Eureka Server correct?
Deployment is on AWS. Is there any specific configuration needed for Eureka Server on AWS?
Spring Boot version: 2.3.4.RELEASE
I am new to all these, any help or direction will be a great help.
Let me try to answer your question one by one.
Renews (last min) is as expected 24. But Renew Threshold is always 0. Is this how it supposed to be when self-preservation is turned on?
What's the expected behaviour in this case and how to resolve this if this is a problem?
I can see that eureka.server.enable-self-preservation=false in your configuration, This is really needed if you want to remove an already registered application as soon as it fails to renew its lease.
Self-preservation feature is to prevent the above-mentioned situation since it can happen if there are some network hiccups. Say, you have two services A and B, both are registered to eureka and suddenly, B failed to renew its lease because of a temporary network hiccup. If Self-preservation is not there then B will be removed from the registry and A won't be able to reach B despite B is available.
So we can say that Self-preservation is a resiliency feature of eureka.
Renews threshold is the expected renews per minute, Eureka server enters self-preservation mode if the actual number of heartbeats in last minute(Renews) is less than the expected number of renews per minute(Renew Threshold) and
Of course, you can control the Renews threshold. you can configure renewal-percent-threshold (by default it is 0.85)
So in your case,
Total number of application instances = 12
You don't have eureka.instance.leaseRenewalIntervalInSeconds so default value 30s
and eureka.client.registerWithEureka=false
so Renewals(last minute) will be 24
You don't have renewal-percent-threshold configured, so the default value is 0.85
Number of renewals per application instance per minute = 2 (30s each)
so in case of self-preservation is enable Renews threshold will be calculated as 2 * 12 * 0.85 = 21 (rounded)
And in your case self-preservation is turned off, so Eureka won't calculate Renews Threshold
One instance of Eureka Server always goes down. Why that could be possibly happening?
I'm not able to answer this question time being, this can be because of multiple reasons.
You can find the reason mostly from logs, or if you can post logs here it would be great.
What needs to be configured on Eureka Server so that it doesn't take a lot of memory?
From the information that you have provided, I cannot tell about your memory issue and in addition to that you already specified -Xmx256m and I didn't face any memory issues with the eureka servers so far.
But I can say that top is not the right tool for checking the memory consumed by your java process. When JVM starts, It takes some memory from the operating system.
This is the amount of memory you see in tools like ps and top. so better use jstat or jvmtop
Is this approach to create multiple instances of Eureka Server correct?
It seems you are having the same hostname(eureka.instance.hostname) for both instances. Replication won't work if you use the same hostname.
And make sure that you have the same application names in both instances.
Deployment is on AWS. Is there any specific configuration needed for Eureka Server on AWS?
Nothing specifically for AWS as per my knowledge, other than making sure that the instances can communicate with each other.
I have an index on AWS Elasticsearch which were unassighed due to NODE_LEFT. Here's an output of _cat/shards
rawindex-2017.07.04 1 p STARTED
rawindex-2017.07.04 3 p UNASSIGNED NODE_LEFT
rawindex-2017.07.04 2 p STARTED
rawindex-2017.07.04 4 p STARTED
rawindex-2017.07.04 0 p STARTED
under normal circumstances, it would be easy to reassign these shards by using the _cluster or _settings. However, these are the exact APIs that are not allowed by AWS. I get the following message:
{
Message: "Your request: '/_settings' is not allowed."
}
According to an answer to a very similar question, I can change the setting of an index using _index API, which is allowed by AWS. However, it seems like index.routing.allocation.disable_allocation is not valid for Elasticsearch 5.x, which I am running. I get the following error:
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[enweggf][x.x.x.x:9300][indices:admin/settings/update]"
}
],
"type": "illegal_argument_exception",
"reason": "unknown setting [index.routing.allocation.disable_allocation] please check that any required plugins are installed, or check the breaking changes documentation for removed settings"
},
"status": 400
}
I tried prioritizing index recovery with high index.priority as well as setting index.unassigned.node_left.delayed_timeout to 1 minute, but I am just not being able to reassign them.
Is there any way (dirty or elegant) to achieve this on AWS managed ES?
Thanks!
I had a similar issue with AWS Elasticsearch version 6.3, namely 2 shards failed to be assigned, and the cluster had status RED. Running GET _cluster/allocation/explain showed that the reason was that they had exceeded the default maximum allocation retries of 5.
Running the query GET <my-index-name>/_settings revealed the few settings that can be changed per index. Note that all queries are in Kibana format which you have out of the box if you are using AWS Elasticsearch service. The following solved my problem:
PUT <my-index-name>/_settings
{
"index.allocation.max_retries": 6
}
Running GET _cluster/allocation/explain immediately afterwards returned an error with the following: "reason": "unable to find any unassigned shards to explain...", and after some time the problem was resolved.
There might be an alternative solution when the other solutions fail. If you have a managed Elasticsearch Instance on AWS the chances are high that you can "just" restore a snapshot.
Check for failed indexes.
You can use for e.g.:
curl -X GET "https://<es-endpoint>/_cat/shards"
or
curl -X GET "https://<es-endpoint>/_cluster/allocation/explain"
Check for snapshots.
To find snapshot repositories execute the following query:
curl -X GET "https://<es-endpoint>/_snapshot?pretty"
Next let's have a look at all the snapshots in the cs-automated repository:
curl -X GET "https://<es-endpoint>/_snapshot/cs-automated/_all?pretty"
Find a snapshot where failures: [ ] is empty or the index you want to restore is NOT in a failed state. Then delete the index you want to restore:
curl -XDELETE 'https://<es-endpoint>/<index-name>'
... and restore the deleted index like this:
curl -XPOST 'https://<es-endpoint>/_snapshot/cs-automated/<snapshot-name>/_restore' -d '{"indices": "<index-name>"}' -H 'Content-Type: application/json'
There is also some good documentation here:
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains-snapshots.html#es-managedomains-snapshot-restore
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-handling-errors.html#aes-handling-errors-red-cluster-status
I also faced a similar problem. The solution is pretty simple. You can solve it in 2 different ways.
First solution is to edit all indexes collectively:
PUT _all/_settings
{
"index.allocation.max_retries": 3
}
Second solution is to edit specific indexes:
PUT <myIndex>/_settings
{
"index.allocation.max_retries": 3
}
I am trying dump a list of docs to an AWS elastic-search instance. It was running fine. Then, all of sudden it started throwing this error:
{ _index: '<my index name>',
_type: 'type',
_id: 'record id',
status: 403,
error:
{ type: 'cluster_block_exception',
reason: 'blocked by: [FORBIDDEN/8/index write (api)];' } }
I checked in forums. Most of them says that it is a JVM memory issue. If it is going more than 92%, AWS will stop any writes to the cluster/index. However, when I checked the JVM memory, it shows less than 92%. I am missing something here?
This error is the Amazon ES service actively blocking writes to protect the cluster from reaching red or yellow status. It does this using index.blocks.write.
The two reasons being:
Low Memory
When the JVMMemoryPressure metric exceeds 92% for 30 minutes, Amazon ES triggers a protection mechanism and blocks all write operations to prevent the cluster from reaching red status. When the protection is on, write operations fail with a ClusterBlockException error, new indexes can't be created, and the IndexCreateBlockException error is thrown.
When the JVMMemoryPressure metric returns to 88% or lower for five minutes, the protection is disabled, and write operations to the cluster are unblocked.
Low Disk Space
Elasticsearch has a default "low watermark" of 85%, meaning that once disk usage exceeds 85%, Elasticsearch no longer allocates shards to that node. Elasticsearch also has a default "high watermark" of 90%, at which point it attempts to relocate shards to other nodes.
This error indicates that AWS ElasticSearch has placed a block on your domain based upon disk space. At 85%, ES will not allow you create any new indexes. At 90%, no new documents can be written.
ES could apply write block on index during rollovers, or Low disk space or memory.
In order to stop these errors you need to remove the write block on the index by setting index.blocks.write to false
curl -X PUT -H "Content-Type: application/json" \
'http://localhost:9200/{index_name}/_settings' \
-d '{ "index": { "blocks": { "write": "false" } } }'
The accepted solution was not enough in my case, I had to remove index.blocks.read_only_allow_delete as well
PUT /my_index/_settings
{
"index.blocks.read_only_allow_delete": null,
"index.blocks.write": null
}
ES version 7.15
This can also happen if the index you're trying to write to has been marked as read only. I've had it happen due to an Index State Management misconfiguration which caused a weekly index to be moved to a warm state after one day.
I'm trying to create a new index in AWS ElasticSearch cluster after increasing the cluster size and seeing index_create_block_exception. How can i rectify this? I tried searching but did not get exact answers. Thank you.
curl -XPUT 'http://<aws_es_endpoint>/optimus/'
{"error":{"root_cause":[{"type":"index_create_block_exception","reason":"blocked by: [FORBIDDEN/10/cluster create-index blocked (api)];"}],"type":"index_create_block_exception","reason":"blocked by: [FORBIDDEN/10/cluster create-index blocked (api)];"},"status":403}
According to AWS, the above exception is being thrown due to a low memory in disks.
For t2 instances, when the JVMMemoryPressure metric exceeds 92%,
Amazon ES triggers a protection mechanism by blocking all write
operations to prevent the cluster from getting into red status. When
the protection is on, write operations will fail with a
ClusterBlockException error, new indexes cannot be created, and the
IndexCreateBlockException error will be thrown.
I'm afraid the issue is still on.
You'll also get this error if you run out of disk space.
This should of course not happen after increasing the cluster size, but if you suddenly get this error it'd be worth to check that all your instances has storage left - i.e. don't look at the Total free storage space graph but at the Minimum free storage space one.