AWS ElasticSearchService index_create_block_exception - amazon-web-services

I'm trying to create a new index in AWS ElasticSearch cluster after increasing the cluster size and seeing index_create_block_exception. How can i rectify this? I tried searching but did not get exact answers. Thank you.
curl -XPUT 'http://<aws_es_endpoint>/optimus/'
{"error":{"root_cause":[{"type":"index_create_block_exception","reason":"blocked by: [FORBIDDEN/10/cluster create-index blocked (api)];"}],"type":"index_create_block_exception","reason":"blocked by: [FORBIDDEN/10/cluster create-index blocked (api)];"},"status":403}

According to AWS, the above exception is being thrown due to a low memory in disks.
For t2 instances, when the JVMMemoryPressure metric exceeds 92%,
Amazon ES triggers a protection mechanism by blocking all write
operations to prevent the cluster from getting into red status. When
the protection is on, write operations will fail with a
ClusterBlockException error, new indexes cannot be created, and the
IndexCreateBlockException error will be thrown.
I'm afraid the issue is still on.

You'll also get this error if you run out of disk space.
This should of course not happen after increasing the cluster size, but if you suddenly get this error it'd be worth to check that all your instances has storage left - i.e. don't look at the Total free storage space graph but at the Minimum free storage space one.

Related

Unable to start Spot Instance VM after Resize

I am unable to start my spot instance VM after resizing it from HHB120rs_v2 to HB120rs_v3, with the error:
Failed to start virtual machine 'VM2'. Error: Allocation failed. VM(s) with the following
constraints cannot be allocated, because the condition is too restrictive. Please remove some
constraints and try again. Constraints applied are:
Low Priority VMs
Networking Constraints (such as Accelerated Networking or IPv6)
Preemptible VMs (VM might be preempted by another VM with a higher priority)
VM Size
This is all done within the US East region. Interestingly, when resizing, HB120rs_v3 is not shown under the H-Series category, but instead under the Other category. It previously appeared under the H-Series category, when I was able to start it correctly. If I change the size back to HB120rs_v2 I am able to start the VM.
This is not a quota issue as I currently only have 1 VM and the VM was deallocated at the time of resizing. I have also previously successfully started this VM with the HB120rs_v3 size about 4 weeks earlier.
My questions are:
How can I determine the specific cause of the start failure?
What is the significance of the VM size being shown under the Other category?
We cannot change the generation of a VM after we create it
How can I determine the specific cause of the start failure?
The reason for VM start failure is the last operation that was run on the VM failed after the input was accepted
As you are trying to change the gen2 VM to gen3 which is not supported, this might be the reason for the start failure
What is the significance of the VM size being shown under the other
category?
When resizing, HB120rs_v3 is not shown under the H-Series category because the hardware cluster currently used for HB120rs_v2 VM does not support the HB120rs_v3 VM
So HB120rs_v3 is shown under other category but not under H-series
The error message should generally be taken at face value. It means that the platform could not satisfy all the given allocation (placement) constraints together. If you remove one or more constraints (some of which are expressed indirectly as feature/capability specifications in the VM model), the allocation may succeed. For a standalone VM (not in an Availability Set or a VMSS), common highly restrictive allocation constraints are IPv6 and Accelerated Networking, often in relation to the VM size. (But you can be certain the VM size by itself is compatible with Accelerated Networking; otherwise the request would've been rejected earlier, with a more specific error.)
'Spot' priority (referred to as "Preemptible VMs" in the error message) can also be restrictive since there must be ample excess capacity available to allow Spot deployments in a region/zone. A good resource for assessing Spot availability is the eviction history: https://learn.microsoft.com/en-us/azure/virtual-machines/spot-vms#pricing-and-eviction-history.

How to fix CloudRun error 'The request was aborted because there was no available instance'

I'm using managed CloudRun to deploy a container with concurrency=1. Once deployed, I'm firing four long-running requests in parallel.
Most of the time, all works fine -- But occasionally, I'm facing 500's from one of the nodes within a few seconds; logs only provide the error message provided in the subject.
Using retry with exponential back-off did not improve the situation; the retries also end up with 500s. StackDriver logs also do not provide further information.
Potentially relevant gcloud beta run deploy arguments:
--memory 2Gi --concurrency 1 --timeout 8m --platform managed
What does the error message mean exactly -- and how can I solve the issue?
This error message can appear when the infrastructure didn't scale fast enough to catch up with the traffic spike. Infrastructure only keeps a request in the queue for a certain amount of time (about 10s) then aborts it.
This usually happens when:
traffic suddenly largely increase
cold start time is long
request time is long
We also faced this issue when traffic suddenly increased during business hours. The issue is usually caused by a sudden increase in traffic and a longer instance start time to accommodate incoming requests. One way to handle this is by keeping warm-up instances always running i.e. configuring --min-instances parameters in the cloud run deploy command. Another and recommended way is to reduce the service cold start time (which is difficult to achieve in some languages like Java and Python)
I also experiment the problem. Easy to reproduce. I have a fibonacci container that process in 6s fibo(45). I use Hey to perform 200 requests. And I set my Cloud Run concurrency to 1.
Over 200 requests I have 8 similar errors. In my case: sudden traffic spike and long processing time. (Short cold start for me, it's in Go)
I was able to resolve this on my service by raising the max autoscaling container count from 2 to 10. There really should be no reason that 2 would be even close to too low for the traffic, but I suspect something about the Cloud Run internals were tying up to 2 containers somehow.
Setting the Max Retry Attempts to anything but zero will remedy this, as it did for me.

AWS elastic-search. FORBIDDEN/8/index write (api). Unable to write to index

I am trying dump a list of docs to an AWS elastic-search instance. It was running fine. Then, all of sudden it started throwing this error:
{ _index: '<my index name>',
_type: 'type',
_id: 'record id',
status: 403,
error:
{ type: 'cluster_block_exception',
reason: 'blocked by: [FORBIDDEN/8/index write (api)];' } }
I checked in forums. Most of them says that it is a JVM memory issue. If it is going more than 92%, AWS will stop any writes to the cluster/index. However, when I checked the JVM memory, it shows less than 92%. I am missing something here?
This error is the Amazon ES service actively blocking writes to protect the cluster from reaching red or yellow status. It does this using index.blocks.write.
The two reasons being:
Low Memory
When the JVMMemoryPressure metric exceeds 92% for 30 minutes, Amazon ES triggers a protection mechanism and blocks all write operations to prevent the cluster from reaching red status. When the protection is on, write operations fail with a ClusterBlockException error, new indexes can't be created, and the IndexCreateBlockException error is thrown.
When the JVMMemoryPressure metric returns to 88% or lower for five minutes, the protection is disabled, and write operations to the cluster are unblocked.
Low Disk Space
Elasticsearch has a default "low watermark" of 85%, meaning that once disk usage exceeds 85%, Elasticsearch no longer allocates shards to that node. Elasticsearch also has a default "high watermark" of 90%, at which point it attempts to relocate shards to other nodes.
This error indicates that AWS ElasticSearch has placed a block on your domain based upon disk space. At 85%, ES will not allow you create any new indexes. At 90%, no new documents can be written.
ES could apply write block on index during rollovers, or Low disk space or memory.
In order to stop these errors you need to remove the write block on the index by setting index.blocks.write to false
curl -X PUT -H "Content-Type: application/json" \
'http://localhost:9200/{index_name}/_settings' \
-d '{ "index": { "blocks": { "write": "false" } } }'
The accepted solution was not enough in my case, I had to remove index.blocks.read_only_allow_delete as well
PUT /my_index/_settings
{
"index.blocks.read_only_allow_delete": null,
"index.blocks.write": null
}
ES version 7.15
This can also happen if the index you're trying to write to has been marked as read only. I've had it happen due to an Index State Management misconfiguration which caused a weekly index to be moved to a warm state after one day.

What is reserved memory in YARN and why it shows spike?

I have a doubt around what exactly is reserved memory in YARN?
I do understand that its YARN's way of balancing the memory requirement raised by several jobs that are submitted, so that no job goes in starvation mode. It tries to reserve the memory as and when it is freed up for another job.
We use AWS EMR for our operations. I have observed that at times when a memory intensive job is submitted on our cluster(say spark-sql job) 1TB ram gets reserved out of our total 3TB RAM. I have observed this even when just one job is submitted/running on the cluster and none other are waiting or queued up.
This spike in memory is observed intermittently at times in range of 5-15 mins and then comes down to manageable level or even 0.
Can someone please explain if this is normal behavior or not. If its normal please explain it in little details. Else if there is some probable configuration mistake that can trigger this please help me in resolving this.
Note -> We have R4 8xlarge 10 node cluster on EMR 5.0.0.
Thanks in advance
Manish Mehra

A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool'

I have a series of Azure SQL Data Warehouse databases (for our development/evaluation purposes). Due to a recent unplanned extended outage (due to an issue with the Tenant Ring associated with some of these databases), I decided to resume the canary queries I had been running before but had quiesced for a couple of months due to frequent exceptions.
The canary queries are not running particularly frequently on any specific database, say every 15 minutes. On one database, I've received two indications of issues completing the canary query in 24 hours. The error is:
Msg 110802, Level 16, State 1, Server adwscdev1, Line 1110802;An internal DMS error occurred that caused this operation to fail. Details: A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool' (2000000007). Rerun the query.
This database is under essentially no load, running at more than 100 DWU.
Other databases on the same logical server may be running under a load, but I have not seen the error on them.
What is the explanation for this error?
Please open a support ticket for this issue, support will have full access to the DMS logs and be able to see exactly what is going on. this behavior is not expected.
While I agree a support case would be reasonable I think you should also try scaling up to say DWU400 and retrying. I would also consider trying largerc or xlargerc on DWU100 and DWU400 as described here. Note it gets more memory and resources per query.
Run the following then retry your query:
EXEC sp_addrolemember 'largerc', 'yourLoginName'