ElasticSearch Node Failure - amazon-web-services

My Elasticsearch cluster dropped from 2B documents to 900M Records, on AWS it shows
Relocating shards: 4
Whilst Showing
Active Shards: 35
and
Active primary shards: 34
(Might not be relevant but here's rest of stats):
Number of nodes: 9
Number of data nodes: 6
Unassigned shards: 17
When running
GET /_cluster/allocation/explain
it returns:
{
"index": "datauwu",
"shard": 6,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "NODE_LEFT",
"at": "2019-10-31T17:02:11.258Z",
"details": "node_left[removedforsecuritybecimparanoid1]",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
"node_allocation_decisions": [
{
"node_id": "removedforsecuritybecimparanoid2",
"node_name": "removedforsecuritybecimparanoid2",
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "removedforsecuritybecimparanoid3",
"node_name": "removedforsecuritybecimparanoid3",
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "removedforsecuritybecimparanoid4",
"node_name": "removedforsecuritybecimparanoid4",
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "removedforsecuritybecimparanoid5",
"node_name": "removedforsecuritybecimparanoid5",
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "removedforsecuritybecimparanoid6",
"node_name": "removedforsecuritybecimparanoid6",
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "removedforsecuritybecimparanoid7",
"node_name": "removedforsecuritybecimparanoid7",
"node_decision": "no",
"store": {
"found": false
}
}
]
}
im a bit confused to what this exactly means, does this mean my elasticsearch cluster did not lose data, but is instead relocating it into different shards, or cannot it not find the shards?
If it cannot find shards, does this mean my data was lost? if so, what could be the reason, how can i prevent this from happening in the future?
I haven't setup replicas as i was indexing data, and replicas slow it down whilst indexing.
also side not, my record count dropped down to 400m at one point but then rose back up to 900m randomly. i don't know what this means and any insight would greatly be appreciated.

"reason": "NODE_LEFT"
And:
I haven't setup replicas as i was indexing data, and replicas slow it down whilst indexing.
If the node holding the primary shards has gone away, then yes, your data is gone. After all, if there are no replicas, then where would the cluster retrieve the data from, if the primary (and only) shards are no longer part of the cluster? You will either need to bring the node holding those shards back up and add it into the cluster, or the data is gone.
The error message is saying "You want me to allocate a primary shard for this index that I know exists, but there used to be another version of that primary shard that can't be found anymore, I won't allocate it again in case the previous primary comes back."
You can force Elasticsearch to reallocate the primary shard (and explicitly accept that the data in the previous primary shard is gone) by performing a reroute with allocate_stale_primary (doc):
curl -H 'Content-Type: application/json' \
-XPOST '127.0.0.1:9200/_cluster/reroute?pretty' -d '{
"commands" : [ {
"allocate_stale_primary" :
{
"index" : "datauwu", "shard" : 6,
"node" : "target-data-node-id",
"accept_data_loss" : true
}
}
]
}'
Turning off replicas for anything but development with disposable data is usually a bad idea.
also side not, my record count dropped down to 400m at one point but then rose back up to 900m randomly. i don't know what this means and any insight would greatly be appreciated.
This happens because shards aren't visible in the cluster. This can happen if all copies of a shard are being allocated, relocated, or recovered. This corresponds with a RED cluster state. You can mitigate it by ensuring that you have at least 1 replica (though ideally you have a sufficient number of replicas set up to survive the loss of N data nodes in the cluster). This lets Elasticsearch keep one shard as the primary while it moves others around.
If you only have the primary and no replicas, then if a primary is being recovered or relocated, the data in that shard will not be visible in the cluster. Once the shard is active again, the documents in it become visible.

When attempting to recover an unallocated shard with a missing primary using allocate_stale_primary as described by Chris Heald you might get:
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "No data for shard [0] of index [xyz] found on any node"
}
This mean the data is gone unless the missing node rejoins the cluster. Alternatively, you can empty the shard using an allocate_empty_primary command.
curl -H 'Content-Type: application/json' \
-XPOST '127.0.0.1:9200/_cluster/reroute?pretty' -d '{
"commands" : [ {
"allocate_empty_primary" :
{
"index" : "datauwu", "shard" : 6,
"node" : "target-data-node-id",
"accept_data_loss" : true
}
}
]
}'
This wipe the data and will overwrite the shard if the missing node rejoins.

Related

ElasticSearch 7.10 AWS Reindexing error es_rejected_execution_exception 429

Both my indices are on the same node. The source has about 200k documents, I'm using AWS and the instance type is "t3.small.search" so 2 vCPUs. I tried slicing already but it just gives me the same error. Any ideas on what I can do to make this process finish successfully?
{
"error" : {
"root_cause" : [
{
"type" : "es_rejected_execution_exception",
"reason" : "rejected execution of coordinating operation [shard_detail=[fulltext][0][C], shard_coordinating_and_primary_bytes=0, shard_operation_bytes=98296362, shard_max_coordinating_and_primary_bytes=105630] OR [node_coordinating_and_primary_bytes=0, node_replica_bytes=0, node_all_bytes=0, node_operation_bytes=98296362, node_max_coordinating_and_primary_bytes=105630924]"
}
],
"type" : "es_rejected_execution_exception",
"reason" : "rejected execution of coordinating operation [shard_detail=[fulltext][0][C], shard_coordinating_and_primary_bytes=0, shard_operation_bytes=98296362, shard_max_coordinating_and_primary_bytes=105630] OR [node_coordinating_and_primary_bytes=0, node_replica_bytes=0, node_all_bytes=0, node_operation_bytes=98296362, node_max_coordinating_and_primary_bytes=105630924]"
},
"status" : 429
}
I ran into a similar problem. I was trying to reindex a couple of indices that had a lot of documents in them. I raised the JVM HeapSize from 512mb to 2gb and it fixed the problem.
Check current JVM HeapSize:
GET {ES_URL}/_cat/nodes?h=heap*&v
Here's how you can change the settings: https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html
Hope this helps.

DynamoDB range keys exceeded size limit

When I do a table.put_item I get the error message "Aggregated size of all range keys has exceeded the size limit of 1024". What options do I have so I can save my data?
Change a setting in DynamoDB to allow a larger limit?
Split or compress the item and save to DynamoDB?
Store the item in s3?
Use another kind of database?
Other options?
Here is the specific code snippet:
def put_record(item):
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('table_name')
table.put_item(Item=item)
Here is an example of an item stored in DynamoDB. The two string variables p and r combined could be up to 4000 tokens.
{
"uuid": {
"S": "5bf19498-344c"
},
“p”: {
"S": “What is the next word after Mary had a”
},
“pp”: {
"S": "0"
},
"response_length": {
"S": "632"
},
"timestamp": {
"S": "04/03/2022 06:30:55 AM CST"
},
"s": {
"S": "1"
},
"c": {
"S": "test"
},
"f": {
"S": "0"
},
"t": {
"S": "0.7"
},
"to": {
"S": "1"
},
"b": {
"S": "1"
},
"r": {
"S": “lamb”
}
}
I read this
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ServiceQuotas.html
and couldn't figure out how the 1024 is calculated but I'm assuming the two string variables are causing the error.
The put_item doesn't cause an error when Item is a smaller size; only when the size is larger than the 1024 limit.
It is hard to estimate how many of the saves will be large but I need to be able to save the large items. So from an architecture perspective willing to consider any and all options.
Appreciate the assistance!
The error message "Aggregated size of all range keys has exceeded the size limit of 1024" is baffling because there can only be one sort key, so what does "aggregate" refer to? The following post was also surprised by this message: https://www.stuffofminsun.com/2019/05/07/dynamodb-keys/.
I am guessing you actually do try to write an item where its "p" (which you said is your sort key) itself is over 1024 characters. I don't see how it's the size of p+r combined that matters. You can take a look (and/or include in the question) at the one specific request that fails, and check what is the length of p itself. Please also double-check that you really set "p", and not something else, as the sort key.
Finally, if you really need sort keys over 1024 characters in length, you can consider Scylla Alternator - an open-source DynamoDB-compatible database which doesn't have this specific limitation.

AWS Elasticsearch performance issue

Have an index which is search heavy. Rpm varies from 15-20k. Issue is, for first few days resp time of search query will be around 15ms. But it will start increasing gradually and touches ~70ms. Some of the requests starts queuing(as per Search thread pool graph in aws console) but there were no rejection. Queuing would increase latency of the search request.
Got to know that queuing will happen if there is pressure on resource. I think I have sufficient cpu and memory, plz look at config below.
Enabled slow query logs, but didnt find any anamoly. Even though average resp time is around 16ms, I see few queries going above 50ms. But there was no issue in search query. Searchable documents is around 8k.
Need your suggestion on how to improve performance here. Document mapping, search query and ES config are given below. Is there any issue in mapping or query here?
Mapping:
{
"data":{
"mappings":{
"_doc":{
"properties":{
"a":{
"type":"keyword"
},
"b":{
"type":"keyword"
}
}
}
}
}
}
Search query:
{
"size":5000,
"query":{
"bool":{
"filter":[
{
"terms":{
"a":[
"all",
"abc"
],
"boost":1
}
},
{
"terms":{
"b":[
"all",
123
],
"boost":1
}
}
],
"adjust_pure_negative":true,
"boost":1
}
},
"stored_fields":[]
}
Im using keyword in mapping and terms in search query as I want to search for exact value.Boost and adjust_pure_negative are added automatically. From what I read, they should not affect performance.
Index settings:
{
"data":{
"settings":{
"index":{
"number_of_shards":"1",
"provided_name":"data",
"creation_date":"12345678154072",
"number_of_replicas":"7",
"uuid":"3asd233Q9KkE-2ndu344",
"version":{
"created":"10499"
}
}
}
}
}
ES config:
Master node instance type: m5.large.search
Master nodes: 3
Data node instance type: m5.2xlarge.search
Data nodes: 8 (8 vcpu, 32 GB memory)

timeout with couchdb mapReduce when database is huge

Details:
Apache CouchDB v. 3.1.1
about 5 GB of twitter data have been dumped in partitions
Map reduce function that I have written:
{
"_id": "_design/Info",
"_rev": "13-c943aaf3b77b970f4e787be600dd240e",
"views": {
"trial-view": {
"map": "function (doc) {\n emit(doc.account_name, 1);\n}",
"reduce": "_count"
}
},
"language": "javascript",
"options": {
"partitioned": true
}
}
when I am trying the following command in postman:
http://<server_ip>:5984/mydb/_partition/partition1/_design/Info/_view/trial-view?key="BT"&group=true
I am getting following error:
{
"error": "timeout",
"reason": "The request could not be processed in a reasonable amount of time."
}
Kindly help me how to apply mapReduce on such huge data?
So, I thought of answering my own question, after realizing my mistake. The answer to this is simple. It just needed more time, as the indexing takes a lot of time. you can see the metadata to see the db data being indexed.

How to specify attributes to return from DynamoDB through AppSync

I have an AppSync pipeline resolver. The first function queries an ElasticSearch database for the DynamoDB keys. The second function queries DynamoDB using the provided keys. This was all working well until I ran into the 1 MB limit of AppSync. Since most of the data is in a few attributes/columns I don't need, I want to limit the results to just the attributes I need.
I tried adding AttributesToGet and ProjectionExpression (from here) but both gave errors like:
{
"data": {
"getItems": null
},
"errors": [
{
"path": [
"getItems"
],
"data": null,
"errorType": "MappingTemplate",
"errorInfo": null,
"locations": [
{
"line": 2,
"column": 3,
"sourceName": null
}
],
"message": "Unsupported element '$[tables][dev-table-name][projectionExpression]'."
}
]
}
My DynamoDB function request mapping template looks like (returns results as long as data is less than 1 MB):
#set($ids = [])
#foreach($pResult in ${ctx.prev.result})
#set($map = {})
$util.qr($map.put("id", $util.dynamodb.toString($pResult.id)))
$util.qr($map.put("ouId", $util.dynamodb.toString($pResult.ouId)))
$util.qr($ids.add($map))
#end
{
"version" : "2018-05-29",
"operation" : "BatchGetItem",
"tables" : {
"dev-table-name": {
"keys": $util.toJson($ids),
"consistentRead": false
}
}
}
I contacted the AWS people who confirmed that ProjectionExpression is not supported currently and that it will be a while before they will get to it.
Instead, I created a lambda to pull the data from DynamoDB.
To limit the results form DynamoDB I used $ctx.info.selectionSetList in AppSync to get the list of requested columns, then used the list to specify the data to pull from DynamoDB. I needed to get multiple results, maintaining order, so I used BatchGetItem, then merged the results with the original list of IDs using LINQ (which put the DynamoDB results back in the correct order since BatchGetItem in C# does not preserve sort order like the AppSync version does).
Because I was using C# with a number of libraries, the cold start time was a little long, so I used Lambda Layers pre-JITed to Linux which allowed us to get the cold start time down from ~1.8 seconds to ~1 second (when using 1024 GB of RAM for the Lambda).
AppSync doesn't support projection but you can explicitly define what fields to return in the response template instead of returning the entire result set.
{
"id": "$ctx.result.get('id')",
"name": "$ctx.result.get('name')",
...
}