We have an OpenSearch domain on AWS.
Sometimes Cluster status and OpenSearch Dashboards health status goes into yellow for a few minutes which is fine I guess.
But today OpenSearch Dashboards health status went into red and is there for a few hours now. Everything else works except the dashboards, which gives error 503: {"Message":"Http request timed out connecting"}
{
"cluster_name" : "779754160511:telemetry",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"discovered_master" : true,
"discovered_cluster_manager" : true,
"active_primary_shards" : 166,
"active_shards" : 332,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
I finally tried updating to an instance with more RAM, but the status is still red.
How could I solve this? Is there a way to restart the domain or debug in someway?
Related
I tried to follow this example https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-data.html to load data to neptune
curl X POST -H 'Content-Type: application/json' https://endpoint:port/loader -d '
{
"source" : "s3://source.csv",
"format" : "csv",
"iamRoleArn" : "role",
"region" : "region",
"failOnError" : "FALSE",
"parallelism" : "MEDIUM",
"updateSingleCardinalityProperties" : "FALSE",
"queueRequest" : "TRUE"
}'
{
"status" : "200 OK",
"payload" : {
"loadId" : "411ee078-3c44-4620-85ac-e22ef5466bbb"
}
And I get status 200 but then I try to check if the data was loaded and get this:
curl G 'https://endpoint:port/loader/411ee078-3c44-4620-85ac-e22ef5466bbb'
{
"status" : "200 OK",
"payload" : {
"feedCount" : [
{
"LOAD_FAILED" : 1
}
],
"overallStatus" : {
"fullUri" : "s3://source.csv",
"runNumber" : 1,
"retryNumber" : 1,
"status" : "LOAD_FAILED",
"totalTimeSpent" : 4,
"startTime" : 1617653964,
"totalRecords" : 10500,
"totalDuplicates" : 0,
"parsingErrors" : 0,
"datatypeMismatchErrors" : 0,
"insertErrors" : 10500
}
}
I had no idea why I get LOAD_FAILED so I decided to use get-status API to see what errors caused the load failure and got this:
curl -X GET 'endpoint:port/loader/411ee078-3c44-4620-85ac-e22ef5466bbb?details=true&errors=true'
{
"status" : "200 OK",
"payload" : {
"feedCount" : [
{
"LOAD_FAILED" : 1
}
],
"overallStatus" : {
"fullUri" : "s3://source.csv",
"runNumber" : 1,
"retryNumber" : 1,
"status" : "LOAD_FAILED",
"totalTimeSpent" : 4,
"startTime" : 1617653964,
"totalRecords" : 10500,
"totalDuplicates" : 0,
"parsingErrors" : 0,
"datatypeMismatchErrors" : 0,
"insertErrors" : 10500
},
"failedFeeds" : [
{
"fullUri" : "s3://source.csv",
"runNumber" : 1,
"retryNumber" : 1,
"status" : "LOAD_FAILED",
"totalTimeSpent" : 1,
"startTime" : 1617653967,
"totalRecords" : 10500,
"totalDuplicates" : 0,
"parsingErrors" : 0,
"datatypeMismatchErrors" : 0,
"insertErrors" : 10500
}
],
"errors" : {
"startIndex" : 1,
"endIndex" : 10,
"loadId" : "411ee078-3c44-4620-85ac-e22ef5466bbb",
"errorLogs" : [
{
"errorCode" : "FROM_OR_TO_VERTEX_ARE_MISSING",
"errorMessage" : "Either from vertex, '1414', or to vertex, '70', is not present.",
"fileName" : "s3://source.csv",
"recordNum" : 0
},
What does this error even mean and what is the possible fix?
It looks as if you were trying to load some edges. When an edge is loaded, the two vertices that the edge will be connecting must already have been loaded/created. The message:
"errorMessage" : "Either from vertex, '1414', or to vertex, '70',is not present.",
is letting you know that one (or both) of the vertices with ID values of '1414' and '70' are missing. All vertices referenced by a CSV file containing edges must already exist (have been created or loaded) prior to loading edges that reference them. If the CSV files for vertices and edges are in the same S3 location then the bulk loader can figure out the order to load them in. If you just ask the loader to load a file containing edges but the vertices are not yet loaded, you will get an error like the one you shared.
Im trying to restore the ElasticSearch snapshot which is taken from the AWS managed elastic search. Version 5.6. Instance type i3.2xlarge.
While restoring this on a VM, immediately the cluster status went to Red and all the shards are unassigned.
{
"cluster_name" : "es-cluster",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 8,
"number_of_data_nodes" : 5,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 480,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 0.0
}
When I use the allocation explain API, I got this below response.
{
"node_id" : "3WEV1tHoRPm6OguKyxp0zg",
"node_name" : "node-1",
"transport_address" : "10.0.0.2:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "replica_after_primary_active",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
},
{
"decider" : "filter",
"decision" : "NO",
"explanation" : "node does not match index setting [index.routing.allocation.include] filters [instance_type:\"i2.2xlarge OR i3.2xlarge\"]"
},
{
"decider" : "throttling",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
}
]
},
This is something strange and I never faced this. Anyhow the snapshot is done, How can I ignore this setting while restoring? Even I tried the below query but still the same issue.
curl -X POST "localhost:9200/_snapshot/restore/awsnap/_restore?pretty" -H 'Content-Type: application/json' -d'
{"ignore_index_settings": [
"index.routing.allocation.include"
]
}'
I found the cause and the solution.
Detailed troubleshooting steps are here https://thedataguy.in/restore-aws-elasticsearch-snapshot-failed-index-settings/
But leaving this comment here, so others can get benefit from it.
This is AWS specific thing, So I used this to solve it.
curl -X POST "localhost:9200/_snapshot/restore/awsnap/_restore?pretty" -H 'Content-Type: application/json' -d'
{"ignore_index_settings": [
"index.routing.allocation.include.instance_type"
]
}
'
I have an app using django-wkhtmltopdf. It is working locally, but when I run on a VPS I'm getting the following error.
Command '['wkhtmltopdf', '--allow', 'True',
'--enable-local-file-access', '--encoding', 'utf8',
'--javascript-delay', '2000', '--page-height', '465mm',
'--page-width', '297mm', '--quiet', '/tmp/wkhtmltopdffyknjyrk.html',
'-']' returned non-zero exit status 1.
I'm guessing I need to set some permissions on the VPS to allow the temp file to be created (or set a directory for it) but I'm not sure how to do this.
Within settings.py I have
WKHTMLTOPDF_CMD_OPTIONS = {
'quiet': True,
'allow' : '/tmp',
'javascript-delay' : 1000,
'enable-local-file-access' : True,
}
And within the django view I've got:
cmd_options = {
#'window-status' : 'ready',
'javascript-delay': 2000,
'page-height' : '465mm',
'page-width' : '297mm',
'allow' : True,
#'T' : 0, 'R' : 0, 'B' : 0, 'L': 0,
}
I am using AWS Elasticsearch and the cluster receives ~ 600 search queries per second. This is causing periodic bursts of 503 Service not available response from Elasticsearch. As, a result I wanted to turn on the cache query for the index (Verified that is it actually turned on by looking at <ES_DOMAIN>/<INDEX_NAME>
However, when I check the query cache stats at <ES_DOMAIN>/_stats/query_cache?pretty&human, this is what I get
"<index_name>" : {
"primaries" : {
"query_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0,
"hit_count" : 0,
"miss_count" : 0
}
},
"total" : {
"query_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0,
"hit_count" : 0,
"miss_count" : 0
}
}
}
Any suggestions on how I can turn on the cache ?
Based on my reading and similar experience (even after setting index.cache.query.enable: true in the index mapping) I can only guess that AWS has disabled query caching. Probably by setting indices.cache.query.size: 0% in config/elasticsearch.yml
UPDATE
After leaving the cluster running for a while and doing some heavy aggregations I am seeing that the query_cache is starting to get used, although not sure why I am not seeing any cache hits
GET _nodes/stats/indices/query_cache?pretty&human
{
"cluster_name": "XXXXXXXXXXXX:xxxxxxxxxxx",
"nodes": {
"q59YfHDdRQupousO9vh6KQ": {
"timestamp": 1465589579698,
"name": "Mongoose",
"indices": {
"query_cache": {
"memory_size": "37.2kb",
"memory_size_in_bytes": 38151,
"evictions": 0,
"hit_count": 0,
"miss_count": 45
}
}
},
"K3olMnkkRZW53tTw05UVhA": {
"timestamp": 1465589579692,
"name": "Meggan Braddock",
"indices": {
"query_cache": {
"memory_size": "47.3kb",
"memory_size_in_bytes": 48497,
"evictions": 0,
"hit_count": 0,
"miss_count": 53
}
}
}
}
}
I want to match multiple start strings in mongo. explain() shows that it's using the indexedfield index for this query:
db.mycol.find({indexedfield:/^startstring/,nonindexedfield:/somesubstring/});
However, the following query for multiple start strings is really slow. When I run explain I get an error. Judging by the faults I can see in mongostat (7k a second) it's scanning the entire collection. It's also alternating between 0% locked and 90-95% locked every few seconds.
db.mycol.find({indexedfield:/^(startstring1|startstring2)/,nonindexedfield:/somesubstring/}).explain();
JavaScript execution failed: error: { "$err" : "assertion src/mongo/db/key.cpp:421" } at src/mongo/shell/query.js:L128
Can anyone shed some light on how I can do this or what is causing the explain error?
UPDATE - more info
Ok, so I managed to get explain to work on the more complex query by limiting the number of results. The difference is this:
For a single substring, "^/BA1/" (yes it's postcodes)
"cursor" : "BtreeCursor pc_1 multi",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 10,
"nscanned" : 10,
"nscannedObjectsAllPlans" : 19,
"nscannedAllPlans" : 19,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"indexedfield" : [
[
"BA1",
"BA2"
],
[
/^BA1/,
/^BA1/
]
]
}
For multiple substrings "^(BA1|BA2)/"
"cursor" : "BtreeCursor pc_1 multi",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 10,
"nscanned" : 1075276,
"nscannedObjectsAllPlans" : 1075285,
"nscannedAllPlans" : 2150551,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5,
"nChunkSkips" : 0,
"millis" : 4596,
"indexBounds" : {
"indexedfield" : [
[
"",
{
}
],
[
/^(BA1|BA2)/,
/^(BA1|BA2)/
]
]
}
which doesn't look very good.
$or solves the problem in terms of using the indexes (thanks EddieJamsession). Queries are now lightening fast.
db.mycoll.find({$or: [{indexedfield:/^startstring1/},{indexedfield:/^startstring2/],nonindexedfield:/somesubstring/})
However, I would still like to do this with a regex if possible so I'm leaving the question open. Not least because I now have to refactor my application to take these types of queries into account.