I am using AWS ElasticSearch service, and am attempting to use a policy to transition indices to UltraWarm storage. However, each time the migration to UltraWarm begins, Kibana displays the error, "Failed to start warm migration" for the managed index. The complete error message is below. The "cause" is not very helpful. I am looking for help on how to identify / resole the root cause of this issue. Thanks!
{
"cause": "[753f6f14e4f92c962243aec39d5a7c31][10.212.32.199:9300][indices:admin/ultrawarm/migration/warm]",
"message": "Failed to start warm migration"
}
Related
Intermittently getting the following error when connecting to an AWS keyspace using a lambda layer
All host(s) tried for query failed. First host tried, 3.248.244.53:9142: Host considered as DOWN. See innerErrors.
I am trying to query a table in a keyspace using a nodejs lambda function as follows:
import cassandra from 'cassandra-driver';
import fs from 'fs';
export default class AmazonKeyspace {
tpmsClient = null;
constructor () {
let auth = new cassandra.auth.PlainTextAuthProvider('cass-user-at-xxxxxxxxxx', 'zzzzzzzzz');
let sslOptions1 = {
ca: [ fs.readFileSync('/opt/utils/AmazonRootCA1.pem', 'utf-8')],
host: 'cassandra.eu-west-1.amazonaws.com',
rejectUnauthorized: true
};
this.tpmsClient = new cassandra.Client({
contactPoints: ['cassandra.eu-west-1.amazonaws.com'],
localDataCenter: 'eu-west-1',
authProvider: auth,
sslOptions: sslOptions1,
keyspace: 'tpms',
protocolOptions: { port: 9142 }
});
}
getOrganisation = async (orgKey) => {
const SQL = 'select * FROM organisation where organisation_id=?;';
return new Promise((resolve, reject) => {
this.tpmsClient.execute(SQL, [orgKey], {prepare: true}, (err, result) => {
if (!err?.message) resolve(result.rows);
else reject(err.message);
});
});
};
}
I am basically following this recommended AWS documentation.
https://docs.aws.amazon.com/keyspaces/latest/devguide/using_nodejs_driver.html
It seems that around 10-20% of the time the lambda function (cassandra driver) cannot connect to the endpoint.
I am pretty familiar with Cassandra (I already use a 6 node cluster that I manage) and don't have any issues with that.
Could this be a timeout or do I need more contact points?
Followed the recommended guides. Checked from the AWS console for any errors but none shown.
UPDATE:
Update to the above question....
I am occasionally (1 in 50 if I parallel call the function (5 concurrent calls)) getting the below error:
"All host(s) tried for query failed. First host tried,
3.248.244.5:9142: DriverError: Socket was closed at Connection.clearAndInvokePending
(/opt/node_modules/cassandra-driver/lib/connection.js:265:15) at
Connection.close
(/opt/node_modules/cassandra-driver/lib/connection.js:618:8) at
TLSSocket.
(/opt/node_modules/cassandra-driver/lib/connection.js:93:10) at
TLSSocket.emit (node:events:525:35)\n at node:net:313:12\n at
TCP.done (node:_tls_wrap:587:7) { info: 'Cassandra Driver Error',
isSocketError: true, coordinator: '3.248.244.5:9142'}
This exception may be caused by throttling in the keyspaces side, resulting the Driver Error that you are seeing sporadically.
I would suggest taking a look over this repo which should help you to put measures in place to either prevent the occurrence of this issue or at least reveal the true cause of the exception.
Some of the errors you see in the logs you will need to investigate Amazon CloudWatch metrics to see if you have throttling or system errors. I've built this AWS CloudFormation template to deploy a CloudWatch dashboard with all the appropriate metrics. This will provide better observability for your application.
A System Error indicates an event that must be resolved by AWS and often part of normal operations. Activities such as timeouts, server faults, or scaling activity could result in server errors. A User error indicates an event that can often be resolved by the user such as invalid query or exceeding a capacity quota. Amazon Keyspaces passes the System Error back as a Cassandra ServerError. In most cases this a transient error, in which case you can retry your request until it succeeds. Using the Cassandra driver’s default retry policy customers can also experience NoHostAvailableException or AllNodesFailedException or messages like yours "All host(s) tried for query failed". This is a client side exception that is thrown once all host in the load balancing policy’s query plan have attempted the request.
Take a look at this retry policy for NodeJs which should help resolve your "All hosts failed" exception or pass back the original exception.
The retry policies in the Cassandra drivers are pretty crude and will not be able to do more sophisticated things like circuit breaker patters. You may want to eventually use a "failfast" retry policy for the driver and handle the exceptions in your application code.
i am making api request to create disks to the Google Cloud platform and get status code as 200.so but when i check if disk is ready i get that "error":{"code":404 ,"reason":"notFound","domain":"global"}. when i check google cloud logs i see for request the below error code. "status": { "code": 8, "message": "RATE_LIMIT_EXCEEDED" } -can anyone help possible solutions for this like which exact quota limit should be increased? i have tried retry mechanism with pause included abt 3 sec's with that i was able to reduce the probability but the real issue still there.
you can request for an increase in the quota allocation using the GCP Console -> IAM & Admin -> Quotas. Please find the Compute Engine Quota that is showing up as exceeded and click on it to drill down to the specific operation types. I believe you were hitting the limit "Operation read requests"
you may have hit an operation read request limit.
I uses AWS Elasticsearch service version 7.1 and its built-it Kibana to manage application logs. New indexes are created daily by Logstash. My Logstash gets error about maximum shards limit reach from time to time and I have to delete old indexes for it to become working again.
I found from this document (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-handling-errors.html) that I have an option to increase _cluster/settings/cluster.max_shards_per_node.
So I have tried that by put following command in Kibana Dev Tools
PUT /_cluster/settings
{
"defaults" : {
"cluster.max_shards_per_node": "2000"
}
}
But I got this error
{
"Message": "Your request: '/_cluster/settings' payload is not allowed."
}
Someone suggests that this error occurs when I try to update some settings that are not allowed by AWS, but this document (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-supported-es-operations.html#es_version_7_1) tells me that cluster.max_shards_per_node is one in the allowed list.
Please suggest how to update this settings.
You're almost there, you need to rename defaults to persistent
PUT /_cluster/settings
{
"persistent" : {
"cluster.max_shards_per_node": "2000"
}
}
Beware though, that the more shards you allow per node, the more resources each node will need and the worse the performance can get.
I am running a test job on AWS. I am reading CSV data from S3 bucket, running a GLUE ETL job on it and storing the same data on Amazon Redshift. GLUE job is just reading the data from S3 and storing in Redshift without any modification. The job runs fine and I get the desired result in Redshift but it returns an error which I am unable to understand.
Here is the error log:
18/11/14 09:17:31 WARN YarnClient: The GET request failed for the URL http://169.254.76.1:8088/ws/v1/cluster/apps/application_1542186720539_0001
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.HttpHostConnectException: Connect to 169.254.76.1:8088 [/169.254.76.1] failed: Connection refused (Connection refused)
It is a WARN rather than error but I want to understand what is causing the WARN. I tried to search for the IP that is indicated in the WARN but I am not able to find the machine with the mentioned IP.
I noticed these error comming up to me in my AWS Glue Job so I found something that could be helpful from AWS:
This WARN message is not so special, and does not mean job failure or any errors directly. I guess there should be other cause.
I would recommend you to enable continuous logging, and check both driver/executor logs to see if there are any suspicious behavior.
If you enable job bookmark, please try disabling it and see how it goes without bookmark.
https://forums.aws.amazon.com/thread.jspa?messageID=927547
I had dissabled bookmarks from the begining. What I check is that my Glue job writing data to S3 and got an exeption per Memory, so what I did is to repartition the data.
MyDynamicFrame.coalesce(100).write.partitionBy("month").mode("overwrite").parquet("s3://"+bucket+"/"+path+"/out_data")
so if you have some write opperations, I'll recommend to check how you are writing to S3
after running AWS Elastic Beanstalk application for few weeks suddenly I can't open my application. Page simply displays an error which doesn't provide much information how to fix it.
Error
A problem occurred while loading your page: AWS Query failed to deserialize response
(and there is no more information, Googling also haven't found any answer)
So before updating my subscription and starting paying to Amazon not insignificant amount of money for being able to contact their technical support I thought I will ask here first if someone here encountered this issue.
Thanks for any suggestions.
After receiving this generic error, I was able to dig into the actual error message by using the EB CLI. In my case the CLI threw "ZIP does not support timestamps before 1980".