Erroneous Aws::ECS::Errors::ClusterNotFoundException — what is happening? - amazon-web-services

I have an ECS cluster, an active service for it, and a task for this service. I am trying to call ListTasks with Ruby AWS SDK.
When there is no active task, it comes through with an empty list, as expected. But when there is a running task, I get the Aws::ECS::Errors::ClusterNotFoundException.
I tried calling ListClusters, and got a successful response:
{:cluster_arns=>["arn:aws:ecs:<region>:<account_num>:cluster/<cluster_name>"], :next_token=>nil}.
I also tried calling DescribeServices, and got a successful response as well: {:clusters=>[{:cluster_arn=>"arn:aws:ecs:<region>:<account_num>:cluster/<cluster_name>", :cluster_name=>"<cluster_name>", :status=>"ACTIVE", :registered_container_instances_count=>0, :running_tasks_count=>1, :pending_tasks_count=>0, :active_services_count=>1, :statistics=>[], :tags=>[], :settings=>[{:name=>"containerInsights", :value=>"enabled"}], :capacity_providers=>["FARGATE_SPOT", "FARGATE"], :default_capacity_provider_strategy=>[{:capacity_provider=>"FARGATE", :weight=>1, :base=>0}], :attachments=>nil, :attachments_status=>nil}], :failures=>[]}.
In addition, I regularly call DescribeServices and UpdateService for the same cluster name successfully.
But the error persists for ListTasks.
Has anyone encountered something similar? What do you think is happening?
UPD The code that generates the error:
##ecs_client = Aws::ECS::Client.new(
region: Aws.config[:region],
access_key_id: Aws.config[:credentials].access_key_id,
secret_access_key: Aws.config[:credentials].secret_access_key
)
...
tasks = ##ecs_client.list_tasks({ cluster: '<cluster_name>' })

If you do not specify a cluster when calling the "ListTasks" API, the "default" cluster is assumed. Also, double check the region used in your script.

Related

Cloudsql replication error: The instance or operation is not in an appropriate state to handle the request

I am trying to setup a cloud replication for my master/slave db. The master resides on an external vpc and I want to set up a slave in google cloud sql. I have followed the steps as here to setup the databases.
They are setup fine and I can see initial replication taking place from my master. The data is synchronized. However shortly after, it becomes disabled for replication. I cannot seem to start it again to replicate and each time gives the following error
The instance or operation is not in an appropriate state to handle the request.
I checked the suggestions from here but that didnt work.
Running gcloud sql instances describe replica-instance1 gives me the following (excerpt):
state: RUNNABLE
replicaConfiguration:
failoverTarget: false
kind: sql#replicaConfiguration
I can update if you need more of the results but that all looks fine. Can anyone help?
Edit:
This is in the postgresql logs
resource: {
labels: {3}
type: "cloudsql_database"
}
severity: "ERROR"
textPayload: "2023-01-20 22:10:36.354 UTC [282]: [2-1] db=postgres,user=[unknown] ERROR: data stream ended"
timestamp: "2023-01-20T22:10:36.354863Z"
}

Errors connecting to AWS Keyspaces using a lambda layer

Intermittently getting the following error when connecting to an AWS keyspace using a lambda layer
All host(s) tried for query failed. First host tried, 3.248.244.53:9142: Host considered as DOWN. See innerErrors.
I am trying to query a table in a keyspace using a nodejs lambda function as follows:
import cassandra from 'cassandra-driver';
import fs from 'fs';
export default class AmazonKeyspace {
tpmsClient = null;
constructor () {
let auth = new cassandra.auth.PlainTextAuthProvider('cass-user-at-xxxxxxxxxx', 'zzzzzzzzz');
let sslOptions1 = {
ca: [ fs.readFileSync('/opt/utils/AmazonRootCA1.pem', 'utf-8')],
host: 'cassandra.eu-west-1.amazonaws.com',
rejectUnauthorized: true
};
this.tpmsClient = new cassandra.Client({
contactPoints: ['cassandra.eu-west-1.amazonaws.com'],
localDataCenter: 'eu-west-1',
authProvider: auth,
sslOptions: sslOptions1,
keyspace: 'tpms',
protocolOptions: { port: 9142 }
});
}
getOrganisation = async (orgKey) => {
const SQL = 'select * FROM organisation where organisation_id=?;';
return new Promise((resolve, reject) => {
this.tpmsClient.execute(SQL, [orgKey], {prepare: true}, (err, result) => {
if (!err?.message) resolve(result.rows);
else reject(err.message);
});
});
};
}
I am basically following this recommended AWS documentation.
https://docs.aws.amazon.com/keyspaces/latest/devguide/using_nodejs_driver.html
It seems that around 10-20% of the time the lambda function (cassandra driver) cannot connect to the endpoint.
I am pretty familiar with Cassandra (I already use a 6 node cluster that I manage) and don't have any issues with that.
Could this be a timeout or do I need more contact points?
Followed the recommended guides. Checked from the AWS console for any errors but none shown.
UPDATE:
Update to the above question....
I am occasionally (1 in 50 if I parallel call the function (5 concurrent calls)) getting the below error:
"All host(s) tried for query failed. First host tried,
3.248.244.5:9142: DriverError: Socket was closed at Connection.clearAndInvokePending
(/opt/node_modules/cassandra-driver/lib/connection.js:265:15) at
Connection.close
(/opt/node_modules/cassandra-driver/lib/connection.js:618:8) at
TLSSocket.
(/opt/node_modules/cassandra-driver/lib/connection.js:93:10) at
TLSSocket.emit (node:events:525:35)\n at node:net:313:12\n at
TCP.done (node:_tls_wrap:587:7) { info: 'Cassandra Driver Error',
isSocketError: true, coordinator: '3.248.244.5:9142'}
This exception may be caused by throttling in the keyspaces side, resulting the Driver Error that you are seeing sporadically.
I would suggest taking a look over this repo which should help you to put measures in place to either prevent the occurrence of this issue or at least reveal the true cause of the exception.
Some of the errors you see in the logs you will need to investigate Amazon CloudWatch metrics to see if you have throttling or system errors. I've built this AWS CloudFormation template to deploy a CloudWatch dashboard with all the appropriate metrics. This will provide better observability for your application.
A System Error indicates an event that must be resolved by AWS and often part of normal operations. Activities such as timeouts, server faults, or scaling activity could result in server errors. A User error indicates an event that can often be resolved by the user such as invalid query or exceeding a capacity quota. Amazon Keyspaces passes the System Error back as a Cassandra ServerError. In most cases this a transient error, in which case you can retry your request until it succeeds. Using the Cassandra driver’s default retry policy customers can also experience NoHostAvailableException or AllNodesFailedException or messages like yours "All host(s) tried for query failed". This is a client side exception that is thrown once all host in the load balancing policy’s query plan have attempted the request.
Take a look at this retry policy for NodeJs which should help resolve your "All hosts failed" exception or pass back the original exception.
The retry policies in the Cassandra drivers are pretty crude and will not be able to do more sophisticated things like circuit breaker patters. You may want to eventually use a "failfast" retry policy for the driver and handle the exceptions in your application code.

Scheduled Cloud build trigger throws 404 NOT_FOUND error

I recently created a scheduled trigger by following this google page: . But when I did a test run from Scheduler's interface, the result was a NOT_FOUND error:
{
#type: "type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished"
jobName: "projects/myproject/locations/australia-southeast1/jobs/trigger-schedule"
status: "NOT_FOUND"
targetType: "HTTP"
url: "https://cloudbuild.googleapis.com/v1/projects/myproject/triggers/ca55b01d-f4e6-4b8b-b92b-b2e4f380788c:run"
}
I was worried about location, which is appEngine related, even there is no instances, the location shows to be in australia-southeast1, which is correct.
What could be the cause of the error? Or even what was not found? the job definition or the target?
After running gcloud beta builds triggers run TRIGGER which is the scheduled job runs, I found the cloudbuild.yaml does not exist in the targeted branch.
First, I wish the error in the scheduler could have been more meaningful and had some details.
Second, triggers all have conditions how they are triggered. Maybe the POST HTTP call to the trigger can allow an empty body to use default condition. In my case, the condition defined in the trigger was branch = test and in my scheduled job definition was branch = master. This mismatch caused the problem.
Hope this could help others to debug scheduled triggers.

while installing aws amplify init on terminal .gives error

i am getting this while doing amplify init , so main agenda is to develop authentication through aws-cognito , which is using aws-amplify,
? Do you want to use an AWS profile? Yes
? Please choose the profile you want to use default
init failed
Error: read ECONNRESET
at TLSWrap.onStreamRead (internal/stream_base_commons.js:205:27) {
message: 'read ECONNRESET',
errno: 'ECONNRESET',
code: 'NetworkingError',
syscall: 'read',
region: 'us-east-1',
hostname: 'amplify.us-east-1.amazonaws.com',
retryable: true,
time: 2020-04-16T12:09:59.975Z
You may try the following strategies to eliminate the problem you are facing,
This more of looks like a Network problem as per the logs from your
Terminal, therefore if you have a jittery connection, I would
recommend that you try the same on a stable internet connection.
I will recommend to do an amplify delete in case there is some mis-configuration from the last time you did an amplify init, but the chances of this are very less.
Check your aws environment variables or configuration file maybe the credentials of your aws account are missing. Try doing an aws configure and reset the values of your key,secret, and region.
I hope the above suggestions help you somehow.

How to run docker task with Amazon ECS - getting error `STOPPED (CannotStartContainerError: Error response from dae)`

My goal is to execute a benchmark deployed as a docker image. While doing so, I had too many issues, so I decided to first make something extremely trivial work.
So I decided to follow the guide in https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
and use the "ping" example - it should just ping a domain couple of times, and stop.
The problem is, I always receive this message in the task status:
STOPPED (CannotStartContainerError: Error response from dae)
I tried it with various subnets and security groups, but the result is always the same - the task starts, and after a minute or two fails with the message above.
I even tried it on a fresh new AWS account, using these steps:
in https://us-east-2.console.aws.amazon.com/ecs/ created new cluster (networking only)
in task definitions, created a taskdef
with docker image alpine:latest, command ping -c 4 google.com
then I select the cluster, switch to "tasks" tab, and enter the run dialog
with one of pre-created subnets
After executing:
the task appears in the cluster's tasks list in PENDING state
it takes couple of minutes
eventually (using refresh button), it changes to the mentioned message - STOPPED (CannotStartContainerError: Error response from dae)
My guess is that the reason is:
either the task cannot download the image
or the instance cannot reach outside net
What can I be doing wrong? How to fix?
In my case too the log group was the problem. The one I had configured wasnt working. Hence I enabled the "Auto-configure CloudWatch Logs" option in the "Log Configuration" of the container settings.
Also if you open the stopped task, navigate to the container section, expand it, under the Details section you can see a detailed error message. Screenshot below
It could be a problem with the entry point as pointed in the comments of the question (in the task definition) Entrypoint: ["sh","-c"]
It could also be a bad reference, for example a wrong log group in the LogConfiguration or something similar.
I just create de group log in my cloudwatch console because it have not created, and now everything is going well.