AWS ECS task fails to start becasue daemon can't create Logstream - amazon-iam

I have 2 versions of a service that run in the same cluster. I'm using the awslogs driver
The v2 logs works fine however the v1 task fails to start because it can't create a log stream.
The setup is identical between services except for the container being used.
The log group exists and the role has permissions to create a "logstream" and can "putevents" as this is pretty much the same setup for the v2 in a different group.
CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: RequestError: send request failed caused by: Post https://logs.eu-west-1-v1.amazonaws.com/: dial tcp: lookup logs.eu-west-1
I've setup a new service and tried to spin it up again but it failed so I thought that this was to do with the container setup.
On the official documentation here it recommends adding this to the environment variables
ECS_AVAILABLE_LOGGING_DRIVERS '["json-file","awslogs"]'
After adding this, it still failed. I've been searching for a while on this and would appreciate any help or preferably guidance.

Related

AWS CDK unstable deployment of Lambda CustomResource

I use cdk to deploy my AWS stack. It's NextJS app with RDS instance. Initialization Database I do using CustomResource approach (Lambda build from Docker image) as suggested that Article
Sometimes my deployment fails with error message
Received response status [FAILED] from custom resource. Message returned: Connection timed out after 120000ms
Im sure because my database init takes to much time. I do filling the database with "INSERT INTO" SQL queries that repeat about 5000 times.
Could you advise how to avoid that error because deployment script is unstable and I can't rely on it? Many thanks.

AWS Codebuild Project Unable to communicate with RDS db

I am attempting to have AWS CodeBuild run a Flyway migration. The DB and CodeBuild Project are created via Terraform (the pipeline runs as a GitHub action, if it matters)
That code is here.
I figured this solution would make the difference: AWS CodeBuild fails to interact with RDS instance
When the CodeBuild project is executed by my GitHub workflow (using the aws-actions/aws-codebuild-run-build action), the migration times out:
[Container] 2022/10/07 21:03:56 Running command flyway -user=$DB_USER -password=$DB_PASSWORD -url=jdbc:mariadb://$DB_HOST:$DB_PORT/$DB_NAME -createSchemas=true migrate
ERROR: Unable to obtain connection from database (jdbc:mariadb://***:***/***) for user '***': Could not connect to address=(host=***)(port=***)(type=master) : Socket fail to connect to host:***, port:***. connect timed out
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL State : 08000
Error Code : -1
Message : Could not connect to address=(host=***)(port=***)(type=master) : Socket fail to connect to host:***, port:***. connect timed out
Caused by: java.sql.SQLNonTransientConnectionException: Could not connect to address=(host=***)(port=***)(type=master) : Socket fail to connect to host:***, port:***. connect timed out
Caused by: java.sql.SQLNonTransientConnectionException: Socket fail to connect to host:***, port:***. connect timed out
Caused by: java.net.SocketTimeoutException: connect timed out
This tells me it's some sort of networking problem but I can't put my finger on what route might be missing. No NACLs other than the defaults. Just security groups. I have a similar pipeline in the AWS CDK that works. As near as I can tell, the security groups and IAM permissions are identical, as is the database config itself.
Looking for debugging tips or anything that's missing.
Consider setting the vpc_security_group_ids parameter on your aws_db_instance resource. In that collection should be the security group you associated with your codebuild project. Currently it doesn't appear that your database has an associated security group and so traffic coming from your codebuild project isn't whitelisted and cannot make it through.
See Terrform docs

How to find out the reason for a failing elastic beanstalk deployment?

After eb deploy the environment gets stuck in Health: 'Severe'.
It show the following warning in recent events:
Environment health has transitioned from Info to Severe. ELB processes
are not healthy on all instances. Application update in progress on 1
instance. 0 out of 1 instance completed (running for 3 minutes). None
of the instances are sending data. ELB health is failing or not
available for all instances.
I'm not able to ssh into the instance: connection reset by peer. (I'm normally able to ssh into the instance without any issues).
The request logs function doesn't work because:
An error occurred requesting logs: Environment named
portal-api-staging is in an invalid state for this operation. Must be
Ready.
Cloudwatch logs only contain the same message from 'recent events'.
How do I figure out what why the deployment fails?
AWS documentation says I should check the logs or ssh into the instance, but none of those options work.
We found the issue. The deployment was somehow getting stuck and rolled back. Changing deployment policy to AllAtOnce and disabling RollingUpdateEnabled fixed it.

AWS fargate tasks won't start reliably

I have an ECS cluster with a bunch of different tasks in it (using the same docker image but with different environment variables).
Some of the tasks come up without problem but others fail a lot even though i've used the same VPC, subnet and security-group. The error message shows ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post https://api.ecr..
Bizarre is that the same task sometimes comes up if i create a new task definition or delete the ECR repository and re-upload the docker image.
I'm unable to draw any conclusion out of this..
Update: strange... the task starts successfully when i deregister the task definition and recreate it with the same specs. But only once..
It turns out one have to select the taskExecution role on Task Role - override and Task Execution Role - override in the run task Advanced Options section when starting the task. I don't know why it was arbitrarily working when randomly trying or working when i recreated the task definition every time.

AWS CloudWatch sending logs but not custom metrics to CloudWatch

first time asker.
So I've been trying to implement AWS Cloud Watch to monitor Disk Usage on an EC2 instance running EC2 Linux. I'm interesting in doing this just using the CW Agent and I've installed it according to the how-to found here. The install runs fine and I've made sure I've created an IAM Role for the instance as is described here. Unfortunately whenever I run the amazon-cloudwatch-agent.service it only sends log files and not the custom used_percent measurement specified. I receive this error when I tail the logs.
2021-06-18T15:41:37Z E! WriteToCloudWatch failure, err: RequestError: send request failed
caused by: Post "https://monitoring.us-west-2.amazonaws.com/": dial tcp 172.17.1.25:443: i/o timeout
I've done my best googlefu but gotten nowhere thus far. If you've got any advice it would be appreciated.
Thank you
Belated answer to my own question. I had to create a security group that would accept traffic from that same security group!
Having the same issue, it definitely wasn't a network restriction as I was still able to telnet to the monitoring endpoint.
From AWS docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-iam-roles-for-cloudwatch-agent.html
One role or user enables CloudWatch agent to be installed on a server
and send metrics to CloudWatch. The other role or user is needed to
store your CloudWatch agent configuration in Systems Manager Parameter
Store. Parameter Store enables multiple servers to use one CloudWatch
agent configuration.
If you're using the default cloudwatchagent configuration wizard, you may require extra policy CloudWatchAgentAdminRole in your role for the agent to connect to the monitoring service.