AWS EKS logging to CloudWatch - how to send logs only, without metrics? - amazon-web-services

I would like to forward the logs of select services running on my EKS cluster to CloudWatch for cluster-independent storage and better observability.
Following the quickstart outlined at https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-quickstart.html I've managed to get the logs forwarded via Fluent Bit service, but that has also generated 170 Container Insights metrics channels. Not only are those metrics not required, but they also appear to cost a fair bit.
How can I disable the collection of cluster metrics such as cpu / memory / network / etc, and only keep forwarding container logs to CloudWatch? I'm having a very hard time finding any documentation on this.

I think I figured it out - the cloudwatch-agent daemonset from quickstart guide is what's sending the metrics, but it's not required for log forwarding. All the objects with names related to cloudwatch-agent in quickstart yaml file are not required for log forwarding.

As suggested by Toms Mikoss, you need to delete the metrics object in your configuration file. This file is the one that you pass to the agent when starting
This applies to "on-premises" "linux" installations. I havent tested this on windows, nor EC2 but I imagine it will be similar. The AWS Documentation here says that you can also distribute the configuration via SSM, but again, I imagine the answer here is still applicable.
Example of file with metrics:
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "root"
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/nginx.log",
"log_group_name": "nginx",
"log_stream_name": "{hostname}"
}
]
}
}
},
"metrics": {
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_idle",
"cpu_usage_iowait"
],
"metrics_collection_interval": 60,
"totalcpu": true
}
}
}
}
Example of file without metrics:
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "root"
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/nginx.log",
"log_group_name": "nginx",
"log_stream_name": "{hostname}"
}
]
}
}
}
}
For reference, the command to start for linux on-premises servers:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config \
-m onPremise -s -c file:configuration-file-path
More details in the AWS Documentation here

Related

AWS ECS does not start new tasks

AWS ECS cluster services do not start new tasks.
Already checked:
ECS EC2 instances are registered, active, full CPU and memory available, ECS agent is connected.
there are no events in ECS service "Events" tab, nothing about registering, starting, stopping, no errors, it's just empty.
Registered EC2 instances are set up correctly, in other cluster the same AMI is working perfect.
Task definition is correct, it used to work a day before and since then no changes happened.
Checked Service role contains all relevant policies
Querying ECS with AWS CLI aws ecs describe-services --services my-service --cluster my-cluster yields that deployment rollout is constantly IN_PROGRESS and stays like this.
Full response with configuration is here (I've substituted real names and IDs):
{
"serviceArn": "arn:aws:ecs:eu-central-1:my-account-id:service/my-cluster/my-service",
"serviceName": "my-service",
"clusterArn": "arn:aws:ecs:eu-central-1:my-account-id:cluster/my-cluster",
"loadBalancers": [
{
"targetGroupArn": "arn:aws:elasticloadbalancing:eu-central-1:my-account-id:targetgroup/my-service-lb/load-balancer-id",
"containerName": "my-service",
"containerPort": 8065
}
],
"serviceRegistries": [
{
"registryArn": "arn:aws:servicediscovery:eu-central-1:my-account-id:service/srv-srv_id",
"containerName": "my-service",
"containerPort": 8065
}
],
"status": "ACTIVE",
"desiredCount": 1,
"runningCount": 0,
"pendingCount": 0,
"launchType": "EC2",
"taskDefinition": "arn:aws:ecs:eu-central-1:my-account-id:task-definition/my-service:76",
"deploymentConfiguration": {
"deploymentCircuitBreaker": {
"enable": false,
"rollback": false
},
"maximumPercent": 200,
"minimumHealthyPercent": 100
},
"deployments": [
{
"id": "ecs-svc/deployment_id",
"status": "PRIMARY",
"taskDefinition": "arn:aws:ecs:eu-central-1:my-account-id:task-definition/my-service:76",
"desiredCount": 1,
"pendingCount": 0,
"runningCount": 0,
"failedTasks": 0,
"createdAt": "2022-06-28T09:15:08.241000+02:00",
"updatedAt": "2022-06-28T09:15:08.241000+02:00",
"launchType": "EC2",
"rolloutState": "IN_PROGRESS",
"rolloutStateReason": "ECS deployment ecs-svc/deployment_id in progress."
}
],
"roleArn": "arn:aws:iam::my-account-id:role/aws-service-role/ecs.amazonaws.com/AWSServiceRoleForECS",
"events": [],
"createdAt": "2022-06-28T09:15:08.241000+02:00",
"placementConstraints": [],
"placementStrategy": [
{
"type": "spread",
"field": "attribute:ecs.availability-zone"
}
],
"healthCheckGracePeriodSeconds": 120,
"schedulingStrategy": "REPLICA",
"createdBy": "arn:aws:iam::my-account-id:role/my-role",
"enableECSManagedTags": false,
"propagateTags": "NONE",
"enableExecuteCommand": false
}
The ECS service and service discovery entry is created using Terraform, and the service definition is
resource "aws_service_discovery_service" "ecs_discovery_service" {
name = var.service_name
dns_config {
namespace_id = var.service_discovery_hosted_zone_id
dns_records {
ttl = 10
type = "SRV"
}
}
health_check_custom_config {
failure_threshold = 1
}
}
resource "aws_ecs_service" "ecs_service" {
name = var.service_name
cluster = var.ecs_cluster_id
task_definition = var.task_definition_arn
desired_count = var.desired_count
deployment_minimum_healthy_percent = 100
deployment_maximum_percent = 200
health_check_grace_period_seconds = var.health_check_grace_period_seconds
target_group_arn = aws_lb_target_group.target_group.arn
container_name = var.service_name
container_port = var.service_container_port
ordered_placement_strategy {
type = "spread"
field = "attribute:ecs.availability-zone"
}
service_registries {
registry_arn = aws_service_discovery_service.ecs_discovery_service.arn
container_name = var.service_name
container_port = var.service_container_port
}
}
This code used to work pretty fine, and without any changes in infrastructure, after destroying and applying the infrastructure code, ECS does not start any new tasks.
I could narrow problem to the service discovery, as if I remove the service_registries section, the tasks are started as normal.
Removing the service discovery solves the issue, however it's not the proper solution and I don't understand what is the reason of the problem.
Again, the Service Role has the permissions for the service discovery.
"servicediscovery:DeregisterInstance",
"servicediscovery:Get*",
"servicediscovery:List*",
"servicediscovery:RegisterInstance",
"servicediscovery:UpdateInstanceCustomHealthStatus"
I can't find any ways to trace this strange behaviour and want to ask you guys for help:
could you give me any hints what / where I could check. I've checked multiple troubleshooting guides, however all of them rely on events in ECS service and I don't have any there, anything else I had in mind is checked.
maybe you know what could be the problem that the service discovery blocks the ECS to start new tasks? I thought ECS adds a SRV record to the registry when it starts the container and the container is healthy, however I could not see that any containers have been started at all.
I would be very thankful for any hints and let me know if you need any details.
Have a nice day and best regards.

AWS Auto Scaling Group does not detect instance is unhealthy from ELB

I’m trying to get an AWS Auto Scaling Group to replace ‘unhealthy’ instances, but I can’t get it to work.
From the console, I’ve created a Launch Configuration and, from there, an Auto Scaling Group with an Application Load Balancer. I've kept all settings regarding the target group and listeners the same as the default settings. I’ve selected ‘ELB’ as an additional health check type for the Auto Scaling Group. I’ve consciously misconfigured the Launch Configuration to result in ‘broken’ instances -- there is no web server to listen to the port configured in the listener.
The Auto Scaling Group seems to be configured correctly and is definitely aware of the load balancer. However, it thinks the instance it has spun up is healthy.
// output of aws autoscaling describe-auto-scaling-groups:
{
"AutoScalingGroups": [
{
"AutoScalingGroupName": "MyAutoScalingGroup",
"AutoScalingGroupARN": "arn:aws:autoscaling:eu-west-1:<accountId>:autoScalingGroup:3edc728f-0831-46b9-bbcc-16691adc8f44:autoScalingGroupName/MyAutoScalingGroup",
"LaunchConfigurationName": "MyLaunchConfiguration",
"MinSize": 1,
"MaxSize": 3,
"DesiredCapacity": 1,
"DefaultCooldown": 300,
"AvailabilityZones": [
"eu-west-1b",
"eu-west-1c",
"eu-west-1a"
],
"LoadBalancerNames": [],
"TargetGroupARNs": [
"arn:aws:elasticloadbalancing:eu-west-1:<accountId>:targetgroup/MyAutoScalingGroup-1/1e36c863abaeb6ff"
],
"HealthCheckType": "ELB",
"HealthCheckGracePeriod": 300,
"Instances": [
{
"InstanceId": "i-0b589d33100e4e515",
// ...
"LifecycleState": "InService",
"HealthStatus": "Healthy",
// ...
}
],
// ...
}
]
}
The load balancer, however, is very much aware that the instance is unhealthy:
// output of aws elbv2 describe-target-health:
{
"TargetHealthDescriptions": [
{
"Target": {
"Id": "i-0b589d33100e4e515",
"Port": 80
},
"HealthCheckPort": "80",
"TargetHealth": {
"State": "unhealthy",
"Reason": "Target.Timeout",
"Description": "Request timed out"
}
}
]
}
Did I just misunderstand the documentation? If not, what else is needed to be done to get the Auto Scaling Group to understand that this instance is not healthy and refresh it?
To be clear, when instances are marked unhealthy manually (i.e. using aws autoscaling set-instance-health), they are refreshed as is expected.
Explanation
If you have consciously misconfigured the instance from the start and the ELB Health Check has never passed, then the Auto Scaling Group does not acknowledge yet that your ELB/Target Group is up and running. See this page of the documentation.
After at least one registered instance passes the health checks, it enters the InService state.
And
If no registered instances pass the health checks (for example, due to a misconfigured health check), ... Amazon EC2 Auto Scaling doesn't terminate and replace the instances.
I configured from scratch and arrived at the same behavior as what you described. To verify that this is indeed the root cause, check the Target Group status in the ASG. It is probably in Added state instead of InService.
[cloudshell-user#ip-10-0-xx-xx ~]$ aws autoscaling describe-load-balancer-target-groups --auto-scaling-group-name test-asg
{
"LoadBalancerTargetGroups": [
{
"LoadBalancerTargetGroupARN": "arn:aws:elasticloadbalancing:us-east-1:xxx:targetgroup/asg-test-1/abc",
"State": "Added"
}
Resolution
To achieve the desired behavior, what I did was
Run a simple web service on port 80. Ensure Security Group is open for the ELB to talk to EC2.
Wait until the ELB status is healthy. Ensure server is returning 200. You may need to create an empty index.html just to pass the health check.
Wait until the target group status has become InService in the ASG.
For example, for Step 3:
[cloudshell-user#ip-10-0-xx-xx ~]$ aws autoscaling describe-load-balancer-target-groups --auto-scaling-group-name test-asg
{
"LoadBalancerTargetGroups": [
{
"LoadBalancerTargetGroupARN": "arn:aws:elasticloadbalancing:us-east-1:xxx:targetgroup/test-asg-1-alb/abcdef",
"State": "InService"
}
]
}
Now that it is in service, turn off the web server and wait. Check often, though, as once ASG detects it is unhealthy it will terminate.
[cloudshell-user#ip-10-0-xx-xx ~]$ aws autoscaling describe-auto-scaling-groups
{
"AutoScalingGroups": [
{
"AutoScalingGroupName": "test-asg",
"AutoScalingGroupARN": "arn:aws:autoscaling:us-east-1:xxx:autoScalingGroup:abc-def-ghi:autoScalingGroupName/test-asg",
...
"LoadBalancerNames": [],
"TargetGroupARNs": [
"arn:aws:elasticloadbalancing:us-east-1:xxx:targetgroup/test-asg-1-alb/abc"
],
"HealthCheckType": "ELB",
"HealthCheckGracePeriod": 300,
"Instances": [
{
"InstanceId": "i-04bed6ef3b2000326",
"InstanceType": "t2.micro",
"AvailabilityZone": "us-east-1b",
"LifecycleState": "Terminating",
"HealthStatus": "Unhealthy",
"LaunchTemplate": {
"LaunchTemplateId": "lt-0452c90319362cbc5",
"LaunchTemplateName": "test-template",
"Version": "1"
},
...
},
...
]
}

installing Cloudwatch Agent with Terrafom

Does anyone know a way to install Cloudwatch agents automatically on EC2 instances while launching them through a launch template/configuration on terraform ?
I have just struggled through the process myself and would have benefited from a clear guide. So here's my attempt to provide one (for Amazon Linux 2 AMI):
Create your Cloudwatch agent configuration json file, which defines the metrics you want to collect. Easiest way is to SSH onto your EC2 instance and run this command to generate the file using the wizard: sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard. This is what my file looks like, it is the most basic config which only collects metrics on disk and memory usage every 60 seconds:
{
"agent": {
"metrics_collection_interval": 60,
"region": "eu-west-1",
"logfile": "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log",
"run_as_user": "root"
},
"metrics": {
"metrics_collected": {
"disk": {
"measurement": [
"used_percent"
],
"metrics_collection_interval": 60,
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 60
}
}
}
}
Create a shell script template file which will run when the EC2 instance is created. This is what mine looks like, it is called userdata.sh.tpl:
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0
--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
# Install Cloudwatch agent
sudo yum install -y amazon-cloudwatch-agent
# Write Cloudwatch agent configuration file
sudo cat >> /opt/aws/amazon-cloudwatch-agent/bin/config_temp.json <<EOF
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "root"
},
"metrics": {
"metrics_collected": {
"disk": {
"measurement": [
"used_percent"
],
"metrics_collection_interval": 60,
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 60
}
}
}
}
EOF
# Start Cloudwatch agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json
--==BOUNDARY==--
Create a directory called templates in your terraform module directory and store the userdata.sh.tpl file in there.
Create a data block in the appropriate .tf file as follows:
data "template_file" "user_data" {
template = file("${path.module}/templates/userdata.sh.tpl")
vars = {
...
}
}
In your aws_launch_configuration block, pass in the following value for the user_data variable:
resource "aws_launch_configuration" "example" {
name = "example_server_name"
image_id = data.aws_ami.ubuntu.id
instance_type = "t2.micro"
user_data = data.template_file.user_data.rendered
}
Add the CloudWatchAgentServerPolicy policy to the IAM role used by your EC2 server. This will give your role all the required service-level permissions e.g. "cloudwatch:PutMetricData".
Relaunch your EC2 server, and SSH on to check that the CloudWatch agent is installed and running using systemctl status amazon-cloudwatch-agent.service
Navigate to the CloudWatch UI and select Metrics from the left-hand menu. You should see CWAgent in the list of namespaces.
Yes this can be achieved with a Bash script (assuming Linux)
Steps to consider
Create UserData.sh file
Use templatefile to link userdata.sh to launch template
Write userdata to install AWS Cloudwatch agent (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance.html)
Terminate/create instance
Check cloudwatch agent is installed, up and running systemctl status amazon-cloudwatch-agent

aws quicksight create-analysis cli command

We have two different accounts:
one for developing
another clien prod account
We have cloudformation templates to deploy resources, during developing new features firstly we test on dev and then deploy to prod. But with quicksight it not so easy, there are no cloudformation templates for quicksight. We need to reacreate all reports in prod account, manually it is very hard. I found QuickSight API and create-analysis command but I don't understand how I can create analysis via this command.
Maybe someone have examples or know how to create analysis with cli?
Slavik
It's not possible to create an entirely new analysis or dashboard via the API, however it is possible to promote these throughout the environments via the API. I found the following AWS blog post to be of some use:
AWS QuickSight Blog
Rich
First create an Analysis Template using:
aws quicksight create-template --aws-account-id 123456789123 --cli-input-json file://./create-template.json
You can use the following JSON (create-analysis-cli-input.json):
{
"AwsAccountId":"123456789123",
"AnalysisId":"TestAnalysis",
"Name":"TestAnalysis-Report",
"Parameters":{
"StringParameters":[
{
"Name":"Parameters1",
"Values":[
"All"
]
},
{
"Name":"Parameters2",
"Values":[
"All"
]
}
],
"IntegerParameters":[
{
"Name":"IntParameter1",
"Values":[
0
]
},
{
"Name":"IntParameter2",
"Values":[
1000
]
}
],
"DateTimeParameters":[
{
"Name":"Date1",
"Values":[
20160327
]
},
{
"Name":"Date2",
"Values":[
20160723
]
}
]
},
"Permissions":[
{
"Principal":"arn:aws:quicksight:ap-southeast-2:123456789123:user/default/user-qs",
"Actions":[
"quicksight:UpdateDataSourcePermissions",
"quicksight:DescribeDataSource",
"quicksight:DescribeDataSourcePermissions",
"quicksight:PassDataSource",
"quicksight:UpdateDataSource",
"quicksight:DeleteDataSource"
]
}
],
"SourceEntity":{
"SourceTemplate":{
"DataSetReferences":[
{
"DataSetPlaceholder":"Template-SRM-Payments Dataset",
"DataSetArn":"arn:aws:quicksight:ap-southeast-2:123456789123:dataset/abc"
},
{
"DataSetPlaceholder":"Template-SRM-DailyPayments Dataset",
"DataSetArn":"arn:aws:quicksight:ap-southeast-2:123456789123:dataset/def"
},
{
"DataSetPlaceholder":"Template-SRM-DateTable Dataset",
"DataSetArn":"arn:aws:quicksight:ap-southeast-2:123456789123:dataset/ghi"
}
],
"Arn":"arn:aws:quicksight:ap-southeast-2:123456789123:template/report-template"
}
},
"ThemeArn":"arn:aws:quicksight::aws:theme/SEASIDE",
"Tags":[
{
"Key":"Name",
"Value":"TestReport"
}
]
}
The CLI command to run is:
aws quicksight create-analysis --aws-account-id 123456789123 --cli-input-json file://./create-analysis-cli-input.json

Autoscaling AWS EMR cluster to 0 nodes

Cross posting from: https://forums.aws.amazon.com/thread.jspa?messageID=766424
Hey,
Trying to apply this policy to a core instance group:
{
"Constraints": {
"MinCapacity": 0,
"MaxCapacity": 2
},
"Rules": [
{
"Name": "ScaleUp",
"Action": {
"Market": "ON_DEMAND",
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "EXACT_CAPACITY",
"ScalingAdjustment": 5,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"ComparisonOperator": "GREATER_THAN",
"MetricName": "AppsPending",
"Threshold": 0,
"Period": 300
}
}
},
{
"Name": "ScaleDown",
"Action": {
"Market": "ON_DEMAND",
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "EXACT_CAPACITY",
"ScalingAdjustment": 0,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"ComparisonOperator": "LESS_THAN_OR_EQUAL",
"MetricName": "AppsRunning",
"Threshold": 0,
"Period": 300
}
}
}
]
}
But I'm getting this error:
An error occurred (ValidationException) when calling the
PutAutoScalingPolicy operation: Auto Scaling constraint parameter
minCapacity should be at least 1 for Core Instance Group.
I'm no expert in EMR but from the docs I thought this would be possible (I can create a master only cluster manually in the UI, why does this difference exist?). The master node is running a job on a cron schedule, when that kicks in it generates the job and then the AutoScaling fires up the core instances to process it, downscaling when the job is done.
Any suggestions?
Thanks, Alex
PS. To clarify the functional requirements, I'm trying to run a zeppelin dashboard service on master, have it kick off a batch job every 24h which will need a few nodes and then downscale back to 0 nodes the rest of the time. Happy to consider other suggestions to achieve this if I've got the wrong end of the stick.
It's true that you can start a single-node, master-only cluster without any core nodes, but this is a special kind of "cluster" that runs everything on the master. It is not possible to transition from a multi-node cluster to a single-node cluster or vice versa. Because of this, the core instance group has a minimum of 1 instance, even when using autoscaling.
Single node cluster is not scalable. You need to have at least one core nodes along with the master node. So while applying scaling policy minimum number of core nodes should be 1.
Please find the screenshot from AWS document:
Please refer to link for more details:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-scale-on-demand.html