Kibana health status is RED - amazon-web-services

I am using AWS ELK(amazon managed elastic) and my Kibana health status is red, trying to browse to the kibana URL i get "Kibana server is not ready yet".
I have tried to fix the problem but without luck, I think it all start when I changed my ELK settings from 1 availability zones with 1 instances to 2 Availability zones or another option is that I have streamed large amount of data in the last day.
As part of trials to fix the problem I returned to 1 availability zones with 1 instances but that didn't fix the problem.
Also I have Enabled the error logs and seen that I receive in cloudwatch:
"publishing cluster state with version [68816] failed for the
following nodes"
"failed to connect to node"
Any help solving this problem will help.
More info(about my current setup):
Domain status:Active
Elasticsearch version: 6.7
Availability zones:1
Instance type:r5.large.elasticsearch
Number of instances:1
Storage type:EBS
EBS volume type:General Purpose (SSD)
EBS volume size:1000 GB
Encryption at rest:Disabled
Node-to-node encryption:Disabled
Amazon Cognito for authentication:Disabled
Service software release:R20190724-P1
in the cluster health tab of the domain I can see:
Cluster status:green
MasterReachableFromNode:green
AutomatedSnapshotFailure:green
KibanaHealthyNodes:red
and in the InvalidHostHeaderRequests I have about 60% of requests that are InvalidHostHeaderRequests out of ElasticsearchRequests (but I guess that is unrelated):
CPUUtilization: is about 8%
JVMMemoryPressure: is about 20%
SysMemoryUtilization:98%

KibanaHealthyNodes is red possibly your kibana is down. Have you updated to AWS Elasticsearch v6.7 recently? Looks like the kibana needs to be restarted on the elasticsearch cluster for which AWS support team can help you with. Or in case you dont have support plan might be if you post on AWS forum someone from AWS can take a look and assist you with the same.
InvalidHostHeaderRequests will not cause the issue with kibana. AWS ES will throw this error when your application is trying to send the request on IPs of the nodes. Please check and use the domain endpoint in the request else this error will come up.
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains.html

Related

AWS EC2 Instance Troubleshooting SSM Agent Ping

all. We manage an EC2 instance (windows) that hosts our on-premises Power BI Gateway. The last few days, we have noticed that we're unable to rdp into the instance. When we check Fleet Manager, we see that "SSM Agent Ping Status" is at Connection Lost:
After an instance reboot, the issue resolves itself. This has happened daily for three days now. I have verified that the ssm agent version is not dated. The current version installed on instance is 3.1.1575, which per the releases doc (below) was released on June 6, 2022 (seems fairly recent). There is one version more recent released 7-14-2022:
https://github.com/aws/amazon-ssm-agent/releases
I looked into the "Connection Lost" status and saw the following:
If an instance fails a health check, AWS OpsWorks Stacks autoheals registered Amazon EC2 instances and changes the status of registered on-premises instances to connection lost. There's some confusion here, I don't recall registering the instance under AWS OpsWorks Stacks -- unless, this is automatic.
Is there anyway to troubleshoot and get to the bottom of this? We can temporarily reboot instance, but I'd like to understand what's causing the issue... Thank you!

EKS Random "Error: ErrImagePull" "i/o timeout" when pulling Images

Running AWS "Managed Nodes" for an EKS Cluster across 2 AZ's.
3 Nodes in total. I get random timeouts when attempting to pull the containers down.
This has been so hard to trace because it does work (sometimes), so it's not like an ACL is blocking or a security group.
When I ssh into the nodes, sometimes I can pull down the image manually and sometimes I cannot. When I've run curl requests curl -I https://hub.docker.com it takes sometimes 2 minutes to get a response back. I'm guessing this is why the images are timing out.
I don't know of a way to increase the timeout for k8s to pull the image, but also can't figure out why the latency is so bad in doing the curl request.
Any suggestions are greatly appreciated.
FYI, worker nodes in Private Subnet, proper routes to NAT Gateway in place. VPC Flow logs are good.
Random is the hardest thing to trace 🤷.
🥼 You could move your images to a private ECR registry or simply run a registry in your cluster to discard that it's an issue with your Kubernetes networking. Running AWS CNI❓
It could also just be rate-limiting from docker hub itself. Are you using the same external NAT IP address to pull from multiple nodes/clusters❓:
Docker will gradually impose download rate limits with an eventual limit of 300 downloads per six hours for anonymous users.
Logged in users will not be affected at this time. Therefore, we recommend that you log into Docker Hub as an authenticated user. For more information, see the following section How do I authenticate pull requests.
✌️

I want to send metric alert (in group setting) of AWS instance with stackdriver monitoring

My question is setting when monitoring AWS metrics with stackdriver.
I'm tried thing below but, alert(policy) is not working.
How do I send alert(policy) with group settings?
I dont want is single monitoring, I do want is group settings.
I completed stackdriver monitoring setting for aws accounts by role settings. for next, I settinged group settings alert(policy) metrics is below.
load average > 5
disk usage > 80%
there target is some ec2 instances, these is group settings.
I complete settings for these. for next, did test of stress.
I looked at the metrics. Then the graph exceeded the threshold.
but not sended alert(policy), and not opened incidents.
below is details.
Alert(Policy) Creation
go to [Alerting/ Policies/ TARGET POLICY]
[Add Condition], for next select to [Metric Threshold]
RESOURCE TYPE is Instance(EC2)
APPLIES TO is Group
Select group. This group is Including EC2 Instances.
CONDITION TRIGGERS IF: Any Member Violates
IF METRIC is [CPU Load Average(past 1m)
CONDITION is above
THRESHOLD is 5 load
FOR is 1 minutes
Write by name and Push [Save Policy]
Test of Stress
ssh to target instances.
Execute stress test.
Confim the Load Average above reached 5.
but not sended alert(policy)
Confirm the Stackdriver
Confirm the above Load Average reached 5, with alert settings page.
But not opened Incidents.
I Tried other settings
For GCP instances, alerts will work correctly. It is both group setting and single setting.
Alerts will work for AWS instances in single configuration, but not for group settings.
Version info
stackdriver
stackdriver-agent version: stackdriver-agent.x86_64 5.5.2-366.amzn1
aws
OS: Amazon Linux
VERSION: 2016.03
ID_LIKE: rhel fedora
more detail is please comments.
If the agent wasn't configured correctly and is sending metrics to the wrong project, this could lead to the behavior described. This works for single instances but doesn't for group of instances. This might work for GCP because it's zero setup for monitoring GCE Instances. This causes any alerts which use group filters to not work.
https://cloud.google.com/monitoring/agent/troubleshooting#verify-project
"If you are using an Amazon EC2 VM instance, or if you are using private-key credentials on your Google Compute Engine instance, then the credentials could be invalid or they could be from the wrong project. For AWS accounts, the project used by the agent must be the AWS connector project, typically named "AWS Link..."."
These instructions at https://cloud.google.com/monitoring/agent/troubleshooting#verify-running help verify that agent is sending metrics correctly.

Failure to launch Amazon EC2 non free instances

I need to launch a Amazon EC2 instance, in particular a GPU accelerated one. I already tried with free tiers using t2.micro
instances and everything is fine. When I try to select a non free one such as g2.2xlarge I get this error
Launch Failed
You have requested more instances (1) than your current instance limit of 0 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
(Service: AmazonEC2; Status Code: 400; Error Code: InstanceLimitExceeded; Request ID: 4ebf71ee-e927-42c2-8377-697a3a6cfd4b)
I'm trying to use a machine with Deep Learning AMI Ubuntu Version (but I also tried with other ones). I get this error even if I'm not running any other instance and according to the documentation the limit for these one is 5 at a time.
I have also tried to select different regions (my country is not among the choices) but it doesn't seem to change the result.
My only guess about this issue is that somehow I'm registered as a free user and I'm not allowed to use the priced services, but I'm not so sure about that.
Edit: I have a credit card on file on Amazon (they require it to register) and they should charge me from that.
Am I missing something?
Every amazon account has limits - even big corporate accounts. These limitations are set by Amazon, but you can request a limit increase. You can find your limits by clicking the Limits link in the top left hand side inside the EC2 Dashboard.
For example, if you have a t2.micro already running and you try to launch a g2.xlarge that had a limit of 1, you would not be able to since your limit of 1 has already been hit with your t2.micro that is running.
More Info:
How do I manage my AWS service limits?
AWS FAQ Overview
Q: How many instances can I run in Amazon EC2?
To request a limit increase, submit a support request through the AWS Support Center
To view your limits:
Yes you must check the limits of usage. More info in the FQA or the limits section. Amazon is not so clear on this.
Amazon reference Forum

AWS: None of the Instances are sending data

I'm trying to set up an Elastic Beanstalk application with Amazon Web Services however I'm receiving a load of errors with the message None of the instances are sending data. I've tried deleting the Elastic Beanstalk Application and the EC2 instance several times with the sample application and trying again but I get the same error.
I also tried uploading a flask application with AWS Elastic Beanstalk command line tools but then I received the error below:
Environment health has transitioned from Pending to Severe. 100.0 % of the requests to the ELB are failing with HTTP 5xx. Insufficient request rate (0.5 requests/min) to determine application health (7 minutes ago). ELB health is failing or not available for all instances. None of the instances are sending data
Why do I get this error and how do I fix it? Thanks.
You are using Enhanced Health Monitoring.
With enhanced health monitoring an agent installed on your EC2 instance monitors vital system and application level health metrics and sends them directly to Elastic Beanstalk.
When you see an error message like "None of the instances are sending data", it means either the agent on the instance has crashed or it is unable to post data to Elastic Beanstalk due to networking error or some other error.
For debugging this, I would recommend downloading "Full logs" from the AWS console. You can follow the instructions for getting logs in the section "Downloading Bundle Logs from Elastic Beanstalk Console" here.
If you are unable to download logs using the console for any reason you can also ssh to the instance and look at the logs in /var/log.
You will find logs for the health agent in /var/log/healthd/daemon.log.
Additional logs useful for this situation are /var/log/cfn-init.log, /var/log/eb-cfn-init.log and /var/log/eb-activity.log. Can you look at the logs and give more details of the errors you see?
This should hopefully give you more details regarding the error "None of the instances are sending data".
Regarding other health "causes" you are seeing:
Environment health has transitioned from Pending to Severe - This is because initially your environment health status is Pending. If the instances do not go healthy within grace period health status transitions to Severe. In your case since none of the instances is healthy / sending data, the health transitioned to Severe.
100.0 % of the requests to the ELB are failing with HTTP 5xx. Insufficient request rate (0.5 requests/min) to determine application health (7 minutes ago).
Elastic Beanstalk monitors other resources in addition to your EC2 instances when using enhanced health monitoring. For example, it monitors cloudwatch metrics for your ELB. This error means that all requests sent to your environment CNAME/load balancer are failing with HTTP 5xx errors. At the same time the request rate is very low only 0.5 requests per minute, so this indicates that even though all requests are failing, the request rate is pretty low. "7 minutes ago" means that information about ELB metrics is slightly old. Because Elastic Beanstalk monitors cloudwatch metrics every few minutes, so the data can be slightly stale. This is as opposed to health data we get directly from the EC2 instances which is "near real time". In your case since the instances are not sending data the only available source for health is ELB metrics which is delayed by about 7 minutes.
ELB health is failing or not available for all instances
Elastic Beanstalk is looking at the health of your ELB, i.e. it is checking how many instances are in service behind ELB. In your case either all instances behind ELB are out of service or the health is not available for some other reason. You should double check that your service role is correctly configured. You can read how to configure service role correctly here or in the documentation. It is possible that your application failed to start.
In your case I would suggest focusing on the first error "None of the instances are sending data". For this you need to look at the logs as outlined above. Let me know what you see in the logs. The agent is started fairly early in the bootstrap process on the instance. So if you see an error like "None of the instances are sending data", it is very likely that bootstrap failed or the agent failed to start for some reason. The logs should tell you more.
Also make sure you are using an instance profile with your environment. Instance profile allows the health agent running on your EC2 instance to authenticate with Elastic Beanstalk. If instance profile is not associated with your environment then the agent will not be able to send data to Elastic Beanstalk. Read more about Instance Profiles with Elastic Beanstalk here.
Update
One common reason for the health cause "None of the instances are sending data" can be that your instance is in a VPC and your VPC does not allow NTP access. Typical indicator of this problem is the following message in /var/log/messages: ntpdate: Synchronizing with time server: [FAILED]. When this happens the clock on your EC2 instance can get out of sync and the data is considered invalid. You should also see a health cause on the instances on the health page on the AWS web console that tells you that instance clock is out-of-sync. The fix is to make sure that your VPC allows access to NTP.
There can be many reasons why the health agent is not able to send any data, so this may not be the answer to your problem, but it was to mine and hopefully can help somebody else:
I got the same error and looking into /var/log/healthd/daemon.log the following was repeatedly reported:
sending message(s) failed: (Aws::Healthd::Errors::GroupNotFoundException) Group 97c30ca2-5eb5-40af-8f9a-eb3074622172 does not exist
This was caused by me making and using an AMI image from an EC2 instance inside an Elastic Beanstalk environment. That is, I created a temporary environment with one instance the same configuration as my production environment, went into the EC2 console and created an image of the instance, terminated the temporary environment, and then created yet another environment using the new custom AMI.
Of course (in hindsight) this meant some settings of the temporary environment were still being used. In this case specifically /etc/healthd/config.yaml, resulting in the health agent trying to send messages to a no longer existing health group.
To fix this and make sure there was no other stale configuration around, I instead started a new EC2 instance by hand from the default AMI used in the production environment (find it under the 'Instances' configuration page of your environment), provision that, then create a new image from that and use that image in my new EB environment.
Check if your instance type's RAM is enough for app + os + amazon tooling. We suffered from this for a long time, when we discovered that t2.micro is barely enough for our use cases. The problem went away right after using t2.small (2GB).
I solved this by adding another security group (the default one for my Elastic Beanstalk).
It appears my problem was that I didn't associate a public ip address to my instance... after I set it it worked just fine.
I was running an app in elastic beanstalk environment with docker as platform. I got the same error that none of the instances are sending. And I was unable fetch logs as well.
Rebuilding the environment worked for me.
I just set the Path on load balancing to a URL that response with status code 200, for this only to study environment.
For my real app, I use actuator
If you see something like this where you don't get any enhanced metrics, check that you haven't accidentally removed the conf.d/elasticbeanstalk/healthd.conf include from your nginx config. This conf adds an machine-read log format that is responsible for reporting that data in EB (see Enhanced health log format - AWS).
My instance profile's IAM Role was lacking elasticbeanstalk:PutInstanceStatistics permission.
I found this by looking at /var/log/healthd/daemon.log as suggested in one of the other answers.
I had to SSH into the machine directly to discover this, as the Get Logs function itself was failing due to missing S3 Write permissions.
If you're running a Worker Tier EB, need to add this policy:
arn:aws:iam::aws:policy/AWSElasticBeanstalkWorkerTier
For anyone arriving here in 2022…
After launching a new environment that was identical to a current healthy environment and seeing no data, I raised an AWS Support ticket. I was informed:
Here, I would like to inform you that recently Elastic Beanstalk introduced new feature called EnhancedHealthAuthEnabled to increase security of your environment and help prevent health data spoofing on your behalf and this option will be enabled by default when you create new environment.
If you use managed policies for your instance profile, this feature is available for your new environment without any further configuration as Elastic Beanstalk instance profile managed policies contain permissions for the elasticbeanstalk:PutInstanceStatistics action. However, If you use a custom instance profile instead of a managed policy, your environment might display a No Data health status. This happens because custom instance profile doesn't PutInstanceStatistics permission by default and instances aren't authorised for the action that communicates enhanced health data to the service. Hence, your environment health shows Unknown/No data status.
The policy that I needed to attach to my existing EC2 role (as advised by AWS Support) looked like:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ElasticBeanstalkHealthAccess",
"Action": [
"elasticbeanstalk:PutInstanceStatistics"
],
"Effect": "Allow",
"Resource": [
"arn:aws:elasticbeanstalk:*:*:application/*",
"arn:aws:elasticbeanstalk:*:*:environment/*"
]
}
]
}
Adding this policy to my EC2 role solved the issue for me.
In My case when i increased my ram or instance type(t2.micro to c5.xlarge) it had resolved.