EKS Random "Error: ErrImagePull" "i/o timeout" when pulling Images - amazon-web-services

Running AWS "Managed Nodes" for an EKS Cluster across 2 AZ's.
3 Nodes in total. I get random timeouts when attempting to pull the containers down.
This has been so hard to trace because it does work (sometimes), so it's not like an ACL is blocking or a security group.
When I ssh into the nodes, sometimes I can pull down the image manually and sometimes I cannot. When I've run curl requests curl -I https://hub.docker.com it takes sometimes 2 minutes to get a response back. I'm guessing this is why the images are timing out.
I don't know of a way to increase the timeout for k8s to pull the image, but also can't figure out why the latency is so bad in doing the curl request.
Any suggestions are greatly appreciated.
FYI, worker nodes in Private Subnet, proper routes to NAT Gateway in place. VPC Flow logs are good.

Random is the hardest thing to trace 🤷.
🥼 You could move your images to a private ECR registry or simply run a registry in your cluster to discard that it's an issue with your Kubernetes networking. Running AWS CNI❓
It could also just be rate-limiting from docker hub itself. Are you using the same external NAT IP address to pull from multiple nodes/clusters❓:
Docker will gradually impose download rate limits with an eventual limit of 300 downloads per six hours for anonymous users.
Logged in users will not be affected at this time. Therefore, we recommend that you log into Docker Hub as an authenticated user. For more information, see the following section How do I authenticate pull requests.
✌️

Related

Aws fargate pods

Hi I am new to Fargate and confused about its calculation.
How is the 'Average duration' calculated and charged ? is it calculated and charged only for the time between request arriving and return of response or pods are continually running and are charged for 24*7*365 ?
Also does fargate fetches image from ECR every time a request arrives ?
Do fargate costs even when there is no request and nothing is processing ?
What is the correct way of calculating Average duration section ?
This can make huge difference in cost.
You can learn more details from AWS Fargate Pricing and from AWS Pricing Calculator. When you read details from first link that I mentioned, you will find the explanation for duration and there are 3 example in the link.
How is the 'Average duration' calculated and charged ? is it calculated and charged only for the time between request arriving and return of response or pods are continually running and are charged for 247365 ?
Fargate is not a request-based service. Fargate runs your pod for the entire time you ask it to run that pod. It doesn't deploy pods when a request comes in, the pods are running 24/7 (or as long as you have it configured to run).
Fargate is "serverless" in the sense that you don't have to manage the EC2 server the container(s) are running on yourself, Amazon manages the EC2 server for you.
Also does fargate fetches image from ECR every time a request arrives ?
Fargate pulls from ECR when a pod is deployed. It has to be deployed and running already in order to accept requests. It does not deploy a pod when a request comes in like you are suggesting.
Do fargate costs even when there is no request and nothing is processing ?
Fargate charges for the amount of RAM and CPU you have allocated to your pod, regardless of if they are actively processing requests or not. Fargate does not care about the number of requests. You could even use Fargate for doing things like back-end processing services that don't accept requests at all.
If you want an AWS service that only runs (and charges) when a request comes in, then you would have to use AWS Lambda.
You could also look at AWS App Runner, which is in kind of a middle ground between Lambda and Fargate. It works like Fargate, but it suspends your containers when requests aren't coming in, in order to save some money on the CPU charges.

How can I get AWS Batch to run more than 2 or 3 jobs at a time?

I'm just getting started with AWS. I have a (rather complicated) Python script which reads in some data from an S3 bucket, does some computation, and then exports some results to the same S3 bucket. I've packaged everything in a Docker container, and I'm trying to run it in parallel (say, 50 instances at a time) using AWS Batch.
I've set up a compute environment with the following parameters:
Type: MANAGED
Provisioning model: FARGATE
Maximum vCPUs: 256
I then set up a job queue using that compute environment.
Next, I set up a job definition using my Docker image with the following parameters:
vCpus: 1
Memory: 6144
Finally, I submitted a bunch of jobs using that job definition with slightly different commands and sent them to my queue.
As I submitted the first few jobs, I saw the status of the first 2 jobs go from RUNNABLE to STARTING to RUNNING. However, the rest of them just sat there in the RUNNABLE state until the first 2 were finished.
Does anyone have any idea what the bottleneck might be to running more than 2 or 3 jobs at a time? I'm aware that there are some account limitations, but I'm not sure which one might be the bottleneck.
Turns out there were 3 things at play here:
There was a service quota on my account of 5 public IP addresses, and each container was getting its own IP address so it could communicate with the S3 bucket. I made one of the subnets a private subnet and put all my containers in that subnet. I then set up a NAT gateway in a public subnet and routed all my traffic through the gateway. (More details at https://aws.amazon.com/premiumsupport/knowledge-center/nat-gateway-vpc-private-subnet/)
As Marcin pointed out, Fargate does scale slowly. I switched to using EC2, which scaled much more quickly but still stopped scaling at around 30 container instances.
There was a service quota on my account called "EC2 Instances / Instance Limit (All Standard (A, C, D, H, I, M, R, T, Z) instances)" which was set to 32. I reached out to AWS, and they raised the limit, so I am now able to run over 100 jobs at once.

Reduce Cloud Run on GKE costs

would be great if I could have to answers to the following questions on Google Cloud Run
If I create a cluster with resources upwards of 1vCPU, will those extra vCPUs be utilized in my Cloud Run service or is it always capped at 1vCPU irrespective of my Cluster configuration. In the docs here - this line has me confused Cloud Run allocates 1 vCPU per container instance, and this cannot be changed. I know this holds for managed Cloud Run, but does it also hold for Run on GKE?
If the resources specified for the Cluster actually get utilized (say, I create a node pool of 2 nodes of n1-standard-4 15gb memory) then why am I asked to choose a memory again when creating/deploying to Cloud Run on GKE. What is its significance?
The memory allocated dropdowon
If Cloud Run autoscales from 0 to N according to traffic, why can't I set the number of nodes in my cluster to 0 (I tried and started seeing error messages about unscheduled pods)?
I followed the docs on custom mapping and set it up. Can I limit the requests which cause a container instance to handle it to be limited by domain name or ip of where they are coming from (even if it only artificially setup by specifying a Host header like in the Run docs.
curl -v -H "Host: hello.default.example.com" YOUR-IP
So that I don't incur charges if I get HTTP requests from anywhere but my verified domain?
Any help will be very much appreciated. Thank you.
1: cloud run managed platform always allow 1 vcpu per revision. On gke, also by default. But, only for gke, you can override with --cpu param
https://cloud.google.com/sdk/gcloud/reference/beta/run/deploy#--cpu
2: can you precise what is asked and when performing which operation?
3: cloud run is build on top of kubernetes thank to knative. By the way, cloud run is in charge to scale pod up and down based on the traffic. Kubernetes is in charge to scale pod and node based on CPU and memory usage. The mechanism isn't the same. Moreover the node scale is "slow" and can't be compliant with spiky traffic. Finally, something have to run on your cluster for listening incoming request and serving/scaling correctly your pod. This thing has to run on a no 0 node cluster.
4: cloud run don't allow to configure this. I think that knative also can't. But you can deploy a ESP in front for routing requests to a specific cloud run service. By the way, you split the traffic before and address it to different services, and thus you scale independently. Each service can have a Max scale param, different concurrency param. ESP can implement rate limit.

Elastic Beanstalk / ELB adding over 60ms latency

I'm running a microservice on AWS Elastic Beanstalk which is logging it's responses internally at 1-4ms, but the AWS Dashboard is showing an average of 68ms (not even counting latency to/from AWS). Is this normal? It just seems odd that EB/ELB would add 60ms of latency to every request.
It's configured to use a Docker container, which seems to use nginx. Although it doesn't seem to be configured to log the ttfb in the access logs, this is auto-configured by Amazon.
In testing I tried both a t2.micro, and a t2.large instance, and that had no effect on the test results. Is there something I can tweak on my end... really need to get this under 10-20ms (not counting rtt/ping distance) for the service to be useful.
It appears to have been a problem on Amazon's side... It was averaging 69ms on Friday, today (Monday morning) it's now 3.9ms

AWS: None of the Instances are sending data

I'm trying to set up an Elastic Beanstalk application with Amazon Web Services however I'm receiving a load of errors with the message None of the instances are sending data. I've tried deleting the Elastic Beanstalk Application and the EC2 instance several times with the sample application and trying again but I get the same error.
I also tried uploading a flask application with AWS Elastic Beanstalk command line tools but then I received the error below:
Environment health has transitioned from Pending to Severe. 100.0 % of the requests to the ELB are failing with HTTP 5xx. Insufficient request rate (0.5 requests/min) to determine application health (7 minutes ago). ELB health is failing or not available for all instances. None of the instances are sending data
Why do I get this error and how do I fix it? Thanks.
You are using Enhanced Health Monitoring.
With enhanced health monitoring an agent installed on your EC2 instance monitors vital system and application level health metrics and sends them directly to Elastic Beanstalk.
When you see an error message like "None of the instances are sending data", it means either the agent on the instance has crashed or it is unable to post data to Elastic Beanstalk due to networking error or some other error.
For debugging this, I would recommend downloading "Full logs" from the AWS console. You can follow the instructions for getting logs in the section "Downloading Bundle Logs from Elastic Beanstalk Console" here.
If you are unable to download logs using the console for any reason you can also ssh to the instance and look at the logs in /var/log.
You will find logs for the health agent in /var/log/healthd/daemon.log.
Additional logs useful for this situation are /var/log/cfn-init.log, /var/log/eb-cfn-init.log and /var/log/eb-activity.log. Can you look at the logs and give more details of the errors you see?
This should hopefully give you more details regarding the error "None of the instances are sending data".
Regarding other health "causes" you are seeing:
Environment health has transitioned from Pending to Severe - This is because initially your environment health status is Pending. If the instances do not go healthy within grace period health status transitions to Severe. In your case since none of the instances is healthy / sending data, the health transitioned to Severe.
100.0 % of the requests to the ELB are failing with HTTP 5xx. Insufficient request rate (0.5 requests/min) to determine application health (7 minutes ago).
Elastic Beanstalk monitors other resources in addition to your EC2 instances when using enhanced health monitoring. For example, it monitors cloudwatch metrics for your ELB. This error means that all requests sent to your environment CNAME/load balancer are failing with HTTP 5xx errors. At the same time the request rate is very low only 0.5 requests per minute, so this indicates that even though all requests are failing, the request rate is pretty low. "7 minutes ago" means that information about ELB metrics is slightly old. Because Elastic Beanstalk monitors cloudwatch metrics every few minutes, so the data can be slightly stale. This is as opposed to health data we get directly from the EC2 instances which is "near real time". In your case since the instances are not sending data the only available source for health is ELB metrics which is delayed by about 7 minutes.
ELB health is failing or not available for all instances
Elastic Beanstalk is looking at the health of your ELB, i.e. it is checking how many instances are in service behind ELB. In your case either all instances behind ELB are out of service or the health is not available for some other reason. You should double check that your service role is correctly configured. You can read how to configure service role correctly here or in the documentation. It is possible that your application failed to start.
In your case I would suggest focusing on the first error "None of the instances are sending data". For this you need to look at the logs as outlined above. Let me know what you see in the logs. The agent is started fairly early in the bootstrap process on the instance. So if you see an error like "None of the instances are sending data", it is very likely that bootstrap failed or the agent failed to start for some reason. The logs should tell you more.
Also make sure you are using an instance profile with your environment. Instance profile allows the health agent running on your EC2 instance to authenticate with Elastic Beanstalk. If instance profile is not associated with your environment then the agent will not be able to send data to Elastic Beanstalk. Read more about Instance Profiles with Elastic Beanstalk here.
Update
One common reason for the health cause "None of the instances are sending data" can be that your instance is in a VPC and your VPC does not allow NTP access. Typical indicator of this problem is the following message in /var/log/messages: ntpdate: Synchronizing with time server: [FAILED]. When this happens the clock on your EC2 instance can get out of sync and the data is considered invalid. You should also see a health cause on the instances on the health page on the AWS web console that tells you that instance clock is out-of-sync. The fix is to make sure that your VPC allows access to NTP.
There can be many reasons why the health agent is not able to send any data, so this may not be the answer to your problem, but it was to mine and hopefully can help somebody else:
I got the same error and looking into /var/log/healthd/daemon.log the following was repeatedly reported:
sending message(s) failed: (Aws::Healthd::Errors::GroupNotFoundException) Group 97c30ca2-5eb5-40af-8f9a-eb3074622172 does not exist
This was caused by me making and using an AMI image from an EC2 instance inside an Elastic Beanstalk environment. That is, I created a temporary environment with one instance the same configuration as my production environment, went into the EC2 console and created an image of the instance, terminated the temporary environment, and then created yet another environment using the new custom AMI.
Of course (in hindsight) this meant some settings of the temporary environment were still being used. In this case specifically /etc/healthd/config.yaml, resulting in the health agent trying to send messages to a no longer existing health group.
To fix this and make sure there was no other stale configuration around, I instead started a new EC2 instance by hand from the default AMI used in the production environment (find it under the 'Instances' configuration page of your environment), provision that, then create a new image from that and use that image in my new EB environment.
Check if your instance type's RAM is enough for app + os + amazon tooling. We suffered from this for a long time, when we discovered that t2.micro is barely enough for our use cases. The problem went away right after using t2.small (2GB).
I solved this by adding another security group (the default one for my Elastic Beanstalk).
It appears my problem was that I didn't associate a public ip address to my instance... after I set it it worked just fine.
I was running an app in elastic beanstalk environment with docker as platform. I got the same error that none of the instances are sending. And I was unable fetch logs as well.
Rebuilding the environment worked for me.
I just set the Path on load balancing to a URL that response with status code 200, for this only to study environment.
For my real app, I use actuator
If you see something like this where you don't get any enhanced metrics, check that you haven't accidentally removed the conf.d/elasticbeanstalk/healthd.conf include from your nginx config. This conf adds an machine-read log format that is responsible for reporting that data in EB (see Enhanced health log format - AWS).
My instance profile's IAM Role was lacking elasticbeanstalk:PutInstanceStatistics permission.
I found this by looking at /var/log/healthd/daemon.log as suggested in one of the other answers.
I had to SSH into the machine directly to discover this, as the Get Logs function itself was failing due to missing S3 Write permissions.
If you're running a Worker Tier EB, need to add this policy:
arn:aws:iam::aws:policy/AWSElasticBeanstalkWorkerTier
For anyone arriving here in 2022…
After launching a new environment that was identical to a current healthy environment and seeing no data, I raised an AWS Support ticket. I was informed:
Here, I would like to inform you that recently Elastic Beanstalk introduced new feature called EnhancedHealthAuthEnabled to increase security of your environment and help prevent health data spoofing on your behalf and this option will be enabled by default when you create new environment.
If you use managed policies for your instance profile, this feature is available for your new environment without any further configuration as Elastic Beanstalk instance profile managed policies contain permissions for the elasticbeanstalk:PutInstanceStatistics action. However, If you use a custom instance profile instead of a managed policy, your environment might display a No Data health status. This happens because custom instance profile doesn't PutInstanceStatistics permission by default and instances aren't authorised for the action that communicates enhanced health data to the service. Hence, your environment health shows Unknown/No data status.
The policy that I needed to attach to my existing EC2 role (as advised by AWS Support) looked like:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ElasticBeanstalkHealthAccess",
"Action": [
"elasticbeanstalk:PutInstanceStatistics"
],
"Effect": "Allow",
"Resource": [
"arn:aws:elasticbeanstalk:*:*:application/*",
"arn:aws:elasticbeanstalk:*:*:environment/*"
]
}
]
}
Adding this policy to my EC2 role solved the issue for me.
In My case when i increased my ram or instance type(t2.micro to c5.xlarge) it had resolved.