Can't connect Burrow to my amazon MSK cluster - amazon-web-services

I've tried several ways to connect the burrow application from my EC2 instance to my kafka cluster to get the consumer lag metrics. I can console produce and consume messages from the instance but the moment I want to connect burrow it throws this error in the logs.
"name":"kafkatestingcluster","error":"kafka: client has run out of available brokers to talk to (Is your cluster reachable?)"
I have checked the bootstrap servers twice, and the zookeeper as well and they are okay. I have also tried with cluster running versions 1.1.0 and 2.2.1, and different client versions in the burrow's config file.
Am I missing a step?

Mind to share your configuration with us? did you enter the correct port?
Have you tried to run a simple telnet test from the host Burrow runs in to the Kafka brokers? Have you checked your inbound and outbound SG rules on AWS?
I would suggest testing those stuff first and if everything is good on that layer, switch Burrow logging level to debug and I'm sure it will give you a better understanding on what's going on.

Related

EC2 server losses internet connection and application fails to send email, sms and even yum updates

I have 5 EC2 servers in the same VPC and all of a sudden yesterday, all of my applications started failing to send email and sms. So I tried doing git pull of my project it also timed out. Then tried to install telnet using yum that to failed with Time out. I have checked almost everything including Network ACLs, Security Groups, Subnets, Iptables, etc and everything is correct. I am not sure why is this happening.
The weird thing is if I reboot the server once the internet comes for a brief amount of time and again it disconnects.
Attaching below are the errors I am facing:
Error while Generating the Tiny URL. Error: {"errno":-110,"code":"ETIMEDOUT","syscall":"connect","address":"XXX.XX.XXX.XX","port":443}
Error SendEmail UnknownEndpoint: Inaccessible host: `email.ap-south-1.amazonaws.com'. This service may not be available in the `ap-south-1' region.
Attaching screenshots of my Network ACLs, Security Groups, Subnets, and iptables:
Please help with what am I doing wrong or if is this an issue with AWS EC2? My goal is to make sure my application works without timeout and git and yum starts working.
Did you try terminating and reprovisioning the instances, rather than rebooting them? There may be some problem with the underlying hardware. When you terminate and recreate an instance, it will likely end up in a different rack in the datacenter, which may solve the problem.
If the above helps, you should consider setting up an application load balancer with an auto scaling group, with health checks enabled for both, so that the auto scaling group terminates unhealthy instances and replaces then with the new ones automatically.
You may also consider using Simple Notification Service and stop worrying about underlying compute for e-mail and sms distribution altogether!

request times out when pinging aws load balancer

I have a dockerized Node.JS express application that I am migrating to AWS from Google Cloud. I had done this before successfully on the same project before deciding Cloud Run was more cost effective because of their free tier. Now, wanting to switch back to Fargate, but am unable to do it again due what I'm guessing is a crucial step. For minimal setup, I used the following guide: https://docs.docker.com/cloud/ecs-integration/ Essentially, using docker compose up with aws context and project name to deploy to ECS and Fargate.
The Load Balancer gives me a public DNS name in the format: xxxxx.elb.us-west-2.amazonaws.com and I have defined a port of 5002 in my Docker container. I know the issue is not related to exposing port numbers or any code-related issue since I had this successfully running in Google Cloud Run. When I try to hit any of my express endpoints, by sending POST to xxxxx.elb.us-west-2.amazonaws.com:5002/my_endpoint, I end up with Error: Request Timed Out
Note: I have already verified that my inbound security rules have been set to all traffic.
I am very new to AWS, so would love guidance if I am missing a critical step.
Thanks!
EDIT (SOLUTION): Turns out everything was deploying correctly, but after checking CloudWatch Logs, it turns out Fargate can't read environment variables defined inside of docker-compose file. Instead, they need to be defined in .env files, then read in docker-compose through -env-file flag. My code was then trying to listen on a port that was in environment variable but was undefined, so was receiving the below error in CloudWatch.

AWS ECS Task can't connect to RDS Database

I'm a newer AWS user and today I got stuck while working on a sample project. I successfully created a docker container that runs a simple R script that connects to my AWS RDS MySQL Database and creates & writes some basic files to it. I built a public ECR repository, pushed my docker image there, and built a ECS cluster & task choosing Fargate and using the container image from my repository. My task ran and I could see the R code being executed when I went through the logs, but it was never able to connect to the SQL Database and exited afterwards.
I've had to whitelist my own IP address in the security group for the RDS Database so that I can connect to it, so I'm aware I probably have to do that for my ECS task to establish that connection too. But won't that IP address constantly change because I won't have a static IP for the Fargate Server that is executing my task? I'm trying to stay on the free tier so I'm not sure I want to setup an elastic IP address for this server.
These 2 articles seem close if not the same issue I'm having but I can't figure out a solution. I haven't found any other info.
https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-task-database-connection/
https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-static-elastic-ip-address/
The end goal is to get this sample project successfully running on a scheduled fixed interval, and then running actual scripts on there to help automate things and make my life easier, so this sample project is a first step towards that. Any help or info on the questions I'm having would be appreciated !
Yes, your task is ephemeral (whether you launch it manually or as part of an ECS service) and its private/public ip address may change over time if it gets replaced. The way you'd make the connectivity rules to stick is to assign a security group to the task (that may have inbound access on a specific port you need I assume and outbound to everything) and assign another security group to the RDS db that has inbound access on port 3306 for the security group you assigned to the task (this is the trick, the SG will not change and you are telling RDS to allow access to ALL traffic coming from that SG). I see the first article you posted doesn't talk about this part (it should).

Why are outbound SSH connections from Google CloudRun to EC2 instances unspeakably slow?

I have a Node API deployed to Google CloudRun and it is responsible for managing external servers (clean, new Amazon EC2 Linux VM's), including through SSH and SFTP. SSH and SFTP actually work eventually but the connections take 2-5 MINUTES to initiate. Sometimes they timeout with handshake timeout errors.
The same service running on my laptop, connecting to the same external servers, has no issues and the connections are as fast as any normal SSH connection.
The deployment on CloudRun is pretty standard. I'm running it with a service account that permits access to secrets, etc. Plenty of memory allocated.
I have a VPC Connector set up, and have routed all traffic through the VPC connector, as per the instructions here: https://cloud.google.com/run/docs/configuring/static-outbound-ip
I also tried setting UseDNS no in the /etc/ssh/sshd_config file on the EC2 as per some suggestions online re: slow SSH logins, but that has not make a difference.
I have rebuilt and redeployed the project a few dozen times and all tests are on brand new EC2 instances.
I am attempting these connections using open source wrappers on the Node ssh2 library, node-ssh and ssh2-sftp-client.
Ideas?
Cloud Run works only until you have a HTTP request active.
You proably don't have an active request during this on Cloud Run, as outside of the active request the CPU is throttled.
Best for this pipeline is Cloud Workflows and regular Compute Engine instances.
You can setup a Workflow to start a Compute Engine for this task, and stop once it finished doing the steps.
I am the author of article: Run shell commands and orchestrate Compute Engine VMs with Cloud Workflows it will guide you how to setup.
Executing the Workflow can be triggered by Cloud Scheduler or by HTTP ping.

Diagnosing Kafka Connection Problems

I have tried to build as much diagnostics into my Kafka connection setup as possible, but it still leads to mystery problems. In particular, the first thing I do is use the Kafka Admin Client to get the clusterId, because if this operation fails, nothing else is likely to succeed.
def getKafkaClusterId(describeClusterResult: DescribeClusterResult): Try[String] = {
try {
val clusterId = describeClusterResult.clusterId().get(futureTimeout.length / 2, futureTimeout.unit)
Success(clusterId)
} catch {
case cause: Exception =>
Failure(cause)
}
}
In testing this usually works, and everything is fine. It generally only fails when the endpoint is not reachable somehow. It fails because the future times out, so I have no other diagnostics to go by. To test these problems, I usually telnet to the endpoint, for example
$ telnet blah 9094
Trying blah...
Connected to blah.
Escape character is '^]'.
Connection closed by foreign host.
Generally if I can telnet to a Kafka broker, I can connect to Kafka from my server. So my questions are:
What does it mean if I can reach the Kafka brokers via telnet, but I cannot connect via the Kafka Admin Client
What other diagnostic techniques are there to troubleshoot Kafka broker connection problems?
In this particular case, I am running Kafka on AWS, via a Docker Swarm, and trying to figure out why my server cannot connect successfully. I can see in the broker logs when I try to telnet in, so I know the brokers are reachable. But when my server tries to connect to any of 3 brokers, the logs are completely silent.
This is a good article that explains the steps that happens when you first connect to a Kafka broker
https://community.hortonworks.com/articles/72429/how-kafka-producer-work-internally.html
If you can telnet to the bootstrap server then it is listening for client connections and requests.
However clients don't know which real brokers are the leaders for each of the partitions of a topic so the first request they always send to a bootstrap server is a metadata request to get a full list of all the topic metadata. The client uses the metadata response from the bootstrap server to know where it can then make new connections to each of Kafka brokers with the active leaders for each topic partition of the topic you are trying to produce to.
That is where your misconfigured broker problem comes into play. When you misconfigure the advertised.listener port the results of the first metadata request are redirecting the client to connect to unreachable IP addresses or hostnames. It's that second connection that is timing out, not the first one on the port you are telnet'ing into.
Another way to think of it is that you have to configure a Kafka server to work properly as both a bootstrap server and a regular pub/sub message broker since it provides both services to clients. Yours are configured correctly as a pub/sub server but incorrectly as a bootstrap server because the internal and external ip addresses are different in AWS (also in docker containers or behind a NAT or a proxy).
It might seem counter intuitive in small clusters where your bootstrap servers are often the same brokers that the client is eventually connecting to but it is actually a very helpful architectural design that allow kafka to scale and to failover seamlessly without needing to provide a static list of 20 or more brokers on your bootstrap server list, or maintain extra load balancers and health checks to know onto which broker to redirect the client requests.
If you do not configure listeners and advertised.listeners correctly, basically Kafka just does not listen. Even though telnet is listening on the ports you've configured, the Kafka Client Library silently fails.
I consider this a defect in the Kafka design which leads to unnecessary confusion.
Sharing Anand Immannavar's answer from another question:
Along with ADVERTISED_HOST_NAME, You need to add ADVERTISED_LISTENERS to container environment.
ADVERTISED_LISTENERS - Broker will register this value in zookeeper and when the external world wants to connect to your Kafka Cluster they can connect over the network which you provide in ADVERTISED_LISTENERS property.
example:
environment:
- ADVERTISED_HOST_NAME=<Host IP>
- ADVERTISED_LISTENERS=PLAINTEXT://<Host IP>:9092