How to connect to Kafka in Kubernetes from outside? - amazon-web-services

The challenge is to create a Kafka producer that connects to a Kafka cluster that lives within a Kubernetes cluster from outside that cluster. We have several RDBMS databases that sit on premise and we want to stream data directly to Kafka that lives in Kubernetes on AWS. We have tried a few things and deployed the Confluent Open Source Platform but nothing worked so far. Does anyone have a clear answer to this problem?

You might have a look at deploying Kafka Connect inside of Kubernetes. Since you want to replicate data from various RDMBS databases, you need to setup source connectors,
A source connector ingests entire databases and streams table updates
to Kafka topics. It can also collect metrics from all of your
application servers into Kafka topics, making the data available for
stream processing with low latency.
Depending on your source databases, you'd have to configure the corresponding connectors.
If you are not familiar with Kafka Connect, this article might be quite helpful as it explains the key concepts.

Kafka clients need to connect to specific node to produce or consume messages.
The kafka protocol can connect to any node to get metadata. Then the client connects to a specific node which has been elected as leader of the partition which the client wants to produce/consume.
Each kafka pod has to be individually accessible, so you need a L4 load balancer per pod. The advertised listener config can be set in the kafka config to advertise different IP/hostname for internal and external clients. Configure the ADVERTISED_LISTENERS EXTERNAL to use the load balancer, and the INTERNAL to use pod IP. The ports has to be different for internal and external.
Checkout https://strimzi.io/, https://bitnami.com/stack/kafka, https://github.com/confluentinc/cp-helm-charts
Update:
Was trying out installing kafka in k8s running in AWS EC2. Between
confluent-operator, bitnami-kafka and strimzi, only strimzi configured
EXTERNAL in the kafka settings to the load balancer.
bitnami-kafka used
headless service, which is not useful outside the k8s network.
Confluent-operator configures to node's IP which makes it accessible
outside k8s, but to those who can reach the EC2 instance via private
IP.

Related

How to process messages outside GCP in a Kafka server running on GCP

I have been trying to run a consumer in my local machine connecting to a Kafka server running inside GCP.
Kafka and Zookeeper is running on the same GCP VM instance
Step 1: Start Zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
Step 2: Start Kafka
bin/kafka-server-start.sh config/server.properties
If I run a consumer inside the GCP VM instance it works fine:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
I verified the firewall rules, and I have access from my local machine, I can reach both the public IP and the port the Kafka server is running on.
I tested many options, changing the server.properties of kafka, for example:
advertised.host.name=public-ip
or
advertised.listeners=public-ip
Following the answer on connecting-kafka-running-on-ec2-machine-from-my-local-machine without success.
From the official documentation:
advertised.listeners
Listeners to publish to ZooKeeper for clients to use. In IaaS environments, this may
need to be different from the interface to which the broker binds. If
this is not set, the value for listeners will be used. Unlike
listeners it is not valid to advertise the 0.0.0.0 meta-address.
After testing many different options, this solution worked for me:
Setting up two listeners, one EXTERNAL with the public IP, and one INTERNAL with the private IP:
# Configure protocol map
listener.security.protocol.map=INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT
# Use plaintext for inter-broker communication
inter.broker.listener.name=INTERNAL
# Specify that Kafka listeners should bind to all local interfaces
listeners=INTERNAL://0.0.0.0:9027,EXTERNAL://0.0.0.0:9037
# Separately, specify externally visible address
advertised.listeners=INTERNAL://localhost:9027,EXTERNAL://kafkabroker-n.mydomain.com:9093
Explanation:
In many scenarios, such as when deploying on AWS, the externally
advertised addresses of the Kafka brokers in the cluster differ from
the internal network interfaces that Kafka uses.
Also remember to set up your firewall rule to expose the port on the EXTERNAL listener in other to connect to it from an external machine.
Note: It's important to restrict access to authorized clients only.
You can use network firewall rules to restrict access. This guidance
applies to scenarios that involve both RFC 1918 and public IP;
however, when using public IP addresses, it's even more important to
secure your Kafka endpoint because anyone can access it.
Taken from google solutions.

How to receive messages/events from public internet in Kafka/Nifi cluster hosted in the private subnet of AWS?

I am working on a project were lots of machines/sensors will be sending messages to Kafka/Nifi cluster directly. This machine/sensors will be pushing messages from public internet not from the corporate network. We are using a Hortonworks distribution on the AWS cloud.
My question is: what is the best architectural practice to setup Kafka /Nifi cluster for such use cases, I don't want to put my cluster in the public subnet in order to receive messages from the public internet.
Can you please help me with this?
Obviously you shouldn't expose your Kafka to the world. Therefore "sensor data directly to Kafka" is the wrong approach, IMO. At least, without using some SSL channel
You could allow a specific subnet of your external devices to reach the internal subnet, assuming you know that range, however I think your better option here is to use either Minifi or Streamsets SDC which are event collectors sitting on the sensors, which can encrypt traffic to an open Nifi or Streamsets cluster, which can then forward events to the internal Kafka cluster. You already have Nifi apparently, and therefore Minifi was built for this purpose
Another option could be the Kafka REST proxy, but you'll still need to setup authentication / security layers around it
Use AWS IoT to receive the devices communication, this option gives you a security layer and isolates you HDF sandbox from the internet.
AWS IoT Core provides mutual authentication and encryption at all points of connection, so that data is never exchanged between devices and AWS IoT Core without a proven identity.
Then import the information with a NiFi processor.

How to set up Tomcat session state in AWS EC2 for failover and security

I am setting up a Tomcat application in EC2. For reliability, I am running two or more instances. If one server goes down, my users should be redirected to the other instance. This suggests that session state should be kept in an external source, or mirrored between the servers.
AWS offers a hosted service, Elasticache, which seems like it would work well. I even found a nice library, memcached-session-manager. However, I soon ran into some issues.
Unless someone can convince me otherwise, I need the session states to be encrypted in transit. Otherwise someone could intercept the network traffic and pretend to be someone else on my site. I don't see any built-in Amazon method to keep traffic off the internet. (Is peering available here?)
The library mentioned earlier does have Redis support with SSL, but it does not support a Redis cluster. Someone put in a pull request for this but it has not been incorporated and this library is a complex build. I may talk myself into living without the cluster, but that puts us back at a single point of failure.
Tomcat is running on EC2 in your VPC, and ElastiCache is in your VPC. Your AWS VPC is an isolated network. Nobody can intercept the traffic between the EC2 and Elasticache servers unless your VPC network becomes compromised in some way.
If you want to use Redis instead, with SSL connections, then I believe at this time you would need a Tomcat Session Manager implementation that uses Jedis. This one uses Jedis, but you would need to upgrade the version of Jedis it uses in order to use SSL connections.

How to browse to a specific instance behind an AWS load balancer

I have a monitor installed into with my application, JavaMelody. The application is running on 7 different instances in AWS in an auto scaling group behind a load balancer in AWS. When I go to myapp.com/monitoring, I get statistics from JavaMelody. However, it is only giving me specifics for the node that the load balancer happens to direct me. Is there a way I can specify which node I am browsing to in a web browser?
The Load Balancer will send you to an Amazon EC2 instance based upon a least open connections algorithm.
It is not possible to specify which instance you wish to sent to.
You will need to connect specifically to each instance, or have the instances push their data to some central store.
You should use CloudWatch Custom Metrics to write data from your instances and their monitoring agent, and then use CloudWatch Dimensions to aggregate this data for the relevant instances
I have not tried this myself but you may create several listeners in your load balancer with a different listening port and a different target server for each listener. So the monitoring reports of the instance #1 may be available at http://...:81/monitoring etc for #2, #n
Otherwise, I think that there are other solutions such as:
host or path based load balancing rules (path based rules would need to add net.bull.javamelody.ReportServlet in your webapp to listen on different paths)
use a javamelody collector server to collect the data in a separate server and to have monitoring reports for each instance or aggregated for all instances
send some of the javamelody metrics to AWS CloudWatch or to Graphite

Mule cluster configuration with Amazon cloud(AWS)

I am using Amazon cloud server (AWS) to create Mule server nodes. Issue with AWS is it doesn't support multicasts, but MuleSoft requires all the nodes are in same network and multicasts enabled for clustering.
Amazon FAQ:
https://aws.amazon.com/vpc/faqs/
Q. Does Amazon VPC support multicast or broadcast?
Ans:No.
Mule cluster doesn't show proper heartbeat without multicasts enabled, mule_ee.log file should show as
Cluster OK
Members [2] {
Member [<IP-Node1>]:5701 this
Member [<IP-Node2>]:5701
}
but my cluster shows as:
Members [1] {
Member [<IP-Node1>]:5701 this
}
which is wrong according to MuleSoft standards. I created a sample Poll scheduler application and deployed in Mule cluster which runs in both nodes due to improper handling of Mule cluster.
But my organization needs AWS to continue with server configuration.
Question
1) is there any other approach instead of using Mule cluster, I can use both Mule server nodes and make it HA cluster configuration(Active-Active).
2) Is it possible to make one server up and running(active) and another one passive mode instead of Mule HA(ACtive-Active) mode?
3) CloudHub and AnypointMQ is deployed in AWS, how did MuleSoft handle multicasts issues with AWS?
According to Mulesoft support team, they don't advise managing Mule HA in AWS , it doesnt matter if we aree managing with ARM or MMC.
The Mule instances communicate with each other and guarantee HA as well as not processing a single request more than once but that does not work on AWS because latency may cause the instances to disconnect from one another. We need to have the servers on-prem to have HA model
Multicast and Unicast are just used for the nodes to be discoverable automatically and further more as explained in the documentation.
Mule cluster config
AWS know limitation: here