Error while using dataflow Kafka to bigquery template - google-cloud-platform

I am using dataflow kafka to bigquery template. after launching the dataflow job, it stays in queue for some time then fails with below error:
Error occurred in the launcher container: Template launch failed. See console logs.
When looking at the logs, I see the following stack trace:
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:192)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:317)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:303)
at com.google.cloud.teleport.v2.templates.KafkaToBigQuery.run(KafkaToBigQuery.java:343)
at com.google.cloud.teleport.v2.templates.KafkaToBigQuery.main(KafkaToBigQuery.java:222)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata –
While lauching job, i have provided below parameters:
kafka topic name
bootstrap server name
bigquery topic name
SA email
zone.
My kafka topic only contanis message: hello
kafka is installed in gcp instance which is in same zone and subnet as dataflow worker.

Adding this here as an answer for posterity:
"Timeout expired while fetching topic metadata" indicates that the the Kafka client is unable to connect to the broker(s) to fetch the metadata. This could be due to various reasons such as the worker VMs unable to talk to the broker (are you talking over public or private ips? Check incoming firewall settings if using public ips). It could also be due to an incorrect port or due to the broker requiring SSL connections. One way to confirm is to install the Kafka client on a GCE VM in the same subnet as Dataflow workers and then verify that the kafka client can connect to the Kafka brokers.
Refer to [1] to configure the ssl settings for the Kafka client (which you can test using the cli on a GCE instance). The team that manages the broker(s) can tell you whether they require SSL connection.
[1] https://docs.confluent.io/platform/current/kafka/authentication_ssl.html#clients

Hey Thanks guys for help , i was trying to access kafka with internal ip. it worked when i ched it to public ip. Actually i am running both kafka machines and workers in same subnet. so it should work with internal ip also... i am checking it now

Related

dataproc hadoop/spark job can not connect to cloudSQL via Private IP

I am facing this issue of setting up private ip access between dataproc and cloud sql with vpc network and peering setup, would really appreciate help since not able to figure this our since last 2 days of debugging, after following pretty much all the docs.
so far the setup i tried ( with internal IP only )
enabled "private google access" to default subnet and used the default subnetwork for the dataproc and SQL.
created the new VPX network/subnetwork and used that to create dataproc and updated cloud sql to use that network.
created ip range and "private service connection" to "google cloud platform" service provider -- enabled it as well. Along with vpc network peering to "servicenetworking"
explicitly added sql client role to default dataproc compute service account ( event though I didnt needed this for other VM connectivity to cloud sql, using the same role, because its a admin ("editor") role anyway. )
All according to the doc : https://cloud.google.com/sql/docs/mysql/private-ip and other links there
Problem:
when I submit spark job on dataproc that connects to this cloud sql, it fails with following error: Communications link failure....
Caused by: java.net.ConnectException: Connection refused (Connection refused)
Test & debug:
connectivity test all passes from the exact internal IP address on both side ( dataproc node and cloud sql node )
mysql command line client can connect fine from dataproc master node
checked cloud logging does not show any deny or issue in connecting mysql
screenshot for the connectivity test on both default and new vpc network.
other stackoverflow questions I referred on using private ip:
Cannot connect to Cloud SQL from Cloud Run after enabling private IP and turning off public iP
How to access Cloud SQL from dataproc?
ps: I want to avoid cloud proxy route to connect to cloud SQL from dataproc so dont want to install cloud_proxy service via initialization.
A "Connection refused" normally means that nothing is listening on the other end. The logs also contain hints that the database connection is attempted to localhost, port 3307. This is the right port for the CloudSQL proxy, one higher than the usual MySQL port.
Check whether the metadata configuration for your cluster is correct:
Workaround 1 :
Check the proxy is a different version in the cluster that is having issues version 1.xx. The difference in SQL proxy version seems to be in this issue. You can pin the suitable version of Cloud SQL proxy to 1.xx.
Workaround 2:
Run the command : journalctl -r -u cloud-sql-proxy.service | grep -i err,
Based on the logs check which sql proxy causes issues.
Check if the root cause may be the Data project was hitting "sql query per 100 sec per user" quota.
Actions:
Increase the Quota and restart the affected cloud sql proxy services (by monitoring jobs running on the master nodes that failed)
this is similar to the link but with the quota error preventing the startup instead of network errors in the link. With the updated quota, the cloud sql proxy should not have this reoccur.
here's a recommended set of next steps:
Reboot any nodes that appear to have a defunct/broken cloudsql proxy -- systemd won't report the truth, but running "mysql --host ... --port ..." trying to connect to the cloudsql proxy on the bad nodes would detect this.
Bump up API quota immediately - in Cloud Console just go to "IAM and Admin", go to "Quotas", search for the "Cloud SQL Admin API", click through it: then click on the pencil to "edit" and should be able to bump to 300 as self service without approval needed. If you want it to be more than 300 per 100s you might need to file an approval request.
If you look at the quota usage, if it's approaching 100 per 100s from time to time, update the quota to 300.
It's possible that the extra cloudsql proxy instances on the worker nodes are causing more load than is necessary just running cloudsql proxy on the master node. If the cluster is only using a driver that runs on a master node, then the other worker nodes don't need to run the proxy.
To find the nodes which are broken, you can see which are responding to the cloud sql proxy port.
You can loop over each hostname and ssh to it and run this command:
nc -zv localhost 3307 || sudo systemctl restart cloud-sql-proxy
or you could check the logs on each to see which ones have logged a quota message like this:
grep cloud_sql_proxy /var/log/syslog | tail
and see if the very last message they see says "Error 429: Quota exceeded for quota group 'default' and limit 'USER-100s' of service
'sqladmin.googleapis.com' for consumer ..."
The nodes which aren't running cloud sql proxy could be rebooted to start from scratch, or restart the proxy with this command on each:
"sudo systemctl restart cloud-sql-proxy"

How to permit Google Cloud Data Fusion to connect to an AWS RDS MySQL database?

I'm getting an error in configuring a database connection in a Google Cloud Data Fusion Pipeline.
"Encountered SQL error while getting query schema: Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server."
We can't connect outside of the company building as the company IP's are whitelisted in AWS security settings. I can query easily using mysql workbench inside the company so, I'm guessing I need to add some IPs to our AWS security groups to provide Data Fusion permissions? I can't find a guideline on this. Where can I find the ip's required to provide in AWS? (Assuming that might fix it)
I've added a mysql plugin artefact using 'mysql-connector-java-8.0.17.jar', which is referred to by plugin name 'mysql-connector-java'.
Do VPN between your GCP VPC and your AWS VPC where your RDS is residing
https://cloud.google.com/solutions/using-gcp-apis-from-an-external-network
https://cloud.google.com/solutions/automated-network-deployment-multicloud
Simple way
Create Haproxy with public IP
Data Fusion --> VM Haproxy Public IP --> AWS RDS Private IP

Can't connect Burrow to my amazon MSK cluster

I've tried several ways to connect the burrow application from my EC2 instance to my kafka cluster to get the consumer lag metrics. I can console produce and consume messages from the instance but the moment I want to connect burrow it throws this error in the logs.
"name":"kafkatestingcluster","error":"kafka: client has run out of available brokers to talk to (Is your cluster reachable?)"
I have checked the bootstrap servers twice, and the zookeeper as well and they are okay. I have also tried with cluster running versions 1.1.0 and 2.2.1, and different client versions in the burrow's config file.
Am I missing a step?
Mind to share your configuration with us? did you enter the correct port?
Have you tried to run a simple telnet test from the host Burrow runs in to the Kafka brokers? Have you checked your inbound and outbound SG rules on AWS?
I would suggest testing those stuff first and if everything is good on that layer, switch Burrow logging level to debug and I'm sure it will give you a better understanding on what's going on.

Diagnosing Kafka Connection Problems

I have tried to build as much diagnostics into my Kafka connection setup as possible, but it still leads to mystery problems. In particular, the first thing I do is use the Kafka Admin Client to get the clusterId, because if this operation fails, nothing else is likely to succeed.
def getKafkaClusterId(describeClusterResult: DescribeClusterResult): Try[String] = {
try {
val clusterId = describeClusterResult.clusterId().get(futureTimeout.length / 2, futureTimeout.unit)
Success(clusterId)
} catch {
case cause: Exception =>
Failure(cause)
}
}
In testing this usually works, and everything is fine. It generally only fails when the endpoint is not reachable somehow. It fails because the future times out, so I have no other diagnostics to go by. To test these problems, I usually telnet to the endpoint, for example
$ telnet blah 9094
Trying blah...
Connected to blah.
Escape character is '^]'.
Connection closed by foreign host.
Generally if I can telnet to a Kafka broker, I can connect to Kafka from my server. So my questions are:
What does it mean if I can reach the Kafka brokers via telnet, but I cannot connect via the Kafka Admin Client
What other diagnostic techniques are there to troubleshoot Kafka broker connection problems?
In this particular case, I am running Kafka on AWS, via a Docker Swarm, and trying to figure out why my server cannot connect successfully. I can see in the broker logs when I try to telnet in, so I know the brokers are reachable. But when my server tries to connect to any of 3 brokers, the logs are completely silent.
This is a good article that explains the steps that happens when you first connect to a Kafka broker
https://community.hortonworks.com/articles/72429/how-kafka-producer-work-internally.html
If you can telnet to the bootstrap server then it is listening for client connections and requests.
However clients don't know which real brokers are the leaders for each of the partitions of a topic so the first request they always send to a bootstrap server is a metadata request to get a full list of all the topic metadata. The client uses the metadata response from the bootstrap server to know where it can then make new connections to each of Kafka brokers with the active leaders for each topic partition of the topic you are trying to produce to.
That is where your misconfigured broker problem comes into play. When you misconfigure the advertised.listener port the results of the first metadata request are redirecting the client to connect to unreachable IP addresses or hostnames. It's that second connection that is timing out, not the first one on the port you are telnet'ing into.
Another way to think of it is that you have to configure a Kafka server to work properly as both a bootstrap server and a regular pub/sub message broker since it provides both services to clients. Yours are configured correctly as a pub/sub server but incorrectly as a bootstrap server because the internal and external ip addresses are different in AWS (also in docker containers or behind a NAT or a proxy).
It might seem counter intuitive in small clusters where your bootstrap servers are often the same brokers that the client is eventually connecting to but it is actually a very helpful architectural design that allow kafka to scale and to failover seamlessly without needing to provide a static list of 20 or more brokers on your bootstrap server list, or maintain extra load balancers and health checks to know onto which broker to redirect the client requests.
If you do not configure listeners and advertised.listeners correctly, basically Kafka just does not listen. Even though telnet is listening on the ports you've configured, the Kafka Client Library silently fails.
I consider this a defect in the Kafka design which leads to unnecessary confusion.
Sharing Anand Immannavar's answer from another question:
Along with ADVERTISED_HOST_NAME, You need to add ADVERTISED_LISTENERS to container environment.
ADVERTISED_LISTENERS - Broker will register this value in zookeeper and when the external world wants to connect to your Kafka Cluster they can connect over the network which you provide in ADVERTISED_LISTENERS property.
example:
environment:
- ADVERTISED_HOST_NAME=<Host IP>
- ADVERTISED_LISTENERS=PLAINTEXT://<Host IP>:9092

Mule cluster configuration with Amazon cloud(AWS)

I am using Amazon cloud server (AWS) to create Mule server nodes. Issue with AWS is it doesn't support multicasts, but MuleSoft requires all the nodes are in same network and multicasts enabled for clustering.
Amazon FAQ:
https://aws.amazon.com/vpc/faqs/
Q. Does Amazon VPC support multicast or broadcast?
Ans:No.
Mule cluster doesn't show proper heartbeat without multicasts enabled, mule_ee.log file should show as
Cluster OK
Members [2] {
Member [<IP-Node1>]:5701 this
Member [<IP-Node2>]:5701
}
but my cluster shows as:
Members [1] {
Member [<IP-Node1>]:5701 this
}
which is wrong according to MuleSoft standards. I created a sample Poll scheduler application and deployed in Mule cluster which runs in both nodes due to improper handling of Mule cluster.
But my organization needs AWS to continue with server configuration.
Question
1) is there any other approach instead of using Mule cluster, I can use both Mule server nodes and make it HA cluster configuration(Active-Active).
2) Is it possible to make one server up and running(active) and another one passive mode instead of Mule HA(ACtive-Active) mode?
3) CloudHub and AnypointMQ is deployed in AWS, how did MuleSoft handle multicasts issues with AWS?
According to Mulesoft support team, they don't advise managing Mule HA in AWS , it doesnt matter if we aree managing with ARM or MMC.
The Mule instances communicate with each other and guarantee HA as well as not processing a single request more than once but that does not work on AWS because latency may cause the instances to disconnect from one another. We need to have the servers on-prem to have HA model
Multicast and Unicast are just used for the nodes to be discoverable automatically and further more as explained in the documentation.
Mule cluster config
AWS know limitation: here