Kafka: Understanding Broker failure - amazon-web-services

I have a Kafka cluster with:
2 brokers b-1 and b-2.
2 topics with both: PartitionCount:1 ReplicationFactor:2 min.insync.replicas=1
Here is what happened:
%6|1613807298.974|FAIL|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: Disconnected (after 3829996ms in state UP)
%3|1613807299.011|FAIL|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: Connect to ipv4#172.31.18.172:9096 failed: Connection refused (after 36ms in state CONNECT)
%3|1613807299.128|FAIL|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: Connect to ipv4#172.31.18.172:9096 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
%4|1613807907.225|REQTMOUT|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 1 partially-sent requests
%3|1613807907.225|FAIL|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: 1 request(s) timed out: disconnect (after 343439ms in state UP)
%5|1613807938.942|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60767ms, timeout #0)
%5|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60459ms, timeout #1)
%5|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60342ms, timeout #2)
%5|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60305ms, timeout #3)
%5|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60293ms, timeout #4)
%4|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out 6 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
%3|1613807938.943|FAIL|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: 6 request(s) timed out: disconnect (after 4468987ms in state UP)
Within the code, I got this error when my producer performed a poll around that time:
2021-02-20 07:59:08,174 - ERROR - Failed to deliver message due to error: KafkaError{code=REQUEST_TIMED_OUT,val=7,str="Broker: Request timed out"}
Broker b-2 logs have this:
[2021-02-20 07:57:24,781] WARN Client session timed out, have not heard from server in 15103ms for sessionid 0x2000190b5d40001 (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:24,782] WARN Client session timed out, have not heard from server in 12701ms for sessionid 0x2000190b5d40000 (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:24,931] INFO Client session timed out, have not heard from server in 12701ms for sessionid 0x2000190b5d40000, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:24,932] INFO Client session timed out, have not heard from server in 15103ms for sessionid 0x2000190b5d40001, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:32,884] INFO Opening socket connection to server INTERNAL_ZK_DNS/INTERNAL_IP. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:32,910] INFO Opening socket connection to server INTERNAL_ZK_DNS/INTERNAL_IP. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:33,032] INFO Socket connection established to INTERNAL_ZK_DNS/INTERNAL_IP, initiating session (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:33,032] INFO Socket connection established to INTERNAL_ZK_DNS/INTERNAL_IP, initiating session (org.apache.zookeeper.ClientCnxn
My understanding here is that (1) b-2 went down i.e. unable to connect to Zookeeper (2) Messages were produced to b-1 successfully during this time. (3) b-1 was also trying to forward messages to b-2during this downtime due to the replication factor set to 2 (4) All these forwarded messages (ProduceRequests) got timed-out after 600s
My question:
Is my understanding correct and how I can prevent this from happening again?
If I had 3 brokers here, would b-1 have tried to connect to b-3 right away rather than waiting for b-2? Is that a good workaround? (Assuming topic replication factor = 2 everywhere)

Related

Connection timeout on oauth2.googleapis.com and bigquery.googleapis.com, when trying to tabledata.insertAll

we are making stream inserts directly to a bigquery table and we are randomly receiving timeouts. The google cloud status page doesn't present any problems and we are respecting the quotas and limitations.
Google\Cloud\Core\Exception\ServiceException: cURL error 7: Failed to connect to oauth2.googleapis.com port 443: Connection timed out (see https://curl.haxx.se/libcurl/c/libcurl-errors.html)
Google\Cloud\Core\Exception\ServiceException: cURL error 7: Failed to connect to bigquery.googleapis.com port 443: Connection timed out (see https://curl.haxx.se/libcurl/c/libcurl-errors.html)
Is anyone having the same problem?

Error reconnecting boost beast (asio) websocket and http connection after disconnect

I am creating a client application that connects to a server using a an ssl Websocket connection and an ssl Http (Keep-Alive) connection and I am using boost::beast package to do the same. So as to detect a dead connection i have implemented a simple ping-pong mechanism. These all work fine, but an issue comes up when handling the ping-pong failure. The issue is as follows:
For testing my code i connected to the remote server, sent few messages and then turned off my wifi. As expected after a certain period it detected that it did not receive any message from the server and it tried to do an async_shutdown for the http connection and an async_close for the websocket connection. First thing i noticed was that both these calls block their respective strands until the wifi is back up.
And after the wifi is up, the application tries to reset the stream before reconnect:
void HttpKeepAliveConnection::recreateSocket()
{
_receivedPongForLastPing = true;
_sslContext.reset(new boost::asio::ssl::context({boost::asio::ssl::context::sslv23_client}));
_stream.reset(new HttpStream(_ioContext, *_sslContext));
}
And reset ws variable for websocket:
void WebsocketConnection::recreateSocket()
{
_receivedPongForLastPing = true;
_sslContext.reset(new boost::asio::ssl::context({boost::asio::ssl::context::sslv23_client}));
_ws.reset(new WebSocket(_ioContext, *_sslContext));
}
Unfortunately it fails at either on_connect or on_ssl_handshake. Following are my logs:
156 AsioConnectionBase.cpp:53 (2018-08-06 15:34:38.458536) [0x00007ffff601e700] : Started connect sequence. Connection Name: HttpKeepAliveConn
157 AsioConnectionBase.cpp:122 (2018-08-06 15:34:38.459802) [0x00007ffff481b700] : Failed establishing connection to destination. Connection failed. Connection Name: HttpKeepAliveConn. Host: xxxxxxxxx. Port: 443. Error: Operation canceled
158 APIManager.cpp:175 (2018-08-06 15:34:38.459886) [0x00007ffff481b700] : Received error callback from connection. Restarting connection in a sec. Connection Name: HttpKeepAliveConn
159 AsioConnectionBase.cpp:53 (2018-08-06 15:34:39.460009) [0x00007ffff481b700] : Started connect sequence. Connection Name: HttpKeepAliveConn
160 HttpKeepAliveConnection.cpp:32 (2018-08-06 15:34:39.460515) [0x00007ffff481b700] : Failed ssl handshake. Connection failed.Connection Name: HttpKeepAliveConn. Host: xxxxxxxxx. Port: 443. Error: Bad file descriptor
161 APIManager.cpp:175 (2018-08-06 15:34:39.460674) [0x00007ffff481b700] : Received error callback from connection. Restarting connection in a sec. Connection Name: HttpKeepAliveConn
So I have 2 questions:
How do we close a connection if internet is down and a proper tcp close is not possible.
Before reconnecting what are the variables in boost::beast (or for that matter boost::asio as boost::beast is built on top of asio) that needs to be reset
Have been stuck trying to debug this for couple of hours. Any help is appreciated
EDIT
So I figured out where I went wrong. Both Alan Birtles and Vinnie Falco were right. The way to close a dead ssl connection after your ping timer has expired (and none of the handlers have returned yet) is
In your timer handler
_stream->lowest_layer().close();
For websocket
_ws->lowest_layer().close();
Wait for one of your handlers (typically read handler) to return with error (typically boost::asio::error::operation_aborted error). From there, queue the start of the next reconnect. (Do not queue the reconnect immediately after step 1, it will result in memory issues that I faced. I know this is asio 101, but is easy to forget)
For resetting socket, all that is required is for the stream to be reset
_stream.reset(new HttpStream(_ioContext, _sslContext));
For websocket
_ws.reset(new WebSocket(_ioContext, _sslContext));
I don't think asio::ssl::stream can be used again after being closed.
How do we close a connection if internet is down and a proper tcp close is not possible.
Simply allow the socket or stream object to be destroyed.

Connection reset by peer error while using celery stats()

I'm trying to get stats for my celery Que (rabbitmq). I'm using celery.app.control.Inspect().stats() API. I'm doing this on a web server, I can get the stats only one time. If I refresh the page I'm getting "[Errno 104] Connection reset by peer" Error. how can I deal with this.
/init.py
celtasks = Celery(app.name,"rabbit mq url")
/helpers.py
get_stats():
stats = celtasks.control.Inspect().stats()
return stats
whenever there is a request "get_stats" function is hit. It is only working for the first request after this, it says connection reset by peer error.
If I go by connection has been reset and try to create the connection again, I get error
updated /helpers.py
get_stats():
celtasks = Celery(app.name,"rabbit mq url")
stats = celtasks.control.Inspect().stats()
return stats
Rabbitmq logs
=WARNING REPORT==== 10-Jul-2017::14:11:54 ===
closing AMQP connection <0.29185.6> (10.246.170.70:48618 -> 10.24.83.115:5672):
connection_closed_abruptly
=WARNING REPORT==== 10-Jul-2017::14:11:54 ===
closing AMQP connection <0.29197.6> (10.246.170.70:48620 -> 10.24.83.115:5672):
connection_closed_abruptly
"rabbit#oser000300.log-20170625" 9054L, 361662C
AT most times , CONNECTION RESET BY PEER is because the server close the connection itself, however the client does not know . When client want to communicate to sever through this broke connection, it receive this ERROR. In your case , maybe the hang time (time interval between two stats()) is too long, and server think this connection is useless and close it .

Issue in using FTP Connection in Informatica

FTP connection object in my Workflow, However, while executing for few minutes it is getting failed throwing an error : '
A socket [29] failure is encountered: [Connection reset by peer].
Might be getting a timeout on the ftp server after inactivity

Could not write message to OutputStream: com.ctc.wstx.exc.WstxIOException: Connection timed out

I am getting this error when we move our database to new server. The application at the new server and the database at older server runs fine. But when we move the database to new server the server log shows this error.
Below is the server log. We are using jboss-5.1.0.GA.
2013-02-22 01:02:31,336 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[localhost].[/]] (main) StandardWrapper.Throwable
org.springframework.ws.soap.saaj.SaajSoapMessageException: Could not write message to OutputStream: com.ctc.wstx.exc.WstxIOException: Connection timed out: connect; nested exception is javax.xml.soap.SOAPException: com.ctc.wstx.exc.WstxIOException: Connection timed out: connect
at org.springframework.ws.soap.saaj.SaajSoapMessage.writeTo(SaajSoapMessage.java:169)