Akka Cluster remove heartbeat connection message - akka

What does the INFO message of
FailureDetector(akka://MyCluster) - Remove heartbeat connection [akka://MyCluster#127.0.0.1:35250]
in an Akka cluster mean? I can't seem to find anything in the documentation. I'm seeing this a fair bit when running lots of JVMs with actors on a test machine, but not sure if it's a bad sign requiring some kind of Akka or Linux tuning.
Akka 2.1.4 on Oracle JDK 1.7
Update:
Having followed #cmbaxter's advice, I investigated options for tuning heartbeats. I found that increasing/decreasing the timings associated with heartbeats had no effect on the presence of the 'Remove heartbest connection' messages. However, I noticed the 'monitored-by-nr-of-members' configuration setting. I now believe the messages indicate that monitoring of heartbeats from a particular node is being passed from one ActorSystem to another. Hence they indicate the current system simply stating that it's no longer it's own responsibility, rather than indicating any kind of connectivity warning. Indeed, during system start-up the first node recieves a heck of a lot of 'First heartbeat's but then removes most of them, as per the 'monitored-by-nr-of-members' setting, as the load is passed to other nodes.

The message you are seeing is coming from the AccrualFailureDetector class in Akka. According to the docs:
The nodes in the cluster monitor each other by sending heartbeats to detect if a
node is unreachable from the rest of the cluster. The heartbeat arrival times is
interpreted by an implementation of The Phi Accrual Failure Detector.
My guess here is that a cluster node (running locally, on port 35250) has become unreachable enough times that it has been deemed to no longer be part of the cluster. When that happens, the heartbeat check to that node is removed and thus you see this message. If you believe that this node was not unreachable and thus should not have been removed from the cluster heartbeat, then you might have an issue. Take a look at the Cluster Docs here under the Failure Detector section for more info on how to tune the failure detection.

Related

CHAINLINK NODE - Your node is overloaded and may start missing jobs ERROR

Running a test node in GCP, using Docker 9.9.4, Ubuntu, Postgres db, Infura. I had issues with public/private IP, but once I cleared that up my node is up and running. I am now throwing the error below repeatedly, potentially due to the blockchain connection. How do I fix this?
[ERROR] HeadTracker: dropping head 26085153 with hash 0xf50e19099b7e343829935d70dd7d86c5bc0398286b7a4e4f32ac033ac60c3733 because queue is full. WARNING: Your node is overloaded and may start missing jobs. logger/default.go:155 stacktrace=github.com/smartcontractkit/chainlink/core/logger.Errorf
This log output is related to an overload of your blockchain connection.
This notification is usually related to the usage of public websocket connections and/or free third party NaaS Provider. To fix this connection issue you can either run an own full node or change the tier or the third party NaaS provider. Also it is recommended to use Chainlink version 0.10.8 or higher, as the HeadTracker has been revised here and performs more efficient.
In regard to the question let me try to give you a small technical overview, which may clarify the payload of a Chainlink node to it's remote full node:
Your Chainlink node establishes a connection to a full node. There the Chainlink node initiates various subscriptions, which are a special feature of the websocket protocol to enable bidirectional communication. More precisely, this means that the Chainlink node is informed if a certain "state" of the subscription changes. Basically, the node interacts with using JSON-RPC methods and uses the following methods to initiate and process various functions internally:
eth_getBlockByNumber,eth_getBalance,eth_getTransactionReceipt,eth_getTransactionCount,eth_getLogs,eth_subscribe,eth_unsubscribe,eth_sendRawTransaction and eth_Call
https://ethereum.org/uk/developers/docs/apis/json-rpc/
The high amount of interactions of the Chainlink node are especially executed during the syncing process via the internal HeadTracker service. This service initiates a "head" subscription in order to interact with every single incoming new blockheader.
During this syncing process it uses the JSON-RPC methods eth_GetBlockByNumber and eth_getBalance to get all the necessary information from the block. So these two methods are used/ executed every block. The number of requests now depends on the average blocktime of the network the Chainlink node is connected to
An example would be the Kovan Testnet:
The avg. blocktime here is 6.7sec, which means you get a daily request number of approx. 21.000
During fulfilling job requests, those request also includes following methods: eth_getTransactionReceipt, eth_sendRawTransaction, eth_getLogs, eth_subscribe, eth_unsubscribe, eth_getTransactionCount and eth_call, which increases the total number significantly depending on the number of job requests.
It should also be noted that especially with faster blockchains (e.g. polygon) there is a very high payload of the WebSocket and you have to deal with a good full node connection in detail, as many full nodes do not receive such a high number of requests permanently.

Why does the gossip protocol in akka need to deliver it's state twice for the state change to be registered?

I am having trouble of understanding the cluster algorithm used in Akka.
In the description in the akka Gossip Protocol it says that:
The recipient of the gossip state or the gossip status can use the
gossip version (vector clock) to determine whether:
it has a newer version of the gossip state, in which case it sends
that back to the gossiper
it has an outdated version of the state, in which case the recipient requests the current state from the gossiper
by sending back its version of the gossip state
it has conflicting gossip versions, in which case the different versions are merged and
sent back
Step two seems a waste of communication as the gossiper sends its state twice. Once when it is noticed that it does not have the newest version, and again, when the recipient want the newest version by sending its own outdated version back.
I think I am misunderstanding this because my understanding of vector clocks and CFRD are limited, and the description given in the Akka documentation is short, and the wikipedia article is to advanced. As far as I intrepid it, is that a vector clock is an implementation of a CRDT, but that might be incorrect.
But in the end I don't understand why the gossip node needs to communicate its state twice. Please clarify.
But I think I might be misunderstanding how vector
Akka Cluster
like it's stated in the documentation, Akka implements :
A variation of push-pull gossip is used to reduce the amount of gossip
information sent around the cluster. In push-pull gossip a digest is
sent representing current versions but not actual values; the
recipient of the gossip can then send back any values for which it has
newer versions and also request values for which it has outdated
versions.
Thus, simplifying, in the case 2 when a recipient of a gossip message sees that it has an outdated version of the cluster state it asks back to the gossiper the last state.
So, the first message from the gossiper to the recipient carries the versions, the second from the gossiper to the recipient carries the actual states of the nodes.
Hope this helps.

Kafka and Akka Cluster

Following is my use case
Bunch of applications enqueue messages in Kafka under different topics.
Have consumer of each topic distribute the work to a worker in a cluster. The work can be classified as long running, memory intensive, simple etc and the worker is chosen accordingly.
This has me exploring Akka cluster for work distribution, routing and scaling. I can use Akka "Supervisor" as a Kafka consumer and assign incoming work to the appropriate worker based on its classification.
But what I am still trying to understand is the correct way to implement a resilient way of communication between the supervisor and workers in the Akka cluster. Because as soon as the supervisor consumes the message from Kafka, the Kafka offset is committed. If some error happens in processing after the offset commit, is the following acceptable way to recover and start from where it was last left?
Make the supervisor a persistent actor by using durable mailbox backed by Kafka. Supervisor enqueues work in Kafka and worker gets its work from Kafka and commits its offset only after completing the work.
As said by Jaakko, it really depends on the third-part library you are using.
As far as I'm concerned I have successfully used Akka Streams Kafka although I did enable offset auto-commit.
However, this library may meet your needs since it allows you to customize offset commit (see sections External Offset Storage and Offset Storage in Kafka).
The documentation says:
The Consumer.committableSource makes it possible to commit offset positions to Kafka. Compared to auto-commit this gives exact control of when a message is considered consumed.
In order to disable auto-commit, you have to complete your Akka application.conf file by adding an akka.kafka.consumer section:
akka.kafka.consumer {
# Properties defined by org.apache.kafka.clients.consumer.ConsumerConfig
# can be defined in this configuration section.
kafka-clients {
# Disable auto-commit by default
enable.auto.commit = false
}
}
Last version of akka-stream-kafka_2.11 (version 0.16) is compatible with Akka 2.5.x but you have to override akka-stream_2.11 dependency with the one of the Akka toolkit. Currently, I am using this library with Akka 2.5.3 and it works really well.
Hope you will find what your are looking for :)

Why is Google Dataproc HDFS Name Node in Safemode?

I am trying to write to an HDFS directory at hdfs:///home/bryan/test_file/ by submitting a Spark job to a Dataproc cluster.
I get an error that the Name Node is in safe mode. I have a solution to get it out of safe mode, but I am concerned this could be happening for another reason.
Why is the Dataproc cluster in safe mode?
ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1443726448000 ms.0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /home/bryan/test_file/_temporary/0. Name node is in safe mode.
The reported blocks 125876 needs additional 3093 blocks to reach the threshold 0.9990 of total blocks 129098.
The number of live datanodes 2 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
What safe mode means
The NameNode is in safemode until data nodes report on which blocks are online. This is done to make sure the NameNode does not start replicating blocks even though there is (actually) sufficient (but unreported) replication.
Why this happened
Generally this should not occur with a Dataproc cluster as you describe. In this case I'd suspect a virtual machine in the cluster did not come online properly or ran into an issue (networking, other) and, therefore, the cluster never left safe mode. The bad news is this means the cluster is in a bad state. Since Dataproc clusters are quick to start, I'd recommend you delete the cluster and create a new one. The good news, these errors should be quite uncommon.
The reason is that you probably started the master node (housing namenode) before starting the workers. If you shutdown all the nodes, start the workers first and then start the master node it should work. I suspect the master node starting first, checks the workers are there. If they are offline it goes into a safe mode. In general this should not happen because of the existence of heart beat. However, it is what it is and restart of master node will resolver the matter. In my case it was with spark on Dataproc.
HTH

Occasional high latency in qpid application

I'm hoping someone can help me with an issue I'm seeing with a Qpid C++ application I'm using. Essentially, we have one application publishing a status to a last_value_queue at about a 10Hz rate and a couple other applications continuously processing this status. The receivers also use the status as a kind of heartbeat and will complain if the status message isn't updated for a certain amount of time (500ms, to be exact.)
This works fine for about a day, after which we start seeing issues. Every couple hours, a single fetch call by a receiver will block for over 500ms (sometimes for up to 900ms.) This behavior will continue until we restart the broker.
I'm no expert, but I don't think I'm doing anything particularly dumb. I've been able to repeat this behavior with a pair of small applications that connect to the broker. Every 100ms the sender sends a std::chrono::time_point object set to the current time. The receiver fetches the message and calculates the delay to the millisecond. The delay is always 0ms or 1ms, except for the single spikes every hour or so after the initial day of everything being happy. The connection is created like so:
qpid::messaging::Connection c("host1:5672","{ reconnect: true}");
and the sender and receiver are both created with the string
"testQueue; { mode: browse, create: always, node: { type: queue, x-declare:{ arguments:{'qpid.last_value_queue_key':'key','qpid.replicate':'none'}}}}"
High availability replication is enabled on the broker, but I have it explicitly disabled for everything for the purpose of my testing. I see no difference in behavior when the broker and apps are running on the same host or different hosts on the LAN. Using qpid-stat, I can see that the broker replication queue is still transmitting quite a bit of data, but its message count is always at 0 so I don't think it's sending more than it can handle. Can anyone think of anything I might be missing that could cause this behavior? We're using the Qpid 0.26 and the C++ broker.