Sawtooth Transaction Processor message - blockchain

I am getting an error like
Did not respond to the ping, removing transaction processor.
Can anyone guide what is the error here/ If there is any problem with my set-up?

This is a message from the Hyperledger Sawtooth blockchain's Validator.
A timeout occurred when the Validator was checking connections with all the registered transaction processors. If a transaction processor does not respond, it is removed from the list.
Some possible causes: the transaction processor (TP) died. Check that the TP process is still running (check in the Docker container if you are running docker). Check network connectivity if the TP is on another host or another virtual machine. Check the message logs. Perhaps the TP is "frozen" or hanging or has a bug. Add logging messages (using LOGGER).

Related

Worker role using event hubs gives 'No connection handler was found for virtual host'

I have a worker role that uses an EventProcessorHost to ingest data from an EventHub. I frequently receive error messages of the following kind:
Microsoft.ServiceBus.Messaging.MessagingCommunicationException:
No connection handler was found for virtual host 'myservicebusnamespace.servicebus.windows.net:42777'. Remote container id is 'f37c72ee313c4d658588ad9855773e51'. TrackingId:1d200122575745cc89bb714ffd533b6d_B5_B5, SystemTracker:SharedConnectionListener, Timestamp:8/29/2016 6:13:45 AM
at Microsoft.ServiceBus.Common.ExceptionDispatcher.Throw(Exception exception)
at Microsoft.ServiceBus.Common.Parallel.TaskHelpers.EndAsyncResult(IAsyncResult asyncResult)
at Microsoft.ServiceBus.Messaging.IteratorAsyncResult`1.StepCallback(IAsyncResult result)
I can't seem to find a way to catch this exception. It seems I can just ignore the error because everything works as expected (I had previously mentioned here that it was dropping messages because of this error, but I have since found out that a bug in the software that sends the messages caused this problem), however I would like to know what causes these errors, since they are clogging up my logging now and then.
Can anyone shed some light on the cause?
The Event Hub partitions are distributed across multiple servers. They sometimes move due to load balancing, upgrade and other reasons. When this happens, the client connection is lost with this error. The connection will be reestablished very quickly so you should not see any issues with message processing. It is safe to ignore this communication error.

Occasional high latency in qpid application

I'm hoping someone can help me with an issue I'm seeing with a Qpid C++ application I'm using. Essentially, we have one application publishing a status to a last_value_queue at about a 10Hz rate and a couple other applications continuously processing this status. The receivers also use the status as a kind of heartbeat and will complain if the status message isn't updated for a certain amount of time (500ms, to be exact.)
This works fine for about a day, after which we start seeing issues. Every couple hours, a single fetch call by a receiver will block for over 500ms (sometimes for up to 900ms.) This behavior will continue until we restart the broker.
I'm no expert, but I don't think I'm doing anything particularly dumb. I've been able to repeat this behavior with a pair of small applications that connect to the broker. Every 100ms the sender sends a std::chrono::time_point object set to the current time. The receiver fetches the message and calculates the delay to the millisecond. The delay is always 0ms or 1ms, except for the single spikes every hour or so after the initial day of everything being happy. The connection is created like so:
qpid::messaging::Connection c("host1:5672","{ reconnect: true}");
and the sender and receiver are both created with the string
"testQueue; { mode: browse, create: always, node: { type: queue, x-declare:{ arguments:{'qpid.last_value_queue_key':'key','qpid.replicate':'none'}}}}"
High availability replication is enabled on the broker, but I have it explicitly disabled for everything for the purpose of my testing. I see no difference in behavior when the broker and apps are running on the same host or different hosts on the LAN. Using qpid-stat, I can see that the broker replication queue is still transmitting quite a bit of data, but its message count is always at 0 so I don't think it's sending more than it can handle. Can anyone think of anything I might be missing that could cause this behavior? We're using the Qpid 0.26 and the C++ broker.

Spark - Remote Akka Client Disassociated

I am setting up Spark 0.9 on AWS and am finding that when launching the interactive Pyspark shell, my executors / remote workers are first being registered:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor#ip-xx-xx-xxx-xxx.ec2.internal:54110/user/
Executor#-862786598] with ID 0
and then disassociated almost immediately, before I have the chance to run anything:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Executor 0 disconnected,
so removing it
14/07/08 22:48:05 ERROR scheduler.TaskSchedulerImpl: Lost an executor 0 (already
removed): remote Akka client disassociated
Any idea what might be wrong? I've tried adjusting the JVM options spark.akka.frameSize and spark.akka.timeout, but I'm pretty sure this is not the issue since (1) I'm not running anything to begin with, and (2) my executors are disconnecting a few seconds after startup, which is well within the default 100s timeout.
Thanks!
Jack
I had a very similar problem, if not the same.
It started to work for me once the workers were connecting to master by using the very same name as the master thought it had.
My log messages were something like:
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#idc1-hrm1.heylinux.com:7078] -> [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]].
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#192.168.121.127:7078] -> [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]]
WARN util.Utils: Your hostname, idc1-hrm1 resolves to a loopback address: 127.0.0.1; using 192.168.121.187 instead (on interface eth0)
So check the log of the master and see what name it thinks it has.
Then use that very same name on the workers.

How to tolerate RabbitMQ restarts in Langohr?

We have Clojure code which reads from a Rabbit queue. We would like to tolerate the case where the RabbitMQ server is down briefly, e.g. in the case of a restart (sudo service rabbitmq-server restart).
There appears to be some provision for reconnecting in Langohr. We adapted the example clojurewerkz.langohr.examples.recovery.example1 (Gist here). Slight differences vs. the published example include the connection parameters, and the removal of the lb/publish call (since we're filling the data with an external source).
We can successfully consume data from the queue and wait for more messages. However, when we restart RMQ (via the above sudo command on the VM hosting RabbitMQ), the following exception is thrown:
Caught an exception during connection recovery!
java.io.IOException
at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:106)
at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:102)
at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:378)
at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:516)
at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:545)
at com.novemberain.langohr.Connection.recoverConnection(Connection.java:166)
at com.novemberain.langohr.Connection.beginAutomaticRecovery(Connection.java:115)
at com.novemberain.langohr.Connection.access$000(Connection.java:18)
at com.novemberain.langohr.Connection$1.shutdownCompleted(Connection.java:93)
at com.rabbitmq.client.impl.ShutdownNotifierComponent.notifyListeners(ShutdownNotifierComponent.java:75)
at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:573)
Caused by: com.rabbitmq.client.ShutdownSignalException: connection error; reason: java.io.EOFException
at com.rabbitmq.utility.ValueOrException.getValue(ValueOrException.java:67)
at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:33)
at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:343)
at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:321)
... 8 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readUnsignedByte(DataInputStream.java:273)
at com.rabbitmq.client.impl.Frame.readFrom(Frame.java:95)
at com.rabbitmq.client.impl.SocketFrameHandler.readFrame(SocketFrameHandler.java:131)
at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:533)
It seems likely that the intended restart mechanism provided by Langohr is breaking when it kicks in. Is there an alternative pattern which is preferred in the case of these "hard" restarts? Alternatively, I suppose we have to implement connection monitoring and retries ourselves. Any suggestions would be most welcome.
We used to see such stack traces, but we no longer see them with Langohr 2.9.0. After a restart, our clojure clients reconnect and messages start flowing again.
We are using the defaults, which have connection and topology coverage turned on, as shown by this:
(infof "Automatic recovery enabled? %s" (rmq/automatic-recovery-enabled? connection))
(infof "Topology recovery enabled? %s" (rmq/automatic-topology-recovery-enabled? connection))

Akka Cluster remove heartbeat connection message

What does the INFO message of
FailureDetector(akka://MyCluster) - Remove heartbeat connection [akka://MyCluster#127.0.0.1:35250]
in an Akka cluster mean? I can't seem to find anything in the documentation. I'm seeing this a fair bit when running lots of JVMs with actors on a test machine, but not sure if it's a bad sign requiring some kind of Akka or Linux tuning.
Akka 2.1.4 on Oracle JDK 1.7
Update:
Having followed #cmbaxter's advice, I investigated options for tuning heartbeats. I found that increasing/decreasing the timings associated with heartbeats had no effect on the presence of the 'Remove heartbest connection' messages. However, I noticed the 'monitored-by-nr-of-members' configuration setting. I now believe the messages indicate that monitoring of heartbeats from a particular node is being passed from one ActorSystem to another. Hence they indicate the current system simply stating that it's no longer it's own responsibility, rather than indicating any kind of connectivity warning. Indeed, during system start-up the first node recieves a heck of a lot of 'First heartbeat's but then removes most of them, as per the 'monitored-by-nr-of-members' setting, as the load is passed to other nodes.
The message you are seeing is coming from the AccrualFailureDetector class in Akka. According to the docs:
The nodes in the cluster monitor each other by sending heartbeats to detect if a
node is unreachable from the rest of the cluster. The heartbeat arrival times is
interpreted by an implementation of The Phi Accrual Failure Detector.
My guess here is that a cluster node (running locally, on port 35250) has become unreachable enough times that it has been deemed to no longer be part of the cluster. When that happens, the heartbeat check to that node is removed and thus you see this message. If you believe that this node was not unreachable and thus should not have been removed from the cluster heartbeat, then you might have an issue. Take a look at the Cluster Docs here under the Failure Detector section for more info on how to tune the failure detection.