Worker role using event hubs gives 'No connection handler was found for virtual host' - azure-eventhub

I have a worker role that uses an EventProcessorHost to ingest data from an EventHub. I frequently receive error messages of the following kind:
Microsoft.ServiceBus.Messaging.MessagingCommunicationException:
No connection handler was found for virtual host 'myservicebusnamespace.servicebus.windows.net:42777'. Remote container id is 'f37c72ee313c4d658588ad9855773e51'. TrackingId:1d200122575745cc89bb714ffd533b6d_B5_B5, SystemTracker:SharedConnectionListener, Timestamp:8/29/2016 6:13:45 AM
at Microsoft.ServiceBus.Common.ExceptionDispatcher.Throw(Exception exception)
at Microsoft.ServiceBus.Common.Parallel.TaskHelpers.EndAsyncResult(IAsyncResult asyncResult)
at Microsoft.ServiceBus.Messaging.IteratorAsyncResult`1.StepCallback(IAsyncResult result)
I can't seem to find a way to catch this exception. It seems I can just ignore the error because everything works as expected (I had previously mentioned here that it was dropping messages because of this error, but I have since found out that a bug in the software that sends the messages caused this problem), however I would like to know what causes these errors, since they are clogging up my logging now and then.
Can anyone shed some light on the cause?

The Event Hub partitions are distributed across multiple servers. They sometimes move due to load balancing, upgrade and other reasons. When this happens, the client connection is lost with this error. The connection will be reestablished very quickly so you should not see any issues with message processing. It is safe to ignore this communication error.

Related

MessageReceiver2topic/Subscriptions/mutationProcessor' is force detached

We are having issues with an Import process and after investigation it looks like its related to issues with a Service Bus.
Exception #1
The link '678ed7a8-31d5-4236-8cc1-a316c3329943;42:47:48:source(address:topic/Subscriptions/mutationProcessor):MessageReceiver2topic/Subscriptions/mutationProcessor' is force detached. Code: RenewToken. Details: Unauthorized access. 'Listen' claim(s) are required to perform this operation. Resource: 'sb://asb.servicebus.windows.net/topic/subscriptions/mutationprocessor'.. TrackingId:b352c1a9858b497b869a58a7be09ae2a_G12, SystemTracker:gateway7, Timestamp:2020-07-27T15:18:02
Expectation:
What happens to messages that are sent when the subscription(s) is detached.
Work arounds/fixes
GitHub links for reference:
https://github.com/Azure/azure-sdk-for-net/issues/11619
https://github.com/Azure/azure-sdk-for-net/issues/8884
The connection was inactive for more than the allowed 60000 milliseconds and is closed by container '1c7fe518-491a-47dd-aa5e-5ae96f0245df'.
Below is a MS documentation link which describes similar but not exact message.
https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-amqp-troubleshoot
Expectation:
Would this cause any issues in case messages were sent when the connection is inactive. If yes, please suggest a way to handle it.
GitHub links for reference:
https://github.com/Azure/azure-service-bus-java/issues/280

Random “upstream connect error or disconnect/reset before headers” between services with Istio 1.3

So, this problem is happening randomly (it seems) and between different services.
For example we have a service A which needs to talk to service B, and some times we get this error, but after a while, the error goes away. And this error doesn't happen too often.
When this happens, we see the error log in service A throwing the “upstream connect error” message, but none in service B. So we think it might be related with the sidecars.
One thing we notice is that in service B, we get a lot of this error messages in the istio-proxy container:
[src/istio/mixerclient/report_batch.cc:109] Mixer Report failed with: UNAVAILABLE:upstream connect error or disconnect/reset before headers. reset reason: connection failure
And according to documentation when a request comes in, envoy asks Mixer if everything is good (authorization and other things), and if Mixer doesn’t reply, the request is not success. So that’s why exists an option called policyCheckFailOpen.
We have that in false, I guess is a sane default, we don’t want the request to go through if Mixer cannot be reached, but why can’t?
disablePolicyChecks: true
policyCheckFailOpen: false
controlPlaneSecurityEnabled: false
NOTE: istio-policy is running with the istio-proxy sidecar. Is that correct?
We don’t see that error in some other service which can also fail.
Another log that I can see a lot, and this one happens in all the services not running as root with fsGroup defined in the YAML files is:
watchFileEvents: "/etc/certs": MODIFY|ATTRIB
watchFileEvents: "/etc/certs/..2020_02_10_09_41_46.891624651": MODIFY|ATTRIB
watchFileEvents: notifying
One of the leads I'm chasing is about default circuitBreakers values. Could that be related with this?
Thanks
The error you are seeing is because of a failure to establish a connection to istio-policy
Based on this github issue
Community members add two answers here which could help you with your issue
If mTLS is enabled globally make sure you set controlPlaneSecurityEnabled: true
I was facing the same issue, then I read about protocol selection. I realised the name of the port in the service definition should start with for example http-. This fixed the issue for me. And . if you face the issue still you might need to look at the tls-check for the pods and resolve it using destinationrules and policies.
istio-policy is running with the istio-proxy sidecar. Is that correct?
Yes, I just checked it and it's with sidecar.
Let me know if that help.

[Amazon](500150) Error setting/closing connection: Connection refused

I have a Glue script which is supposed to write its result in a Redshift table in a for loop.
After many hours of processing it raises this exception:
Py4JJavaError: An error occurred while calling o11362.pyWriteDynamicFrame.
: java.sql.SQLException: [Amazon](500150) Error setting/closing connection: Connection refused.
Why am I getting this exception?
It turns out that Redshift clusters have a maintenance window in which they are re-booted. This event of course causes the Glue Job to fail when attempting to write to a table of that cluster.
May be useful to reschedule the maintenance window https://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-console.html
This error can occur for many reasons. I'm sure after a few google searches you've found that the most common cause of this is improper security group settings for your cluster (make sure your inbound settings are correct).
I would suggest that you make sure you're able to create a connection for even a short period of time before you try this longer process. If you are able to do so, then I bet the issue is that your connection is closing out after a timeout (since your process is so long). To solve this, you should look into connection pooling, which involves creating an instance of a connection and constantly checking to ensure it is still alive, thus allowing a process to continuously use the cluster connection.

What to do when Celery broker is down?

I have a Celery server with a RabbitMQ broker. I use it to run background tasks in my Django project.
In one of my views a signal is triggered which then calls a celery task like this:
create_something.delay(pk)
The task is defined like this:
#task
def create_something(donation_pk):
# do something
Everything works great, but:
If RabbitMQ is down when I am calling the task no error is thrown during the create_something.delay(pk) call. But the view throws this error:
[Errno 111] Connection refused
(The stack trace is kind of useless, I think this is because of the signals used)
The question now is: How can I prevent this error? Is there a possibility to perform retries of the create_something.delay(pk) when the broker is down?
Thanks in advance for any hints!
Celery tasks has a .run() method, which will execute the task as it were part of the normal code flow.
create_something.run(pk)
You could catch the exception and execute .run() if needed.
Is there a possibility to perform retries of the create_something.delay(pk) when the broker is down?
The exception thrown when you call the .delay() method and you cannot connect can be caught just like any other exception:
try:
foo.delay()
except <whatever exception is actually thrown>:
# recover
You could build a loop around this to retry but you should take care not to keep the request up for very long. For instance, if it takes a whole second for your connectivity problem to get resolved, you don't want to hold up the request for a whole second. An option here may be to abort quickly but use the logging infrastructure so that an email is sent to the site administrators. A retry loop would be the last thing I'd do once I've identified what causes the connectivity issue and I have determined it cannot be helped. In most cases, it can be helped, and the retry loop is really a bandaid solution.
How can I prevent this error?
By making sure your broker does not go down. To get a more precise answer, you'd have to give more diagnostic information in your question.
By the way, Celery has a notion of retrying tasks but that's for when the task is already known to the broker. It does not apply to the case where you cannot contact the broker.

How to tolerate RabbitMQ restarts in Langohr?

We have Clojure code which reads from a Rabbit queue. We would like to tolerate the case where the RabbitMQ server is down briefly, e.g. in the case of a restart (sudo service rabbitmq-server restart).
There appears to be some provision for reconnecting in Langohr. We adapted the example clojurewerkz.langohr.examples.recovery.example1 (Gist here). Slight differences vs. the published example include the connection parameters, and the removal of the lb/publish call (since we're filling the data with an external source).
We can successfully consume data from the queue and wait for more messages. However, when we restart RMQ (via the above sudo command on the VM hosting RabbitMQ), the following exception is thrown:
Caught an exception during connection recovery!
java.io.IOException
at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:106)
at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:102)
at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:378)
at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:516)
at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:545)
at com.novemberain.langohr.Connection.recoverConnection(Connection.java:166)
at com.novemberain.langohr.Connection.beginAutomaticRecovery(Connection.java:115)
at com.novemberain.langohr.Connection.access$000(Connection.java:18)
at com.novemberain.langohr.Connection$1.shutdownCompleted(Connection.java:93)
at com.rabbitmq.client.impl.ShutdownNotifierComponent.notifyListeners(ShutdownNotifierComponent.java:75)
at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:573)
Caused by: com.rabbitmq.client.ShutdownSignalException: connection error; reason: java.io.EOFException
at com.rabbitmq.utility.ValueOrException.getValue(ValueOrException.java:67)
at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:33)
at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:343)
at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:321)
... 8 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readUnsignedByte(DataInputStream.java:273)
at com.rabbitmq.client.impl.Frame.readFrom(Frame.java:95)
at com.rabbitmq.client.impl.SocketFrameHandler.readFrame(SocketFrameHandler.java:131)
at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:533)
It seems likely that the intended restart mechanism provided by Langohr is breaking when it kicks in. Is there an alternative pattern which is preferred in the case of these "hard" restarts? Alternatively, I suppose we have to implement connection monitoring and retries ourselves. Any suggestions would be most welcome.
We used to see such stack traces, but we no longer see them with Langohr 2.9.0. After a restart, our clojure clients reconnect and messages start flowing again.
We are using the defaults, which have connection and topology coverage turned on, as shown by this:
(infof "Automatic recovery enabled? %s" (rmq/automatic-recovery-enabled? connection))
(infof "Topology recovery enabled? %s" (rmq/automatic-topology-recovery-enabled? connection))