I have a bunch of tasks within a Cloud Composer Airflow DAG, one of which is a KubernetesPodOperator. This task seems to get stuck in the scheduled state forever and so the DAG runs continuously for 15 hours without finishing (it normally takes about an hour). I have to manually mark it failed for it to end.
I've set the DAG timeout to 2 hours but it does not make any difference.
The Cloud Composer logs show the following error:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server:
Connection refused
Is the server running on host "airflow-sqlproxy-service.default.svc.cluster.local" (10.7.124.107)
and accepting TCP/IP connections on port 3306?
The error log also gives me a link to this documentation about that error type: https://docs.sqlalchemy.org/en/13/errors.html#operationalerror
When the DAG is next triggered on schedule, it works fine without any fix required. This issue happens intermittently, we've not been able to reproduce it.
Does anyone know the cause of this error and how to fix it?
The reason behind the issue is related to SQLAlchemy using a session by a thread and creating a callable session that can be used later in the Airflow Code. If there are some minimum delays between the queries and sessions, MySQL might close the connection. The connection timeout is set to approximately 10 minutes.
Solutions:
Use the airflow.utils.db.provide_session decorator. This decorator
provides a valid session to the Airflow database in the session
parameter and closes the session at the end of the function.
Do not use a single long-running function. Instead, move all database
queries to separate functions, so that there are multiple functions
with the airflow.utils.db.provide_session decorator. In this case,
sessions are automatically closed after retrieving query results.
I have a Glue script which is supposed to write its result in a Redshift table in a for loop.
After many hours of processing it raises this exception:
Py4JJavaError: An error occurred while calling o11362.pyWriteDynamicFrame.
: java.sql.SQLException: [Amazon](500150) Error setting/closing connection: Connection refused.
Why am I getting this exception?
It turns out that Redshift clusters have a maintenance window in which they are re-booted. This event of course causes the Glue Job to fail when attempting to write to a table of that cluster.
May be useful to reschedule the maintenance window https://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-console.html
This error can occur for many reasons. I'm sure after a few google searches you've found that the most common cause of this is improper security group settings for your cluster (make sure your inbound settings are correct).
I would suggest that you make sure you're able to create a connection for even a short period of time before you try this longer process. If you are able to do so, then I bet the issue is that your connection is closing out after a timeout (since your process is so long). To solve this, you should look into connection pooling, which involves creating an instance of a connection and constantly checking to ensure it is still alive, thus allowing a process to continuously use the cluster connection.
I have a worker role that uses an EventProcessorHost to ingest data from an EventHub. I frequently receive error messages of the following kind:
Microsoft.ServiceBus.Messaging.MessagingCommunicationException:
No connection handler was found for virtual host 'myservicebusnamespace.servicebus.windows.net:42777'. Remote container id is 'f37c72ee313c4d658588ad9855773e51'. TrackingId:1d200122575745cc89bb714ffd533b6d_B5_B5, SystemTracker:SharedConnectionListener, Timestamp:8/29/2016 6:13:45 AM
at Microsoft.ServiceBus.Common.ExceptionDispatcher.Throw(Exception exception)
at Microsoft.ServiceBus.Common.Parallel.TaskHelpers.EndAsyncResult(IAsyncResult asyncResult)
at Microsoft.ServiceBus.Messaging.IteratorAsyncResult`1.StepCallback(IAsyncResult result)
I can't seem to find a way to catch this exception. It seems I can just ignore the error because everything works as expected (I had previously mentioned here that it was dropping messages because of this error, but I have since found out that a bug in the software that sends the messages caused this problem), however I would like to know what causes these errors, since they are clogging up my logging now and then.
Can anyone shed some light on the cause?
The Event Hub partitions are distributed across multiple servers. They sometimes move due to load balancing, upgrade and other reasons. When this happens, the client connection is lost with this error. The connection will be reestablished very quickly so you should not see any issues with message processing. It is safe to ignore this communication error.
I have a Celery server with a RabbitMQ broker. I use it to run background tasks in my Django project.
In one of my views a signal is triggered which then calls a celery task like this:
create_something.delay(pk)
The task is defined like this:
#task
def create_something(donation_pk):
# do something
Everything works great, but:
If RabbitMQ is down when I am calling the task no error is thrown during the create_something.delay(pk) call. But the view throws this error:
[Errno 111] Connection refused
(The stack trace is kind of useless, I think this is because of the signals used)
The question now is: How can I prevent this error? Is there a possibility to perform retries of the create_something.delay(pk) when the broker is down?
Thanks in advance for any hints!
Celery tasks has a .run() method, which will execute the task as it were part of the normal code flow.
create_something.run(pk)
You could catch the exception and execute .run() if needed.
Is there a possibility to perform retries of the create_something.delay(pk) when the broker is down?
The exception thrown when you call the .delay() method and you cannot connect can be caught just like any other exception:
try:
foo.delay()
except <whatever exception is actually thrown>:
# recover
You could build a loop around this to retry but you should take care not to keep the request up for very long. For instance, if it takes a whole second for your connectivity problem to get resolved, you don't want to hold up the request for a whole second. An option here may be to abort quickly but use the logging infrastructure so that an email is sent to the site administrators. A retry loop would be the last thing I'd do once I've identified what causes the connectivity issue and I have determined it cannot be helped. In most cases, it can be helped, and the retry loop is really a bandaid solution.
How can I prevent this error?
By making sure your broker does not go down. To get a more precise answer, you'd have to give more diagnostic information in your question.
By the way, Celery has a notion of retrying tasks but that's for when the task is already known to the broker. It does not apply to the case where you cannot contact the broker.
I'm running Celery in a Django app with RabbitMQ as the message broker. However, RabbitMQ keeps breaking down like so. First is the error I get from Django. The trace is mostly unimportant, because I know what is causing the error, as you will see.
Traceback (most recent call last):
...
File "/usr/local/lib/python2.6/dist-packages/amqplib/client_0_8/transport.py", line 85, in __init__
raise socket.error, msg
error: [Errno 111] Connection refused
I know that this is due to a corrupt rabbit_persister.log file. This is because after I kill all processes tied to RabbitMQ, I run "sudo rabbitmq-server start" to get the following crash:
...
starting queue recovery ...done
starting persister ...BOOT ERROR: FAILED
Reason: {{badmatch,{error,{{{badmatch,eof},
[{rabbit_persister,internal_load_snapshot,2},
{rabbit_persister,init,1},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
{child,undefined,rabbit_persister,
{rabbit_persister,start_link,[]},
transient,100,worker,
[rabbit_persister]}}}},
[{rabbit_sup,start_child,2},
{rabbit,'-run_boot_step/1-lc$^1/1-1-',1},
{rabbit,run_boot_step,1},
{rabbit,'-start/2-lc$^0/1-0-',1},
{rabbit,start,2},
{application_master,start_it_old,4}]}
Erlang has closed
My current fix: Every time this happens, I rename the corresponding rabbit_persister.log file to something else (rabbit_persister.log.bak) and am able to restart RabbitMQ with success. But the problem keeps occurring, and I can't tell why. Any ideas?
Also, as a disclaimer, I have no experience with Erlang; I'm only using RabbitMQ because it's the broker favored by Celery.
Thanks in advance, this problem is really annoying me because I keep doing the same fix over and over.
The persister is RabbitMQ's internal message database. That "log" is presumably like a database log and deleting it will cause you to lose messages. I guess it's getting corrupted by unclean broker shutdowns, but that's a bit beside the point.
It's interesting that you're getting an error in the rabbit_persister module. The last version of RabbitMQ that has that file is 2.2.0, so I'd strongly advise you to upgrade. The best version is always the latest, which you can get by using the RabbitMQ APT repository. In particular, the persister has seen a fairly large amount of fixes in the versions after 2.2.0, so there's a big chance your problem has already been resolved.
If you still see the problem after upgrading, you should report it on the RabbitMQ Discuss mailing list. The developers (of both Celery and RabbitMQ) make a point of fixing any problems reported there.
A. Because you are running an old version of RabbitMQ earlier than 2.7.1
B. Because RabbitMQ doesn't have enough RAM. You need to run RabbitMQ on a server all by itself and give that server enough RAM so that the RAM is 2.5 times the largest possible size of your persisted message log.
You might be able to fix this without any software changes just by adding more RAM and killing other services on the box.
Another approach to this is to build your own RabbitMQ from source and include the toke extension that persists messages using Tokyo Cabinet. Make sure you are using local hard drive and not NFS partitions because Tokyo Cabinet has corruption issues with NFS. And, of course, use version 2.7.1 for this. Depending on your message content, you might also benefit from Tokyo Cabinets compression settings to reduce the read/write activity of persisted messages.