upgrade to Airflow 1.10.10 - airflow-scheduler

I want to know if there are any major open issues around the core functionalities of Airflow such the Webserver, Scheduler and Worker on Airflow 1.10.10 before we perform the actual upgrade.
Ideally we expect the Scheduler hung issue should subside if not resolved in 1.10.10 ( This bug gave us head aches in 1.10.3). Pls confirm

Related

Airflow web-server produces temporary 502 errors in Cloud Composer

I'm encountering 502 errors on AirFlow(2.0.2) UI hosted in Cloud Composer(1.17.0).
Error: Server Error The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
They last for a few minutes and it happens several times a day after it's gone everything works fine.
At the moment of errors:
there is a gap in logs and after we can see that logs resumed with messages about staring gunicorn:
[1133] [INFO] Starting gunicorn 19.10.0
there is a spike in resource usage of web-server
I didn't spot any other suspicious activity in other parts of the system(workers, scheduler, DB)
I think that this is a result of OOM error because we have DAGs with a big number of tasks (2k).
But I'd like to be sure and I haven't found a way to connect to VM of app engine in tenant project(where Airflow server is hosted by default) to get additional logs.
Maybe anyone knows a way to get additional logs from AirFlow server VMs or have any other idea?
Cloud Composer documentation shows Troubleshooting DAGs sections. It shows how to check individual workers logs. It even mentions OOM issues (direct link).
Generally troubleshooting section is well documented so you should be able to find many interesting information. You can also use Cloud Monitoring and Cloud Logging to monitor Composer, but I am not sure if this will be valuable in this use case (reference).

Airflow jobs on MWAA fail with no Log messages

I have been using Airflow on AWS(MWAA) for a couple of months now and I noticed that occasionally some Airflow tasks fail with no discernable reason and with no log messages in Cloudwatch. I often have to clear and retry the tasks multiple times for the task to eventually succeed.
Does anyone know the reason why this would be happening? I'm thinking with the autoscaling feature of MWAA this should not be an issue of resources. Anyone with any experience dealing with something like this or with any idea would be greatly appreciated. Thanks

on heroku, celery beat database scheduler doesn’t run periodic tasks

I have an issue where django_celery_beat’s DatabaseScheduler doesn’t run periodic tasks. Or I should say where celery beat doesn’t find any tasks when the scheduler is DatabaseScheduler. In case I use the standard scheduler the tasks are executed regularly.
I setup celery on heroku by using a dyno for worker and one for beat (and one for web, obviously).
I know that beat and worker are connected to redis and to postgres for task results.
Every periodic task I run from django admin by selecting a task and “run selected task” gets executed.
However, it is about two days that I’m trying to figure out why there isn’t a way for beat/worker to find that I scheduled a task to execute every 10 seconds, or using a cron (even restarting beat and remot doesn’t change it).
I’m kind of desperate, and my next move would be to give redbeat a try.
Any help on how to how to troubleshoot this particular problem would be greatly appreciated. I suspect the problem is in the is_due method. I am using UTC (in celery and django), all cron are UTC based. All I see in the beat log is “writing entries..” every on and then.
I’ve tried changing celery version from 4.3 to 4.4 and django celery beat from 1.4.0 to 1.5.0 to 1.6.0
Any help would be greatly appreciated.
In case it helps someone who's having or will have a similar trouble as ours: to recreate this issue, it is possible to create a task as simple as:
#app.task(bind=True)
def test(self, arg):
print(kwargs.get("notification_id"))
then, in django admin, enter the task editing and put something in the extra args field. Or, viceversa, the task could be
#app.task(bind=True)
def test(self, **kwargs):
print(notification_id)
And try to pass a positional argument. While locally this breaks, on Heroku's beat and worker dyno, this somehow slips away unnoticed, and django_celery_beats stop processing any task whatsoever in the future. The scheduler is completely broken by a "wrong" task.

Django celery, celery-beat: fills the queue without control, scheduling troubles

I have small project with couple of tasks to run several times a day.
The project is based on Django 2.1, having celery 4.2.1 and django-celery-beat 1.3.0. And also have rabbitmq installed.
Each task is inside it's projects application. Runs, works, gives some result.
The problem is - on virtual server, leased from some provider, if I set any task to run periodically (each hour, or two)- it starts running immidiately, without end and, as i suppose in some kind of parallel threads, wish mesh each other.
Command rabbintmqctl list_queues name messages_unacknowldged always shows 8 in queue celery. Purging the queue celery does not give any changes. Restarting service - too.
But setting tasks schedule to run in exact time works good. Well, almost good. Two tasks have schedule to run in the beginning of different hours (even and odd). But both run in about 30 minutes after hour beginning, of the same (odd) hour. At least tasks don't run more times in a day than set in schedule. But it is still something wrong.
As a newbie with rabbitmq and celery don't know where to look for solution. Official celery docs didn't help me. May be was not looking in right place. Any help help or advice would be good. Thanks.
It seems this is bug of django-celery-beat - https://github.com/celery/celery/issues/4041.
If anyone have already made any solution for this - please inform.

Why does RabbitMQ keep breaking from a corrupt persister log file?

I'm running Celery in a Django app with RabbitMQ as the message broker. However, RabbitMQ keeps breaking down like so. First is the error I get from Django. The trace is mostly unimportant, because I know what is causing the error, as you will see.
Traceback (most recent call last):
...
File "/usr/local/lib/python2.6/dist-packages/amqplib/client_0_8/transport.py", line 85, in __init__
raise socket.error, msg
error: [Errno 111] Connection refused
I know that this is due to a corrupt rabbit_persister.log file. This is because after I kill all processes tied to RabbitMQ, I run "sudo rabbitmq-server start" to get the following crash:
...
starting queue recovery ...done
starting persister ...BOOT ERROR: FAILED
Reason: {{badmatch,{error,{{{badmatch,eof},
[{rabbit_persister,internal_load_snapshot,2},
{rabbit_persister,init,1},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
{child,undefined,rabbit_persister,
{rabbit_persister,start_link,[]},
transient,100,worker,
[rabbit_persister]}}}},
[{rabbit_sup,start_child,2},
{rabbit,'-run_boot_step/1-lc$^1/1-1-',1},
{rabbit,run_boot_step,1},
{rabbit,'-start/2-lc$^0/1-0-',1},
{rabbit,start,2},
{application_master,start_it_old,4}]}
Erlang has closed
My current fix: Every time this happens, I rename the corresponding rabbit_persister.log file to something else (rabbit_persister.log.bak) and am able to restart RabbitMQ with success. But the problem keeps occurring, and I can't tell why. Any ideas?
Also, as a disclaimer, I have no experience with Erlang; I'm only using RabbitMQ because it's the broker favored by Celery.
Thanks in advance, this problem is really annoying me because I keep doing the same fix over and over.
The persister is RabbitMQ's internal message database. That "log" is presumably like a database log and deleting it will cause you to lose messages. I guess it's getting corrupted by unclean broker shutdowns, but that's a bit beside the point.
It's interesting that you're getting an error in the rabbit_persister module. The last version of RabbitMQ that has that file is 2.2.0, so I'd strongly advise you to upgrade. The best version is always the latest, which you can get by using the RabbitMQ APT repository. In particular, the persister has seen a fairly large amount of fixes in the versions after 2.2.0, so there's a big chance your problem has already been resolved.
If you still see the problem after upgrading, you should report it on the RabbitMQ Discuss mailing list. The developers (of both Celery and RabbitMQ) make a point of fixing any problems reported there.
A. Because you are running an old version of RabbitMQ earlier than 2.7.1
B. Because RabbitMQ doesn't have enough RAM. You need to run RabbitMQ on a server all by itself and give that server enough RAM so that the RAM is 2.5 times the largest possible size of your persisted message log.
You might be able to fix this without any software changes just by adding more RAM and killing other services on the box.
Another approach to this is to build your own RabbitMQ from source and include the toke extension that persists messages using Tokyo Cabinet. Make sure you are using local hard drive and not NFS partitions because Tokyo Cabinet has corruption issues with NFS. And, of course, use version 2.7.1 for this. Depending on your message content, you might also benefit from Tokyo Cabinets compression settings to reduce the read/write activity of persisted messages.