I am trying to control which Airflow worker a task is executed on, however the queue parameter in the DAG definition is not being picked up by the scheduler.
I have defined a queue in my subdag operator:
xdata_run_etl = sub_dag_operator_with_celery_executor(
subdag = build_xdata_etl_dag(dag, 'xdata_run_etl'),
task_id = 'xdata_run_etl',
dag = dag,
trigger_rule='none_failed',
queue='subdag'
)
And I can see the queue setting has been picked up. In the "Task Attributes" section in the UI queue is set to subdag.
However, when I trigger the DAG the scheduler is still sending the task to the default queue. As observed by the scheduler logs:
: [2020-04-02 20:38:49,581] {scheduler_job.py:1168} INFO - Sending ('run_etl', 'xdata_run_etl', datetime.datetime(2020, 4, 2, 17, 27, 38, 368220, tzininfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 10) to executor with priority 2 and queue default
Expected behavior is that this task will be sent to the subdag queue and be ran on an Airflow worker that is listening on this queue. (airflow worker -q subdag). Actual behavior is that all tasks get sent to the default queue irrespective of the queue parameter being defined.
Airflow version: 1.10.9
This can happen due to manual execution of a dag,but you can also try setting the default_queue argument in airflow.cfg to your custom queue name and it will work.
Related
I've noticed the following in the docs:
Ideally task functions should be idempotent: meaning the function won’t cause unintended effects even if called multiple times with the same arguments. Since the worker cannot detect if your tasks are idempotent, the default behavior is to acknowledge the message in advance, just before it’s executed, so that a task invocation that already started is never executed again.
If your task is idempotent you can set the acks_late option to have the worker acknowledge the message after the task returns instead. See also the FAQ entry Should I use retry or acks_late?.
If set to True messages for this task will be acknowledged after the task has been executed, not just before (the default behavior).
Note: This means the task may be executed multiple times should the worker crash in the middle of execution. Make sure your tasks are idempotent.
Then there is the BROKER_TRANSPORT_OPTIONS = {'confirm_publish': True} option found here. I could not find official documentation for that.
I want to be certain that tasks which are submitted to celery (1) arrive at celery and (2) eventually get executed.
Here is how I think it works:
Celery stores the information which tasks should get executed in a broker (typically RabbitMQ or Redis)
The application (e.g. Django) submits a task to Celery which immediately stores it in the broker. confirm_publish confirms that it was added (right?). If confirm_publish is set but the confirmation is missing, it retries (right?).
Celery takes messages from the broker. Now celery behaves as a consumer for the broker. The consumer acknowledges (confirms) that it received a message an the broker stores this information. If the consumer didn't sent an acknowledgement, the broker will re-try to send it.
Is that correct?
I have an Apache Airflow managed environment running in which a number of DAGs are defined and enabled. Some DAGs are scheduled, running on a 15 minute schedule, while others are not scheduled. All the DAGs are single-task DAGs. The DAGs are structured in the following way:
level 2 DAGs -> (triggers) level 1 DAG -> (triggers) level 0 DAG
The scheduled DAGs are the level 2 DAGs, while the level 1 and level 0 DAGs are unscheduled. The level 0 DAG uses ECSOperator to call a pre-defined Elastic Container Service (ECS) task, to call a Python ETL script inside a Docker container defined in the ECS task. The level 2 DAGs wait on the level 1 DAG to complete, which in turns waits on the level 0 DAG to complete. The full Python logs produced by the ETL scripts are visible in the CloudWatch logs from the ECS task runs, while the Airflow task logs only show high-level logging.
The singular tasks in the scheduled DAGs (level 2) have depends_on_past set to False, and I expected that as a result successive scheduled runs of a level 2 DAG would not depend on each other, i.e. that if a particular run failed it would not prevent the next scheduled run from occurring. But what is happening is that Airflow is overriding this and I can clearly see in the UI that a failure of a particular level 2 DAG run is preventing the next run from being selected by the scheduler - the next scheduled run state is being set to None, and I have to manually clear the failed DAG run state before the scheduler can schedule it again.
Why does this happen? As far as I know, there is no Airflow configuration option that should override the task-level setting of False for depends_on_past in the level 2 DAG tasks. Any pointers would be greatly appreciated.
Answering the question "why is this happening?". I understand that the behavior you are watching is explained by the fact that Tasks are being defined with wait_for_downstream = True. The docs state the following about it:
wait_for_downstream (bool) -- when set to true, an instance of task X will wait for tasks immediately downstream of the previous instance of task X to finish successfully or be skipped before it runs. This is useful if the different instances of a task X alter the same asset, and this asset is used by tasks downstream of task X. Note that depends_on_past is forced to True wherever wait_for_downstream is used. Also note that only tasks immediately downstream of the previous task instance are waited for; the statuses of any tasks further downstream are ignored.
Keep in mind that the term previous instances of task X refers to the task_instance of the last scheduled dag_run, not the upstream Task (in a DAG with a daily schedule, that would be the task_instance from "yesterday").
This also explains why your Task are being executed once you clear the state of the previous DAG Run.
I hope it helps you clarifying things up!
I was using Airflow 1.10.10 with Celery executor. I defined two dag, each with three task. Same Pool id was used in both dag/task. Pool slot was configured as 3. First DAG (say High_prioirty)was having priroity_weight as 10 for each task. Second DAG (say Low_priority) was having default priority_weight ( that is 1). I submitted first 5 Low_priority Dag. I waited till 3 tasks of low priority were moved into running state. Then I submitted 4 high priority dag. I was expecting when pool slot becomes available in next scheduling round , high priority task should be moved into QUEUING state. But high priority task remain in Scheduling State. I repeated this for 10-15 times and observe same thing each and every time.
However this works fine when I moved to LocalExecutor.
Please suggest fix/workaround for resolving this priority_weight issue in CeleryExecutor.
Hi I am using sidekiq_cron in my rails project. Jobs classes are from ActiveJob. In my job file I have queue_as :default and in schedule.yml file I have queue: high_priority. In actual which queue will be used?
Are you talking about sidekiq-cron? It seems that it allows you to pick the queue so one should expect high_priority queue to be used when jobs are scheduled by cron, and default queue to be used when the same jobs are scheduled by other means (other Ruby code that is not the cron).
You have a way to confirm this, spin up locally a sidekiq that does not process jobs in any of these queues, set your cron to every minute, and in your Sidekiq web dashboard you will be able to see jobs accumulating on the Queues tab.
I have some java code that calls Thread.sleep(100_000) inside a job running in SQS. In production, during the sleep the job is often killed and re-submitted as failed. On dev I can never re-create that. Does SQS in production kill long running jobs?
SQS doesn't kill jobs - and I am not sure what you mean by you have code 'running in SQS' - what SQS does do, is to assume your job (which is running someplace other than SQS), has failed if you don't mark it completed within the timeout (Default Visibility Timeout) you set when you setup the queue.
Your job asks SQS for an item to work on (a message to process) - your job is supposed to do that work and then tell SQS that the job is now done (deletemessage). If you don't tell it it is done, SQS makes an assumption for you that the job has failed and puts the message back in the queue for another task to process.
If you need more time to complete the tasks, you can change the default visibility timeout to a higher value if you want.