Add job to existing Airflow DAG without dependency on any job - airflow-scheduler

I am creating an Airflow job that will run as part of an existing dag that has n number of jobs. I have to add this new job as independent job
My current job dependency is as below
accountable_job >> dq_check >> dq_a1_validaton_job >> data_aggregation_job >> sync_job
I have to add another job dq_b1_validaton_job that will be independent but jobs after dq_a1 validation will be dependent on dq_b1_validaton_job. In sort dq_a1_validaton_job and dq_b1_validaton_job will be in parallel but dq_b1_validaton_job will be independent of any job.

You just add it to dag. Either with context manager:
with Dag(...):
independent_task = YourOperator()
or by passing dag as parameter:
your_dag = Dag(....)
independent_task = YourOperator(..., dag=your_dag)
See https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#declaring-a-dag

Related

Naming a Sagemaker Processing job using Sagemaker Pipelines ProcessingStep

I am running a Sagemaker Pipeline with the current processor:
from sagemaker.sklearn.processing import SKLearnProcessor
framework_version = "0.23-1"
sklearn_processor = SKLearnProcessor(
framework_version=framework_version,
instance_type=processing_instance_type,
instance_count=processing_instance_count,
base_job_name="pre-processing-job-name",
role=role
)
and the processing step is:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
step_process = ProcessingStep(
name="AbaloneProcess",
processor=sklearn_processor,
inputs=[
ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
],
outputs=[
ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
],
code="abalone/preprocessing.py",
)
It looks like the base_job_name does nothing, because the processing job that is created is pipelines-o6e2jn38g05j-AbaloneProcess-nc2OlXF8jA.
I want the processing job name to be defined manually. Does Sagemaker pipelines support this? I seem to be going around in circles.
Pipelines sets the processing job name to preserve caching behavior.
An override could be added here to allow manually defining job names, but that has been intentionally left out. Because SageMaker job names must be unique, this would become a sharp edge for users who statically define the name without realizing that any pipeline execution after the first will fail with ResourceAlreadyExists.
I'd suggest filing an issue explaining the usecase and see if the team picks it up.

ignoreCommitterStrategy is not working in mutibranch job-dsl generator job

enter code hereI'm trying to implement ignoreCommitterStrategy approach via multibranch generator job (i.e, job-dsl way)
I'm trying to implement ignoreCommitterStrategy approach via multibranch generator job (i.e, job-dsl way)
since we have too many existing multibranch-pipeline jobs and I try to achieve ignorecommitter stragegy inside branchsources of dsl.
After running seed job (i.e., multibranch generator job) I could see ignoreCommiter Strategy updated in existing multibranch pipeline jobs but still ignored author is not added. It means at this moment inside multibranch pipeline jobs -> config I have to manually click Add button and add ignored author list which is bit painful as we have many jobs.
buildStrategies {
ignoreCommitterStrategy {
ignoredAuthors("sathya#xyz.com")
allowBuildIfNotExcludedAuthor(false)
}
}
Note: I even tried with "au.com.versent.jenkins.plugins.ignoreCommitterStrategy"
Expecting IgnoredAuthor to be added into existing multibranch pipeline jobs upon execution of multibranch generator job
Currently the build strategy can not be set by using Dynamic DSL because of JENKINS-26535. But a fix is currently in review and will hopefully be released soon.
The correct syntax would be
multibranchPipelineJob('example') {
branchSources {
branchSource {
buildStrategies {
ignoreCommitterStrategy {
ignoredAuthors('test#acme.org')
allowBuildIfNotExcludedAuthor(true)
}
}
}
}
}
Until the problem has been fixed, you can use a Configure Block to set the necessary options.

How can I set my Airflow DAG to wait for Dataflow jobs to complete?

I have a DAG that executes 3 Dataflow pipelines. I have set the dependency as such:
a > b > c
I have set the following default arguments:
default_dag_args = {
'start_date': yesterday,
'depends_on_past': True,
'wait_for_downstream': True
}
However, it seems like all 3 pipelines are being scheduled at the same time. How can I set pipeline b to run only after pipeline a finishes? And similarly pipeline c to only run after pipeline b finishes?
Update:
I changed it to:
a >> b >> c
Now it seems that a will kick off and complete, but b never begins. The DAG is active ("On"). Task a is still in a state of 'running' on Airflow, but in Dataflow the job has completed. How do I get Airflow to recognize the Dataflow job has completed and proceed with task b?
I don't know if that's pseudocode, but your dependency should look like this (if you're talking about operators):
a >> b >> c
In the graph view, do lines appear between a --- b, and b --- c? If the dependency is not set properly you will see all three of those operators just 'on the graph' without lines, and they will therefore be scheduled together.
If you want each DAG to complete one at a time, then set
max_active_runs=1
in the DAG() definition (not default args).
max_active_runs defines how many running concurrent instances of a DAG there are allowed to be.
Finally another thing you can do - create a pool with 1 slot and assign this DAG to that pool.
Use the following default_dag_args:
default_dag_args = {
'start_date': yesterday,
'depends_on_past': False,
'wait_for_downstream': True
}
depends_on_past flag would actually look for the previous run of the same task. So if the previous task instance failed, this task doesn't run. Example: If Task A was run yesterday and it failed, and now it is run today as well, it won't run if depends_on_past: True.
depends_on_past: when set to true, task instances will run sequentially while relying on the previous task’s schedule to succeed. The task instance for the start_date is allowed to run.

Can you reserve a set amount of celery workers for specific tasks or set a task to higher priority when you delay it?

My django application currently takes in a file, and reads in the file line by line, for each line there's a celery task that delegates processing said line.
Here's kinda what it look slike
File -> For each line in file -> celery_task.delay(line)
Now then, I also have other celery tasks that can be triggered by the user for example:
User input line -> celery_task.delay(line)
This of course isn't strictly the same task, the user can in essence invoke any celery task depending on what they do (signals also invoke some tasks as well)
Now the problem that I'm facing is, when a user uploads a relatively large file, my redis queue gets boggled up with processing the file, when the user does anything, their task will be delegated and executed only after the file's celery_task.delay() tasks are done executing. My question is, is it possible to reserve a set amount of workers or delay a celery task with a "higher" priority and overwrite the queue?
Here's in general what the code looks like:
#app.task(name='process_line')
def process_line(line):
some_stuff_with_line(line)
do_heavy_logic_stuff_with_line(line)
more_stuff_here(line)
obj = Data.objects.make_data_from_line(line)
serialize_object.delay(obj.id)
return obj.id
#app.task(name='serialize_object')
def serialize_object(important_id):
obj = Data.objects.get(id=important_id)
obj.pre_serialized_json = DataSerializer(obj).data
obj.save()
#app.task(name='process_file')
def process_file(file_id):
ingested_file = IngestedFile.objects.get(id=file_id)
for line in ingested_file.get_all_lines():
process_line.delay(line)
Yes you can create multiple queues, and then you can decide to route your tasks to those queues running on multiple workers or single worker. By default all tasks go to default queue named as celery. Check Celery documentation on Routing Tasks to get more information and some examples.

Prioritize some workflow executions over others

I've been using the flow framework for amazon swf and I want to be able to run priority workflow executions and normal workflow executions. If there are priority tasks, then activities should pick up the priority tasks ahead of normal priority tasks. What is the best way to accomplish this?
I'm thinking that the following might work but I wonder if there's a better/recommended approach.
I'll define two Activity Workers and two activity lists for the activity. One priority list and one normal list. Each worker will be using the same activity class.
Both workers will be run on the same host (ec2 instance).
On the workflow, I'll define two methods: startNormalWorkflow and startHighWorkflow. In the startHighWorkflow method, I can use ActivitySchedulingOptions to put the task on the high priority list.
Problem with this approach is that there is no guarantee that the high priority task is scheduled before normal tasks.
It's a good question, it had me scratching my head for a while.
Of course, there is more than one way to skin this cat and there exists a number of valid solutions. I focused here on the simplest possible that I could conceive of, namely, execution of tasks in order of priority within a single workflow.
The scenario goes as follows: I define one activity worker serving two task lists, default_tasks and urgent_tasks, with a trivial logic:
If there are pending tasks on the urgent_tasks list, then pick one from there,
Otherwise, pick a task from default_tasks
Execute any task selected.
The question is how to check if any high priority tasks are pending? CountPendingActivityTasks API comes to the rescue!
I know you use Flow for development. My example is written using boto.swf.layer2 as Python is so much easier for prototyping - but the idea remains the same and can be extended to a more complex scenario with high and low priority workflow executions.
So, to accomplish the above using boto.swf follow these steps:
Export credentials to the environment
$ export AWS_ACCESS_KEY_ID=your access key
$ export AWS_SECRET_ACCESS_KEY= your secret key
Get the code snippets
For convenience, you can fork it from github:
$ git clone git#github.com:oozie/stackoverflow.git
$ cd stackoverflow/amazon-swf/priority_tasks/
To bootstrap the domain and the workflow:
# domain_setup.py
import boto.swf.layer2 as swf
DOMAIN = 'stackoverflow'
VERSION = '1.0'
swf.Domain(name=DOMAIN).register()
swf.ActivityType(domain=DOMAIN, name='SomeActivity', version=VERSION, task_list='default_tasks').register()
swf.WorkflowType(domain=DOMAIN, name='MyWorkflow', version=VERSION, task_list='default_tasks').register()
Decider implementation:
# decider.py
import boto.swf.layer2 as swf
DOMAIN = 'stackoverflow'
ACTIVITY = 'SomeActivity'
VERSION = '1.0'
class MyWorkflowDecider(swf.Decider):
domain = DOMAIN
task_list = 'default_tasks'
version = VERSION
def run(self):
history = self.poll()
print history
if 'events' in history:
# Get a list of non-decision events to see what event came in last.
workflow_events = [e for e in history['events']
if not e['eventType'].startswith('Decision')]
decisions = swf.Layer1Decisions()
last_event = workflow_events[-1]
last_event_type = last_event['eventType']
if last_event_type == 'WorkflowExecutionStarted':
# At the start, get the worker to fetch the first assignment.
decisions.schedule_activity_task(ACTIVITY+'1', ACTIVITY, VERSION, task_list='default_tasks')
decisions.schedule_activity_task(ACTIVITY+'2', ACTIVITY, VERSION, task_list='urgent_tasks')
decisions.schedule_activity_task(ACTIVITY+'3', ACTIVITY, VERSION, task_list='default_tasks')
decisions.schedule_activity_task(ACTIVITY+'4', ACTIVITY, VERSION, task_list='urgent_tasks')
decisions.schedule_activity_task(ACTIVITY+'5', ACTIVITY, VERSION, task_list='default_tasks')
elif last_event_type == 'ActivityTaskCompleted':
# Complete workflow execution after 5 completed activities.
closed_activity_count = sum(1 for wf_event in workflow_events if wf_event.get('eventType') == 'ActivityTaskCompleted')
if closed_activity_count == 5:
decisions.complete_workflow_execution()
self.complete(decisions=decisions)
return True
Prioritizing worker implementation:
# worker.py
import boto.swf.layer2 as swf
DOMAIN = 'stackoverflow'
VERSION = '1.0'
class PrioritizingWorker(swf.ActivityWorker):
domain = DOMAIN
version = VERSION
def run(self):
urgent_task_count = swf.Domain(name=DOMAIN).count_pending_activity_tasks('urgent_tasks').get('count', 0)
if urgent_task_count > 0:
self.task_list = 'urgent_tasks'
else:
self.task_list = 'default_tasks'
activity_task = self.poll()
if 'activityId' in activity_task:
print urgent_task_count, 'urgent tasks in the queue. Executing ' + activity_task.get('activityId')
self.complete()
return True
Run the workflow from three instances of an interactive Python shell
Run the decider:
$ python -i decider.py
>>> while MyWorkflowDecider().run(): pass
...
Start an execution:
$ python -i decider.py
>>> swf.WorkflowType(domain='stackoverflow', name='MyWorkflow', version='1.0', task_list='default_tasks').start()
Finally, kick off the worker and watch the tasks as they're getting executed:
$ python -i worker.py
>>> while PrioritizingWorker().run(): pass
...
2 urgent tasks in the queue. Executing SomeActivity2
1 urgent tasks in the queue. Executing SomeActivity4
0 urgent tasks in the queue. Executing SomeActivity5
0 urgent tasks in the queue. Executing SomeActivity1
0 urgent tasks in the queue. Executing SomeActivity3
It turns out that using a separate task list that you have to check first doesn't work well.
There's a couple of problems.
First, the count API doesn't update reliably. So you may get 0 tasks even when there are urgent tasks in the queue.
Second, the call that polls for tasks hangs if there are no tasks available. So when you poll for the non-urgent tasks, that will "stick" for either 2 minutes, or until you have a non-urgent task to do.
So this can cause all kinds of problems in your workflow.
For this to work, SWF would have to implement a polling API that could return the first task from a list of task lists. Then it would be much easier.