I am using BranchPythonOperator to make branches in airflow. My use case is I need to make two branches from mainstream. Branch A (which has few tasks) will be followed when somefile.csv is present otherwise Branch B(which has no task) to follow. At last both branches should be merged to make a mainstream again.
Now I am able to follow either Branch A or Branch B but the issue is if I follow Branch B final mainstream tasks executed and if I follow Branch A final mainstream tasks are being skipped.
MainstreamTaskA.setDownStream(MainstreamTaskB)
MainstreamTaskB.setDownStream(BranchATaskA)
BranchATaskA.setDownStream(MainstreamTaskC)
MainstreamTaskB.setDownStream(MainstreamTaskC)
I have set trigger rule as "all_done" in MainstreamTaskB and MainstreamTaskC.
Can somebody guide me through this?
I cannot see the other branch in your dependencies. The only branch is BranchATaskA.
But based on what you have mentioned you should have the following task dependencies and have two branch tasks BranchATaskA and BranchATaskB.
MainstreamTaskA >> MainstreamTaskB
MainstreamTaskB >> BranchATaskA >> MainstreamTaskC
MainstreamTaskB >> BranchATaskB >> MainstreamTaskC
You should have trigger rule as all_done on MainstreamTaskC.
Related
Is there a way to add an end step to a sagemaker pipeline that still runs at the very end (and runs code) even if other previous steps fail. Before I thought we could make it a Fail Step but that only lets you return an error message and doesn’t let you run code. If we made it a conditional step how would we make sure it ran at the very end without depending on any previous steps. I thought of adding all previous steps as a dependency so it runs at the end, but then the end step wouldn't run if any step before that failed.
I tried using the fail step, but I can't provide code. I tried putting it with dependencies but then it won't run if other steps fail before it. I tried putting no dependencies, but then it won't run at the end.
Pipelines doesn't currently have a finally construct like this. Any step failure will stop the pipeline immediately.
This might be added in the future, but the best way to accomplish this now would be an EventBridge rule on pipeline status change that triggers a Lambda, SageMaker Pipeline, etc which has your failure logic.
I come from experiences with Luigi, where if a file was produced successfully by a task and the task was also unmodified, then re-runs of the DAG would not re-run that task, but would reuse its previously-obtained output.
Is there any way to obtain the same behavior with AirFlow?
Currently, if I re-run the dag, it re-executes all the tasks, no matter if they produced a successful (and unchanged) output in the past. So, basically I need a task to be marked as successful if its code was unchanged.
This is the crucial and important feature of Airflow to have all the tasks as idempotent. This means that re-running a task on the same input should generally override the output with newly processed version of that data - so that task depending on it can be automatically rerun. But the data might be different after reprocessing than it was originally.
That's why in Airflow you have a backfill command that basically means.
Please re-run this DAG for selected past runs (say last week worth of runs) - but you should JUST reprocess starting from task X (which will re-run task X and ALL tasks that depend on its output).
This also means that when you want to re-run parts of past DAGs but you know that you want to relay on existing output of certain tasks there - you only backfill the tasks that are depending on the output of that task (but not the task itself).
This allows for much more flexibility by defining which tasks in past DAG runs should be re-run (you basically invalidate outputs of certain tasks by making them target of backfill).
This covers more than the case you mention:
a) if you want to not change an output of certain task - you do not backfill that task - but the task(s) that follow from it
b) more importantly - if you want to re-process the task even in the task input and task itself were modified, you can still do it - by backfilling that task.
The case b) is often important, because some of the tasks might have implicit dependencies that change - even if the inputs and task did not change, processing it again might produce different (often better) result.
A good example that I've heard is re-processing call records by telecom operators where you had to determine phone models from IMEI of the phones. In this case you might have a single service that does the mapping, but it might get updated to a newer version when manufacturers refresh their model database - when new phones are introduced, the refresh will happen with some delays, so reprocessing regularly last week's of data might give different results even if the input ("list of calls") and task ("execute map IMEIS to phone models") did not change from the DAG's Python point of view.
Airflow almost always calls external services to run certain tasks, and those services themselves might improve over time - this means that limiting re-processing to the cases where "no input + no task code" has changed is very limiting (but you can still deliberately decide on it by choosing the backfill scope - i.e. which tasks to reprocess).
We have multiple release branches in our product (currently this is unavoidable). For this question, suppose we have two:
master
release/r-858
Both have classic CI builds. Now I want to replace them with YAML builds. Our requirement is simple - have two distinct build definitions pointing to a YAML script - one for master, one for release/r-858
At the beginning I thought it is a trivial exercise:
Create YAML build script in master. Set the CI trigger to master.
Cherry-pick (never mind why not merge) to release/r-858 - set the CI trigger to release/r-858.
Not ideal, because the two scripts only differ in their CI trigger. But "I am learning, it is good enough for now" saying me to myself.
However, this simple scheme does not work! The build I created for release/r-858 is triggered on changes in master!
I double check every setting I know about builds - all look correct.
Please, observe:
The master build
The release/r-858 build
Uh oh, look at that. It shows the YAML on the master branch! Well, maybe it is an innocent presentation bug? Let us check the branch I am supposed to build from:
Yup, the file is different - I am playing with the trigger trying to solve the very same problem this question is about. The original code had release/r-858 instead of $(Build.SourceBranch) as the CI trigger, but since it did not help I started playing with all kinds of trigger values.
To remove any doubt, here is the proof the branch corresponds to release/r-858:
C:\xyz\58 [arch/shelve/798914 ≡ +0 ~17 -0 !]> git lg -2
cfdb6a9a86a | (HEAD -> arch/shelve/798914, origin/arch/shelve/798914) Rename azure-pipelines-ci-58.yml to azure-pipelines-ci.yml and sync it with the master version (68 seconds ago) [Kharitonov, Mark] (2020-08-14 09:09:46 -0400)
a931e3bd96b | (origin/release/r-858, release/r-858) Merged PR 90230: 793282 Work Assignments Merge (28 minutes ago) [Mihailichenco, Serghei] (2020-08-14 12:02:20 -0400)
C:\xyz\58 [arch/shelve/798914 ≡ +0 ~17 -0 !]>
Anyway, more build properties:
The problem
So a developer pushed some code to master and now the release/r-858 build is running:
Why is this? One of our guys asked a similar question in the Microsoft Developer Community forum, but that thread does not make sense to me.
What am I doing wrong here?
Edit 1
Imagine a big enterprise monolithic application. It is deployed in production at version 858. At the same time, developers work on the next version and also hot fixes and service packs for the version already deployed in prod.
A change can be made only in master or only in release/r-858 or in both (not at the same time, though). Many teams are working at the same time on many different aspects of the application and hence QA has many pods where the application is deployed. As I have mentioned above - about 150 pods for the bleeding edge (master) and about the same amount for the already released code, because there is active work to test hot fixes and service packs.
I appreciate this arrangement is not ideal. It is such not because we love it, but because one has to deal with decade old decisions. We are working to change it, but it takes time.
Anyway, the current process is to have 2 build definitions (in reality there are more for different reasons). So far we used classic CI builds, now we want to migrate to YAML (which we already use for micro services, but not the monolith).
Now I understand that we can have different release pipelines based off the same build definition, but different branch filters.
And maybe we will. But I do not understand why it is wrong to have different build definitions here, given that each branch is a long living release branch.
Edit 2
You can ignore $(Build.SourceBranch) and imaging release/r-858 instead. The net result is exactly the same. In the scenario I bring above code is committed to master, not release/r-858.
Edit 3
It is very confusing. Suppose I am creating a new YAML build. The dialog says "select YAML in any branch", but they point is that once selected this branch becomes the default branch of the build. That is the branch we can see here:
If I have a single YAML file in the master branch, the build with the default branch release/r-858 cannot even use it, unless it is merged to release/r-858. I tried it - I:
created a new YAML build
selected the YAML file from the master branch
ran and right away cancelled the build
then went to edit the build and changes the branch of the build from master to release/r-858 - it allowed me to save the build, even if the YAML does not exist in that branch
But then when I tried to run the build again I got this:
An error occurred while loading the YAML build pipeline. File /Build/azure-pipelines-ci.yml not found in repository bla-bla-bla branch refs/heads/release/r-858 version 5893f559292e56cf6db48687fd910bd2916e3cef.
And indeed, looking at the raw build definition, the process section contains the YAML file path, but not the branch:
"process": {
"yamlFilename": "Build/azure-pipelines-ci.yml",
"type": 2,
"resources": {},
"target": null
},
The branch only appears in the repository section of the definition:
"repository": {
"defaultBranch": "refs/heads/release/r-858",
...
},
It is clear to me that a single build definition can be used to CI build many branches. But this model I need to implement is build definition per release branch. I cannot have a single build definition for the following reasons:
Different release branches have different agent pools, because of the different development intensity. Remember, this is on on-prem Azure DevOps Server with self hosted agents. Can we express this requirement with a single build definition?
Different build variable values which we want to control without sending a Pull Request to YAML file repository. How do you do it with a single build definition? For example, one of the variables controls the version Major.Minor. They are different in each release branch.
So, I do not see any way to avoid multiple build definitions in our situation. The root cause for this are the release branches, but we cannot throw them away in the near future.
So, we have 2 build definitions. That forces us to have 2 YAML - one per branch, because a build definition with the default branch of release/r-858 expects to find YAML in that branch, otherwise we cannot trigger the build manually. Which is a must, even if the build has a CI trigger.
So, 2 build definitions, 2 YAMLs (one per branch). So far my hands were forced. But now I am told that the release branch build would be triggered by the master YAML just because the release branch build is linked to the same YAML file name ignoring the default branch of the build!
Because this is what happens - a commit is checked in to master and the release branch build is invoked in addition to the master branch build! Both build definitions build exactly the same branch (master) using the master YAML script. But because the release branch build has different set of variables the end result is plain wrong.
This is not reasonable. I am going to create a dummy repo to reproduce it cleanly and post here.
Edit 4
As promised - a trivial reproduction. Given:
master branch build test-master-CI
release branch build test-r58-CI
Since having two build definitions necessarily means two YAMLs (one per branch), here they are:
C:\xyz\DevOps\Test [master ≡]> cat .\azure-pipelines.yml
trigger:
branches:
include:
- master
name: $(BuildVersionPrefix).$(DayOfYear)$(Date:HH)
steps:
- script: echo master
C:\xyz\DevOps\Test [master ≡]> git co release/r-858
Switched to branch 'release/r-858'
Your branch is up to date with 'origin/release/r-858'.
C:\xyz\DevOps\Test [release/r-858 ≡]> cat .\azure-pipelines.yml
trigger:
branches:
include:
- release/r-858
name: $(BuildVersionPrefix).$(DayOfYear)$(Date:HH)
steps:
- script: echo release/r-858
C:\Dayforce\DevOps\Test [release/r-858 ≡]>
Where BuildVersionPrefix = 59.0 for master and 58.3 for release/r-858
When I trigger each build manually I get this:
Now I commit a change to master. Lo and behold - both builds are triggered:
In both cases the YAML from the master branch is used. BUT the release branch defines BuildVersionPrefix = 58.3 and so the master build executed by the release branch build definition has bogus version.
Is this really how the feature is supposed to work? That makes the CI YAML trigger useless for my scenario. Thank you Matt for helping me to realize that.
I think I get where the confusion comes from. When you are configuring the pipeline, you are specifying the branch (notice the description says the file in any branch) and the file name.
What you are doing is just duplicating the monitoring though. If you were to really inspect it, I think you will see that when you push to release branch, it isn't trigger the master YAML pipeline ... it is just triggering the release YAML steps a second time. That is because the pipeline is just monitoring changes to the repo and responding based on the YAML configuration. In this case, you pushed to release and it evaluated that there was a YAML that matched that trigger (the release branch's copy) and triggered for both build definitions.
I verified this on a mocked-up pipeline. I had selected different branches on the creation, but the only thing that really impacts I believe is the default branch it would use for scheduled builds. I created a simple echo statement in both of these it was using the release branches YAML configuration.
I think if you really want to achieve the desired results you are expecting, you will want to use the override triggers that you define on the definition instead of relying on what is in the YAML trigger.
I had the same issue and Matt helped me solve this.
I'm only writing this as the only way to get this working for me was to create a build YAML file on one branch (with the correct configuration). Then create the other YAML file on another branch. And then create the pipelines in the new shiny YAML editor within Devops.
The key is, when in the "Configure" section of a new pipeline, select:
"Existing Azure Pipelines YAML file" which allows you to select a branch and a YAML file within that branch.
This allowed me to have the SystemOne branch build and test the system one site and the SystemTwo branch build and test the system two site.
I also added triggers inside the SystemOne.yml using a wild card. EG
trigger:
batch: true
branches:
include:
- SystemOne/*
And the same for the SystemTwo.yml.
The requirement is to set the restrictions such that any new branches pushed to Stash from a developer's machine must follow our naming convention of
"feature/PPT-", "bugfix/PPT-", "hotfix/PPT-", "feature/QC",
"bugfix/QC*", or "hotfix/QC*".
We have Yet Another Commit Checker pre-receive hook enabled and it has an option to restrict using
Branch Regex -
If present, pushes to branches that don't match this regex will be blocked.
What is the format to be used here, to meet my requirement here?
Branch Name Regex
If present, only branches with names that match this regex will be allowed to be created. This affects both new branches being pushed and branches created within the Bitbucket Server UI.
For example, master|(?:(?:bugfix|hotfix|feature)/[A-Z]+-\d+-.+) would enforce that pushes should be done to branches that follow the Bitbucket Server Branching Model naming convention.
https://github.com/sford/yet-another-commit-checker
Anyone using this already?
master|develop|(?:(?:bugfix/QC.|hotfix/QC.|feature/QC.)), master|develop|(?:(?:bugfix/PPT|hotfix/PPT|feature/PPT-.))
This is the entry that needs to go to the Pre-receive hook for branch regex.
This will restrict the push from the developer Stash/Atlassian Sourcetree to branches which doesn't match this requirement.
Example:
Try to push to a branch feature/PPT-Test from the local Atlassian Source tree repo and it works.
However, to push to a branch feature/PPTRandom from the local Atlassian Source tree repo will fail as the regex doesn't match.
It's (dotstar) for wildcard
We need a regex expression to enforce the branch name to a certain pattern, I updated it to the following regex and it worked good for me.
feature/([a-zA-Z0-9_-]*)|bugfix/([a-zA-Z0-9_-]*)|hotfix/([.a-zA-Z0-9_-]*)|release/([.a-zA-Z0-9_-]*)
I have just started with Spark and was struggling with the concept of tasks.
Can any one please help me in understanding when does an action (say reduce) not run in the driver program.
From the spark tutorial,
"Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel. "
I'm currently experimenting with an application which reads a directory on 'n' files and counts the number of words.
From the web UI the number of tasks is equal to number of files. And all the reduce functions are taking place on the driver node.
Can you please tell a scenario where the reduce function won't execute at the driver. Does a task always include "transformation+action" or only "transformation"
All the actions are performed on the cluster and results of the actions may end up on the driver (depending on the action).
Generally speaking the spark code you write around your business logic is not the program that would actually run - rather spark uses it to create a plan which will execute your code in the cluster. The plan creates a task of all the actions that can be done on a partition without the need to shuffle data around. Every time spark needs the data arranged differently (e.g. after sorting) It will create a new task and a shuffle between the first and the latter tasks
Ill take a stab at this, although I may be missing part of the question. A task is indeed always transformation(s) and an action. The transformation's are lazy and would not submit anything, thus the need for an action. You can always call .toDebugString on your RDD to see where each job split will be; each level of indentation is a new stage. I think the reduce function showing on the driver is a bit of a misnomer as it will run first in parallel and then merge the results. So, I would expect that the task does indeed run on the workers as far as it can.