How to pass a build number within the MultiJob plugin? - build

The MultiJob plugin is great and I want to use it for my build process, but there is one issue I have to solve before: There are three jobs A, B and C. SVN triggers job A and B (parallel execution) and job C starts when A and B have finished. Job C requires the artifacts from job A and B as an input.
-> Job A (with A.zip)
Trigger -> Job C (use artifacts A.zip and B.zip)
-> Job B (with B.zip)
To design the workflow with the MultiJob plugin is easy, but I have no clue how to get the corresponding artifacts from job A and B in job C. Can I pass the build numbers to job C (buildNr(A) != buildNr(B))? Or is there a smarter way to solve the issue?

The multijob plugin sets the following environment variables per job (code):
<JOBNAME>_BUILD_NUMBER
<JOBNAME>_BUILD_RESULT
Where JOBNAME is created by the name of the job with all non characters and numbers replaced with _. Thus you can pass the build numbers as parameters to Job C:

There's a workaround using EnvInject and a groovy script:
https://issues.jenkins-ci.org/browse/JENKINS-20241

Related

making Airflow behave like Luigi: how to prevent tasks to be re-run in future runs of a DAG if their output was necessary to be obtained only once?

I come from experiences with Luigi, where if a file was produced successfully by a task and the task was also unmodified, then re-runs of the DAG would not re-run that task, but would reuse its previously-obtained output.
Is there any way to obtain the same behavior with AirFlow?
Currently, if I re-run the dag, it re-executes all the tasks, no matter if they produced a successful (and unchanged) output in the past. So, basically I need a task to be marked as successful if its code was unchanged.
This is the crucial and important feature of Airflow to have all the tasks as idempotent. This means that re-running a task on the same input should generally override the output with newly processed version of that data - so that task depending on it can be automatically rerun. But the data might be different after reprocessing than it was originally.
That's why in Airflow you have a backfill command that basically means.
Please re-run this DAG for selected past runs (say last week worth of runs) - but you should JUST reprocess starting from task X (which will re-run task X and ALL tasks that depend on its output).
This also means that when you want to re-run parts of past DAGs but you know that you want to relay on existing output of certain tasks there - you only backfill the tasks that are depending on the output of that task (but not the task itself).
This allows for much more flexibility by defining which tasks in past DAG runs should be re-run (you basically invalidate outputs of certain tasks by making them target of backfill).
This covers more than the case you mention:
a) if you want to not change an output of certain task - you do not backfill that task - but the task(s) that follow from it
b) more importantly - if you want to re-process the task even in the task input and task itself were modified, you can still do it - by backfilling that task.
The case b) is often important, because some of the tasks might have implicit dependencies that change - even if the inputs and task did not change, processing it again might produce different (often better) result.
A good example that I've heard is re-processing call records by telecom operators where you had to determine phone models from IMEI of the phones. In this case you might have a single service that does the mapping, but it might get updated to a newer version when manufacturers refresh their model database - when new phones are introduced, the refresh will happen with some delays, so reprocessing regularly last week's of data might give different results even if the input ("list of calls") and task ("execute map IMEIS to phone models") did not change from the DAG's Python point of view.
Airflow almost always calls external services to run certain tasks, and those services themselves might improve over time - this means that limiting re-processing to the cases where "no input + no task code" has changed is very limiting (but you can still deliberately decide on it by choosing the backfill scope - i.e. which tasks to reprocess).

AWS CodeCommit trigger 2 different project

I want to build a trigger or any other way in order that for every change I make in one of the repos in CodeCommit 2 different jobs will be triggered.
Let's say I have repo A,B,C - whenever a change happen on A, I only want to build B, C.
A is like a src of modules that don't need to be built.
The solution was a multi source trigger.
Created a pipeline with 2 src, but 1 build.
Mean 1 pipeline listens to changes in A and B an another to A and C but eventually 1 pipeline is building only B and the other C. A isn't built at all.
In case someone will face that in the future.

When does an action not run on the driver in Apache Spark?

I have just started with Spark and was struggling with the concept of tasks.
Can any one please help me in understanding when does an action (say reduce) not run in the driver program.
From the spark tutorial,
"Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel. "
I'm currently experimenting with an application which reads a directory on 'n' files and counts the number of words.
From the web UI the number of tasks is equal to number of files. And all the reduce functions are taking place on the driver node.
Can you please tell a scenario where the reduce function won't execute at the driver. Does a task always include "transformation+action" or only "transformation"
All the actions are performed on the cluster and results of the actions may end up on the driver (depending on the action).
Generally speaking the spark code you write around your business logic is not the program that would actually run - rather spark uses it to create a plan which will execute your code in the cluster. The plan creates a task of all the actions that can be done on a partition without the need to shuffle data around. Every time spark needs the data arranged differently (e.g. after sorting) It will create a new task and a shuffle between the first and the latter tasks
Ill take a stab at this, although I may be missing part of the question. A task is indeed always transformation(s) and an action. The transformation's are lazy and would not submit anything, thus the need for an action. You can always call .toDebugString on your RDD to see where each job split will be; each level of indentation is a new stage. I think the reduce function showing on the driver is a bit of a misnomer as it will run first in parallel and then merge the results. So, I would expect that the task does indeed run on the workers as far as it can.

Wrapping code for PIG script,Hive Queries and Corresponding MapReduce code

I am working on 2 datasets.I have MapReduced those,then Operated on output by means of PIG & HIVE.I want to execute all these steps at once in sequence.How should I wrap these things into a single scritp i.e Map Reduce code,followed by PIG script and finally few Hive queries.
Thanks,
Ketan
You need to wrap those in Oozie workflow.
Oozie enable you to run collection of actions arrange in a DAG - check this link
They have good documentation so you can start with that.

Removing old jobs from Jenkins

I'd like to shelve old builds in all of my jobs for example
build numbers 1-10
I'm wondering if there is way to do that from the jenkins UI using a single command.
First of all in order to make changes to a bulk of jobs of I would use something called configuration slicer.
you can get to that from here: https://wiki.jenkins-ci.org/display/JENKINS/Configuration+Slicing+Plugin
Also you want to delete your build? or archive them?! in case of deleting I would use the Log rotation eaither by date or number of builds. In the configure section of the job click on Discard old build and you will see the options.
and finally you can always use Artifact deployer and somether examples from that plug in.
Link Here: https://wiki.jenkins-ci.org/display/JENKINS/ArtifactDeployer+Plugin
Link on how to use the CLI in Jenkins : https://wiki.jenkins-ci.org/display/JENKINS/Jenkins+CLI
EDIT 1
In regards to the comments below where you are asking about "Shelving Jobs" .
I think the phrase you are looking for here is "archive" and not shelving - that is a very Visual Studio/TFS concept - so I am not personally aware of any anything that does SHELVING per say.
In terms of Groovy script I believe that you are now asking a different question and so this should be raised specifically as different question - but as far as groovy script go you can use the following link as an intro :
http://groovy.codehaus.org/