How to do branch grafting on separate tasks - ducttape

I want to write a ducttape task that grafts the outputs of tasks belonging to different branches. For example, take a scenario where you have a training and test data. You want to use the training data to develop a model and you want to evaluate the model on a test set.
Here's a dummy example. prep is run on each of the data sets, but trainModel is only to be executed on training. The goal is to have applyModel use the results from trainModel with the training set and to evaluate it on the test set.
task prep
< input=$corpus
> output
{
# Do some common preprocessing on all corpora
cp $input $output
}
task trainModel
< input=$output#prep
> trainedModel
{
echo "Trained..." > $trainedModel
cat $input >> $trainedModel
}
task applyModel
< trainedModel=$trainedModel#trainModel[Set:Train]
< testInput=$output#prep[Set:Test]
> out
{
diff $file1 $file2 > $out
}
plan DoIt {
# reach prep via (Set: *)
reach trainModel via (Set: Train)
reach applyModel via (Set: *)
}
global
{
train=train.txt
test=test.txt
corpus=(Set: Train=$train Test=$test)
}
However, if you look at the execution plan:
ducttape 0.3
by Jonathan Clark
Loading workflow version history...
Have 0 previous workflow versions
Finding hyperpaths contained in plan...
Found 4 vertices implied by realization plan DoIt (Clause 1 of 2)
Found 4 vertices implied by realization plan DoIt (Clause 2 of 2)
Union of all planned vertices has size 5
Checking for completed tasks
Finding packages...
Found 0 packages
Checking for already built packages (if this takes a long time, consider switching to a local-disk git clone instead of a remote repository)...
Checking inputs...
Work plan (depth-first traversal):
RUN: ./prep/Baseline.baseline (Set.Train)
RUN: ./trainModel/Baseline.baseline (Set.Train)
RUN: ./prep/Set.Test (Set.Test)
RUN: ./applyModel/Baseline.baseline (Baseline.baseline)
RUN: ./trainModel/Set.Test (Set.Test)
you see that trainModel will be executed twice, both on the training and test data. How can I configure the work plan to train only on one or more training sets, and evaluate on one or more test sets?
(I realize it might be better in this example to break the $corpus variable into sub-branches, but in more complex scripts this problem still exists.)

This is exactly the sort of scenario that branches are aimed at. The issue in the sample above is that the graft for the training data is applied to late -- since you want to run training for only the training data set, you should apply the graft at the training task.
You also mentioned the use case of having multiple test sets, so I've included an example of that as well, which makes use of nested branch points.
To keep things clear, I renamed your original branch point "Set" to "Section" (i.e. train vs test data sections) and introduced a new "TestSet" branch point.
task prep
< input=$corpus
> output
{
# Do some common preprocessing on all corpora
cp $input $output
}
task trainModel
< input=$output#prep[Section:Train] <----- EDITED
> trainedModel
{
echo "Trained..." > $trainedModel
cat $input >> $trainedModel
}
task applyModel
< trainedModel=$trainedModel#trainModel <----- EDITED
< testInput=$output#prep[Section:Test]
> out
{
diff $file1 $file2 > $out
}
plan DoIt {
reach applyModel via (TestSet: *) <----- EDITED
}
global
{
train=train.txt
test_a=test_a.txt
test_b=test_b.txt
corpus=(Section:
Train=$train
Test=(TestSet: A=$test_a B=$test_b) <----- EDITED
)
}

Related

Informatica Looping

I am looking for information on looping in Informatica. Specifically, I need to check if a source table has been loaded, if it has, move to next step, if not wait X minutes and check the status table again. I would prefer direction to a place I can learn this on my own, but I need to confirm this is even possible as I have not found anything on my google searches.
You can use a simple shell script to do this wait and watch capability.
#/bin/sh
# call it as script_name.sh
# it will wait for 10 min and check again for data, in total it will wait for 2hours. change them if you want to
# Source is assumed as oracle. change it as per your source.
interval=600
loop_count=10
counter=0
while true
do
$counter=`expr $counter + 1 `
db_value=`sqlplus -s user/pass#local_SID <<EOF
set heading off
set feedback off
SELECT count(*) FROM my_source_table;
exit
EOF`;
if [ $db_value -gt 0 ]; then
echo "Data Found."
exit 0
else
if [ $counter -eq $loop_count ]
then
echo "No data found in source after 2hours"
exit 1
else
sleep $interval
fi
fi
done
And add this shell script(in a CMD task) to the beginning of the workflow.
Then use informatica link condition as if status= 0, proceed else email that wait time is over.
You can refer to the pic below. This will send a mail if wait time is over and still data is not there in source.
In general, looping is not supported in Informatica PowerCenter.
One way is to use scripts, as discussed by Koushik.
Another way to do that is to have a Continuously Running Workflow with a timer. This is configurable on Scheduler tab of your workflow:
Such configuration makes the workflow start again right after it succeeds. Over and over again.
Workflow would look like:
Start -> s_check_source -> Decision -> timer
|-> s_do_other_stuff -> timer
This way it will check source. Then if it has not been loaded trigger the timer. Thet it will succeed and get triggered again.
If source turns out to be loaded, it will trigger the other session, complete and probably you'd need another timer here to wait till next day. Or basically till whenever you'd like the workflow to be triggered again.

Linkages of Blocks in Hyperledger Blockchain

In case of Bitcoin Blockchain blocks are linked with hash of previous block. How blocks are linked in the Hyperledger Fabric blockchain? And how to visualize them?
You can find how block in structured in Hyperledger Fabric, by inspecting protobuf definitions, here. Where block defined as:
message Block {
BlockHeader header = 1;
BlockData data = 2;
BlockMetadata metadata = 3;
}
and you will drill down further to definition of BlockHeader:
message BlockHeader {
uint64 number = 1; // The position in the blockchain
bytes previous_hash = 2; // The hash of the previous block header
bytes data_hash = 3; // The hash of the BlockData, by MerkleTree
}
you will see that blocks in Hyperledger Fabric are chained with hashes in very similar fashion to Bitcoin and many other blockchain solutions. The hash (H) of the block N computed by following scheme:
H(H) = hash([transactions] || H(N-1) || N)
E.g. content of the block, previous block hash and the block sequence number.
Many aspects have good coverage in "Getting Started" documentation.
If you are using IBM Bluemix here is simple tutorial with GO
# Create the parent directories on your GOPATH
mkdir -p $GOPATH/src/github.com/hyperledger
cd $GOPATH/src/github.com/hyperledger
# Clone the appropriate release codebase into $GOPATH/src/github.com/hyperledger/fabric
# Note that the v0.5 release is a branch of the repository. It is defined below after the -b argument
git clone -b v0.5-developer-preview https://github.com/hyperledger-archives/fabric.git
If you are installing IBM BLUEMIX v0.6 release, use this for your git clone command:
# The v0.6 release exists as a branch inside the Gerrit fabric repository
git clone -b v0.6 http://gerrit.hyperledger.org/r/fabric
If the fabric is not installed properly on your GOPATH, you will see errors like the one below when building your chaincode
Apart from you can simply refer documentation for linking process!

Jenkins Pipeline Job with Condition

I am creating a Jenkins pipeline. This pipeline is building three jobs (JobOne, JobTwo, JobThree). I am able to run the job with following code.
node {
stage 'Stage 1'
echo 'Hello World 1'
build 'Pipeline-Test/JobOne'
stage 'Stage 2'
echo 'Hello World 2'
build 'Pipeline-Test/JobTwo'
stage 'Stage 3'
echo 'Hello World 3'
build 'Pipeline-Test/JobThree'
}
Now I want to put some condition in it. Example, when JobOne fails, the job has to restart once again. When JobTwo passes, want to run job again. And JobThree should run after a span of 10 mins after JobTwo completes. I am not sure how to make pipeline with this condition. I am new to Jenkins pipeline.
I checked few Jenkins WiKi page but could not find a proper way to implement if condition with above conditions. I tried below code just to check how 'if' condition can be implemented. But it fails.
node {
stage 'Stage 1'
echo 'Hello World 1'
build 'Pipeline-Test/JobOne'
post {
always{
build 'Pipeline-Test/JobOne'
}
}
Error:
java.lang.NoSuchMethodError: No such DSL method 'post' found among [archive, bat, build, catchError, checkout, checkpoint, deleteDir, dir, dockerFingerprintFrom, dockerFingerprintRun, echo, error, fileExists, git, input, isUnix, load, mail, node, parallel, properties, publishHTML, pwd, readFile, retry, sh, sleep, sshagent, stage, stash, step, teamconcert, timeout, tool, triggerRemoteJob, unarchive, unstash, waitUntil, withCredentials, withDockerContainer, withDockerRegistry, withDockerServer, withEnv, wrap, writeFile, ws]
at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:107)
Can someone please guide me through this?
Thanks in advance!
There is definitely a learning curve with Jenkins Pipelines, so don't get discouraged :)
I recommend checking out Jenkins' official documentation as well as the Pipeline Steps Reference page to anyone getting started with Jenkins Pipelines.
FYI, stages without a block argument is deprecated; you should define stages as follows:
stage('Name of Stage') {
// code
}
Pipelines have a retry step that you can use to retry the JobOne build if it fails.
To wait 10 minutes between stage 2 and 3, you can use the sleep step.
if statements are written just like in Java, since Groovy is actually compiled on a JVM.
if (animal == 'dog' || boolean == true) {
Combining each of these, I think this is what you can use:
node {
stage ('Stage 1') {
echo 'Hello World 1'
retry(1) {
build 'Pipeline-Test/JobOne'
}
}
stage ('Stage 2') {
echo 'Hello World 2'
build 'Pipeline-Test/JobTwo'
}
sleep time:10, unit:"MINUTES"
stage ('Stage 3') {
echo 'Hello World 3'
build 'Pipeline-Test/JobThree'
}
}

rawlist function hangs for hours and never completes on large folder

I am retrieving a list of files in a folder with tens of thousands of files. I've let the script execute for a few hours but it never seems to progress after this part. Is there a way to diagnose what's going on or the progress of the list? It used to take a few minutes to load the list, so I am not sure why it's all of a sudden hanging now.
Code
function EchoDump($echoOut = 0)
{
if($echoOut)
echo str_repeat("<!-- Agent Smith -->", 1000);
else
return str_repeat("<!-- Agent Smith -->", 1000);
}
$sftp_connection = new Net_SFTP($website ,2222);
$login_result = $sftp_connection->login($credentials->login(), $credentials->password());
echo "<pre>Login Status: ".( $login_result ? "Success" : "Failure") ."</pre>";
if($sftp_connection->chdir($photosFolder))
echo "<pre>Changed to $photosFolder</pre>";
else
{
echo "<pre>Failed to change to $photosFolder</pre>";
}
echo "<pre>Downloading List.</pre>";
EchoDump(1);
$rawlist = $sftp_connection->rawlist();
echo "<pre>List downloaded.</pre>";
EchoDump(1);
Output
Login Status: Success
Changed to /vspfiles/photos
Downloading List.
And then it will just sit with Downloading List as the last output forever.
Real time logging might help provide some insight. You can enable that by adding define('NET_SSH2_LOGGING', 3); to the top of your script. Once you got that maybe post it on pastebin.org for dissection.
That said, I will say that rawlist() doesn't return directory info on demand. It'll return everything. It's not the most efficient of designs but back in phpseclib 1.0 days Iterators didn't exist. phpseclib works with PHP4 so it had to work with what it had. The current master branch requires PHP 5.6 and is under active development so maybe this behavior will change but we'll see.

Linux - Detecting idleness

I need to detect when a computer is idle for a certain time period. My definition of idleness is:
No users logged in, either by remote methods or on the local machine
X server inactivity, with no movement of mouse or key presses
TTY keyboard inactivity (hopefully)
Since the majority of distros have now moved to logind, I should be able to use its DBUS interface to find out if users are logged in, and also to monitor logins/logouts. I have used xautolock to detect X idleness before, and I could continue using that, but xscreensaver is also available. Preferably however I want to move away from any specific dependencies like the screensaver due to different desktop environments using different components.
Ideally, I would also be able to base idleness on TTY keyboard inactivity, however this isn't my biggest concern. According to this answer, I should be able to directly query the /dev/input/* interfaces, however I have no clue how to go about this.
My previous attempts at making such a monitor have used Bash, due to the ease of changing a plain text script file, howver I am happy using C++ in case more advanced methods are required to accomplish this.
From a purely shell standpoint (since you tagged this bash), you can get really close to what you want.
#!/bin/sh
users_are_logged_in() {
who |grep -q .
return $?
}
x_is_blanked() {
local DISPLAY=:0
if xscreensaver-command -time |grep -q 'screen blanked'; then
return 0 # we found a blanked xscreensaver: return true
fi
# no blanked xscreensaver. Look for DPMS modes
xset -q |awk '
/DPMS is Enabled/ { dpms = 1 } # DPMS is enabled
/Monitor is On$/ { monitor = 1 } # The monitor is on
END { if(dpms && !monitor) { exit 0 } else { exit 1 } }'
return $? # true when DPMS is enabled and the monitor is not on
}
nobody_here() {
! users_are_logged_in && x_is_blanked
return $?
}
if nobody_here; then
sleep 2m
if nobody_here; then
# ...
fi
fi
This assumes that a user can log in in two minutes and that otherwise, there is no TTY keyboard activity.
You should verify that the who |grep works on your system (i.e. no headers). I had originally grepped for / but then it won't work on FreeBSD. If who has headers, maybe try [ $(who |grep -c .) -gt 1 ] which will tell you that the number of lines that who outputs is more than one.
I share your worry about the screensaver part; xscreensaver likely isn't running in the login manager (any other form of X would involve a user logged in, which who would detect), e.g. GDM uses gnome-screensaver, whose syntax would be slightly different. The DPMS part may be good enough, giving a far larger buffer for graphical logins than the two minutes for console login.
Using return $? in the last line of a function is redundant. I used it to clarify that we're actually using the return value from the previous line. nobody_here short circuits, so if no users are logged in, there is no need to run the more expensive check for the status of X.
Side note: Be careful about using the term "idle" as it more typically refers to resource (hardware, that is) consumption (e.g. CPU load). See the uptime command for load averages for the most common way of determining system (resource) idleness. (This is why I named my function nobody_here instead of e.g. is_idle)