Can SnakeMake be forced to rerun rules when files are missing - delete-file

When a file that was made earlier in the pipeline is removed, SnakeMake does not seem to consider that a problem, as long as later files are there:
rule All:
input: "testC1.txt", "testC2.txt"
rule A:
input: "{X}{Y}.txt"
output: "{X}A{Y}.txt"
shell: "cp {input} {output}"
rule B:
input: "{X}A{Y}.txt"
output: "{X}B{Y}.txt"
shell: "cp {input} {output}"
rule C:
input: "{X}B{Y}.txt"
output: "{X}C{Y}.txt"
shell: "cp {input} {output}"
Save this SnakeFile in test.sf and do this:
rm testA*.txt testB*.txt testC*.txt
echo "test1" >test1.txt
echo "test2" >test2.txt
snakemake -s test.sf
# Rerun:
snakemake -s test.sf
# SnakeMake says all is up to date, which it is.
# Remove intermediate results:
rm testA1.txt
# Rerun:
snakemake -s test.sf
SnakeMake says all is up to date. It does not detect missing testA1.txt.
I seem to recall something in the online SnakeMake manual about this, but I can no longer find it.
I assume this is expected SnakeMake behavior. It can sometimes be desired behavior, but sometimes you may want it to detect and rebuild the missing file. How can this be done?

As mentioned in this other answer, the -R parameter can help, but there are more options:
Force a rebuild of the whole workflow
When you call
snakemake -F
this will trigger a rebuild of the whole pipeline. This basically means, forget all intermediate files and start anew. This will definitely (re-) generate all intermediate files on the way. The downside is: it might take some time.
Force a specific rule
This is the realm of the -R <rule> parameter. This re-runs the given rule and all rules that depend on it. So in your case
snakemake -R A -s test.sf
would rerun rule A (to build testA1.txt from test.txt) and the rules B, C and All, since they depend on A. Mind that this runs all copies of rule A that are required, so in your example testA2.txt and everything that follows from it is also rebuild.
If, in your example, you would have removed testB1.txt instead, only the rules B and C would have been rerun.
Why does this happen?
If I remember correctly, snakemake detects if a file needs to be rebuild by its utime. So if you have a version of testA1.txt that is younger (as in more recently created) than testB1.txt, testB1.txt has to be rebuild using rule B, to assure everything is up to date. Hence, you cannot easily rebuild only testA1.txt without also building all following files unless you somehow change the files' utimes.
I have not tried this out, but this can be done with snakemakes --touch parameter. If you manage to only run rule A and then run snakemake -R B -t ,which touches all output files of the rules B and following, you could get a valid workflow state without actually rerunning all steps in between.

I found this thread a while ago about the --forcerun/-R parameter that might be informative.
Ultimately, snakemake will force execution of the entire pipeline if you want to regenerate that intermediate file without having a separate rule for it or including it as a target in all.

Indeed, it would be nice if snakemake had a flag which looked for missing intermediate results and regenerates them if missing (an all it's dependencies).
I'm not aware of such an option, but there are some workarounds.
Note, the -R option suggested by m00am and Jon Chung will regenerate all other files reagardless of wheather intermediate files are missing or not. So this is not ideal at all.
Workaround 1: Force recreation of file
Force recreation of the intermediate file using -R or -f flag (help copied below). The key here to be explicit targetting the file rather than the rule.
snakemake -s test.sf testA1.txt # only works if testA1.txt was deleted
# or
snakemake -s test.sf -R testA1.txt # testA1.txt can be present or absent
# or
snakemake -s test.sf -f testA1.txt
# or
snakemake -s test.sf -F testA1.txt
Note, the later for the latter two, the pipeline need to be run again to update dependencies:
snakemake -s test.sf
prevent update of dependent files (by touching files)
If you don't want the dependent files (i.e. testB1.txt, testC1.txt) to be updated there are also options.
You could regenerate testA1.txt and then "reset" it's modification time, e.g. to the source file which will prevent the pipeline to update anything:
snakemake -s test.sf -f testA1.txt
touch testA1.txt -r test1.txt
snakemake -s test.sf now won't to anything since testB1.txt is newer than testA1.txt
Or you could mark the dependent files (i.e. testB1.txt, testC1.txt) as "newer" using --touch:
snakemake -s test.sf -f testA1.txt
snakemake -s test.sf --touch
Workaround 2: Creating a new rule
The snakefile could be extended by a new rule:
rule A_all:
input: "testA1.txt", "testA2.txt"
Which could then be called like so:
snakemake A_all -s test.sf
This will only generate testA1.txt, similar to -f in the workflow above, so the the pipline needs to be rerun or the modification time can to be changed.
A trick might to "update" a intermediate file using --touch
snakemake -s test.sf --touch testA1.txt -n
This will "update" testA1.txt. To recreate the dependent files snakemake needs to be run as normal afterwards:
snakemake -s test.sf
Note this will not work if testA1.txt was deleted, this needs to be done instead of deletion.
Relevant help on used parameters:
--touch, -t Touch output files (mark them up to date without
really changing them) instead of running their
commands. This is used to pretend that the rules were
executed, in order to fool future invocations of
snakemake. Fails if a file does not yet exist.
--force, -f Force the execution of the selected target or the
first rule regardless of already created output.
--forceall, -F Force the execution of the selected (or the first)
rule and all rules it is dependent on regardless of
already created output.
--forcerun [TARGET [TARGET ...]], -R [TARGET [TARGET ...]]
Force the re-execution or creation of the given rules
or files. Use this option if you changed a rule and
want to have all its output in your workflow updated.

Related

How to use docker to test multiple compiler versions

What is the idiomatic way to write a docker file for building against many different versions of the same compiler?
I have a project which tests against a wide-range of versions of different compilers like gcc and clang as part of a CI job. At some point, the agents for the CI tasks were updated/changed, resulting in newer jobs failing -- and so I've started looking into dockerizing these builds to try to guarantee better reliability and stability.
However, I'm having some difficulty understanding what a proper and idiomatic approach is to producing build images like this without causing a large amount of duplication caused by layers.
For example, let's say I want to build using the following toolset:
gcc 4.8, 4.9, 5.1, ... (various versions)
cmake (latest)
ninja-build
I could write something like:
# syntax=docker/dockerfile:1.3-labs
# Parameterizing here possible, but would cause bloat from duplicated
# layers defined after this
FROM gcc:4.8
ENV DEBIAN_FRONTEND noninteractive
# Set the work directory
WORKDIR /home/dev
COPY . /home/dev/
# Install tools (cmake, ninja, etc)
# this will cause bloat if the FROM layer changes
RUN <<EOF
apt update
apt install -y cmake ninja-build
rm -rf /var/lib/apt/lists/*
EOF
# Default command is to use CMak
CMD ["cmake"]
However, the installation of tools like ninja-build and cmake occur after the base image, which changes per compiler version. Since these layers are built off of a different parent layer, this would (as far as I'm aware) result in layer duplication for each different compiler version that is used.
One alternative to avoid this duplication could hypothetically be using a smaller base image like alpine with separate installations of the compiler instead. The tools could be installed first so the layers remain shared, and only the compiler changes as the last layer -- however this presents its own difficulties, since it's often the case that certain compiler versions may require custom steps, such as installing certain keyrings.
What is the idiomatic way of accomplishing this? Would this typically be done through multiple docker files, or a single docker file with parameters? Any examples would be greatly appreciated.
I would separate the parts of preparing the compiler and doing the calculation, so the source doesn't become part of the docker container.
Prepare Compiler
For preparing the compiler I would take the ARG approach but without copying the data into the container. In case you wanna fast retry while having enough resources you could spin up multiple instances the same time.
ARG COMPILER=gcc:4.8
FROM ${COMPILER}
ENV DEBIAN_FRONTEND noninteractive
# Install tools (cmake, ninja, etc)
# this will cause bloat if the FROM layer changes
RUN <<EOF
apt update
apt install -y cmake ninja-build
rm -rf /var/lib/apt/lists/*
EOF
# Set the work directory
VOLUME /src
WORKDIR /src
CMD ["cmake"]
Build it
Here you have few options. You could either prepare a volume with the sources or use bind mounts together with docker exec like this:
#bash style
for compiler in gcc:4.9 gcc:4.8 gcc:5.1
do
docker build -t mytag-${compiler} --build-arg COMPILER=${compiler} .
# place to clean the target folder
docker run -v $(pwd)/src:/src mytag-${compiler}
done
And because the source is not part of the docker image you don't have bloat. You can also have two mounts, one for a readonly source tree and one for the output files.
Note: If you remove the CMake command you could also spin up the docker containers in parallel and use docker exec to start the build. The downside of this is that you have to take care of out of source builds to avoid clashes on the output folder.
put an ARG before the FROM and then invoke the ARG as the FROM
so:
ARG COMPILER=gcc:4.8
FROM ${COMPILER}
# rest goes here
then you
docker build . -t test/clang-8 --build-args COMPILER=clang-8
or similar.
If you want to automate just make a list of compilers and a bash script looping over the lines in your file, and paste the lines as inputs to the tag and COMPILER build args.
As for Cmake, I'd just do:
RUN wget -qO- "https://cmake.org/files/v3.23/cmake-3.23.1-linux-"$(uname -m)".tar.gz" | tar --strip-components=1 -xz -C /usr/local
When copying, I find it cleaner to do
WORKDIR /app/build
COPY . .
edit: formatting
As far as I know, there is no way to do that easily and safely. You could use a RUN --mount=type=cache, but the documentation clearly says that:
Contents of the cache directories persist between builder invocations without invalidating the instruction cache. Cache mounts should only be used for better performance. Your build should work with any contents of the cache directory as another build may overwrite the files or GC may clean it if more storage space is needed.
I have not tried it but I guess the layers are duplicated anyway, you just save time, assuming the cache is not emptied.
The other possible solution you have is similar to the one you mention in the question: starting with the tools installation and then customizing it with the gcc image. Instead of starting with an alpine image, you could start FROM scratch. scratch is basically the empty image, you could COPY the files generated by
RUN <<EOF
apt update
apt install -y cmake ninja-build
rm -rf /var/lib/apt/lists/*
EOF
Then you COPY the entire gcc filesystem. However, I am not sure it will work because the order of the initial layers is now reversed. This means that some files that were in the upper layer (coming from tools) now are in the lower layer and could be overwritten. In the comments, I asked you for a working Dockerfile because I wanted to try this out before answering. If you want, you can try this method and let us know. Anyway, the first step is extracting the files created from the tools layer.
How to extract changes from a layer?
Let's consider this Dockerfile and build it with docker build -t test .:
FROM debian:10
RUN apt update && apt install -y cmake && ( echo "test" > test.txt )
RUN echo "new test" > test.txt
Now that we have built the test image, we should find 3 new layers. You mainly have 2 ways to extract the changes from each layer:
the first is docker inspecting the image and then find the ids of the layers in the /var/lib/docker folder, assuming you are on Linux. Each layer has a diff subfolder containing the changes. Actually, I think it is more complex than this, that is why I would opt for...
skopeo: you can install it with apt install skopeo and it is a very useful tool to operate on docker images. The command you are interested in is copy, that extracts the layers of an image and export them as .tar:
skopeo copy docker-daemon:{image_name}:latest "dir:/home/test_img"
where image_name is test in this case.
Extracting layer content with Skopeo
In the specified folder, you should find some tar files and a configuration file (look at the skopeo copy command output and you will know which one is that). Then extract each {layer}.tar in a different folder and you are done.
Note: to find the layer containing your tools just open the configuration file (maybe using jq because it is json) and take the diff_id that corresponds to the RUN instruction you find in the history property. You should understand it once you open the JSON configuration. This is unnecessary if you have a small image that has, for example, debian as parent image and a single RUN instruction containing the tools you want to install.
Get GCC image content
Now that we have the tool layer content, we need to extract the gcc filesystem. we don't need skopeo for this one, but docker export is enough:
create a container from gcc (with the tag you need):
docker create --name gcc4.8 gcc:4.8
export it as tar:
docker export -o gcc4.8.tar gcc4.8
finally extract the tar file.
Putting all together
The final Dockerfile could be something like:
FROM scratch
COPY ./tools_layer/ /
COPY ./gcc_4.x/ /
In this way, the tools layer is always reused (unless you change the content of that folder, of course), but you can parameterize the gcc_4.x with the ARG instruction for example.
Read carefully: all of this is not tested but you might encounter 2 issues:
the gcc image overwrites some files you have changed in the tools layer. You could check if this happens by computing the diff between the gcc layer folder and the tools layer folder. If it happens, you can only keep track of that file/s and add it/them in the dockerfile after the COPY ./gcc ... with another COPY.
When in the upper layer a file is removed, docker marks that file with a .wh extension (not sure if it is different with skopeo). If in the tools layer you delete a file that exists in the gcc layer, then that file will not be deleted using the above Dockerfile (the COPY ./gcc ... instruction would overwrite the .wh). In this case too, you would need to add an additional RUN rm ... instruction.
This is probably not the correct approach if you have a more complex image that the one you are showing us. In my opinion, you could give this a try and just see if this works out with a single Dockerfile. Obviously, if you have many compilers, each one having its own tools set, the maintainability of this approach could be a real burden. Instead, if the Dockerfile is more or less linear for all the compilers, this might be good (after all, you do not do this every day).
Now the question is: is avoiding layer replication so important that you are willing to complicate the image-building process this much?

Dynamically-created 'zip' command not excluding directories properly

I'm the author of a utilty that makes compressing projects using zip a bit easier, especially when you have to compress regularly, such as for updating projects submitted to an application store (like Chrome's Web Store).
I'm attempting to make quite a few improvements, but have run into an issue, described below.
A Quick Overview
My utility's command format is similar to command OPTIONS DEST DIR1 {DIR2 DIR3 DIR4...}. It works by running zip -r DEST.zip DIR1; a fairly simple process. The benefit to my utility, however, is the ability to use a predetermined file (think .gitignore) to ignore specific files/directories, or files/directories which match a pattern.
It's pretty simple -- if the "ignorefile" exists in a target directory (DIR1, DIR2, DIR3, etc), my utility will add exclusions to the zip -r DEST.zip DIR1 command using the pattern -x some_file or -x some_dir/*.
The Issue
I am running into an issue with directory exclusion, however, and I can't quite figure out why (this is probably be because I am still quite the sh novice). I'll run through some examples:
Let's say that I want to ignore two things in my project directory: .git/* and .gitignore. Running command foo.zip project_dir builds the following command:
zip -r foo.zip project -x project/.git/\* -x project/.gitignore
Woohoo! Success! Well... not quite.
In this example, .gitignore is not added to the compressed output file, foo.zip. The directory, .git/*, and all of it's subdirectories (and files) are added to the compressed output file.
Manually running the command:
zip -r foo.zip project_dir -x project/.git/\* -x project/.gitignore
Works as expected, of course, so naturally I am pretty puzzled as to why my identical, but dynamically-built command, does not work.
Attempted Resolutions
I have attempted a few different methods of resolving this to no avail:
Removing -x project/.git/\* from the command, and instead adding each subdirectory and file within that directory, such as -x project/.git/config -x project/.git/HEAD, etc (including children of subdirectories)
Removing the backslash before the asterisk, so that the resulting exclusion option within the command is -x project/.git/*
Bashing my head on the keyboard in angst (I'm really surprised this didn't work, it usually does)
Some notes
My utility uses /bin/sh; I would prefer to keep it that way for maximum compatibility.
I am aware of the git archive feature -- my use of .git/* and .gitignore in the above example is simply as an example; my utility is not dependent on git nor is used exclusively for projects which are git repositories.
I suspected the problem would be in the evaluation of the generated command, since you said the same command when executed directly did right.
So as the comment section says, I think you already found the correct solution. This happens because if you run that variable directly, some things like globs can be expanded directly, instead of passed to the command. And arguments may be messed up, depending on the situation.
Yes, in that case:
eval $COMMAND
is the way to go.

Git status showing weird untracked "path_of_file\r" files, how to remove by command line

I was in a C++ program with google unit test, gtest. I ran and built the projects.
At the end, when I ran git status, it gave some weird untracked files. I do not know where they are from, and how I should remove them please. Using bash.
> git status
On branch A
Untracked files:
(use "git add <file>..." to include in what will be committed)
"../path_of_file1\r"
"../path_of_file2\r"
"../path_of_file3\r"
nothing added to commit but untracked files present (use "git add" to track)
This did not work:
rm -f "path_to_file\r"
Thank you.
I believe git clean should work in most scenarios. I tried the rm without the "", it worked! Thank you all.
rm path_to_file\r (complete by tabs)
You can always remove all untracked (and unignored) files with git clean -f. To be safe, run git clean -n first to see which files will be deleted.
David's answer is a good one, assuming you want to do a full git clean.
Here is another option that lets you delete the files individually: Let your shell complete the file names for you, escaping them as necessary.
For example, if you type
rm path_to_file1
and press Tab, most shells will complete the filename with a proper escape sequence. The precise sequence will be shell-specific, and I'm not clear whether \r is the two characters \ and r or whether it's a single special character, but your shell will know for sure.

Is there a "watch" / "monitor" / "guard" program for Makefile dependencies?

I've recently been spoiled by using nodemon in a terminal window, to run my Node.js program whenever I save a change.
I would like to do something similar with some C++ code I have. My actual project has lots of source files, but if we assume the following example, I would like to run make automatically whenever I save a change to sample.dat, program.c or header.h.
test: program sample.dat
./program < sample.dat
program: program.c header.h
gcc program.c -o program
Is there an existing solution which does this?
(Without firing up an IDE. I know lots of IDEs can do a project rebuild when you change files.)
If you are on a platform that supports inotifywait (to my knowledge, only Linux; but since you asked about Make, it seems there's a good chance you're on Linux; for OS X, see this question), you can do something like this:
inotifywait --exclude '.*\.swp|.*\.o|.*~' --event MODIFY -q -m -r . |
while read
do make
done
Breaking that down:
inotifywait
Listen for file system events.
--exclude '.*\.swp|.*\.o|.*~'
Exclude files that end in .swp, .o or ~ (you'll probably want to add to this list).
--event MODIFY
When you find one print out the filepath of the file for which the event occurred.
-q
Do not print startup messages (so make is not prematurely invoked).
-m
Listen continuously.
-r .
Listen recursively on the current directory. Then it is piped into a simple loop which invokes make for every line read.
Tailor it to your needs. You may find inotifywait --help and the manpage helpful.
Here is a more detailed script. I haven't tested it much, so use with discernment. It is meant to keep the build from happening again and again needlessly, such as when switching branches in Git.
#!/bin/sh
datestampFormat="%Y%m%d%H%M%S"
lastrun=$(date +$datestampFormat)
inotifywait --exclude '.*\.swp|.*\.o|.*~' \
--event MODIFY \
--timefmt $datestampFormat \
--format %T \
-q -m -r . |
while read modified; do
if [ $modified -gt $lastrun ]; then
make
lastrun=$(date +$datestampFormat)
fi
done

Is there a build tool based on inotify-like mechanism

In relatively big projects which are using plain old make, even building the project when nothing has changed takes a few tens of seconds. Especially with many executions of make -C, which have the new process overhead.
The obvious solution to this problem is a build tool based on inotify-like feature of the OS. It would look out when a certain file is changed, and based on that list it would compile this file alone.
Is there such machinery out there? Bonus points for open source projects.
You mean like Tup:
From the home page:
"Tup is a file-based build system - it inputs a list of file changes and a directed acyclic graph (DAG), then processes the DAG to execute the appropriate commands required to update dependent files. The DAG is stored in an SQLite database. By default, the list of file changes is generated by scanning the filesystem. Alternatively, the list can be provided up front by running the included file monitor daemon."
I am just wondering if it is stat()ing the files that takes so long. To check this here is a small systemtap script I wrote to measure the time it takes to stat() files:
# call-counts.stp
global calls, times
probe kernel.function(#1) {
times[probefunc()] = gettimeofday_ns()
}
probe kernel.function(#1).return {
now = gettimeofday_ns()
delta = now - times[probefunc()]
calls[probefunc()] <<< delta
}
And then use it like this:
$ stap -c "make -rC ~/src/prj -j8 -k" ~/tmp/count-calls.stp sys_newstat
make: Entering directory `/home/user/src/prj'
make: Nothing to be done for `all'.
make: Leaving directory `/home/user/src/prj'
calls["sys_newstat"] #count=8318 #min=684 #max=910667 #sum=26952500 #avg=3240
The project I ran it upon has 4593 source files and it takes ~27msec (26952500nsec above) for make to stat all the files along with the corresponding .d files. I am using non-recursive make though.
If you're using OSX, you can use fswatch
https://github.com/alandipert/fswatch
Here's how to use fswatch to for changes to a file and then run make if it detects any
fswatch -o anyFile | xargs -n1 -I{} make
You can run fswatch from inside a makefile like this:
watch: $(FILE)
fswatch -o $^ | xargs -n1 -I{} make
(Of course, $(FILE) is defined inside the makefile.)
make can now watch for changes in the file like this:
> make watch
You can watch another file like this:
> make watch anotherFile
Install inotify-tools and write a few lines of bash to invoke make when certain directories are updated.
As a side note, recursive make scales badly and is error prone. Prefer non-recursive make.
The change-dependency you describe is already part of Make, but Make is flexible enough that it can be used in an inefficient way. If the slowness really is caused by the recursion (make -C commands) -- which it probably is -- then you should reduce the recursion. (You could try putting in your own conditional logic to decide whether to execute make -C, but that would be a very inelegant solution.)
Roughly speaking, if your makefiles look like this
# main makefile
foo:
make -C bar baz
and this
# makefile in bar/
baz: quartz
do something
you can change them to this:
# main makefile
foo: bar/quartz
cd bar && do something
There are many details to get right, but now if bar/quartz has not been changed, the foo rule will not run.