Is dvc.yaml supposed to be written or generated by dvc run command? - directed-acyclic-graphs

Trying to understand dvc, most tutorials mention generation of dvc.yaml by running dvc run command.
But at the same time, dvc.yaml which defines the DAG is also well documented. Also the fact that it is a yaml format and human readable/writable would point to the fact that it is meant to be a DSL for specifying your data pipeline.
Can somebody clarify which is the better practice?
Writing the dvc.yaml or let it be generated by dvc run command?
Or is it left to user's choice and there is no technical difference?

I'd recommend manual editing as the main route! (I believe that's officially recommended since DVC 2.0)
dvc stage add can still be very helpful for programmatic generation of pipelines files, but it doesn't support all the features of dvc.yaml, for example setting vars values or defining foreach stages.

Both, really.
Primarily dvc run (or the newer dvc stage add followed by dvc exp run) is meant to mange your dvc.yaml file. For most (including casual) users, this is probably easiest & thus best. The format will be guaranteed to be correct (similar to choosing between {git,dvc} config and directly modifying .{git,dvc}/config)
However as you note, dvc.yaml is human-readable. This is intentional so that more advanced users could manually edit the YAML (potentially short-circuiting some validation checks, or unlocking advanced functionality such as foreach stages).

Related

Advice for Web-based Remote Build System

I'm interested in setting up a remote build system at work, initially for internal use, potentially for some customers going forward. We need to compile library code on several different machines (PC, Mac) and with multiple compilers, and it can be a real pain trying to get access to a full set. This is not our main build system, which is Jenkins-based and uses an approach that is not easily modified for the purpose envisaged here.
The idea would be that you could post your source to a website with some basic build parameters, it would compile the code and you could then download the generated code. Ideally users could pick which version of the underlying software they compiled their libraries against. I envisage it being supported by a virtual machine.
Reason I'm posting is that I don't really want to roll-my-own as much as possible - longer term it has maintenance implications - and would prefer something as pre-existing as possible. Obviously one would expect some adaptation in terms of scripting.
Any suggestions? It would have to be supported on Mac and PC at absolute minimum.
This sounds like something you could do by creating a parameterized Jenkins job (the build params given as input to your web frontent could be passed on to the job, perhaps via the Jenkins API). Personally, I would see if you could skip the step of creating a new webfrontend, and have users pass their build params directly to Jenkins.
To support downloading the resulting compiled code, you could have the Jenkins job archive the build as an artifact. Users could then download the files from the result page for that individual build.
As for how to make a Jenkins job accept source code to compile as input, perhaps you could use branches in your CM system? Your users could push their code to a branch, and then pass the branch as a build param. Otherwise, you might be able to use the file parameter feature of Jenkins.

GNU Parallel host sticky jobs

I am writing a parallel build farm to build C++ cross-platform applications against various platforms / environments. Every time new code is pushed to a git repo, I build and test the latest code against all the platforms.
I've setup parallel to correctly distribute the jobs among several hosts using the --sshlogin option.
I transfer files, collect output and results. It's all working more than fine and I love the tool.
The build time being sometimes quite long for some platforms, I would like the build to be as incremental as possible.
My only issue is that the build is only incremental if the scheduler sends the jobs to the same machine and reuse the artefacts of the previous build on this specific host.
Say I have 3 hosts, I have 1 chance in 3 for the build to be incremental. If a hosts hasn't built this platform in a while, it might take a long time.
Is it possible to gain control over the host a specific input source will run on and only fallback to the other hosts if the host is busy?
Ideally, I would love to see a tag system where I tag input source with a name and tag several hosts with a name, creating pools of jobs and pools of machines specialized into that type of build.
But a very simple implementation where the input sources are distributed in the same order as the order the sshlogins are defined could be a simple & quick fix in my situation.
I tried to find the source code to implement it myself but I only see doc generation when I browse the code on Savannah.
Any ideas?
Thanks,
M
There is currently no support for prioritizing a given argument to a given sshlogin. The source code is at https://savannah.gnu.org/git/?group=parallel
Feel free to join the mailing list and discuss the idea: https://lists.gnu.org/mailman/listinfo/parallel
The only priority in the code is when a job has failed on an sshlogin, then GNU Parallel prefers to retry that job on another sshlogin. Maybe that could be extended?
If a job is marked as having failed -1 time for a given sshlogin, then GNU Parallel ought to prefer to run the job on that sshlogin.
I've been trying to discuss this idea on the mailing list as you suggested but never had any respone in more than 10 days... I guess you must be busy with other things at the moment. So I went along and forked the source code to make the necessary changes and make my solution work.
I pushed it there a week ago:
http://michakfromparis.github.io/gnu-parallel-sticky/
the source code is available on github here:
https://github.com/michaKFromParis/gnu-parallel-sticky
Wasn't exactly easy without any guidance as the source code has a lot of history so I tried to keep the changes surgical to ease merge of your future releases.
I've been using it in production for more than a week now and it works perfectly in my configuration.
It is also compatible with older formats, should be a drop-in replacement for usual parallel uses with extra features on the side.
Would love to get feedback from other users though as it might not be completely dry.
Thanks for sharing the original source code.
Best Regards,
M

buildbot vs hudson/jenkins for C++ continuous integration

I'm currently using jenkins/hudson for continuous integration a large mostly C++ project. We have separate projects for trunk and every branch. Also, there are some related projects for the Java code, but the setup for those are fairly basic right now (we may do more later though). The C++ projects do the following:
Builds everything with options for whether to reconfigure, do a clean build, or use a fresh checkout
Optionally builds and runs all tests
Optionally runs all tests using Valgrind's memcheck
Runs cppcheck
Generates doxygen documentation
Publishes reports: unit tests, valgrind, cppcheck, compiler warnings, SLOC, open tasks, and code coverage (using gcov, gcovr, and the cobertura plugin)
Deploys code nightly or on demand to a test environment and a package repository
Everything is configurable for automatic builds and optional for on demand builds. Underneath, there's a bash script that controls much of this, which farther depends on our build system, which uses automake and autoconf along with custom bash scripts.
We started using Hudson (at the time) because that's what the Java guys were using and we just wanted nightly builds. Since then, we've added a lot more and continue to add more. In some ways Hudson is great, but certainly isn't ideal.
I've looked at other solutions and the only one that looks like it could be a replacement is buildbot. Would buildbot be better for this situation? Is the investment worth it since we're already using Hudson? Why?
EDIT: Someone asked why I haven't found Hudson/Jenkins to be ideal. The short answer is that everything can be improved. I'm simply wondering if Jenkins is the best current solution for my use case or whether there is something better (buildbot?) that would be easier to maintain in the long run even as new requirements come up.
Both are open source projects, but you do not need to change buildbot code to "extend" it, it is actually quite easy to import your own packages in its configuration in which you can sub-class most of the features with your own additions. Examples: your own compilation or test code, some parsing of outputs/errors to be given to the next steps, your own formating of alert emails etc. there are lots of possibilities.
Generally I would say that buildbot is the most "general purpose" automatic builds tools. Jenkins however might be the best related to running tests, especially for parsing and presenting results in nice ways (results, details, charts.. some clicks away), things that buildbot does not do "out-of-the-box". I'm actually thinking of using both to have sexier test result pages.. :-)
Also as a rule of thumb it should not be difficult to create a new tool's config: if the specification of what to do (configs, builds, tests) is too hard to switch from one tool to another, it is a (bad) sign that not enough configuration scripts are moved to the sources. Buildbot (or Jenkins) should only call simple commands. If it is simple to run tests, then developers will do it as well and this will improve the success rate, whereas if only the continuous integration system runs the tests, you will be running after it to fix the new code failures, and will loose its non-regression value, just my 0.02€ :-)
Hope it'll help.
The 'result integration' is also in jenkins/hudson, and you can relatively easily capture build products without having to 'copy them elsewhere'.
For our instance, the coverage reports and unit test metrics and javadoc for the java code is all integrated. For our C++ code, the plugins are a little lacking, but you can still get most of it.
we ran buildbot since pre 0.7, and are now running 0.8 and are only now seeing any real reason to switch, as buildbot 0.8 forgot about windows slaves for an extended period of time and the support was pretty poor.
There are many other solutions out there, besides Jenkins/Hudson/BuildBot:
TeamCity by Jetbrains
Bamboo by Atlassian
Go by Thoughtworks
Cruise Control
OpenMake Meister
The specifics about what you are doing are not so important, in fact, as long as the agents (aka nodes) that you are doing them on support those tasks.
The beauty of a CI server is noticing when the build changes to trigger a new build (and test), publish the artifacts, and publish test results.
When you compare CI tools like those we mentioned, consider features like the usability of its interface, how easy is branching (and features it might offer like automatic merging), notifications (like XMPP/jabber), or an information-radiator (like hooking up a monitor to always show status). Product support is another thing to consider - Jenkins' support is only as good as who is responding to community questions at the time you have questions.
My personal favorite is Bamboo, but it comes with a license fee.
I'm a long-time Jenkins user in the middle of evaluating Buildbot and would like to offer a few items for folks considering using Buildbot for multi-module solutions:
*) Buildbot doesn't have any out-of-the-box concept of file artifacts related to each build. It's not in the UI and it's not in any of the builtin "steps" modules as far as I can see:
http://docs.buildbot.net/current/manual/configuration/buildsteps.html
...and I see no third party plugin:
https://github.com/buildbot/buildbot/wiki/PluginList#steps
Buildbot does collect all the console output from a given build, but critically, you can't collect files related to it.
*) Given that artifacts are not supported, it's not easy to create "collector" projects that bring multiple modules into say, a single installer. Jenkins has a great feature that lets you parameterize a build with builds from other modules (the parameter type is a run).
*) Establishing dependencies between modules is trickier in Buildbot. Say you have a library that three binaries depend on, and you want those binaries to rebuild each time the library changes. Jenkins has triggers built into the UI. If you want to do triggers in Buildbot you have to script them using schedulers.Dependent, and it causes a lot of item congestion in the Schedulers UI.
*) When you're working in Buildbot, it seems that pretty much all of the configuration is done in master.cfg in code. This is awesome and frustrating.
*) Buildbot forces you to create a worker in addition to a master server. This is annoying for beginners and systems for which a single build server is sufficient.
My impression after two days of Buildbot evaluation is that we'll stick with Jenkins, primarily due to it having artifacts. Buildbot is a tool we'd only use if we had more extensive customization needs, and the time to do it.
On the subject of buildbot and artifacts -- I don't have enough user score to make a comment -- you can get artifacts from buildbot 2.x series pretty easy with built-in file/directory upload actions. However you rarely want to just move files. Typically you make a triggered buildstep that does deployment directly off the worker for best results. eg push to cloud storage, containers, thirdparty (steam uploads), etc.
This way you can get metrics on the uploads and conditionally control them better (or even mix and match artifacts across worker machines).

Best Practices for Code/Web Application Deployment?

I would love to hear ideas on how to best move code from development server to production server.
A list of gotcha's, don't do this list would be helpful.
Any tools to help automate the steps of.
Make backups of existing code, given these list of files
Record the Deployment of these files from dev to production
Allow easier rollback if deployment or app fails in any way...
I have never worked at a company that had a deployment process, other than a very manual, ftp files from dev to production.
What have you done in your companies, departments, etc?
Thank you...
Yes, I am a coldfusion programmer, but files are files, and this should be language agnostic question.
OK, I'll bite. There's the technology aspect of this problem, which other answers have already covered. But the real issue is a process problem. Where the real focus should be ensuring a meaningful software development life cycle (SDLC) - planning, development, validation, and deployment. I'll cover each in turn. What you want is a repeatable activity at each phase.
Planning
Articulating and recording what's to be delivered. Often tickets or user stories are enough. Sometimes you do more, like a written requirements document, that a customer signs off on, that's translated into various artifacts such as written use cases - ultimately what you want though is something recorded in an electronic system where you can associate changes to code with it. Which leads me to...
Development
Remember that electronic system? Good. Now when you make changes to code (you're committing to source control right?) you associate those change with something in this electronic system - typically tickets. I like Trac, but have also heard good things about Atlassian's suite. This gives you traceability. So you can assert what's been done and how. Then you can use this system and source control to create a build - all the bits needed for whatever's changed - and tag that build in source control - that's your list of what's changed. Even better, have a build contain everything, so that it's standalone entity that can easily be deployed on it's own. The build is then delivered for...
Validation
Perhaps the most important step that many shops ignore - at their own peril. Defects found in production are exponentially more expensive to fix then when they're discovered earlier in the process. And validation is often the only step where this occurs in many shops - so make sure yours does it.
This should not be done by the programmer! That's like the fox watching the hen house. And whoever is doing is should be following some sort of plan. We use Test Link. This means each build is validated the same way, so you can identify regression bugs. And, this build should be deployed in the same way as you would into production.
If all goes well (we usually need a minimum of 3 builds) the build is validated. And this goes to...
Deployment
This should be a non-event, because you're taking a validated build following the same steps as you did in testing. Could be first it hits a staging server, where there's an automated copying process, but the point being is that is shouldn't be an issue at this point, because you validated with the same process.
Conclusion
In terms of knowing what's where, what you really want is a logical way to group changes together. This is where the idea of a build comes in. It's really the unit that should segue between steps in the SDLC. If you already have that, then the ability to understand the state of a given system becomes trivial.
Check out Ant or Maven - these are build and deployment tools used in the Java world which can help you copy / ftp files, backup and even check out code from SVN.
You can automate your deployment steps using these tools, for example Ant will allow you declare a set of tasks as part of your deployment. So you could, for example:
Check out a revision using SVNAnt or similar to a directory
Copy (and perhaps zip first) these files to a backup directory
FTP all the files to your web server(s)
Create a report to email to the team illustrating the deployment
Really you can do almost anything you wish to put time into using Ant. Maven is a little more strucutred (and newer) and you can see a discussion of the differences here.
Hope that helps!
In a nutshell...
You should start with some source control solution - probably Subversion or Git. Once that's in place you can create a script that generates a clean build of your source code and deploys it to your production server(s).
You could do this with a simple batch script or use something like Ant for more control. Here is a simple example of a batch file using Subversion:
svn copy svn://path/to/your/project/trunk -r HEAD svn://path/to/your/project/tags/%version%
svn checkout svn://path/to/your/project/trunk -r HEAD //path/to/target/directory
Ant makes it easy to do things like automatically run unit tests and sync directories. For example:
<sync todir="//path/to/target/directory" includeEmptyDirs="true" overwrite="true">
<fileset dir="${basedir}">
<exclude name="**/*.svn"/>
<exclude name="**/test/"/>
</fileset>
</sync>
This is really just a starting point. A next step might be a continuous integration solution like Hudson. I would also recommend reading "Pragmatic Project Automation: How to Build, Deploy, and Monitor Java Applications".
One ColdFusion specific gotcha is to make sure you clear the Application scope when required (to update any singleton components). A common approach here is to use a URL parameter that causes onRequestStart() to call onApplicationStart(). You may also have to clear the trusted cache.
We use a system called AnthillPro: http://www.anthillpro.com
It's commercial software, but it allows us to completely automate our deployment process across multiple servers and operating systems (We currently use it for both ColdFusion and Java, but it can be used for most languages. It has a ton of 3rd party integrations:
http://www.anthillpro.com/html/products/anthillpro/tool-integrations.html

Configuration Storage

I want to store a lot of configuration data pertaining to cluster, process, IP addresses etc. I have worked on one such product earlier where LDAP was used for this purpose. Although it was PITA to configure it the first time, I liked the transactional LDAP part which helps in dynamic reloading of the configuration when there is a change. It can be done with a flat file using inotify, but that is not as good as transactional LDAP. But, as I said, the configuration was a real pain, and also I don't want to borrow the same idea of LDAP in this product.
So can anyone give me an idea about which will be the next best replacement, which makes entering configuration easy and also that can help in dynamic configuration and notify my process whenever there is a change in the configuration file and exactly what changed (directly or indirectly)?
I am planning to develop my product in C++ and C.
The configuration can be edited by an Admin, or if he is too lazy he can automate it using some script. Also through cli, but not by a running process, that will land me up in concurrency and locking issues.
My program is a daemon, some sort of cluster manager running on multiple nodes.
There is no wrapper provided for user to edit configuration.
I am only looking for Linux/Solaris platform.
You have not really given enough background information for a good answer to be given. So, here are some of the unasked questions, the answers to which will influence your choice:
How is the configuration file edited? By your process, or by hand-editing, or by some other program?
How is the main program running - in the foreground with a user interacting, or in the background as a daemon?
If you expect people to hand-edit the configuration, then you can provide a wrapper script for doing so which sends a signal (conventionally SIGHUP) to the daemon to tell it to reread its configuration file.
If your main program is going to guide the user through the editing, then you really don't need to tell the program when the editing is complete. It already knows.
You mention Linux in the tags; can we assume that Windows portability is not an issue?
As to configuration file formats, you can go with the vogue (and bloat) of using XML. However, although that is a good tool for programs communicating, it is not very good for people to edit. You should look at E S Raymond's "The Art of UNIX Programming" which is a good general read and has a chapter on different configuration file formats. You should probably adopt one of the schemes outlined there. Which scheme is best depends in part on what information you have to capture in your configuration file.
If you're going to embed an interpreter (Perl, Lua, Tcl/Tk, ...) into your program, you might use that language to handle the configuration file...or you might not.