Finding bugs in Subversion's mixed revision working copies

Finding bugs in Subversion's mixed revision working copies - c++

The project I work on has recently been switched from a horribly antiquated revision control system to Subversion. I felt like I had a fairly good understanding of Subversion a few years ago, but once I learned about Mercurial, I forgot about Subversion quickly.
My question is targeted at those who work with a sizable number (15+) developers on the same branch in Subversion.
Let's say you checkout rev N of the repository. You make some changes and then commit. Meanwhile, other developers have made changes to other files. You hear about another developers changes to subsystem X and decide you need them immediately. You don't want to update the entire working copy because it would pull in all sorts of stuff and then you would have to do a lengthy compile (C++ project). The risk I see is that the developer updates only subsystem X not realizing that the new code depends on a recent change in subsystem Y. The code compiles, but crashes at runtime.
How do you deal with this?
Does the developer report what they think might be a bug (even though it's not a bug)?
Do you require developers to update their entire working copy before reporting a bug? Wouldn't this deter bug reports?
Do you prevent this situation from occurring through some mechanism I haven't thought of?

Since you committed all your work in progress, you have no reason not to update your copy with the entire latest revision. The lengthy compile is part of the price of a large project. The compile time is almost always less than the time spent trying to determine whether you have a bug, or whether there's some obscure incompatibility because you didn't check everything out.
That project had distributed compiles to all the workstations in the group. Since we had about 15 computers for the task, that meant what would normally be a 6 hour or so build took about 25 minutes.

The responsibility to keep track of dependencies is the develloper who introduce it.
In this case develloper X should make sure the change work with current version of other subsystems. Or at least document which versions it works with.
The ways I have seen helping devellopers deal with this are.
Include dependency checking in the build system. Many open source project does this in a .configure script.
Configure the version control system to handle this for you. I don't know to do this in subversion.
This is obviously not a cure for long compile times but it helps to avoid unnecesary rebuilds.
Complex dependeny graphs is also an indication of questionable design. It may be good idea to refactor the code to reduce coupling between subsystems.

Related

Faster build times in C++ [duplicate]

I once worked on a C++ project that took about an hour and a half for a full rebuild. Small edit, build, test cycles took about 5 to 10 minutes. It was an unproductive nightmare.
What is the worst build times you ever had to handle?
What strategies have you used to improve build times on large projects?
Update:
How much do you think the language used is to blame for the problem? I think C++ is prone to massive dependencies on large projects, which often means even simple changes to the source code can result in a massive rebuild. Which language do you think copes with large project dependency issues best?

Forward declaration
pimpl idiom
Precompiled headers
Parallel compilation (e.g. MPCL add-in for Visual Studio).
Distributed compilation (e.g. Incredibuild for Visual Studio).
Incremental build
Split build in several "projects" so not compile all the code if not needed.
[Later Edit]
8. Buy faster machines.

My strategy is pretty simple - I don't do large projects. The whole thrust of modern computing is away from the giant and monolithic and towards the small and componentised. So when I work on projects, I break things up into libraries and other components that can be built and tested independantly, and which have minimal dependancies on each other. A "full build" in this kind of environment never actually takes place, so there is no problem.

One trick that sometimes helps is to include everything into one .cpp file. Since includes are processed once per file, this can save you a lot of time. (The downside to this is that it makes it impossible for the compiler to parallelize compilation)
You should be able to specify that multiple .cpp files should be compiled in parallel (-j with make on linux, /MP on MSVC - MSVC also has an option to compile multiple projects in parallel. These are separate options, and there's no reason why you shouldn't use both)
In the same vein, distributed builds (Incredibuild, for example), may help take the load off a single system.
SSD disks are supposed to be a big win, although I haven't tested this myself (but a C++ build touches a huge number of files, which can quickly become a bottleneck).
Precompiled headers can help too, when used with care. (They can also hurt you, if they have to be recompiled too often).
And finally, trying to minimize dependencies in the code itself is important. Use the pImpl idiom, use forward declarations, keep the code as modular as possible. In some cases, use of templates may help you decouple classes and minimize dependencies. (In other cases, templates can slow down compilation significantly, of course)
But yes, you're right, this is very much a language thing. I don't know of another language which suffers from the problem to this extent. Most languages have a module system that allows them to eliminate header files, which area huge factor. C has header files, but is such a simple language that compile times are still manageable. C++ gets the worst of both worlds. A big complex language, and a terrible primitive build mechanism that requires a huge amount of code to be parsed again and again.

Multi core compilation. Very fast with 8 cores compiling on the I7.
Incremental linking
External constants
Removed inline methods on C++ classes.
The last two gave us a reduced linking time from around 12 minutes to 1-2 minutes. Note that this is only needed if things have a huge visibility, i.e. seen "everywhere" and if there are many different constants and classes.
Cheers

IncrediBuild

Unity Builds
Incredibuild
Pointer to implementation
forward declarations
compiling "finished" sections of the proejct into dll's

ccache & distcc (for C/C++ projects) -
ccache caches compiled output, using the pre-processed file as the 'key' for finding the output. This is great because pre-processing is pretty quick, and quite often changes that force recompile don't actually change the source for many files. Also, it really speeds up a full re-compile. Also nice is the instance where you can have a shared cache among team members. This means that only the first guy to grab the latest code actually compiles anything.
distcc does distributed compilation across a network of machines. This is only good if you HAVE a network of machines to use for compilation. It goes well with ccache, and only moves the pre-processed source around, so the only thing you have to worry about on the compiler engine systems is that they have the right compiler (no need for headers or your entire source tree to be visible).

The best suggestion is to build makefiles that actually understand dependencies and do not automatically rebuild the world for a small change. But, if a full rebuild takes 90 minutes, and a small rebuild takes 5-10 minutes, odds are good that your build system already does that.
Can the build be done in parallel? Either with multiple cores, or with multiple servers?
Checkin pre-compiled bits for pieces that really are static and do not need to be rebuilt every time. 3rd party tools/libraries that are used, but not altered are a good candidate for this treatment.
Limit the build to a single 'stream' if applicable. The 'full product' might include things like a debug version, or both 32 and 64 bit versions, or may include help files or man pages that are derived/built every time. Removing components that are not necessary for development can dramatically reduce the build time.
Does the build also package the product? Is that really required for development and testing? Does the build incorporate some basic sanity tests that can be skipped?
Finally, you can re-factor the code base to be more modular and to have fewer dependencies. Large Scale C++ Software Design is an excellent reference for learning to decouple large software products into something that is easier to maintain and faster to build.
EDIT: Building on a local filesystem as opposed to a NFS mounted filesystem can also dramatically speed up build times.

Fiddle with the compiler optimisation flags,
use option -j4 for gmake for parallel compilation (multicore or single core)
if you are using clearmake , use winking
we can take out the debug flags..in extreme cases.
Use some powerful servers.

This book Large-Scale C++ Software Design has very good advice I've used in past projects.

Minimize your public API
Minimize inline functions in your API. (Unfortunately this also increases linker requirements).
Maximize forward declarations.
Reduce coupling between code. For instance pass in two integers to a function, for coordinates, instead of your custom Point class that has it's own header file.
Use Incredibuild. But it has some issues sometimes.
Do NOT put code that get exported from two different modules in the SAME header file.
Use the PImple idiom. Mentioned before, but bears repeating.
Use Pre-compiled headers.
Avoid C++/CLI (i.e. managed c++). Linker times are impacted too.
Avoid using a global header file that includes 'everything else' in your API.
Don't put a dependency on a lib file if your code doesn't really need it.
Know the difference between including files with quotes and angle brackets.

Powerful compilation machines and parallel compilers. We also make sure the full build is needed as little as possible. We don't alter the code to make it compile faster.
Efficiency and correctness is more important than compilation speed.

In Visual Studio, you can set number of project to compile at a time. Its default value is 2, increasing that would reduce some time.
This will help if you don't want to mess with the code.

This is the list of things we did for a development under Linux :
As Warrior noted, use parallel builds (make -jN)
We use distributed builds (currently icecream which is very easy to setup), with this we can have tens or processors at a given time. This also has the advantage of giving the builds to the most powerful and less loaded machines.
We use ccache so that when you do a make clean, you don't have to really recompile your sources that didn't change, it's copied from a cache.
Note also that debug builds are usually faster to compile since the compiler doesn't have to make optimisations.

We tried creating proxy classes once.
These are really a simplified version of a class that only includes the public interface, reducing the number of internal dependencies that need to be exposed in the header file. However, they came with a heavy price of spreading each class over several files that all needed to be updated when changes to the class interface were made.

In general large C++ projects that I've worked on that had slow build times were pretty messy, with lots of interdependencies scattered through the code (the same include files used in most cpps, fat interfaces instead of slim ones). In those cases, the slow build time was just a symptom of the larger problem, and a minor symptom at that. Refactoring to make clearer interfaces and break code out into libraries improved the architecture, as well as the build time. When you make a library, it forces you to think about what is an interface and what isn't, which will actually (in my experience) end up improving the code base. If there's no technical reason to have to divide the code, some programmers through the course of maintenance will just throw anything into any header file.

Cătălin Pitiș covered a lot of good things. Other ones we do:
Have a tool that generates reduced Visual Studio .sln files for people working in a specific sub-area of a very large overall project
Cache DLLs and pdbs from when they are built on CI for distribution on developer machines
For CI, make sure that the link machine in particular has lots of memory and high-end drives
Store some expensive-to-regenerate files in source control, even though they could be created as part of the build
Replace Visual Studio's checking of what needs to be relinked by our own script tailored to our circumstances

It's a pet peeve of mine, so even though you already accepted an excellent answer, I'll chime in:
In C++, it's less the language as such, but the language-mandated build model that was great back in the seventies, and the header-heavy libraries.
The only thing that is wrong about Cătălin Pitiș' reply: "buy faster machines" should go first. It is the easyest way with the least impact.
My worst was about 80 minutes on an aging build machine running VC6 on W2K Professional. The same project (with tons of new code) now takes under 6 minutes on a machine with 4 hyperthreaded cores, 8G RAM Win 7 x64 and decent disks. (A similar machine, about 10..20% less processor power, with 4G RAM and Vista x86 takes twice as long)
Strangely, incremental builds are most of the time slower than full rebuuilds now.

Full build is about 2 hours. I try to avoid making modification to the base classes and since my work is mainly on the implementation of these base classes I only need to build small components (couple of minutes).

Create some unit test projects to test individual libraries, so that if you need to edit low level classes that would cause a huge rebuild, you can use TDD to know your new code works before you rebuild the entire app. The John Lakos book as mentioned by Themis has some very practical advice for restructuring your libraries to make this possible.

Recompilation upon Comment Modification

I have a header which a lot of files depend on. I changed a comment and this caused full recompilation. I heard that whether the code requires recompilation depends on the comparison between compilation date and current date, but is there a way so that I can freely modify comments and keep VS2008 from recompiling everything?

There is no way to freely modify comments and keep Visual Studio (any version) from recompiling "because only comments have changed". Tracking what has actually changed is the job of the version control system (e.g. git or SVN).
Your question seems to arise from working on a solution that takes a long time to build (incrementally or fully), and there are effective ways of improving that situation.
This Visual C++ PCH howto helped me reduce our build times significantly. Also we apply all 3 points explained in this article and on top of all that, also use IncrediBuild (commercial product). Each of these steps helped us keep C++ build times in check.

When full build and when partial build?

Hi I am trying to find out when full build is required and when partial build is sufficient.
There are many articals but I am not able to find the specific answers.
Below are my thoughts
Full build is required when :
1.Change in build of dependent modules.
---change in build option or using optimization techniques.
2.changes in the object layout:
---Any change in the headder file, adding and deleting of new methods in class .
---Changing object size by adding or removing of variables or virtual functions.
---Data alignment changes using pragma pack.
3.Any change in global variables
Partial build is sufficient when:
1.Any change in the logic as long as it is not altering the interface specified
2.change in stack variable

In the ideal world a full build should never be necessary, because all the build tools detecting automatically if one of their dependencies have changed.
But this is true only in the ideal world. Practically build tools are written by humans and humans
make failures, so the tools may not take every possible change into account,
are lazy, so the tools may not take any change into account.
For you this means you have to have some experience with your build tools. With a good written makefile may take everything into account and you rarely have to do a full build. But in the 21st century a makefile is not really state of the art any more, and they become complex very soon. Todays development environments do a fairly good job in finding dependencies, but for larger projects, you may have dependencies which are hard to put in the concept of your development environment and you will writing a script.
So there is no real answer to your question. In practise it is good to do a full rebuild for every release, then this rebuild should be done by pressing just one button. And do a partial build for daily work, since nobody wants to wait 2 hours to see if is code is compilable or not. But even in dayly work a full rebuild is sometimes neccessary because the linker/compiler/(your choice of tool here) had not recognized even the simplest change.

Edit and Continue on GDB

I know that E&C is a controversial subject and some say that it encourages a wrong approach to debugging, but still - I think we can agree that there are numerous cases when it is clearly useful - experimenting with different values of some constants, redesigning GUI parameters on-the-fly to find a good look... You name it.
My question is: Are we ever going to have E&C on GDB? I understand that it is a platform-specific feature and needs some serious cooperation with the compiler, the debugger and the OS (MSVC has this one easy as the compiler and debugger always come in one package), but... It still should be doable. I've even heard something about Apple having it implemented in their version of GCC [citation needed]. And I'd say it is indeed feasible.
Knowing all the hype about MSVC's E&C (my experience says it's the first thing MSVC users mention when asked "why not switch to Eclipse and gcc/gdb"), I'm seriously surprised that after quite some years GCC/GDB still doesn't have such feature. Are there any good reasons for that? Is someone working on it as we speak?

It is a surprisingly non-trivial amount of work, encompassing many design decisions and feature tradeoffs. Consider: you are debugging. The debugee is suspended. Its image in memory contains the object code of the source, and the binary layout of objects, the heap, the stacks. The debugger is inspecting its memory image. It has loaded debug information about the symbols, types, address mappings, pc (ip) to source correspondences. It displays the call stack, data values.
Now you want to allow a particular set of possible edits to the code and/or data, without stopping the debuggee and restarting. The simplest might be to change one line of code to another. Perhaps you recompile that file or just that function or just that line. Now you have to patch the debuggee image to execute that new line of code the next time you step over it or otherwise run through it. How does that work under the hood? What happens if the code is larger than the line of code it replaced? How does it interact with compiler optimizations? Perhaps you can only do this on a specially compiled for EnC debugging target. Perhaps you will constrain possible sites it is legal to EnC. Consider: what happens if you edit a line of code in a function suspended down in the call stack. When the code returns there does it run the original version of the function or the version with your line changed? If the original version, where does that source come from?
Can you add or remove locals? What does that do to the call stack of suspended frames? Of the current function?
Can you change function signatures? Add fields to / remove fields from objects? What about existing instances? What about pending destructors or finalizers? Etc.
There are many, many functionality details to attend to to make any kind of usuable EnC work. Then there are many cross-tools integration issues necessary to provide the infrastructure to power EnC. In particular, it helps to have some kind of repository of debug information that can make available the before- and after-edit debug information and object code to the debugger. For C++, the incrementally updatable debug information in PDBs helps. Incremental linking may help too.
Looking from the MS ecosystem over into the GCC ecosystem, it is easy to imagine the complexity and integration issues across GDB/GCC/binutils, the myriad of targets, some needed EnC specific target abstractions, and the "nice to have but inessential" nature of EnC, are why it has not appeared yet in GDB/GCC.
Happy hacking!
(p.s. It is instructive and inspiring to look at what the Smalltalk-80 interactive programming environment could do. In St80 there was no concept of "restart" -- the image and its object memory were always live, if you edited any aspect of a class you still had to keep running. In such environments object versioning was not a hypothetical.)

I'm not familiar with MSVC's E&C, but GDB has some of the things you've mentioned:
http://sourceware.org/gdb/current/onlinedocs/gdb/Altering.html#Altering
17. Altering Execution
Once you think you have found an error in your program, you might want to find out for certain whether correcting the apparent error would lead to correct results in the rest of the run. You can find the answer by experiment, using the gdb features for altering execution of the program.
For example, you can store new values into variables or memory locations, give your program a signal, restart it at a different address, or even return prematurely from a function.
Assignment: Assignment to variables
Jumping: Continuing at a different address
Signaling: Giving your program a signal
Returning: Returning from a function
Calling: Calling your program's functions
Patching: Patching your program
Compiling and Injecting Code: Compiling and injecting code in GDB

This is a pretty good reference to the old Apple implementation of "fix and continue". It also references other working implementations.
http://sources.redhat.com/ml/gdb/2003-06/msg00500.html
Here is a snippet:
Fix and continue is a feature implemented by many other debuggers,
which we added to our gdb for this release. Sun Workshop, SGI ProDev
WorkShop, Microsoft's Visual Studio, HP's wdb, and Sun's Hotspot Java
VM all provide this feature in one way or another. I based our
implementation on the HP wdb Fix and Continue feature, which they
added a few years back. Although my final implementation follows the
general outlines of the approach they took, there is almost no shared
code between them. Some of this is because of the architectual
differences (both the processor and the ABI), but even more of it is
due to implementation design differences.
Note that this capability may have been removed in a later version of their toolchain.
UPDATE: Dec-21-2012
There is a GDB Roadmap PDF presentation that includes a slide describing "Fix and Continue" among other bullet points. The presentation is dated July-9-2012 so maybe there is hope to have this added at some point. The presentation was part of the GNU Tools Cauldron 2012.
Also, I get it that adding E&C to GDB or anywhere in Linux land is a tough chore with all the different components.
But I don't see E&C as controversial. I remember using it in VB5 and VB6 and it was probably there before that. Also it's been in Office VBA since way back. And it's been in Visual Studio since VS2005. VS2003 was the only one that didn't have it and I remember devs howling about it. They intended to add it back anyway and they did with VS2005 and it's been there since. It works with C#, VB, and also C and C++. It's been in MS core tools for 20+ years, almost continuous (counting VB when it was standalone), and subtracting VS2003. But you could still say they had it in Office VBA during the VS2003 period ;)
And Jetbrains recently added it too their C# tool Rider. They bragged about it (rightly so imo) in their Rider blog.

Files on XP: Is turning off "last access time" safe?

I'm desperately looking for cheap ways to lower the build times on my home PC. I just read an article about disabling the Last Access Time attribute of a file on Windows XP, so that simple reads don't write anything back to disk.
It's really simple too. At a DOS-prompt write:
fsutil behavior set disablelastaccess 1
Has anyone ever tried it in the context of building C++ projects? Any drawbacks?
[Edit] More on the topic here.

From SetFileTime's documentation:
"NTFS delays updates to the last access time for a file by up to one hour after the last access."
There's no real point turning this off - the original article is wrong, the data is not written out on every access.
EDIT:
As to why the author of that article claimed a 10x speed-up, I think he attributed his speed-up to the wrong thing: he also disabled 8.3 filename generation. To generate an 8.3 filename for a file, NTFS has to basically generate each possibility in turn then see if it's already in use (no reference; I'm sure Raymond has talked about it but can't find a link). If your files all share the same first six characters, you will be bitten by this problem, and the corrolary is you should put characters which differentiate files in the first six characters so they don't clash. Turning off short name generation will prevent this.

I haven't tried this on a Windows box (I will be tonight, thanks) but the similar thing on Linux (noatime option when mounting the drive) sped things up considerably.
I can't think of any uses where the last access time would be useful other than for auditing purposes and, even then, does Windows store the user that accessed it? I know Linux doesn't.

I'd suggest you try it and see if it makes a difference.
However I'm pessimistic about this actually making any difference, since in the larger/clean builds you'll be writing out large amounts of data anyway, so adjusting the file access times wouldn't take that much time (plus it'd probably be cached anyway).
I'd love to be proven wrong though.
Results:
Ran a few builds on the code base at work in both debug and release configurations with the last access time enabled, and disabled.
Our source code is about 39 MB (48 MB size on disk), and we build about half of that for the configuration that I built for these tests. The debug build generated 1.76 GB of temporary and output files, while the release generated about 600 MB of such data. We build on the command line using a combination of Ant and the Visual Studio command line built tools.
My machine is a Core 2 Duo 3GHz, with 4GB of ram, a 7200rpm hdd, running Windows XP 32 bit.
Building with the last access time disabled:
Debug times = 6:17, 5:41
Release times = 6:07, 6:06
Building with the last access time enabled:
Debug times = 6:00, 5:47
Release times = 6:19, 5:48
Overall I did not notice any difference between the two modes, as in both cases the files are most likely in the system cache already so it should just be reading from memory.
I believe that you'll get the biggest bang for your buck by just implementing proper precompiled headers (not the automatically generated ones that Visual Studio creates in a project). We implemented this a few years ago at work (when the code base was far smaller) and it cut down our build time to a third of what it was.

It's a good alternative, but it will affect some tools. Like the Remote Storage Service, and other utilies that depend on file access statistics to optimize your file system (i.e. Norton Defrag)

it will improve the performance a little. Other than that it won't do much more (you won't be able to see when the file was last accessed of course). I have it turned of by default when I install windows XP using nLite to cut of the bloat I don't need.

I don't want to draw attention away from the "last access time" question, but there might be other ways to speed up your builds. Not knowing the context and your project setup, it's hard to say what might be slow, but there might be some things that might help:
Create "uber" builds. That is, create a single compilation uber.cpp file that contains a bunch of lines like
#include "file1.cpp"
#include "file2.cpp"
You might have trouble with conflicting static variable names, but those are generally easy to sort out. Initial setup is kind of a pain, but build times can increase dramatically. For us, the biggest drawback is that in developer studio, you can't right click a file and say 'compile' if that file is part of an uber build. It's not a big deal though. We have seperate build configurations for 'uber' builds which compile the uber files but exclude the individual cpp files from the build process. If you need more info, leave a comment and I can get you that. Also, the optimizer tends to do a slightly better job with uber builds.
Also, do you have a large number of include files, or a lot of depencendies between include files? If so, that will drastically slow down build times.
Are you using precompiled headers? If not, you might look into that as a solution as that will help as well.
Slow build times are usually tracked down to lots of file I/O. That is by far the biggest time sink in a build -- just opening, reading and parsing all of the files. If you cut down file I/O, you will improve build times.
Anyway, sorry to slightly derail the topic slightly, but the suggestion at hand to change how the last access time of a file is set seemed to be somewhat of a 'sledgehammer' solution.

For busy servers, disabling last access time is generally a good idea. The only potential downside is if there are scripts that use last access time to, for instance, tell that a file is no longer being written.
That said, if you're looking to improve build times on a C++ project, I highly recommend reading Recursive Make Considered Harmful. The article is about a decade old, but the points it makes about how recursive definitions in our build scripts cause long build times is still well worth understanding.

Disabling access time is useful when using ssd's (solid state drives - cards,usb drives etc) as it reduces the number of writes to the drive. All solid state storage devices have a life which is measured by the number of writes that can be made to each individual address. Some media specify a minimum of 100's of thousands and some even 1 million. Operating systems and other executables can access many files in a single operation as well as user document accesses. This would apply to eee pc's, embedded systems and others.

To Mike Dimmick:
Try to connect USB drive with many files and copy them to your internal drive. That's also the case in addition to program compilation (which is described in original post).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js