How express PGO dependencies in CMake 3.7?

How express PGO dependencies in CMake 3.7? - c++

I have a C++ program that I'm building with Clang 3.9's profile-guided optimization feature. Here's what's supposed to happen:
I build the program with instrumentation enabled.
I run that program, creating a file with profile-data: prof.raw.
I use llvm-profdata to convert prof.raw to a new file, prof.data.
I create a new build of that same program, with a few changes:
When compiling each .cpp file to a .o file, I use the compiler flag -fprofile-use=prof.data.
When linking the executable, I also specify -fprofile-use.
I have a Gnu Makefile for this, and it works great. My problem arises now that I'm trying to port that Makefile to CMake (3.7, but I could upgrade ). I need the solution to work with (at least) the Unix Makefiles generator, but ideally it would work for all generators.
In CMake, I've defined two executable targets: foo-gen and foo-use:
When foo-gen is executed, it creates the prof.raw file.
I use add_custom_command to create a rule to create prof.data from prof.raw.
My problem is that I can't figure out how to tell CMake that each of the object files depended upon by foo-use has a dependency on the file prof.data.
The most-promising idea I had was to (1) find a way to enumerate all of the .o files upon which foo-use depenends, and then (2) iterate over each of those .o files, calling add_dependency for each one.
The problem with this approach is I can't find an idiomatic way, in my CMakeLists.txt file, to enumerate the list of object files upon which an executable depends. This might be an open problem with CMake.
I also considered using set_source_files_properties to set the OBJECT_DEPENDS property on each of my .cpp files used by foo-use, adding prof.data to that property's list.
The problem with this (AFAICT) is that each of my .cpp files is used to create two different .o files: one for foo-gen and one for foo-use. I want the .o files that get linked into foo-use to have this compile-time dependency on prof.data; but the .o files that get linked into foo-gen must not have a compile-time dependency on prof.data.
And AFAIK, set_source_files_properties doesn't let me set the OBJECT_DEPENDS property to have one of two values, contingent on whether foo-gen or foo-use is the current target of interest.
Any suggestions for a clean(ish) way to make this work?

Discussion on author's idea #1
The most-promising idea I had was to (1) find a way to enumerate all of the .o files upon which foo-use depenends, and then (2) iterate over each of those .o files, calling add_dependency for each one.
This shouldn't work according to the documentation for add_dependencies, which states:
Makes a top-level depend on other top-level targets to ensure that they build before does.
Ie. You can't use it to make a target depend on files- only on other targets.
Discussion on author's idea #2
I also considered using set_source_files_properties to set the OBJECT_DEPENDS property on each of my .cpp files used by foo-use, adding prof.data to that property's list.
The problem with this (AFAICT) is that each of my .cpp files is used to create two different .o files: one for foo-gen and one for foo-use. I want the .o files that get linked into foo-use to have this compile-time dependency on prof.data; but the .o files that get linked into foo-gen must not have a compile-time dependency on prof.data.
And AFAIK, set_source_files_properties doesn't let me set the OBJECT_DEPENDS property to have one of two values, contingent on whether foo-gen or foo-use is the current target of interest.
In the comment section, you mentioned that you could solve this if OBJECT_DEPENDS supported generator expressions, but it doesn't. As a side note, there is an issue ticket tracking this on the CMake gitlab repo. You can go give it a thumbs up and describe your use-case for their reference.
In the comments section you also mentioned a possible solution to this:
Potential other solution a) double project system where main user invoked project forwards settings to second pgo project compiling same settings again.
You can actually put this into the CMake project via ExternalProject so that it becomes part of the generated buildsystem: Make the top-level project include itself as an external project. The external project can be passed a cache variable to configure it to be the -gen version, and the top-level can be the -use version.
Speaking from experience, this is a whole other rabbit hole of long CMake-documentation-reading and finicking sessions if you have never manually invoked or done anything with ExternalProject before, so that answer might belong with a new question dedicated to it.
This can solve the problem of not having generator expressions in OBJECT_DEPENDS, but if you want to have multi-config for the top-level project and that some of the configs in the multi-config config not be for PGO, then you will be back to square one.
Proposed Solution
Here's what I've found works to make sources re-compile when profile data changes:
To the custom command which runs the training executable and produces and re-formats the training data, add another COMMAND which produces a c++ header file containing a timestamp in a comment.
Include that header in all sources which you want to re-compile if the training has been re-run.
If you want to support non-PGO builds, wrap the timestamp header in a header which checks that it exists with __has_include and only includes it if it exists.
I'm pretty sure with this approach, CMake doesn't do the dependency checking of TUs on the profile data, and instead, it's the generated buildsystem's header-dependency tracking which does that work. The rationale for including a timestamp comment in the header file instead of just "touch"ing it to change the timestamp in the filesystem is that the generated buildsystem might detect changes by file contents instead of by the filesystem timestamp.
All the shortcomings of the proposed solution
The painfully obvious weakness of this approach is that you need to add a header include to all the .cpp files that you want to check for re-compilation. There are several problems that can spawn from this (from least to most egregious):
You might not like it from an aesthetics standpoint.
It certainly opens up a hole for human-error in forgetting to include the header for new .cpp files. I don't know how to solve that.
You might not be able to change some of the sources that you need to re-compile, such as sources from third-party static libraries that your library depends on. There may be workarounds if you're using ExternalProject by doing something with the patch step, but I don't know.
Unfortunately I don't know how to solve any of those problems. For my personal project, #1 and #2 are acceptable, and #3 happens to not be an issue.

Related

CMake: is it possible to rebuild target when header file part of input list changes that isn't relevant for the objects being output?

I'm working on a library where a header is provided for user convenience. When this header is changed the post-build step to copy the header files is not run, as the build itself is not run, which is because the target per se did not change. Is it possible to make it rebuild the target anyway, when any of the headers in the explicit dependency list changes? Or at least, to trigger the post-build steps somehow?

A general solution would be to add a source file to the library which will do nothing but #include the convenience header file.
To solve this within CMake, you can specify the convenience header file as an additional object dependency for one of the library's sources:
set_property(SOURCE some/library_source.cpp APPEND PROPERTY OBJECT_DEPENDS /full/path/to/convenience/header.h)

SCons and/or CMake: any way to automatically map from "header included during compilation" to "corresponding object file must be linked"?

Super-simple, totally boring setup: I have a directory full of .hpp and .cpp files. Some of these .cpp files need to be built into executables; naturally, these .cpp files #include some of the .hpp files in the same directory, which may then include others, etc. etc. Most of those .hpp files have corresponding .cpp files, which is to say: if some_application.cpp #includes foo.hpp, either directly or transitively, then chances are there's also a foo.cpp file that needs to be compiled and linked into the some_application executable.
Super-simple, but I'm still clueless about what the "best" way to build it is, either in SCons or CMake (neither of which I have any expertise in yet, other than staring at documentation for the last day or so and becoming sad). I fear that the sort of solution I want may actually be impossible (or at least grossly overcomplicated) to pull off in most build systems, but if so, it'd be nice to know that so I can just give up and be less picky. Naturally, I'm hoping I'm wrong, which wouldn't be surprising given how ignorant I am about build systems (in general, and about CMake and SCons in particular).
CMake and SCons can, of course, both automatically detect that some_application.cpp needs to be recompiled whenever any of the header files it depends on (either directly or transitively) changes, since they can "parse" C++ files well enough to pick out those dependencies. OK, great: we don't have to list each .cpp-#includes-.hpp dependency by hand. But: we still need to decide what subset of object files need to get sent to the linker when it's time to actually generate each executable.
As I understand it, the two most straightforward alternatives to dealing with that part of the problem are:
A. Explicitly and laboriously enumerating the "anything using this object file needs to use these other object files too" dependencies by hand, even though those dependencies are exactly mirrored by the corresponding-.cpp-transitively-includes-the-corresponding-.hpp dependencies that the build system already went to the trouble of figuring out for us. Why? Because computers.
B. Dumping all the object files in this directory into a single "library", and then having all executables depend on and link in that one library. This is much simpler, and what I understand most people would do, but it's also kinda sloppy. Most of the executables don't actually need everything in that library, and wouldn't actually need to be rebuilt if only the contents of one or two .cpp files changed. Isn't this setting up exactly the kind of unnecessary computation a supposed "build system" should be avoiding? (I suppose maybe they wouldn't need to be rebuilt if the library were dynamically linked, but suffice it to say I dislike dynamically linked libraries for other reasons.)
Can either CMake or SCons do better than this in any remotely straightforward fashion? I see a bunch of limited ways to twiddle the automatically generated dependency graph, but no general-purpose way to do so interactively ("OK, build system, what do you think the dependencies are? Ah. Well, based on that, add the following dependencies and think again: ..."). I'm not too surprised about that. I haven't yet found a special-purpose mechanism in either build system for dealing with the super-common case where link-time dependencies should mirror corresponding compile-time #include dependencies, though. Did I miss something in my (admittedly somewhat cursory) reading of the documentation, or does everyone just go with option (B) and quietly hate themselves and/or their build systems?

Your statement in point A) "anything using this object file needs to use these other object files too" is something that will indeed need to be done by hand. Compilers dont automatically find object files needed by a binary. You have to explicitly list them at link time. If I understand your question correctly, you dont want to have to explicitly list the objects needed by a binary, but want the build tool to automatically find them. I doubt there is any build too that does this: SCons and Cmake definitely dont do this.
If you have an application some_application.cpp that includes foo.hpp (or other headers used by these cpp files), and subsequently needs to link the foo.cpp object, then in SCons, you will need to do something like this:
env = Environment()
env.Program(target = 'some_application',
source = ['some_application.cpp', 'foo.cpp'])
This will only link when 'some_application.cpp', 'foo.hpp', or 'foo.cpp' have changed. Assuming g++, this will effectively translate to something like the following, independently of SCons or Cmake.
g++ -c foo.cpp -o foo.o
g++ some_application.cpp foo.o -o some_application
You mention you have "a directory full of .hpp and .cpp files", I would suggest you organize those files into libraries. Not all in one library, but logically organize them into smaller, cohesive libraries. Then your applications/binaries would link the libraries they need, thus minimizing recompilations due to not used objects.

I had more or less the same problem as you have and I solved it as follows:
import SCons.Scanner
import os
def header_to_source(header_file):
"""Specify the location of the source file corresponding to a given
header file."""
return header_file.replace('include/', 'src/').replace('.hpp', '.cpp')
def source_files(main_file, env):
"""Returns list of source files the given main_file depends on. With
the function header_to_source one must specify where to look for
the source file corresponding to a given header. The resulting
list is filtered for existing files. The resulting list contains
main_file as first element."""
## get the dependencies
node = File(main_file)
scanner = SCons.Scanner.C.CScanner()
path = SCons.Scanner.FindPathDirs("CPPPATH")(env)
deps = node.get_implicit_deps(env, scanner, path)
## collect corresponding source files
root_path = env.Dir('#').get_abspath()
res = [main_file]
for dep in deps:
source_path = header_to_source(
os.path.relpath(dep.get_abspath(), root_path))
if os.path.exists(os.path.join(root_path, source_path)):
res.append(source_path)
return res
The header_to_source method is the one you need to modify such that it returns the source file corresponding to a given header file. Then the method source_file gives you all the source files you need to build the given main_file (including the main_file as first element). Non existing files are automatically removed. So the following should be sufficient to define the target for an executable:
env.Program(source_files('main.cpp', env))
I am not sure whether this works in all possible setups, but at least for me it works.

Proper structure for C++ project & libs

I'm starting to write a data processing library of mine and quite confused about building the proper structure of project and libraries.
Say, I'd like to have a set of functions stored in myfunclib library. My current set up (taken from multiple recommendations online) looks like this:
myproj/include/myfunclib.h - class declaration
myproj/include/myfunclib.cpp - class functionality
myproj/src/functest.cpp - test file to check functions
Firstly, it feels like this is a proper set up in case I use myfunc only for myproj project, but say I want to reuse it - then I'd need to specify it's path in each of cpp files using it or store multiple copies of it.
Secondly, compilation is a bit bulky in such case:
g++ -I include include/myfunclib.cpp src/functest.cpp
Is it a normal practice to type all that stuff every time? What if I have many custom libraries I need? Is there a way to store them all separately, simply include as 'myfunclib.h' and not worry about recompiling etc?

Use a makefile to handle all of your dependencies and building your code. Google the syntax it's pretty simple. then you can just say "make" on the command line and it will build everything for you.
here's a good tutorial
http://mrbook.org/tutorials/make/

some things that bit me originally,
remember that templated classes should only be included, what is generally the source implementation should not be built like normal class implementations into object files, so generally i put my whole template implementation within the include directory
i keep include and source files separate, by source files i mean code (definitions) that needs to be compiled into object files for linking, and includes are all the declarations, inline functions, etc it just seems to make more sense to me
sometimes i'll have a header file that includes all relevant headers for a specific module, and in turn perhaps a header file higher up that includes all main headers for modules i am using
also as said in the comments, you need to introduce yourself to some build tools, and get comfortable with them, these will help you track dependencies within your project, and in most cases avoid rebuilding an entire project when only a subset of dependencies have changed (this can be a pain to get right in the beginning but is worthwhile learning, if you use make and g++ there is a way to get this working with g++ -MM ... not sure how well it works for all cases ), i know that the way i organized my projects changed drastically the more i learnt about the build process, and the more complex my projects became (and the more flaws i had to fix )
this is how i generally keep my a project directory structure when starting
build - where all the built files will be stored
app - these are the main apps (can also be split into include/src)
include - includes files
src - src files (compiled into objects and then linked with main compiled app)
lib - any libs (usually 3rdparty libs , if any my src is compiled into a library it usually ends up in build/lib/target/... )
hope some of this helps

Header files dependencies between C++ modules

In my place we have a big C++ code base and I think there's a problem how header files are used.
There're many Visual Studio project, but the problem is in concept and is not related to VS. Each project is a module, performing particular functionality. Each project/module is compiled to library or binary. Each project has a directory containing all source files - *.cpp and *.h. Some header files are API of the module (I mean the to the subset of header files declaring API of the created library), some are internal to it.
Now to the problem - when module A needs to work with module B, than A adds B's source directory to include search path. Therefore all B's module internal headers are seen by A at compilation time.
As a side effect, developer is not forced to concentrate what is the exact API of each module, which I consider a bad habit anyway.
I consider an options how it should be on the first place. I thought about creating in
each project a dedicated directory containing interface header files only. A client module wishing to use the module is permitted to include the interface directory only.
Is this approach ok? How the problem is solved in your place?
UPD On my previous place, the development was done on Linux with g++/gmake and we indeed used to install API header files to a common directory is some of answers propose. Now we have Windows (Visual Studio)/Linux (g++) project using cmake to generate project files. How I force the prebuild install of API header files in Visual Studio?
Thanks
Dmitry

It sounds like your on the right track. Many third party libraries do this same sort of thing. For example:
3rdParty/myLib/src/ -contains the headers and source files needed to compile the library
3rdParty/myLib/include/myLib/ - contains the headers needed for external applications to include
Some people/projects just put the headers to be included by external apps in /3rdParty/myLib/include, but adding the additional myLib directory can help to avoid name collisions.
Assuming your using the structure: 3rdParty/myLib/include/myLib/
In Makefile of external app:
---------------
INCLUDE =-I$(3RD_PARTY_PATH)/myLib/include
INCLUDE+=-I$(3RD_PARTY_PATH)/myLib2/include
...
...
In Source/Headers of the external app
#include "myLib/base.h"
#include "myLib/object.h"
#include "myLib2/base.h"

Wouldn't it be more intuitive to put the interface headers in the root of the project, and make a subfolder (call it 'internal' or 'helper' or something like that) for the non-API headers?

Where I work we have a delivery folder structure created at build time. Header files that define libraries are copied out to a include folder. We use custom build scripts that let the developer denote which header files should be exported.
Our build is then rooted at a substed drive this allows us to use absolute paths for include directories.
We also have a network based reference build that allows us to use a mapped drive for include and lib references.
UPDATE: Our reference build is a network share on our build server. We use a reference build script that sets up the build environment and maps(using net use) the named share on the build server(i.e. \BLD_SRV\REFERENCE_BUILD_SHARE). Then during a weekly build(or manually) we set the share(using net share) to point to the new build.
Our projects then a list of absolute paths for include and lib references.
For example:
subst'ed local build drive j:\
mapped drive to reference build: p:\
path to headers: root:\build\headers
path to libs: root:\build\release\lib
include path in project settings j:\build\headers; p:\build\headers
lib path in project settings j:\build\release\lib;p:\build\release\lib
This will take you local changes first, then if you have not made any local changes(or at least you haven't built them) it will use the headers and libs from you last build on the build server.

I've seen problems like this addressed by having a set of headers in module B that get copied over to the release directory along with the lib as part of the build process. Module A then only sees those headers and never has access to the internals of B. Usually I've only seen this in a large project that was released publicly.
For internal projects it just doesn't happen. What usually happens is that when they are small it doesn't matter. And when they grow up it's so messy to separate it out no one wants to do it.

Typically I just see an include directory that all the interface headers get piled into. It certainly makes it easy to include headers. People still have to think about which modules they're taking dependencies on when they specify the modules for the linker.
That said, I kinda like your approach better. You could even avoid adding these directories to the include path, so that people can tell what modules a source file depends on just by the relative paths in the #includes at the top.
Depending on how your project is laid out, this can be problematic when including them from headers, though, since the relative path to a header is from the .cpp file, not from the .h file, so the .h file doesn't necessarily know where its .cpp files are.
If your projects have a flat hierarchy, however, this will work. Say you have
base\foo\foo.cpp
base\bar\bar.cpp
base\baz\baz.cpp
base\baz\inc\baz.h
Now any header file can include
#include "..\baz\inc\baz.h
and it will work since all the cpp files are one level deeper than base.

In a group I had been working, everything public was kept in a module-specific folder, while private stuff (private header, cpp file etc.) were kept in an _imp folder within this:
base\foo\foo.h
base\foo\_imp\foo_private.h
base\foo\_imp\foo.cpp
This way you could just grab around within your projects folder structure and get the header you want. You could grep for #include directives containing _imp and have a good look at them. You could also grab the whole folder, copy it somewhere, and delete all _imp sub folders, knowing you'd have everything ready to release an API.
Within projects headers where usually included as
#include "foo/foo.h"
However, if the project had to use some API, then API headers would be copied/installed by the API's build wherever they were supposed to go on that platform by the build system and then be installed as system headers:
#include <foo/foo.h>

Where does the C++ compiler start?

If you have a c++ project with several source files and you hit compile, which file does the compiler start with?
I am asking cause I am having some #include-dependency issues on a library.
Compiler would be: VC2003.

It should not be order-dependent. The only relevant steps are:
Each compilation unit includes what it depends on, and should be compilable individually. This means, first, that each CPP file includes all the headers it depends on; and second, that each header should in turn include what it needs so that it can compile even if it is the first one to be compiled.
A link step puts all the compiled object code together and builds the final binary.

It should not matter which file it starts with, the linker resolves external references after all the files have been compiled

Irrelevant. Post the exact issue. The compilation order is non-deterministic and arbitrary, and must have no effect on the compilability of your project.

This depends on the environment. In general a "compiler" only works on a single source file at a time; you use higher-level tools to direct it and compute the proper build order.
Examples of such tools can be make, ant, CMake, SCons, Eclipse, and Visual Studio. A basic check is generally the modification date of the source code files, coupled with built-in and custom rules that define how various output files depend on the inputs.

The order the compiler compiles in shouldn't make a difference, as others have noted.
From the compiler's point of view, when you compile a file with a #include, the included file is inserted into the file being compiled at the point where the #include is, recursing as necessary.

Others have already said that the order shouldn't make a difference.
What you might not have realized is that the compiler compiles every .cpp or .cc file. It does not compile header files. And typically, you only #include header files, and never .cpp files, so the order does not matter. Every .cpp file is processed in isolation. It includes a number of headers, but these are never compiled separately, and it does not typically include other .cpp files either.

The only "include-dependency" problem I can think of is a recursive inclusion. For which the fix normally is guarding it with #ifdef
#ifndef INCLUDED_THEFILENAME_H
#define INCLUDED_THEFILENAME_H
/* content goes here *
#endif
But you better elaborate on the issue you're having.

As others have pointed out, conceptually it is not important which file it starts with. However, it can be useful to start with the most recently edited file (assuming more than one file has been edited) of with the file with the most dependencies. Some environments, such as Code::Blocks, actually allow you to give weightings to source files to give you some control over the compilation order.

A make tool builds a directed acyclic graph of the dependencies specified in the make file. This will normally say the executable depends on a number of object files. The object files depends on source files, each source file depends on headers, and so on.
This produces essentially a multi-way tree. The tree will have the executable as its root, and typically have mostly headers as the leaf nodes (though if you're using some sort of code generator, it could also have the input file for that code generator as a leaf).
It then walks that tree working its way from the leaf nodes to the root node and building as it goes. The answers that have said "it doesn't matter" are basically pointing out that it can pick any branch of that tree to build first. It does matter, however, that when it picks a branch, it builds in the order specified for that branch.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js