Why does using boost increase file size so much?

Why does using boost increase file size so much? - c++

I've noticed that when I use a boost feature the app size tends to increase by about .1 - .3 MB. This may not seem like much, but compared to using other external libraries it is (for me at least). Why is this?

Boost uses templates everywhere. These templates can be instantiated multiple times with the same parameters. A sufficiently smart linker will throw out all but one copy. However, not all linkers are sufficiently smart. Also, templates are instantiated implicitly sometimes and it's hard to even know how many times one has been instantiated.

"so much" is a comparative term, and I'm afraid you're comparing apples to oranges. Just because other libraries are smaller doesn't imply you should assume Boost is as small.
Look at the sheer amount of work Boost does for you!
I doubt making a custom library with the same functionality would be of any considerable lesser size. The only valid comparison to make is "Boost's library that does X" versus "Another library that does X". Not "Boost's library that does X" and "Another library that does Y."
The file system library is very powerful, and this means lots of functions, and lot's of back-bone code to provide you and I with a simple interface. Also, like others mentioned templates in general can increase code size, but it's not like that's an avoidable thing. Templates or hand-coded, either one will results in the same size code. The only difference is templates are much easier.

It all depends on how it is used. Since Boost is a bunch of templates, it causes a bunch of member functions to be compiled per type used. If you use boost with n types, the member functions are defined (by C++ templates) n times, one for each type.

Boost consists primarily of very generalized and sometimes quite complex templates, which means, types and functions are created by the compiler as required per usage, and not simply by declaration. In other words, a small amount of source code can produce a significant quantity of object code to fulfill all variations of templates declared or used. Boost also depends on standard libraries, pulling in those dependencies as well. However, the most significant contribution is the fact that Boost source code exists almost primarily in include files. Including standard c include files (outside of STL) typically includes very little source code and contains mostly prototypes, small macros or type declarations without their implementations. Boost contains most of its implementations in its include file.

Related

Why isn't the C++ standard library already pre-included in any C++ source?

In C++ the standard library is wrapped in the std namespace and the programmer is not supposed to define anything inside that namespace. Of course the standard include files don't step on each other names inside the standard library (so it's never a problem to include a standard header).
Then why isn't the whole standard library included by default instead of forcing programmers to write for example #include <vector> each time? This would also speed up compilation as the compilers could start with a pre-built symbol table for all the standard headers.
Pre-including everything would also solve some portability problems: for example when you include <map> it's defined what symbols are taken into std namespace, but it's not guaranteed that other standard symbols are not loaded into it and for example you could end up (in theory) with std::vector also becoming available.
It happens sometimes that a programmer forgets to include a standard header but the program compiles anyway because of an include dependence of the specific implementation. When moving the program to another environment (or just another version of the same compiler) the same source code could however stop compiling.
From a technical point of view I can image a compiler just preloading (with mmap) an optimal perfect-hash symbol table for the standard library.
This should be faster to do than loading and doing a C++ parse of even a single standard include file and should be able to provide faster lookup for std:: names. This data would also be read-only (thus probably allowing a more compact representation and also shareable between multiple instances of the compiler).
These are however just shoulds as I never implemented this.
The only downside I see is that we C++ programmers would lose compilation coffee breaks and Stack Overflow visits :-)
EDIT
Just to clarify the main advantage I see is for the programmers that today, despite the C++ standard library being a single monolithic namespace, are required to know which sub-part (include file) contains which function/class. To add insult to injury when they make a mistake and forget an include file still the code may compile or not depending on the implementation (thus leading to non-portable programs).

Short answer is because it is not the way the C++ language is supposed to be used
There are good reasons for that:
namespace pollution - even if this could be mitigated because std namespace is supposed to be self coherent and programmer are not forced to use using namespace std;. But including the whole library with using namespace std; will certainly lead to a big mess...
force programmer to declare the modules that he wants to use to avoid inadvertently calling a wrong standard function because standard library is now huge and not all programmers know all modules
history: C++ has still strong inheritance from C where namespace do not exist and where the standard library is supposed to be used as any other library.
To go in your sense, Windows API is an example where you only have one big include (windows.h) that loads many other smaller include files. And in fact, precompiled headers allows that to be fast enough
So IMHO a new language deriving from C++ could decide to automatically declare the whole standard library. A new major release could also do it, but it could break code intensively using using namespace directive and having custom implementations using same names as some standard modules.
But all common languages that I know (C#, Python, Java, Ruby) require the programmer to declare the parts of the standard library that he wants to use, so I suppose that systematically making available every piece of the standard library is still more awkward than really useful for the programmer, at least until someone find how to declare the parts that should not be loaded - that's why I spoke of a new derivative from C++

Most of the C++ standard libraries are template based which means that the code they'll generate will depend ultimately in how you use them. In other words, there is very little that could be compiled before instantiate a template like std::vector<MyType> m_collection;.
Also, C++ is probably the slowest language to compile and there is a lot parsing work that compilers have to do when you #include a header file that also includes other headers.

Well, first thing first, C++ tries to adhere to "you only pay for what you use".
The standard-library is sometimes not part of what you use at all, or even of what you could use if you wanted.
Also, you can replace it if there's a reason to do so: See libstdc++ and libc++.
That means just including it all without question isn't actually such a bright idea.
Anyway, the committee are slowly plugging away at creating a module-system (It takes lots of time, hopefully it will work for C++1z: C++ Modules - why were they removed from C++0x? Will they be back later on?), and when that's done most downsides to including more of the standard-library than strictly neccessary should disappear, and the individual modules should more cleanly exclude symbols they need not contain.
Also, as those modules are pre-parsed, they should give the compilation-speed improvement you want.

You offer two advantages of your scheme:
Compile-time performance. But nothing in the standard prevents an implementation from doing what you suggest[*] with a very slight modification: that the pre-compiled table is only mapped in when the translation unit includes at least one standard header. From the POV of the standard, it's unnecessary to impose potential implementation burden over a QoI issue.
Convenience to programmers: under your scheme we wouldn't have to specify which headers we need. We do this in order to support C++ implementations that have chosen not to implement your idea of making the standard headers monolithic (which currently is all of them), and so from the POV of the C++ standard it's a matter of "supporting existing practice and implementation freedom at a cost to programmers considered acceptable". Which is kind of the slogan of C++, isn't it?
Since no C++ implementation (that I know of) actually does this, my suspicion is that in point of fact it does not grant the performance improvement you think it does. Microsoft provides precompiled headers (via stdafx.h) for exactly this performance reason, and yet it still doesn't give you an option for "all the standard libraries", instead it requires you to say what you want in. It would be dead easy for this or any other implementation to provide an implementation-specific header defined to have the same effect as including all standard headers. This suggests to me that at least in Microsoft's opinion there would be no great overall benefit to providing that.
If implementations were to start providing monolithic standard libraries with a demonstrable compile-time performance improvement, then we'd discuss whether or not it's a good idea for the C++ standard to continue permitting implementations that don't. As things stand, it has to.
[*] Except perhaps for the fact that <cassert> is defined to have different behaviour according to the definition of NDEBUG at the point it's included. But I think implementations could just preprocess the user's code as normal, and then map in one of two different tables according to whether it's defined.

I think the answer comes down to C++'s philosophy of not making you pay for what you don't use. It also gives you more flexibility: you aren't forced to use parts of the standard library if you don't need them. And then there's the fact that some platforms might not support things like throwing exceptions or dynamically allocating memory (like the processors used in the Arduino, for example). And there's one other thing you said that is incorrect. As long as it's not a template class, you are allowed to add swap operators to the std namespace for your own classes.

First of all, I am afraid that having a prelude is a bit late to the game. Or rather, seeing as preludes are not easily extensible, we have to content ourselves with a very thin one (built-in types...).
As an example, let's say that I have a C++03 program:
#include <boost/unordered_map.hpp>
using namespace std;
using boost::unordered_map;
static unordered_map<int, string> const Symbols = ...;
It all works fine, but suddenly when I migrate to C++11:
error: ambiguous symbol "unordered_map", do you mean:
- std::unordered_map
- boost::unordered_map
Congratulations, you have invented the least backward compatible scheme for growing the standard library (just kidding, whoever uses using namespace std; is to blame...).
Alright, let's not pre-include them, but still bundle the perfect hash table anyway. The performance gain would be worth it, right?
Well, I seriously doubt it. First of all because the Standard Library is tiny compared to most other header files that you include (hint: compare it to Boost). Therefore the performance gain would be... smallish.
Oh, not all programs are big; but the small ones compile fast already (by virtue of being small) and the big ones include much more code than the Standard Library headers so you won't get much mileage out of it.
Note: and yes, I did benchmark the file look-up in a project with "only" a hundred -I directives; the conclusion was that pre-computing the "include path" to "file location" map and feeding it to gcc resulted in a 30% speed-up (after using ccache already). Generating it and keeping it up-to-date was complicated, so we never used it...
But could we at least include a provision that the compiler could do it in the Standard?
As far as I know, it is already included. I cannot remember if there is a specific blurb about it, but the Standard Library is really part of the "implementation" so resolving #include <vector> to an internal hash-map would fall under the as-if rule anyway.
But they could do it, still!
And lose any flexibility. For example Clang can use either the libstdc++ or the libc++ on Linux, and I believe it to be compatible with the Dirkumware's derivative that ships with VC++ (or if not completely, at least greatly).
This is another point of customization: if the Standard library does not fit your needs, or your platforms, by virtue of being mostly treated like any other library you can replace part of most of it with relative ease.
But! But!
#include <stdafx.h>
If you work on Windows, you will recognize it. This is called a pre-compiled header. It must be included first (or all benefits are lost) and in exchange instead of parsing files you are pulling in an efficient binary representation of those parsed files (ie, a serialized AST version, possibly with some type resolution already performed) which saves off maybe 30% to 50% of the work. Yep, this is close to your proposal; this is Computer Science for you, there's always someone else who thought about it first...
Clang and gcc have a similar mechanism; though from what I've heard it can be so painful to use that people prefer the more transparent ccache in practice.
And all of these will come to naught with modules.
This is the true solution to this pre-processing/parsing/type-resolving madness. Because modules are truly isolated (ie, unlike headers, not subject to inclusion order), an efficient binary representation (like pre-compiled headers) can be pre-computed for each and every module you depend on.
This not only means the Standard Library, but all libraries.
Your solution, more flexible, and dressed to the nines!

One could use an alternative implementation of the C++ Standard Library to the one shipped with the compiler. Or wrap headers with one's definitions, to add, enable or disable features (see GNU wrapper headers). Plain text headers and the C inclusion model are a more powerful and flexible mechanism than a binary black box.

Can STL be made completely inline?

We are using a third-party library in our Visual Studio 2013 C++ project. This library I guess uses a different STL library than ours. We are getting some link errors. There are no compile-time errors. As I understand, STL is primarily template-based code thus requiring no linking. Looks like there are some parts of STL that do require linking with an STL library. Wondering if there is a way to force STL to be completely inline. Regards.

There is no single answer to this question. In general, C++ is well known to not be "link-compatible" (or "binary compatible") between different versions of compilers (or different brands of compilers, etc, etc). So someone supplying a library SHOULD specify exactly which compiler they expect to be used, and the user of the library will need to use that one. [I've even seen problems using exactly the same version, but two different "builds" of the compiler]
To POSSIBLY find a solution, you'd have to look at what it is the compiler/linker is attempting to find, and then see if that function is at all present in the sources for the STL - if it is, then possibly applying a liberal sprinkling (on the relevant functins) of "always_inline" or whatever that particular compiler uses for that functionality. But chances are that the functions you're missing aren't provided in the header files in the first place. This of course assumes that you actually have the source for the library, so that it can be recompiled [or you can convince the provider to recompile with new settings].
And you'll potentially still have problems with things that are implementation dependent (the problem with "same version, different build" I mentioned above is that someone who wrote STL implementation decided to change a constructor parameter from "unsigned int" to "size_t" at some point, which [could change or] changes the size of the data passed to the constructor -> not behaving the same when you call the function [but the shared library loader detects it and refuses to even load the executable/shared library combination]
[As Lightness Races in Orbit says, the above applies to the "Standard Template Library", which is things like std::vector, std::map, std::random and many other things, but it does not include ALL of the runtime functionality that is required to write any non-trivial C++ program]

STL is primarily template-based code
The "STL" as the term is commonly used (which in itself is inaccurate) relates only to a very small subset of the C++ standard library, and plenty of the C++ standard library is not templates. And that still doesn't take into consideration any of the C++ runtime that your implementation has to link into your project.
So, assuming you mean either the C++ standard library or (more likely) the whole Visual Studio runtime … no, you can't inline it all.
You should probably rebuild the third-party library against your own toolchain.

Will modules make template compilation faster?

Will modules make template compilation faster? Templates (usually) have to be header only, and end up residing in the translation unit of the #includer.
Related: Do precompiled headers make template compilation faster?

According to modules proposal, from the very paper you cited, it's the first of the three primary goals for adding modules:
1 Introduction
Modules are a mechanism to package libraries and encapsulate their implementations.
They differ from the traditional approach of translation units and header ﬁles primarily in
that all entities are deﬁned in just one place (even classes, templates, etc.). This paper
proposes a module mechanism (somewhat similar to that of Modula-2) with three
primary goals:
Signiﬁcantly improve build times of large projects
Enable a better separation between interface and implementation
Provide a viable transition path for existing libraries
While these are the driving goals, the proposal also resolves a number of other longstanding practical C++ issues (initialization ordering, run-time performance, etc.).
So, how can they accomplish those goals? Well, from section 4.1:
Since header ﬁles are typically included in many other ﬁles, the
growth in build cycles is generally superlinear with respect to the total amount of source
code. If the issue is not addressed, it is likely to become worse as the use of templates
increases and more powerful declarative facilities (like concepts, contract programming,
etc.) are added to the language.
Modules address this issue by replacing the textual inclusion mechanism (whose
processing time is roughly proportional to the amount of code included) by a precompiled
module attachment mechanism (whose processing time—when properly implemented—
is roughly proportional to the number of imported declarations). The property that client
translation units need not be recompiled when private module deﬁnitions change can be
retained.
In other words, at the very least, the time taken to parse these templates is only done once instead of N times, which is already a huge improvement.
Later sections describe improvements for things like explicit instantiation. The one thing this doesn't directly improve is automatic template instantiation, as section 5.8 acknowledges. Here all that can be guaranteed is exactly the same benefit you already get from precompiled headers: "Both modules Set and Reset must instantiate Lib::S and in fact both expose this instantiation in their interface file." But the proposal then gives some possible technical solutions to the ODR problems, at least some of which also solve the multiple-instantiation problem and may not be possible in today's world. For example, the kind of queried instantiation suggested has been tried repeatedly and it's generally considered too hard to get right with today's model, but modules might make it feasible. There's no proof that it's impossible to get right today, just experience that it's hard, and there's no proof that it would be easier with modules, just the plausible possibility that it might be.
And that fits with a general implication that's never quite stated in the proposal, but is there in the background: Making compilation simpler means we might get new optimizations in the process (directly, because it's easier to reason about what's happening, or indirectly, because more people work on the problem once it's not such a huge pain).
In summary, modules can and will definitely make template compilation faster, if for no other reason than that the template definitions themselves only have to be parsed once. They may allow for other benefits that are either impossible or more difficult to do without modules, but that may not be guaranteeable.

I don't know about modules, but I do know that gcc even now provides precompiled headers, as do many other compilers. A precompiled header can contain a very efficient machine-readable version of a template description, so when that is available upon inclusion of a header, many compiling steps can be skipped which would normally be required for a source-text-only uncompiled header.
The modules paper talks about precompiled interface files, so I assume that current precompiled headers and new precompiled interface files will provide comparable peformance. Creating such a file from a plain text portable module description will probably be more efficient as it can save time due to restrictions of the language syntax. And it will be more standardized, so more headers will get the benefit of precompilation. Current projects seldom precompile more than one header, and cross-project precompiled headers are even rarer in my experience.

Do precompiled headers make template compilation faster?
No; it makes templates not compile. Which is the entire point of both PCHs and modules: to stop compiling everything.
The idea is to turn "load C++ text and compile" into "load C++ symbols." Modules are a generalized form of PCHs.
Now, you still have the cost of instantiating templates (unless they were instantiated within a PCH/module). But the cost of compiling the C++ template code is removed.

Why are not all Boost libraries header-only?

Why are all libraries in Boost not headers-only?
Saying it differently, what makes the use of .lib/.dll mandatory?
Is it when a class can't be a template or has static fields?

Different points, I guess.
Binary size. Could header-only put a size burden on the client?
Compilation times. Could header-only mean a significant decrease in compilation performance?
Runtime Performance. Could header-only give superior performance?
Restrictions. Does the design require header-only?
About binary size.
and a bit of security
If there's a lot of reachable code in the boost library, or code about which the compiler can't argue whether it is reachable by the client, it has to be put into the final binary. (*)
On operating systems that have package management (e.g. RPM- or .deb-based), shared libraries can mean a big decrease in binary distribution size and have a security advantage: Security fixes are distributed faster and are then automatically used by all .so/.DLL users. So you had one recompile and one redistribution, but N profiteers. With a header-only library, you have N recompiles, N redistributions, always for each fix, and some member of those N are huge in themselves already.
(*) reachable here means "potentially executed"
About compilation times.
Some boost libraries are huge. If you would #include it all, each time you change a bit in your source-file, you have to recompile everything you #included.
This can be counter-measured with cherry picked headers, e.g.
#include <boost/huge-boost-library.hpp> // < BAD
#include <boost/huge-boost-library/just-a-part-of-it.hpp> // < BETTER
but sometimes the stuff you really need to include is already big enough to cripple your recompiles.
The countermeasure is to make it a static or shared library, in turn meaning "compile completely exactly once (until the next boost update)".
About runtime performance.
We are still not in an age were global optimization solves all of our C++ performance problems. To make sure you give the compiler all the information it needs, you can make stuff header-only and let the compiler make inlining decisions.
In that respect, note that inlining gives not always superior performance because of caching and speculation issues on the CPU.
Note also that this argument is mostly with regards to boost libraries that might be used frequently enough, e.g. one could expect boost::shared_ptr<> to be used very often, and thus be a relevant performance factor.
But consider the real and only relevant reason boost::shared_ptr<> is header-only ...
About restrictions.
Some stuff in C++ can not be put into libraries, namely templates and enumerations.
But note that this is only halfway true. You can write typesafe, templated interfaces to your real data structures and algorithms, which in turn have their runtime-generic implementation in a library.
Likewise, some stuff in C++ should be put into source files, and in case of boost, libraries. Basically, this is everything that would give "multiple definition" errors, like static member variables or global variables in general.
Some examples can also be found in the standard library: std::cout is defined in the standard as extern ostream cout;, and so cout basically requires the distribution of something (library or sourcefile) that defines it once and only once.

Why STL implementation is so unreadable? How C++ could have been improved here?

For instance why does most members in STL implementation have _M_ or _ or __ prefix?
Why there is so much boilerplate code ?
What features C++ is lacking that would allow make vector (for instance) implementation clear and more concise?

Implementations use names starting with an underscore followed by an uppercase letter or two underscores to avoid conflicts with user-defined macros. Such names are reserved in C++.
For example, one could define a macro called Type and then #include <vector>. If vector implementations used Type as a template parameter name, it would break.
However, one is not allowed to define macros called _Type (or __type, type__ etc.). Therefore, vector can safely use such names.

Lots of STL implementations also include checking for debug builds, such as verifying that two iterators are from the same container when comparing them, and watching for iterators going out of bounds. This involves fairly complex code to track the container and validity of every iterator created, but is invaluable for finding bugs. This code is also all interwoven with the standard release code with #ifdefs - even in the STL algorithms. So it's never going to be as clear as their most basic operation. Sites like this one show the most basic functionality of STL algorithms, stating their functionality is "equivalent to" the code they show. You won't see that in your header files though.

In addition to the good reasons robson and AshleysBrain have already given, one reason that C++ standard library implementations have such terse names and compact code is that virtually every C++ program (compilation unit, really) includes a large number of the standard library headers, and they are thus repeatedly recompiled (remember that they're largely inlined and template-based, whereas the C standard library headers only contain a handful of function declarations). A standard library written to "industry standard" style guidelines would take longer to compile and thus lead to the perception that a particular compiler was "slow". By minimizing whitespace and using short identifier names, the lexer and parser have less work to do, and the whole compilation process completes a little bit faster.
Another reason worth mentioning is that many standard library implementations (e.g. Dinkumware, Rogue Wave (old), etc.) can be used with several different compilers with widely different standards compliance and quirks. There's frequently a lot of macro hackery aimed at satisfying each supported platform.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js