I've few questions about C++ compilers
Are C++ compilers required to be one-pass compiler? Does the Standard talk about it anywhere?
In particular, is GCC one-pass compiler? If it is, then why does it generate the following error twice in this example (though the template argument is different in each error message)?
error: declaration of ‘adder<T> item’ shadows a parameter
error: declaration of ‘adder<char [21]> item’ shadows a parameter
A more general question
What are the advantages and disadvantages of one-pass compiler and multi-pass compiler?
Useful links:
A List of C/C++ compilers (wikipedia)
An incomplete list of C++ compilers (Bjarne Stroustrup's site)
The standard sets no requirements what so ever with regards to
how a compiler is implemented. But what do you mean by
"one-pass"? Most compilers today do only read the input file
once. They create an in memory representation (often in the
form of some sort of parse tree), and may make multiple passes
over that. And almost certainly make multiple passes over parts
of it. The compiler must make a "pass" over the internal
representation of a template each time it is instantiated, for
example; there's no way of avoiding that. G++ also makes
a "pass" over the template when it is defined, before any
instantiation, and reports some errors then. (The standard
committee expressedly designed templates to allow a maximum of
error detection at the point of definition. This is the
motivation behind the requirement for typename in certain
places, for example.) Even without templates, a compiler will
generally have to make two passes over a class definition if
there are functions defined in it.
With regards to the more general question, again, I think you'd
have to define exactly what you mean by "one-pass". I don't
know of any compiler today which reads the source file several
times, but almost all will visit some or all of the nodes in the
parse tree more than once. Is this one-pass or multi-pass? The
distinction was more significant in the past, when memory wasn't
sufficient to maintain much of the source code in an internal
representation. Languages like Pascal and, to a lesser degree
C, were sometimes designed to be easy to implement with a single
pass compiler, since a single pass compiler would be
significantly faster. Today, this issue is largely irrelevant,
and modern languages, including C++, tend to ignore it; where
C++ seems to conform to the needs of a one-pass compiler, it's
largely for reasons of C compatibility, and where
C compatibility is not an issue (e.g. in a class definition), it
often makes order of declaration irrelevant.
From what I know, 30 years ago it was important for a compiler to be one-pass, because reads and writes to disk (or magnetic tape) were very slow and there was not enough memory to hold whole code (thanks James Kanze). Also, a single-pass is a requirement for scripting/interactive languages.
Nowdays compilers are usually not one-pass, there are several intermediate representations (e.g Abstract Syntax Tree or Static Single Assignment Form) that the code is transformed into and then analised/optimised.
Some elements in C++ cannot be solved without some intermediate steps, e.g. in a class you can reference members which are defined only later in the class body. Also, all templates need to be somehow remembered for further access during instantiation.
What does not happen usually, is that the source code is not parsed several times --- there is no need for that. So you should not experience same syntactic error being reported several times.
No, I would be surprised if you found a heavily used C++ single pass compiler.
No, it does multiple passes and even different optimizations based on the flags you pass it.
Advantages (single-pass): fast! Since all the source only needs to be examined once the compilation phase (and thus beginning of execution) can happen very quickly. It is also a model that is attractive because it makes the compiler easy to understand and often times "easier" to implement. (I worked on a single pass Pascal compiler once, but don't encounter them often, whereas single pass interpreters are common)
Disadvantages (sinlge-pass): Optimization, semantic/syntactic analysis. Sometimes a single code look lets things through that are easily caught by simple mechanisms in multiple passes. (kind of why we have things like JSLint)
Advantages (multi-pass): optimizations, semantic/syntactic analysis. Even pseudo interpreted languages like "JRuby" go through a pipeline compilation process to get to java/jvm bytecode before execution, you could consider this multi-pass and the multiple looks at the varying representations (and consequently the resulting optimizations) of code can make it very fast.
Disadvantages (multi-pass): complexity, sometimes time (depending on if AOT/JIT is being used as your compilation method)
Also, single-pass is pretty common in academia to help learn the aspects of compiler design.
Walter Bright, the developer of the first C++ compiler, has stated that he believes it is not possible to compile C++ without at least 3 passes. And, yes, that means 3 full text-transforming passes over the source, not just traversals through an internal tree representation. See his Dr. Dobb's magazine article, "Why is C++ compilation so slow?" So any hope of finding a true one-pass compiler seems doomed. (I think this was part of the motivation Bright had to develop D, his C++ alternative.)
The compiler only needs to look at the sources once top down, but that does not mean that it does not have to process the parsed contents more than once. In particular with templates, it has to instantiate the templated code with the type, and that cannot happen until the template is used (or explicitly instantiated by the user), which is the reason for your duplicate errors:
When the template is defined, the compiler detects an error and at that point the type has not been substituted. When the actual instantiation occurs it substitutes the template arguments and processes the result, which is what triggers the second error. Note that if the template was specialized after the first definition, and before the instantiation, for that particular type, the second error need not occur.
Related
Does compiler make optimzation on data structures when input size is small ?
unordered_set<int>TmpSet;
TmpSet.insert(1);
TmpSet.insert(2);
TmpSet.insert(3);
...
...
Since since is small using hashing would be not required we can simply store this in 3variables. Do optimization like this happen ? If yes who is responsible for them ?
edit: Replaced set with unordered_set as former doesn't do hasing.
Possible in theory to totally replace whole data structures with a different implementation. (As long as escape analysis can show that a reference to it couldn't be passed to separately-compiled code).
But in practice what you've written is a call to a constructor, and then three calls to template functions which aren't simple. That's all the compiler sees, not the high-level semantics of a set.
Unless it's going to optimize stuff away, real compilers aren't going to use a different implementation which would have different object representations.
If you want to micro-optimize like this, in C++ you should do it yourself, e.g. with a std::bitset<32> and set bits to indicate set membership.
It's far too complex a problem for a compiler to reliably do a good job. And besides, there could be serious quality-of-implementation issues if the compiler starts inventing different data-structures that have different speed/space tradeoffs, and it guesses wrong and uses one that doesn't suit the use-case well.
Programmers have a reasonable expectation that their compiled code will work somewhat like what they wrote, at least on a large scale. Including calling those standard header template functions with their implementation of the functions.
Maybe there's some scope for a C++ implementation adapting to the use-case, but I that would need much debate and probably a special mechanism to allow it in practice. Perhaps name-recognition of std:: names could be enough, making those into builtins instead of header implementations, but currently the implementation strategy is to write those functions in C++ in .h headers. e.g. in libstdc++ or libc++.
I've been programming with C++ for quite a while and I enjoy using templates a lot. What I've been wondering recently due to my foray into embedded programming is how one should expect the linker to behave with respect to code duplication in template instances where the template parameter is different.
For multiple instances of the same template with same parameters this is well known to be optimized away during link time (see also: How does C++ link template instances)
However in my case I'm interested if the linker will recognize any duplicated code between two templates that were instantiated with different parameters. As they are different types I would assume that they would not automatically be collapsed. However since they might have some functions that do not depend on the template parameters and would thus be identical between the two classes one might assume that a linker could optimize those away and thus save space.
What would be the expected behaviour in this case?
gold linker does exactly that.
Safe ICF: Pointer Safe and Unwinding Aware Identical Code Folding in Gold:
We have found that large C++ applications and shared libraries tend to have many functions whose code is identical with another function. As much as 10% of the code could theoretically be eliminated by merging such identical functions into a single copy. This optimization, Identical Code Folding (ICF), has been implemented in the gold linker. At link time, ICF detects functions with identical object code and merges them into a single copy.
Why is it not possible to create a C++ decompiler that will function as accurately as those made for Java and C#?
There are several reasons:
Inlining. A lot of C++ code gets inlined in optimized builds. That plays havoc with any form of decompiler. To figure out that a function was inlined, the decompiler would have to analyze the specifics of the inlined code and match them up. And post-inlining optimization steps can make code very different, depending on where it was inlined.
Templates. Templates use #1 exclusively, but they create additional problems. It is at least theoretically possible that a function that gets inlined in two places would compile to the same sequence of assembly instructions. But for template code, which was instantiated with different template arguments? Different instantiations will usually have to compile down to different sequences of instructions. And this becomes even more difficult, since template code can call different sets of functions based on the template parameters. And those functions themselves could be inlined.
Compile-time execution. Template metaprogramming allows the compiler to actually execute code. But C++11's constexpr provides a more natural way to do some computations at compile time. Obviously, compile-time function calls or metafunction instantiations cannot be part of the compiled executable. Only the results of them will be (since that's kinda the point).
Lack of comprehensive runtime reflection. C# and Java both lace their bytecode with a lot of information about what the nature of the original source code. Object definitions are easily detectable, as are object names, member variable types and names, etc. C++ compiles down to machine language, which is not required to have any such information. And since it isn't required, compilers don't generate it. Even the reflection study group of the ISO C++ committee is focused on compile-time reflection, which is information that won't be available at runtime.
Even std::type_info doesn't offer anything. The reason being that, if the compiler does not detect that a particular type will have typeid called on it, then the compiler doesn't need to generate a std::type_info object for it. And even if it did, all that gives you is an object's name (and an identifier). Nothing more.
Because C++ compilers generally do not put any more information into the executable than they absolutely have to (especially not if they are compiling in release mode rather than a debug build), so the information you'd need to accurately decompile the program simply is not present in the executable.
Of course a C++ compiler could be made that does include all of the necessary information in the executable (e.g. in the most naive implementation, it could simply include a copy of the source code itself in the executable), but doing so would make the executables significantly larger, and most non-open-source C++ developers would prefer that other people not be able to decompile the executable, so there isn't a whole lot of demand for that functionality.
I understand that you can't declare a virtual method as templated, because the compiler would not know how much entries to reserve in the virtual table. This is, however, a technical limitation, rather than a language one. The compiler could know how many instances of the template are actually needed, and "go back" to allocate a proper vtable size.
Is there a planned technical solution in the upcoming standard?
The compiler can never know all of the possible instantiations of a template. Under the current compilation model, each translation unit is compiled separately and later linked. When compiling a template type in one translation unit, you do not know the instantiations of that type in another.
Imagine you're writing a library and you want a template function in it. You compile the library and then distribute it to your clients. Now the clients can instantiate your template function with whatever template arguments they like, but your library has already been compiled! It can't "go back" and change this.
You're assuming that when you compile the template function, you also have available every instantiation of that function. That's often not the case and, under the current compilation and linking model, cannot be known to be the case.
It's certainly possible to do this, given no requirements of working with existing linkers. That is, the linker could sift through all the instantiations of that template function and build the appropriate data structures. But one of the strengths of C++ is that it doesn't require specialized linkers; that makes it portable to systems where the linker is written in stone and cannot be changed. And, yes, that happens; the linker is where all the object code meets, and it has to be compatible with all the programming languages that the system supports, and that, in turn, means that it sometimes has grown old and crufty, and any change brings a substantial risk of breakage. So, while it's theoretically possible to do this, it ain't gonna happen.
There is nothing currently planned based on the C++ Standards Committee papers and core language issues. The C++ Standard specifies requirements for implementations of C++, but not define the technical implementation itself. Hence, template virtual functions are explicitly not a technical limitation, but rather a limitation of the language defined by the standard. Nevertheless, the limitation of the language may be the result of the risk involved in changing existing implementations rather than being imposed as a result of an implementation's technical limitations.
For instance why does most members in STL implementation have _M_ or _ or __ prefix?
Why there is so much boilerplate code ?
What features C++ is lacking that would allow make vector (for instance) implementation clear and more concise?
Implementations use names starting with an underscore followed by an uppercase letter or two underscores to avoid conflicts with user-defined macros. Such names are reserved in C++.
For example, one could define a macro called Type and then #include <vector>. If vector implementations used Type as a template parameter name, it would break.
However, one is not allowed to define macros called _Type (or __type, type__ etc.). Therefore, vector can safely use such names.
Lots of STL implementations also include checking for debug builds, such as verifying that two iterators are from the same container when comparing them, and watching for iterators going out of bounds. This involves fairly complex code to track the container and validity of every iterator created, but is invaluable for finding bugs. This code is also all interwoven with the standard release code with #ifdefs - even in the STL algorithms. So it's never going to be as clear as their most basic operation. Sites like this one show the most basic functionality of STL algorithms, stating their functionality is "equivalent to" the code they show. You won't see that in your header files though.
In addition to the good reasons robson and AshleysBrain have already given, one reason that C++ standard library implementations have such terse names and compact code is that virtually every C++ program (compilation unit, really) includes a large number of the standard library headers, and they are thus repeatedly recompiled (remember that they're largely inlined and template-based, whereas the C standard library headers only contain a handful of function declarations). A standard library written to "industry standard" style guidelines would take longer to compile and thus lead to the perception that a particular compiler was "slow". By minimizing whitespace and using short identifier names, the lexer and parser have less work to do, and the whole compilation process completes a little bit faster.
Another reason worth mentioning is that many standard library implementations (e.g. Dinkumware, Rogue Wave (old), etc.) can be used with several different compilers with widely different standards compliance and quirks. There's frequently a lot of macro hackery aimed at satisfying each supported platform.