How does C++ link template instances

How does C++ link template instances - c++

If I define a function (maybe a class member function but not inlined) in a header file that is included by two different translation units I get a link error since that function is multiply defined. Not so with templates since they are not compilable types until the compiler resolves a declaration of an object of a templatized type. This made me realize I don't know where compiled template code resides and how it is linked since C++ does not just create multiple copies of code to define SomeTemplateClass. Any info would be appreciated.
Thanks!

There are 3 implementation schemes used by C++ compilers:
greedy instantiation, where the compiler generates an instantiation in each compilation unit that uses it, then the linker throws away all but one of them (this is not just a code-size optimization, it's required so that function addresses, static variables, and the like are unique). This is the most common model.
queried instantiation, where the compiler has a database of instantiations already done. When an instantiation is needed, the DB is checked and updated. The only compiler I know which uses this is Sun's, and it isn't used by default anymore.
iterated instantiation, where the instantiations are made by the linker (either directly or by assigning them to a compilation unit, which will then be recompiled). This is the model used by CFront -- i.e. historically it was the first one used -- and also by compilers using the EDG front-end (with some optimisations compared to CFront).
(See C++ Templates, The Complete Guide by David Vandevoorde and Nicolai Josuttis. Another online reference is http://www.bourguet.org/v2/cpplang/export.pdf, which is more concerned about the compilation model but still has descriptions of the instantiation mechanisms).

All template functions are implicitly inline. Just as methods defined in the class declaration are implicitly inline.
When I say implicitly inline I mean the more modern usage of the word. See my lengthy description here.
In short, inline, static, and extern are all sibling linkage directives. inline tells the linker to ignore duplicate definitions of a function. Generally this means the linker will pick one definition and use it for all compilation units. I don't know of any compilers that do or did leave all duplicate template code in the final executable.
Where are template instantiations stored?
They are stored in the same way in the same place as inline functions. The details of that are compiler specific.

This is implementation specific.
Some compilers will generate the same template instances over and over for each translation unit they are instantiated in and let the linker fold the duplicates.
Templates got a bad reputation for "code bloat" when linkers weren't yet up to that task. Nowadays this is probably undeserved. Some implementations will even fold different instantiations when they compile to the same target machine code. (Like f<A*>() and f<B*>(), as pointer types are just addresses in the generated machine code.)
Others will defer template compilation until link-time instead, and there might be still other ways to deal with this. As I said, it's up to the implementation.
These all have different advantages and disadvantages. Absent of a true module concept I doubt anyone will come up with the perfect scheme.
With export there used to be a requirement for compilers to pre-compile template code and instantiate at request. However, except for one vendor nobody implemented export for their compiler, and now it's removed.

It actually does create multiple copies. Those copies are special and don't violate the one-definition rule. Some linkers will come along, remove the copies, and relink the functions using them; not all do.

Related

Why must an inline function be "defined" in a header if the compiler can just inline functions on its own accord

The reason we have to define inline functions in the header is that each compilation unit where that function is called must have the entire definition in order to replace the call, or substitute it. My question is why are we forced to put a definition in a header file if the compiler can and does do its own optimisations of inlining, which would require it to dig into the cpp files where the functions are defined anyway.
In other words, the compiler seems to me to have the ability to see the function "declaration" in a header file, go to the corresponding cpp file and pull the definition from it and paste it in the appropriate spot in the other cpp. Given that this is the case, why the insistence of defining the function in the header, implying as if the compiler can't "see" into other cpp files.
The MSDN says about Ob2/ optimisation setting:
Ob2/ The default value. Allows expansion of functions marked as inline, __inline, or __forceinline, and any other function that the compiler chooses (My emphasis).

The reason we're forced to provide definitions of inline function in header files (or at least, in some form that is visible to the implementation when inlining a function in a given compilation unit) is requirements of the C++ standard.
However, the standard does not go out of its way to prevent implementations (e.g. the toolchain or parts of it, such as the preprocessor, compiler proper, linker, etc) from doing things a little smarter.
Some particular implementations do things a little smarter, so can actually inline functions even in circumstances where they are not visible to the compiler. For example, in a basic "compile all the source files then link" toolchain, a smart linker may realise that a function is small and only called a few times, and elect to (in effect) inline it, even if the points where inlining occurs were not visible to the compiler (e.g. because the statements that called the functions were in separate compilation units, the function itself is in another compilation unit) so the compiler would not do inlining.
The thing is, the standard does not prevent an implementation from doing that. It simply states the minimum set of requirements for behaviour of ALL implementations.
Essentially, the requirement that the compiler have visibility of a function to be inlined is the minimum requirement from the standard. If a program is written in that way (e.g. all functions to be inlined are defined in their header file) then the standard guarantees that it will work with every (standard compliant) implementation.
But what does this mean for our smarter tool-chain? The smarter tool-chain must produce correct results from a program that is well-formed - including one that defines inlined functions in every compilation unit which uses those functions. Our toolchain is permitted to do things smarter (e.g. peeking between compilation units) but, if code is written in a way that REQUIRES such smarter behaviour (e.g. that a compiler peek between compilation units) that code may be rejected by another toolchain.
In the end, every C++ implementation (the toolchain, standard library, etc) is required to comply with requirements of the C++ standard. The reverse is not true - one implementation may do things smarter than the standard requires, but that doesn't generate a requirement that some other implementation do things in a compatible way.
Technically, inlining is not limited to being a function of the compiler. It may happen in the compiler or the linker. It may also happen at run time - for example "Just In Time" technology can, in effect, restructure executable code after it has been run a few times in order to enhance subsequent performance [this typically occurs in a virtual machine environment, which permits the benefits of such techniques while avoiding problems associated with self-modifying executables].

The inline keyword isn't just about expanding the implementation at the point it was called, but in fact primarily about declaring that multiple definitions of a function may exist in a given translation unit.
This has been covered in other questions before, which can it explain much better than I :)
Why are class member functions inlined?
Is "inline" implicit in C++ member functions defined in class definition

No, compilers traditionally can't do this. In classic model, compiler 'sees' only one cpp file at a time, and can't go to any other cpp files. Out of this cpp file compiler so-called object file in platofirm native format, which is than linked using effectively linker from 1970s, which is as dumb as a hammer.
This model is slowly evolving. With more and more effective link-time optimizations (LTO) linkers become aware of what cpp code is, and can perform their own inlining. However, even with link-time optimization model compiler-done inlining and optimization are still way more efficient than link-time - a lot of important context is lost when cpp code is converted to intermediate format suitable for linking.

It's much easier for the compiler to expand a function inline if it has seen the definition of that function. The easiest way to let the compiler see the definition of a function in every translation unit that uses that function is to put the definition in a header and #include that header wherever the function will be used. When you do that you have to mark the definition as inline so that the compiler (actually the linker) won't complain about seeing the definition of that function in more than one translation unit.

inline keyword for templates [duplicate]

This question already has answers here:
Does it make any sense to use inline keyword with templates?
(4 answers)
Closed 8 years ago.
This code will be placed in a header file:
template<typename TTT>
inline Permutation<TTT> operator * (const Cycle<TTT>& cy, const Permutation<TTT>& p)
{
return Permutation<TTT>(cy)*p;
}
Is inline necessary to avoid a linker error?
If this function is not a template and the header file is used in more than one .cpp file, inline is necessary to avoid a liker error complaining about multiple definitions for a function. It seems linker ignores this for templates.

Is inline necessary to avoid a linker error?
On a function template, no. Templates, like inline functions, are subject to a more relaxed One Definition Rule which allows multiple definitions - as long as the definitions are identical and in separate translation units.
As you say, inline would be necessary if you wanted to define a non-template function in a header; non-inline functions are subject to a more strict One Definition Rule, and can only have one definition in a program.
For the gory details, this is specified by C++11 3.2/5:
There can be more than one definition of a class type, inline function with
external linkage, class template, non-static function template, static data member
of a class template, member function of a class template, or template specialization for
which some template parameters are not specified in a program provided that each definition
appears in a different translation unit, and provided the definitions satisfy the following requirements.
(The "following requirements" basically say that the definitions must be identical).

Consider that a template function (or function template if you prefer) is not a function at all. It is rather a recipe to create a function. The actual function is only created when and where the template is instantiated. So you do not need the inline keyword here, because template functions will not result in multiple-definition linker errors because they are not actually defined (from the linker's perspective) until they are used.

Expanding on Mike Seymour's answer -- the paragraph (3.2/5) he cites in in the Standard refers to a concept called "vague linkage". Basically, it's a way of saying "we need this to exist somewhere in the resulting binary, but we don't have a clear-cut home for it in any specific object file we're emitting." On modern platforms (Windows, ELF systems such as Linux, and OS X), this is implemented using a mechanism known as COMDAT support that allows the compiler to simply generate the instantiations and other vague linkage items (vtables, typeinfos, and inline function bodies) as-needed -- the linker then is free to throw out the duplicates:
When used with GNU ld version 2.8 or later on an ELF system such as GNU/Linux or Solaris > 2, or on Microsoft Windows, duplicate copies of these constructs will be discarded at
link time. This is known as COMDAT support.
This is discussed in more detail in the GCC manual (quote snipped as the Cfront model is irrelevant for modern compilers):
C++ templates are the first language feature to require more intelligence from the
environment than one usually finds on a UNIX system. Somehow the compiler and linker
have to make sure that each template instance occurs exactly once in the executable if
it is needed, and not at all otherwise. There are two basic approaches to this problem,
which are referred to as the Borland model and the Cfront model.
Borland model
Borland C++ solved the template instantiation problem by adding the code equivalent
of common blocks to their linker; the compiler emits template instances in each
translation unit that uses them, and the linker collapses them together. The advantage
of this model is that the linker only has to consider the object files themselves; there
is no external complexity to worry about. This disadvantage is that compilation time is
increased because the template code is being compiled repeatedly. Code written for this
model tends to include definitions of all templates in the header file, since they must
be seen to be instantiated.

How does C++ partial compilation with templates work? [duplicate]

This question already has answers here:
How does C++ link template instances
(4 answers)
Closed 8 years ago.
In C, partial compilation is possible since the entire *.c file can be compiled into machine code with resolution and relocation left for the linker to handle. This is just an issue of calculating the displacement certain instructions have in the final executable or knowing the absolute address for some global variable.
In C++ it would seem that almost the same can be done - there exists a fairly uncomplicated mapping between C++ code and equivalent C code (as far as mappings between programming languages go). However, templates seem to complicate things.
If I use, for instance, a std::vector<int> in 1.c, then, since the template class was specified by the <vector> header, the compiler can generate machine code for an int specification. Suppose in the same project there is a file 2.c which also relies on a std::vector<int> specialization, and that 1.o and 2.o must be linked. Is partial compilation of 1.c and 2.c to their own *.o files to be linked later possible?
As mentioned in the linked question in the comments below, there are two commonly used methods for this problem: both generate std::vector<int> code, or the linker goes through another round of "dependency compilations" where a single vector<int> is compiled and then linked to both files.
Regarding "greedy compilation" - does this mean that every use of template class methods in every compilation unit must be put in the linker relocation table? Also, certain calls may not use long jumps (i.e., a template class is defined right above the method using it). However, if the linker is going to force a compilation unit to use the specialization it has selected, then a long jump would be necessary - but the instruction size would be too large to patch in.

This is a slightly more complicated question than what most people will realize.
In the general and simplest case, the template definition is present in the header, and it behaves as inline functions. The compiler will generate the code for those functions needed in each translation unit that needs them. Then the linker will resolve the duplicate symbols by removing all but one. Since the standard requires that they are exactly equivalent, the linker can pick any one from the list.
If the template need only work with a couple of types, you can move the definition to a single translation unit and explicitly instantiate the template for those types there. This would behave as a non-inline function in the general case.
Somewhere in between, if the template can be instantiated with any type but it is commonly instantiated with a few of them, the implementor of the template can use a mixed approach, where the template and the members are defined in the header, but explicit instantiations are also declared. Then in a single translation unit, those explicit instantiations can be done.
This approach can be used, for example, to minimize compile and link time when using std::string (which is really std::basic_string<char, std::char_traits<char>, std::allocator<char> >). The compiler can, in a single translation unit provide all of the functions for the common instantiation, but still provide the definition of the template functions in the header so that if you opt to use a different instantiation of the basic_string template it will still work for you. In all translation units that only use std::string, the compiler knows not to generate the code for all members as those will be available to the linker.

The compiler generates code for each instantiation of the template and makes sure that there are no name clashes. It is not like ordinary functions, where you get linker errors if a .cpp file is used in two compilation units.
It is possible to save some compilation time by explicitly instantiating the template in some compilation unit and use that template elsewhere, but this needs quite a bit of manual housekeeping (adding explicit instantiations for each new type that is used in the project). You can also save some compilation time by avoiding some unnecessary conversions by using the keyword explicit on templated constructors.

Code organization across files that has to deal with template functions and inlining

I'm maintaining a large library of template classes that perform algebraic computations based on either float or double type. Many of the classes have accessor methods (getters and setters) and other functions that run small amounts of code, therefore such functions need to be qualified as inline when the compiler locates their definitions. Other member functions, in contrast, contain sophisticated code and thus would better be called rather than inlined.
A substantial part of the function definitions are located in headers, actually in .inl files included by headers. But there are also many classes whose function definitions happily live in .cpp files by means of explicit instantiation for float and double, which is rather a good thing to do in case of a library (here explained why). And finally, there is a considerable number of classes whose function definitions are broken across .inl files (accessor methods) and .cpp files (constructors, destructors, and heavy computations), which makes them all pretty difficult to maintain.
I would have all my class implementations in .inl files only if I knew a reliable way to prevent some functions from being inlined, or in .cpp files if inline keyword could strongly suggest compiler to inline some of the functions, which, of course, it does not. I would really prefer all the function definitions in the library to reside in .cpp files, but since accessor methods are used extensively throughout the library, I have to make sure they are inlined whenever referenced, not called.
So, in this connection, my questions are:
Does it make any sense to mark the definition of a template function with inline in view of the fact that, as I've recently learnt here, it is going to be automatically qualified as inline by the compiler regardless of whether it's marked with inline or not?
And most importantly, since I would like to have the definitions of all the member functions of a template class gathered together in a single file, either it's .inl or .cpp (using explicit instantiation in case of .cpp), preferably still being able to hint the compiler (MSVC and GCC) which of the functions should be inlined and which shouldn't, sure if such thing is possible with template functions, how can I achieve this or, if there is really no way (I hope there is), what would be the most optimal compromise?
----------
EDIT1: I knew that inline keyword is just a suggestion to the compiler to inline a function.
EDIT2: I really do know. I like making suggestions to the compiler.
EDIT3: I still know. It's not what the question is about.
----------
In view of some new information, there is also third question that goes hand in hand with the second one.
3. If compilers are so smart these days that they can make better choices about which function should be inlined and which should be called and are capable of link-time code generation and link-time optimization, which effectively allows them looking into a .cpp-located function definition at link time to decide its fate about being inlined or called, then maybe a good solution would be simply moving all the definitions into respective .cpp files?
----------
So what's the conclusion?
First of all, I'm grateful to Daniel Trebbien and Jonathan Wakely for their structured and well-founded answers. Upvoted both but had to choose just one. None of the given answers, however, presented an acceptable solution to me, so the chosen answer happened to be the one that helped me slightly more than others in making the final decision, the details of which are explained next for anyone who's interested.
Well, since I've always been valuing the performance of code more than how much convenient it is to maintain and develop, it appears to me that the most acceptable compromise would be to move all the accessor methods and other lightweight member functions of each of the template classes into the .inl file included by the respective header, marking these functions with inline keyword in an attempt to provide the compiler with a good hint (or with a keyword for inline forcing), and move the rest of the functions into the respective .cpp file.
Having all member function definitions located in .cpp files would hinder inlining of lightweight functions while unleashing some problems with link-time optimization, as has been ascertained by Daniel Trebbien for MSVC (in an older stage of development) and by Jonathan Wakely for GCC (in its current stage of development). And having all function definitions located in headers (or .inl files) doesn't outweigh the summary benefit of having the implementation of each class sorted into .inl and .cpp files combined with a bonus side effect of this decision: it would ensure that only the code of primitive accessor methods is visible to a client of the library, while more juicy stuff is hidden in the binaries (ensuring this wasn't a major reason, however, but this plus was obvious for anyone who is familiar with software libraries). And any lightweight member function that doesn't need to be exposed by the include files of the library and is used privately by its class can have its definition in the .cpp file of the class, while its declaration/definition is spiced with inline to encourage the inline status of the function (don't know yet whether the keyword should be in both places or just one in this particular case).

In short: Put the template code in a header file. Use compiler-specific forceinline or noinline keywords if the optimizer fails to make good decisions about inlining.
You can and should put definitions of template members into header files. This ensures that the compiler has access to the definition at the point of use when it finds out what the actual template parameters are, and is able to perform implicit instantiaion.
The inline keyword has very little impact on templates, since template functions are already exempted from the single definition requirement (The One Definition Rule still requires that all definitions be the same). It is a hint to the compiler that the function should be inlined. And you can omit it as a hint to the compiler to not inline the function. So use it that way. But the optimizer will still look at other factors (function size) and make its own choice on inlining.
Some compilers have special keywords, like __attribute__(always_inline) or __declspec(noinline) to override the optimizer's choice.
Mostly, though, the compiler is smart enough not to inline "complex code that makes more sense as a function call". You shouldn't have to worry about it, just let the optimizer do its thing.
Portable inlining control isn't beneficial, because the trade-offs of inlining are very platform-specific. The optimizers should already be aware of those platform-specific tradeoffs, and if you do feel the need to override the compiler's choice, do so on a per-platform basis.

1. Does it make any sense to mark the definition of a template function with inline in view of the fact that, as I've recently learnt, it is going to be automatically qualified as inline by the compiler regardless of whether it's marked with inline or not? Is the behavior compiler-specific?
I think you are referring to the fact that a member function defined in its class definition is always an inline function. This is per the C++ Standard, and has been since the first publication:
9.3 Member functions
...
A member function may be defined (8.4) in its class definition, in which case it is an inline member function (7.1.2)
So, in the following example, template <typename FloatT> my_class<FloatT>::my_function() is always an inline function:
template <typename FloatT>
class my_class
{
public:
void my_function() // `inline` member function
{
//...
}
};
template <>
class my_class<double> // specialization for doubles
{
public:
void my_function() // `inline` member function
{
//...
}
};
However, by moving the definition of my_function() outside of the definition of template <typename FloatT> my_class<FloatT>, it is not automatically an inline function:
template <typename FloatT>
class my_class
{
public:
void my_function();
};
template <typename FloatT>
void my_class<FloatT>::my_function() // non-`inline` member function
{
//...
}
template <>
void my_class<double>::my_function() // non-`inline` member function
{
//...
}
In the latter example, it does make sense (as in, it's not redundant) to use the inline specifier with the definitions:
template <typename FloatT>
inline void my_class<FloatT>::my_function() // `inline` member function
{
//...
}
template <>
inline void my_class<double>::my_function() // `inline` member function
{
//...
}
2. And most importantly, since I would like to have the definitions of all the member functions of a template class gathered together in a single file, either it's .inl or .cpp (using explicit instantiation in case of .cpp), preferably still being able to hint the compiler (MSVC and GCC) which of the functions should be inlined and which shouldn't, sure if such thing is possible with template functions, how can I achieve this or, if there is really no way (I hope there is), what would be the most optimal compromise?
As you know, the compiler may elect to inline a function, whether or not it has the inline specifier; the inline specifier is just a hint.
There is no standard way to force inlining or prevent inlining; however, most C++ compilers support syntactic extensions for accomplishing just that. MSVC supports a __forceinline keyword to force inlining and #pragma auto_inline(off) to prevent it. G++ supports always_inline and noinline attributes for forcing and preventing inlining, respectively. You should refer to your compiler's documentation for details, including how to enable diagnostics when the compiler is unable to inline a function as requested.
If you use those compiler extensions, then you should be able to hint to the compiler whether a function is inlined or not.
In general, I recommend to have all "simple" member function definitions gathered together in a single file (usually the header), by which I mean, if the member function does not require very many more #includes above the set of #includes required to define the classes/templates. Sometimes, for example, a member function definition will require #include <algorithm>, but it is unlikely that the class definition requires <algorithm> to be included in order to be defined. Your compiler is able to skip over function definitions that it does not use, but the larger number of #includes can noticeably lengthen compile times, and it is unlikely that you will want to inline these non-"simple" functions anyway.
3. If compilers are so smart these days that they can make better choices about which function should be inlined and which should be called and are capable of link-time code generation and link-time optimization, which effectively allows them looking into a .cpp-located function definition at link time to decide its fate about being inlined or called, then maybe a good solution would be simply moving all the definitions into respective .cpp files?
If you place all of your function definitions into CPP files, then you will be relying on LTO for mostly all function inlining. This may not be what you want for the following reasons:
At least with MSVC's LTCG, you give up the ability to force inlining (See inline, __inline, __forceinline.)
If the CPP files are linked to a shared library, then programs linking with the shared libraries will not benefit from LTO inlining of library functions. This is because the compiler intermediate language (IL)—the input to LTO—has been discarded and is not available in the DLL or SO.
If Under The Hood: Link-time Code Generation is still correct, "calls to functions in static libraries can't be optimized".
The linker would be performing all inlining, which might be a lot slower than having the compiler perform some inlining at compile time.
The compiler's LTO implementation might have bugs that cause it to not inline certain functions.
Use of LTO might impose certain limitations on projects using your library. For example, according to Under The Hood: Link-time Code Generation, "precompiled headers and LTCG are incompatible". The /LTCG (Link-time Code Generation) MSDN page has other notes, such as "/LTCG is not valid for use with /INCREMENTAL".
If you keep the likely-to-be-inlined function definitions in the header files, then you could use both compiler inlining and LTO. On the other hand, moving all function definitions into CPP files will restrict compiler inlining to only within the translation units.

I don't know where you learnt that, but templates are not "automatically qualified as inline by the compiler regardless of whether it's marked with inline or not". Templates and inline functions both have what is sometimes called "vague linkage" meaning their definitions can be present in multiple objects without error and the linker will use one of the definitions and discard the others. But the fact templates and inline functions both have vague linkage doesn't mean templates are automatically inline. Lions and tigers are both big cats but that doesn't mean lions are tigers.
Unless you know all the instantiations you are using in advance you can't always use explicit instantiation e.g. if you're writing a template library for others to use then you can't provide all the explicit instantiations, so you must define the template in .h (or .inl) files that the user of the code can #include. If you do know all the instantiations in advance then using explicit instantiations in .cpp files has the advantage of improving compilation time, because the compiler only instantiates the templates once in the file containing the explicit instantiations, not in every file that uses them. But that has nothing to do with inlining. For a function to be inlined its definition must be visible to the code calling it, so if you only define function templates (or member functions of class templates) in a .cpp file then they can't be inlined anywhere except in that file. If you define them in a .cpp file and do qualify them as inline then you might cause problems trying to call them from other files, which can't see the inline keyword (if a function is declared inline in one translation unit it must be declared inline in all translation units in which it appears, [dcl.fct.spec]/4.)
For what it's worth, I don't generally bother using .inl files, I just define templates directly in .h files, which gives one less file to deal with. Everything's in one place, and it just works, all files that use the templates can see the definitions and choose to inline them if desired. You can still use explicit instantiations in that case too, to improve compilation time and reduce object file size, without sacrificing inlining opportunites.
Why would that be better than just defining your template code in headers, where it belongs? What exactly are you trying to achieve? If it's fewer files, put the template code in headers, that will always work, the compiler can choose to inline everything without needing LTO, and you only have one file per class template (and you can still use explicit instantiation to improve compilation times). If you're trying to move all your code into .cpp files (which I think you're focusing on too much) then go ahead and do it. I think it's a bad idea, and will probably cause problems (link-time optimisation still has issues with the only compiler I've tried using it with, and certainly won't make compilation any faster) but if that's what you want, do whatever floats your boat.
It seems like your questions revolve around a misunderstanding here:
I would have all my class implementations in .inl files only if I knew a reliable way to prevent some functions from being inlined,
If all your template definitions are in header files you don't need "a reliable way to prevent some functions from being inlined" ... as I said above, templates are not automatially inline just because they're in headers, and if they're too large to inline the compiler won't inline them. First problem solved. Secondly:
or in .cpp files if inline keyword could strongly suggest compiler to inline some of the functions, which, of course, it does not, especially if a function marked with inline is located in a .cpp file.
As I said above, a function marked inline in a .cpp file is ill-formed unless it's also marked inline in the header, and never used in any other .cpp file. So doing this is just making life difficult and possibly causing linker errors. Why bother.
Again, all signs point to just put your template definitions in headers. You can still use explicit instantiation (as GCC does for std::string, as mentioned in the post you link to) so you get the best of both worlds. The only thing it doesn't achieve is hiding the implementations from users of the templates, but it doesn't sound like that's your aim anyway, if it is then provide non-template function API, which can be implemented in terms of templates in a single .cpp file.

This is not a complete answer.
I read that clang and llvm are able to do very comprehensive link time optimization. This includes link time inlining! To enable this, compile with optimization level -O4 when using clang++. The object files will be llvm bytecode instead of machine code. This is what makes this possible. This feature should therefore allow you to put all of your definitions in the cpp files, knowing that they will still be inlined where necessary.
Btw, the length of a function body is not the only thing that determines whether it will be inlined. A lengthy function that is only called from one location can easily be inlined at that location.

Multiple definitions of a function template

Suppose a header file defines a function template. Now suppose two implementation files #include this header, and each of them has a call to the function template. In both implementation files the function template is instantiated with the same type.
// header.hh
template <typename T>
void f(const T& o)
{
// ...
}
// impl1.cc
#include "header.hh"
void fimpl1()
{
f(42);
}
// impl2.cc
#include "header.hh"
void fimpl2()
{
f(24);
}
One may expect the linker would complain about multiple definitions of f(). Specifically, if f() wouldn't be a template then that would indeed be the case.
How come the linker doesn't complain about multiple definitions of f()?
Is it specified in the standard that the linker must handle this situation gracefully? In other words, can I always count on programs similar to the above to compile and link?
If the linker can be clever enough to disambiguate a set of function template instantiations, why can't it do the same for regular functions, given they are identical as is the case for instantiated function templates?

The Gnu C++ compiler's manual has a good discussion of this. An excerpt:
C++ templates are the first language
feature to require more intelligence
from the environment than one usually
finds on a UNIX system. Somehow the
compiler and linker have to make sure
that each template instance occurs
exactly once in the executable if it
is needed, and not at all otherwise.
There are two basic approaches to this
problem, which are referred to as the
Borland model and the Cfront model.
Borland model
Borland C++ solved the template
instantiation problem by adding the
code equivalent of common blocks to
their linker; the compiler emits
template instances in each translation
unit that uses them, and the linker
collapses them together. The advantage
of this model is that the linker only
has to consider the object files
themselves; there is no external
complexity to worry about. This
disadvantage is that compilation time
is increased because the template code
is being compiled repeatedly. Code
written for this model tends to
include definitions of all templates
in the header file, since they must be
seen to be instantiated.
Cfront model
The AT&T C++ translator, Cfront,
solved the template instantiation
problem by creating the notion of a
template repository, an automatically
maintained place where template
instances are stored. A more modern
version of the repository works as
follows: As individual object files
are built, the compiler places any
template definitions and
instantiations encountered in the
repository. At link time, the link
wrapper adds in the objects in the
repository and compiles any needed
instances that were not previously
emitted. The advantages of this model
are more optimal compilation speed and
the ability to use the system linker;
to implement the Borland model a
compiler vendor also needs to replace
the linker. The disadvantages are
vastly increased complexity, and thus
potential for error; for some code
this can be just as transparent, but
in practice it can be very difficult
to build multiple programs in one
directory and one program in multiple
directories. Code written for this
model tends to separate definitions of
non-inline member templates into a
separate file, which should be
compiled separately.
When used with GNU ld version 2.8 or
later on an ELF system such as
GNU/Linux or Solaris 2, or on
Microsoft Windows, G++ supports the
Borland model. On other systems, G++
implements neither automatic model.

In order to support C++, the linker is smart enough to recognize that they are all the same function and throws out all but one.
EDIT: clarification:
The linker doesn't compare function contents and determine that they are the same.
Templated functions are marked as such and the linker recognizes that they have the same signatures.

This is more or less a special case just for templates.
The compiler only generates the template instantiations that are actually used. Since it has no control over what code will be generated from other source files, it has to generate the template code once for each file, to make sure that the method gets generated at all.
Since it's difficult to solve this (the standard has an extern keyword for templates, but g++ doesn't implement it) the linker simply accepts the multiple definitions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js