Clang/GCC Compiler Intrinsics without corresponding compiler flag - c++

I know there are similar questions to this, but compiling different file with different flag is not acceptable solution here since it would complicate the codebase real quick. An answer with "No, it is not possible" will do.
Is it possible, in any version of Clang OR GCC, to compile intrinsics function for SSE 2/3/3S/4.1 while only enable compiler to use SSE instruction set for its optimization?
EDIT: For example, I want compiler to turn _mm_load_si128() to movdqa, but compiler must not do emit this instruction at any other place than this intrinsics function, similar to how MSVC compiler works.
EDIT2: I have dynamic dispatcher in place and several version of single function with different instruction sets written using intrinsics function. Using multiple file will make this much harder to maintain as same version of code will span multiple file, and there are a lot of this type of functions.
EDIT3: Example source code as requested: https://github.com/AviSynth/AviSynthPlus/blob/master/avs_core/filters/resample.cpp or most file in that folder really.

Here is an approach using gcc that might be acceptable. All source code goes into a single source file. The single source file is divided into sections. One section generates code according to the command line options used. Functions like main() and processor feature detection go in this section. Another section generates code according to a target override pragma. Intrinsic functions supported by the target override value can be used. Functions in this section should be called only after processor feature detection has confirmed the needed processor features are present. This example has a single override section for AVX2 code. Multiple override sections can be used when writing functions optimized for multiple targets.
// temporarily switch target so that all x64 intrinsic functions will be available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
#include <intrin.h>
// restore the target selection
#pragma GCC pop_options
//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
//----------------------------------------------------------------------------
int dummy1 (int a) {return a;}
//----------------------------------------------------------------------------
// the following functions will be compiled using core-avx2 code generation
// all x64 intrinc functions are available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
//----------------------------------------------------------------------------
static __m256i bitShiftLeft256ymm (__m256i *data, int count)
{
__m256i innerCarry, carryOut, rotate;
innerCarry = _mm256_srli_epi64 (*data, 64 - count); // carry outs in bit 0 of each qword
rotate = _mm256_permute4x64_epi64 (innerCarry, 0x93); // rotate ymm left 64 bits
innerCarry = _mm256_blend_epi32 (_mm256_setzero_si256 (), rotate, 0xFC); // clear lower qword
*data = _mm256_slli_epi64 (*data, count); // shift all qwords left
*data = _mm256_or_si256 (*data, innerCarry); // propagate carrys from low qwords
carryOut = _mm256_xor_si256 (innerCarry, rotate); // clear all except lower qword
return carryOut;
}
//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
#pragma GCC pop_options
//----------------------------------------------------------------------------
int main (void)
{
return 0;
}
//----------------------------------------------------------------------------

There is no way to control instruction set used for the compiler, other than the switches on the compiler itself. In other words, there are no pragmas or other features for this, just the overall compiler flags.
This means that the only viable solution for achieving what you want is to use the -msseX and split your source into multiple files (of course, you can always use various clever #include etc to keep one single textfile as the main source, and just include the same file in multiple places)
Of course, the source code of the compiler is available. I'm sure the maintainers of GCC and Clang/LLVM will happily take patches that improve on this. But bear in mind that the path from "parsing the source" to "emitting instructions" is quite long and complicated. What should happen if we do this:
#pragma use_sse=1
void func()
{
... some code goes here ...
}
#pragma use_sse=3
void func2()
{
...
func();
...
}
Now, func is short enough to be inlined, should the compiler inline it? If so, should it use sse1 or sse3 instructions for func().
I understand that YOU may not care about that sort of difficulty, but the maintainers of Clang and GCC will indeed have to deal with this in some way.
Edit:
In the headerfiles declaring the SSE intrinsics (and many other intrinsics), a typical function looks something like this:
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
}
The builtin_ia32_addss is only available in the compiler when you have enabled the -msse option. So if you convince the compiler to still allow you to use the _mm_add_ss() when you have -mno-sse, it will give you an error for "__builtin_ia32_addss is not declared in this scope" (I just tried).
It would probably not be very hard to change this particular behaviour - there are probably only a few places where the code does the "introduce builtin functions". However, I'm not convinced that there are further issues in the code, later on when it comes to actually issuing instructions in the compiler.
I have done some work with "builtin functions" in a Clang-based compiler, and unfortunately, there are several steps involved in getting from the "parser" to the "code generation", where the builtin function gets involved.
Edit2:
Compared to GCC, solving this for Clang is even more complex, in that the compiler itself has understanding of SSE instructions, so it simply has this in the header file:
static __inline__ __m128 __attribute__((__always_inline__, __nodebug__))
_mm_add_ps(__m128 __a, __m128 __b)
{
return __a + __b;
}
The compiler will then know that to add a couple of __m128, it needs to produce the correct SSE instruction. I have just downloaded Clang (I'm at home, my work on Clang is at work, and not related to SSE at all, just builtin functions in general - and I haven't really done much of the changes to Clang as such, but it was enough to understand roughly how builtin functions work).
However, from your perspective, the fact that it's not a builtin function makes it worse, because the operator+ translation is much more complicated. I'm pretty sure the compiler just makes it into an "add these two things", and then pass it to LLVM for further work - LLVM will be the part that understands SSE instructions etc. But for your purposes, this makes it worse, because the fact that this is an "intrinsic function" is now pretty much lost, and the compiler just deals with it just as if you'd written a + b, with the side effect of a and b being types that are 128 bits long. It makes it even more complicated to deal with generating "the right instructions" and yet keeping "all other" instructions at a different SSE level.

Related

Using MSVC's __popcnt in a constexpr function

Background: I am trying to repurpose some C++ code written for GCC in a MSVC project. I have been trying to refactor code to make it compatible with MSVC compiler.
Simplified, one of the functions originally was this:
[[nodiscard]] constexpr int count() const noexcept {
return __builtin_popcountll(mask);//gcc-specific function
}
Where mask is a 64-bit member variable. The obvious conversion to MSVC is:
[[nodiscard]] constexpr int count() const noexcept {
return __popcnt64(mask); // MSVC replacement
}
However, it doesn't compile because __popcnt64 in not allowed in a constexpr function.
I am using C++17, and I would prefer to avoid having to switch to C++20 if possible.
Is there a way to make it work?
You cannot make a non-constexpr function become a constexpr one. If their standard library doesn't declare it constexpr, then that's it. You will have to write your own, which would be difficult in C++17.
It depends on goal:
Just count bits in compile time and possibly in runtime. Then just implement own constexpr bit counting and don't use __popcnt64. You can look in into Wikipedia's Hamming weight article for ideas.
Use popcnt instruction in runtime. Then you need to implement compile-time / run-time distinction, to use different compile-time and runtime implementations.
For compile-time/runtime distinction in C++20 you would have used if (std::is_constant_evaluated()) { ... } else { ... }
In MSVC, std::is_constant_evaluated is implemented via compiler magic __builtin_is_constant_evaluated(), which happen to compile and work properly in C++17. So you can:
constexpr int popcount(unsigned long long x)
{
if (__builtin_is_constant_evaluated())
{
return -1; // TODO: count bits
}
else
{
return __popcnt64(x);
}
}
Note: __builtin_popcountll compiles into either the popcnt instruction or bit counting via bit hacks, depending on compilation flags. MSVC __popcnt64 always compiles into the popcnt instruction. If the goal is to support older CPUs that do not have the popcnt instruction, you'd have to provide CPU detection (compile-time or runtime, again, depending on goals) and fallback, or don't use __popcnt64 at all.
The question has already been answered. So, just some side note.
Building your own and efficient pop function would pobably be the best solution.
And for that you may reuse the very old wisdom from the book "Hacker's Delight" by Henry S. Warren. This book is from the time, when programmers and algorithm developers worked on solutions, to minimize the usage of the precious assembler statements. Both for ROM consumption (yes, indeed) and performance.
You will find there many very efficient solutions, even completely loop free and an astonishingly low number of assembler instructions. For example, with the usage of the divide and conquer method.
It is worth a visit . . .

Practical approach to do vectorization on different CPUs [duplicate]

What is the best way to implement multiple versions of the same function that uses a specific CPU instructions if available (tested at run time), or falls back to a slower implementation if not?
For example, x86 BMI2 provides a very useful PDEP instruction. How would I write a C code such that it tests BMI2 availability of the executing CPU on startup, and uses one of the two implementations -- one that uses _pdep_u64 call (available with -mbmi2), and another that does bit manipulation "by hand" using C code. Are there any built-in support for such cases? How would I make GCC compile for older arch while providing access to the newer intrinsic? I suspect execution is faster if the function is invoked via a global function pointer, rather than an if/else every time?
You can declare a function pointer and point it to the correct version at program startup by calling cpuid to determine the current architecture
But it's better to utilize support from many modern compilers. Intel's ICC has automatic function dispatching to select the optimized version for each architecture long ago. I don't know the details but looks like it only applies to Intel's libraries. Besides it only dispatches to the efficient version on Intel CPUs, hence would be unfair to other manufacturers. There are many patches and workarounds for that in Agner`s CPU blog
Later a feature called Function Multiversioning was introduced in GCC 4.8. It adds the target attribute that you'll declare on each version of your function
__attribute__ ((target ("sse4.2")))
int foo() { return 1; }
__attribute__ ((target ("arch=atom")))
int foo() { return 2; }
int main() {
int (*p)() = &foo;
return foo() + p();
}
That duplicates a lot of code and is cumbersome so GCC 6 added target_clones that tells GCC to compile a function to multiple clones. For example __attribute__((target_clones("avx2","arch=atom","default"))) void foo() {} will create 3 different foo versions. More information about them can be found in GCC's documentation about function attribute
The syntax was then adopted by Clang and ICC. Performance can even be better than a global function pointer because the function symbols can be resolved at process loading time instead of runtime. It's one of the reasons Intel's Clear Linux runs so fast. ICC may also create multiple versions of a single loop during auto-vectorization
Function multi-versioning in GCC 6
Function Multi-Versioning
The - surprisingly limited - usefulness of function multiversioning in GCC
Generate code for multiple SIMD architectures
Here's an example from The one with multi-versioning (Part II) along with its demo which is about popcnt but you get the idea
__attribute__((target_clones("popcnt","default")))
int runPopcount64_builtin_multiarch_loop(const uint8_t* bitfield, int64_t size, int repeat) {
int res = 0;
const uint64_t* data = (const uint64_t*)bitfield;
for (int r=0; r<repeat; r++)
for (int i=0; i<size/8; i++) {
res += popcount64_builtin_multiarch_loop(data[i]);
}
return res;
}
Note that PDEP and PEXT are very slow on current AMD CPUs so they should only be enabled on Intel

Is it possible to calculate function length at compile time in C++?

I have this piece of code:
constexpr static VOID fStart()
{
auto a = 3;
a++;
}
__declspec(naked)
constexpr static VOID fEnd() {};
static constexpr auto getFSize()
{
return (SIZE_T)((PBYTE)fEnd - (PBYTE)fStart);
}
static constexpr auto fSize = getFSize();
static BYTE func[fSize];
Is it possible to declare "func[fSize]" array size as the size of "fStart()" during compilation without using any std library? It is necessary in order to copy the full code of fStart() into this array later.
There is no method in standard C++ to get the length of a function.
You'll need to use a compiler specific method.
One method is to have the linker create a segment, and place your function in that segment. Then use the length of the segment.
You may be able to use some assembly language constructs to do this; depends on the assembler and the assembly code.
Note: in embedded systems, there are reasons to move function code, such as to On-Chip memory or swap to external memory, or to perform a checksum on the code.
The following calculates the "byte size" of the fStart function. However, the size cannot be obtained as a constexpr this way, because casting loses the compile-time const'ness (see for example Why is reinterpret_cast not constexpr?), and the difference of two unrelated function pointers cannot be evaluated without some kind of casting.
#pragma runtime_checks("", off)
__declspec(code_seg("myFunc$a")) static void fStart()
{ auto a = 3; a++; }
__declspec(code_seg("myFunc$z")) static void fEnd(void)
{ }
#pragma runtime_checks("", restore)
constexpr auto pfnStart = fStart; // ok
constexpr auto pfnEnd = fEnd; // ok
// constexpr auto nStart = (INT_PTR)pfnStart; // error C2131
const auto fnSize = (INT_PTR)pfnEnd - (INT_PTR)pfnStart; // ok
// constexpr auto fnSize = (INT_PTR)pfnEnd - (INT_PTR)pfnStart; // error C2131
On some processors and with some known compilers and ABI conventions, you could do the opposite:
generate machine code at runtime.
For x86/64 on Linux, I know GNU lightning, asmjit, libgccjit doing so.
The elf(5) format knows the size of functions.
On Linux, you can generate shared libraries (perhaps generate C or C++ code at runtime (like RefPerSys does and GCC MELT did), then compiling it with gcc -fPIC -shared -O) and later dlopen(3) / dlsym(3) it. And dladdr(3) is very useful. You'll use function pointers.
Read also a book on linkers and loaders.
But you usually cannot move machine code without doing some relocation, unless that machine code is position-independent code (quite often PIC is slower to run than ordinary code).
A related topic is garbage collection of code (or even of agents). You need to read the garbage collection handbook and take inspiration from implementations like SBCL.
Remember also that a good optimizing C++ compiler is allowed to unroll loops, inline expand function calls, remove dead code, do function cloning, etc... So it may happen that machine code functions are not even contiguous: two C functions foo() and bar() could share dozens of common machine instructions.
Read the Dragon book, and study the source code of GCC (and consider extending it with your GCC plugin). Look also into the assembler code produced by gcc -O2 -Wall -fverbose-asm -S. Some experimental variants of GCC might be able to generate OpenCL code running on your GPGPU (and then, the very notion of function end does not make sense)
With generated plugins thru C and C++, you carefully could remove them using dlclose(3) if you use Ian Taylor's libbacktrace and dladdr to explore your call stack. In 99% of the cases, it is not worth the trouble, since in practice a Linux process (on current x86-64 laptops in 2021) can do perhaps a million of dlopen(3), as my manydl.c program demonstrates (it generates "random" C code, compile it into a unique /tmp/generated123.so, and dlopen that, and repeat many times).
The only reason (on desktop and server computers) to overwrite machine code is for long lasting server processes generating machine code every second. If this was your scenario, you should have mentioned it (and generating JVM bytecode by using Java classloaders could make more sense).
Of course, on 16 bits microcontrollers things are very different.
Is it possible to calculate function length at compile time in C++?
No, because at runtime time some functions do not exist anymore.
The compiler have somehow removed them. Or cloned them. Or inlined them.
And for C++ it is practically important with standard containers: a lot of template expansion occurs, including for useless code which has to be removed by your optimizing compiler at some point.
(Think -in 2021 of compilation with a recent GCC 10.2 or 11. using everywhere, and linking with, gcc -O3 -flto -fwhole-program: a function foo23 might be defined but never called, and then it is not inside the ELF executable)

How to set ICC attribute "fp-model precise" for a single function, to prevent associative optimizations?

I am implementing Kahan summation, in a project that supports compilation with gcc47, gcc48, clang33, icc13, and icc14.
As part of this algorithm, I would like to disable optimizations that take advantage of the associativity of addition of real numbers. (Floating point operations are not associative.)
I would like to disable those optimizations only in the relevant function. I have figured out how to do this under gcc, using the ''no-associative-math'' attribute. How can I do this in icc or clang? I have searched without luck.
class KahanSummation
{
// GCC declaration
void operator()(float observation) __attribute__((__optimize__("no-associative-math")))
{
// Kahan summation implementation
}
};
Other GCC attributes that would imply no-associative-math are no-unsafe-math-optimizations or no-fast-math.
Looking at an Intel presentation (PDF, slide 8) or another or another (PDF, slide 11), I want to set "fp-model precise" in ICC, for only this function. The compilers I care about are ICC 13 and ICC 14.
class KahanSummation
{
// ICC or clang declaration
void operator()(float observation) __attribute__((????))
{
// Kahan summation implementation
}
};
__attribute__ is a GCC extension. Clang also supports that syntax to maintain some GCC compatibility, but the only optimization-related attribute it seems to support is optnone, which turns off all optimizations. ICC has a few optimization pragmas, but nothing that does what you want to do, from what I can tell. appears to support #pragma float_control for VC++ compatibility, although I can't find anything on exactly how that pragma is supposed to be used in ICC's documentation, so you'll have to use VC++'s.
What you can do, though, is to define the function you want in a separate translation unit (i.e., cpp file):
// Your header
class KahanSummation
{
// Declaration
void operator()(float observation);
};
// Separate cpp file - implements only this function
void KahanSummation::operator()(float observation)
{
// Kahan summation implementation
}
You can then compile the separate file using whatever compiler option you need to use, and link the resulting object file to the rest of your program.

Will C++ linker automatically inline functions (without "inline" keyword, without implementation in header)?

Will the C++ linker automatically inline "pass-through" functions, which are NOT defined in the header, and NOT explicitly requested to be "inlined" through the inline keyword?
For example, the following happens so often, and should always benefit from "inlining", that it seems every compiler vendor should have "automatically" handled it through "inlining" through the linker (in those cases where it is possible):
//FILE: MyA.hpp
class MyA
{
public:
int foo(void) const;
};
//FILE: MyB.hpp
class MyB
{
private:
MyA my_a_;
public:
int foo(void) const;
};
//FILE: MyB.cpp
// PLEASE SAY THIS FUNCTION IS "INLINED" BY THE LINKER, EVEN THOUGH
// IT WAS NOT IMPLICITLY/EXPLICITLY REQUESTED TO BE "INLINED"?
int MyB::foo(void)
{
return my_a_.foo();
}
I'm aware the MSVS linker will perform some "inlining" through its Link Time Code Generation (LTGCC), and that the GCC toolchain also supports Link Time Optimization (LTO) (see: Can the linker inline functions?).
Further, I'm aware that there are cases where this cannot be "inlined", such as when the implementation is not "available" to the linker (e.g., across shared library boundaries, where separate linking occurs).
However, if this is code is linked into a single executable that does not cross DLL/shared-lib boundaries, I'd expect the compiler/linker vendor to automatically inline the function, as a simple-and-obvious optimization (benefiting both performance-and-size)?
Are my hopes too naive?
Here's a quick test of your example (with a MyA::foo() implementation that simply returns 42). All these tests were with 32-bit targets - it's possible that different results might be seen with 64-bit targets. It's also worth noting that using the -flto option (GCC) or the /GL option (MSVC) results in full optimization - wherever MyB::foo() is called, it's simply replaced with 42.
With GCC (MinGW 4.5.1):
gcc -g -O3 -o test.exe myb.cpp mya.cpp test.cpp
the call to MyB::foo() was not optimized away. MyB::foo() itself was slightly optimized to:
Dump of assembler code for function MyB::foo() const:
0x00401350 <+0>: push %ebp
0x00401351 <+1>: mov %esp,%ebp
0x00401353 <+3>: sub $0x8,%esp
=> 0x00401356 <+6>: leave
0x00401357 <+7>: jmp 0x401360 <MyA::foo() const>
Which is the entry prologue is left in place, but immediately undone (the leave instruction) and the code jumps to MyA::foo() to do the real work. However, this is an optimization that the compiler (not the linker) is doing since it realizes that MyB::foo() is simply returning whatever MyA::foo() returns. I'm not sure why the prologue is left in.
MSVC 16 (from VS 2010) handled things a little differently:
MyB::foo() ended up as two jumps - one to a 'thunk' of some sort:
0:000> u myb!MyB::foo
myb!MyB::foo:
001a1030 e9d0ffffff jmp myb!ILT+0(?fooMyAQBEHXZ) (001a1005)
And the thunk simply jumped to MyA::foo():
myb!ILT+0(?fooMyAQBEHXZ):
001a1005 e936000000 jmp myb!MyA::foo (001a1040)
Again - this was largely (entirely?) performed by the compiler, since if you look at the object code produced before linking, MyB::foo() is compiled to a plain jump to MyA::foo().
So to boil all this down - it looks like without explicitly invoking LTO/LTCG, linkers today are unwilling/unable to perform the optimization of removing the call to MyB::foo() altogether, even if MyB::foo() is a simple jump to MyA::foo().
So I guess if you want link time optimization, use the -flto (for GCC) or /GL (for the MSVC compiler) and /LTCG (for the MSVC linker) options.
Is it common ? Yes, for mainstream compilers.
Is it automatic ? Generally not. MSVC requires the /GL switch, gcc and clang the -flto flag.
How does it work ? (gcc only)
The traditional linker used in the gcc toolchain is ld, and it's kind of dumb. Therefore, and it might be surprising, link-time optimization is not performed by the linker in the gcc toolchain.
Gcc has a specific intermediate representation on which the optimizations are performed that is language agnostic: GIMPLE. When compiling a source file with -flto (which activates the LTO), it saves the intermediate representation in a specific section of the object file.
When invoking the linker driver (note: NOT the linker directly) with -flto, the driver will read those specific sections, bundle them together into a big chunk, and feed this bundle to the compiler. The compiler reapplies the optimizations as it usually does for a regular compilation (constant propagation, inlining, and this may expose new opportunities for dead code elimination, loop transformations, etc...) and produces a single big object file.
This big object file is finally fed to the regular linker of the toolchain (probably ld, unless you're experimenting with gold), which performes its linker magic.
Clang works similarly, and I surmise that MSVC uses a similar trick.
It depends. Most compilers (linkers, really) support this kind of optimizations. But in order for it to be done, the entire code-generation phase pretty much has to be deferred to link-time. MSVC calls the option link-time code generation (LTCG), and it is by default enabled in release builds, IIRC.
GCC has a similar option, under a different name, but I can't remember which -O levels, if any, enables it, or if it has to be enabled explicitly.
However, "traditionally", C++ compilers have compiled a single translation unit in isolation, after which the linker has merely tied up the loose ends, ensuring that when translation unit A calls a function defined in translation unit B, the correct function address is looked up and inserted into the calling code.
if you follow this model, then it is impossible to inline functions defined in another translation unit.
It is not just some "simple" optimization that can be done "on the fly", like, say, loop unrolling. It requires the linker and compiler to cooperate, because the linker will have to take over some of the work normally done by the compiler.
Note that the compiler will gladly inline functions that are not marked with the inline keyword. But only if it is aware of how the function is defined at the site where it is called. If it can't see the definition, then it can't inline the call. That is why you normally define such small trivial "intended-to-be-inlined" functions in headers, making their definitions visible to all callers.
Inlining is not a linker function.
The toolchains that support whole program optimization (cross-TU inlining) do so by not actually compiling anything, just parsing and storing an intermediate representation of the code, at compile time. And then the linker invokes the compiler, which does the actual inlining.
This is not done by default, you have to request it explicitly with appropriate command-line options to the compiler and linker.
One reason it is not and should not be default, is that it increases dependency-based rebuild times dramatically (sometimes by several orders of magnitude, depending on code organization).
Yes, any decent compiler is fully capable of inlining that function if you have the proper optimisation flags set and the compiler deems it a performance bonus.
If you really want to know, add a breakpoint before your function is called, compile your program, and look at the assembly. It will be very clear if you do that.
Compiled code must be able to see the content of the function for a chance of inlining. The chance of this happening more can be done though the use of unity files and LTCG.
The inline keyword only acts as a guidance for the compiler to inline functions when doing optimization. In g++, the optimization levels -O2 and -O3 generate different levels of inlining. The g++ doc specifies the following : (i) If O2 is specified -finline-small-functions is turned ON.(ii) If O3 is specified -finline-functions is turned ON along with all options for O2. (iii) Then there is one more relevant options "no-default-inline" which will make member functions inline only if "inline" keyword is added.
Typically, the size of the functions (number of instructions in the assembly), if recursive calls are used determine whether inlining happens. There are plenty more options defined in the link below for g++:
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Please take a look and see which ones you are using, because ultimately the options you use determine whether your function is inlined.
Here is my understanding of what the compiler will do with functions:
If the function definition is inside the class definition, and assuming no scenarios which prevent "inline-ing" the function, such as recursion, exist, the function will be "inline-d".
If the function definition is outside the class definition, the function will not be "inline-d" unless the function definition explicitly includes the inline keyword.
Here is an excerpt from Ivor Horton's Beginning Visual C++ 2010:
Inline Functions
With an inline function, the compiler tries to expand the code in the body of the function in place of a call to the function. This avoids much of the overhead of calling the function and, therefore, speeds up your code.
The compiler may not always be able to insert the code for a function inline (such as with recursive functions or functions for which you have obtained an address), but generally, it will work. It's best used for very short, simple functions, such as our Volume() in the CBox class, because such functions execute faster and inserting the body code does not significantly increase the size of the executable module.
With function definitions outside of the class definition, the compiler treats the functions as a normal function, and a call of the function will work in the usual way; however, it's also possible to tell the compiler that, if possible, you would like the function to be considered as inline. This is done by simply placing the keyword inline at the beginning of the function header. So, for this function, the definition would be as follows:
inline double CBox::Volume()
{
return l * w * h;
}