C++: Include vs LoadLibrary() - c++

I am having some trouble understanding why both #include and LoadLibrary() is needed in C++. In C++ "#include" forces the pre-processor to replace the #include line with the contents of the file you are including (usually a header file containing declarations). As far as I understand, this enables me to use the routines I might want in the external libraries the headers belong to.
Why do I then need LoadLibrary()? Can't i just #include the library itself?
Just as a side note: In C#, which I am more familiar with, I just Add a Reference to a DLL if I want to use types or routines from that DLL in my program. I do not have to #include anything, as the .NET framework apparently automatically searches all the referenced assemblies for the routines I want to use (as specified by the namespace)
Thank you very much in advance.
Edit: Used the word "definitions", but meant "declarations". Now fixed.
Edit 2: Tough to pick one answer, many good replies. Thanks for all contributions.

C++ uses a full separate compilation model; you can even compile
against code which hasn't been written. (This often occurs in
large projects.) When you include a file, all you are doing is
telling the compiler that the functions, etc. exist. You do not
provide an implementation (except for inline functions and
templates). In order to execute the code, you have to provide
the implementation, by linking it into your application. This
can occur in several different ways:
You have the source files; you compile them along with your
sources, and link in the resulting objects.
You have a static library; you must link against it.
You have a dynamic library. Here, what you must do will
depend on the implemention: under Windows, you must link
against a .lib stub, and put the .dll somewhere where the
runtime will find it when you execute. (Putting it in the same
directory as your application is usually a good solution.)
I don't quite understand your need to call LoadLibrary. The
only time I've needed this is when I've intentionally avoided
using anything in the library directly, and want to load it
conditionally, use GetProcAddr to get the addresses of the
functions I need.
EDIT:
Since I was asked to clarify "linking": program translation
(from the source to an executable) takes place in a number of
steps. In traditional terms, each translation unit is
"compiled" into an object file, which contains an image of the
machine instructions, but with unfilled spaces for external
references. For example, if you have:
extern void function();
in your source (probably via inclusion of a header), and you
call function, the compiler will leave the address field of
the call instruction blank, since it doesn't know where the
function will be located. Linking is the process of taking all
of the object files, and filling in these blanks. One of the
object files will define function, and the linker will
establish the actual address in the memory image, and fill in
the blank referring to function with the address of function
in that image. The result is a complete memory image of the
executable. On the early systems I worked on: literally. The
OS would simply copy the executable file directly into memory,
and then jump into it. Things like virtual memory and shared,
write protected code segments make this a little more
complicated today, but for statically linked libraries or object
files (my first two cases above), the differences aren't that
great.
Modern system technologies have blurred the lines somewhat. For
example, most Java (and I think C#) compilers don't generate
classical object files, with machine code, but rather byte code,
and the compile and link phases, above, don't take place until
runtime. Some C++ compilers also only generate byte code, which
will be compiled when the code is "linked". This is done to
permit cross-module optimizations. And all modern systems
support dynamic linking: some of the blank addresses are left
blank until execution time. And dynamic linking can be implicit
or explicit: when it is implicit, the link phase will insert
information into the executable concerning the libraries it
needs, and where to find them, and the OS will link them,
implicitly, either when the executable is loaded, or on demand,
triggered by the code attempting to use one of the unfilled
address slots. When it is explicit, you normally don't have any
explicit referenced to the name in your code. In the case of
function, above, for example, you wouldn't have any code which
directly called function. Your code would, however, load the
dynamic library using LoadLibrary (or dlopen under Unix),
then request the address of a name, using GetProcAddr (or
dlsys), and call the function indirectly through the pointer
it received.

The #include directive is, like all preprocessor functionality, merely a text replacement. The text "#include " is replaced with the contents of that file.
Typically (but not necessarily), this is used to include a header file which declares the functions that you want to use, i.e. you tell the compiler (which runs after the preprocessor) how some functions that you intend to use are named, what parameters they take, and what the return type is. It does not define what the function is actually doing.
You then also need an implementation of these functions, too. Usually, if you do not implement them in your program, you leave this task to the link stage. You give a list of libraries that your program depends on to the linker, and the linker divines via some implementation-defined way (such as an "import library") what it needs to do to "make it work". The linker will produce some glue code and write some information into the executable that will make the loader automatically load the required libraries. Everything "just works" without you having to do something special.
In some cases, however, you want to postpone the linker stage and do the loading "fully dynamically" by hand rather than automatically. This is when you have to call LoadLibrary() and GetProcAddress. The former brings the DLL into memory and does some setup (e.g. relocation), the latter gives you the address of a function that you want to call.
The #include in your code is still necessary so the compiler knows what to do with that pointer. Otherwise, you could of course call the obtained function via its address, but it would not be possible to call the function in a meaningful way.
One reason why one would want to load a library manually (using LoadLibrary) is that it is more failsafe. If you link a program against a library and the library cannot be found (or a symbol cannot be found), then your application will not start up and the user will see a more or less obscure error message.
If LoadLibrary fails or GetProcAddress doesn't work, your program can in principle still run, albeit with reduced functionality.
Another example for using LoadLibrary might be to load an alternative version of a function from a different library (some programs implement "plugins" that way). The function "looks" the same to the compiler, as defined in the include file, but may behave differently, as by whatever is in the loaded binary.

#include brings in source code only: symbol declarations for the compiler. A library (or a DLL) is object code: Use either LoadLibrary or link to a lib file to bring in object code.

LoadLibrary() causes the code module to be loaded from disk into your applications memory space for execution. This allows for dynamically loading code at runtime. You would not use LoadLibrary(), for example, if the code you want to use is compiled into a statically linked library. In that case you would provide the name of the .lib file that contained the code to the linker and it gets resolved at link time - the code is linked in to your .exe and the .lib is not distributed with the .exe in order for it to execute.
LoadLibrary() creates a dependency on an external DLL which must be present on the path provided to the method call in order for the .exe to properly execute. If LoadLibrary() fails, you must ensure your code will handle it appropriately, by either exiting gracefully or providing some other execution alternative. You must provide a .lib file to the linker the same as you would for the static library above. This .lib file however does not contain code, just entry points for the actual code that resides in the .dll.
In both cases you must #include the headers for the code you wish to execute. This is required by the compiler in order to build function call signatures properly based on the type information provided by the header.
C# assemblies contain both type information and IL. A single reference is sufficient to satisfy the need for header information and binding to the code itself.

#include is static, the substitution is done at compile time. LoadLibrary() lets you load a DLL at runtime, for example based on user imput.

Related

Load-time dynamic link library dispatching

I'd like my Windows application to be able to reference an extensive set of classes and functions wrapped inside a DLL, but I need to be able to guide the application into choosing the correct version of this DLL before it's loaded. I'm familiar with using dllexport / dllimport and generating import libraries to accomplish load-time dynamic linking, but I cannot seem to find any information on the interwebs with regard to possibly finding some kind of entry point function into the import library itself, so I can, specifically, use CPUID to detect the host CPU configuration, and make a decision to load a paricular DLL based on that information. Even more specifically, I'd like to build 2 versions of a DLL, one that is built with /ARCH:AVX and takes full advantage of SSE - AVX instructions, and another that assumes nothing is available newer than SSE2.
One requirement: Either the DLL must be linked at load-time, or there needs to be a super easy way of manually binding the functions referenced from outside the DLL, and there are many, mostly wrapped inside classes.
Bonus question: Since my libraries will be cross-platform, is there an equivalent for Linux based shared objects?
I recommend that you avoid dynamic resolution of your DLL from your executable if at all possible, since it is just going to make your life hard, especially since you have a lot of exposed interfaces and they are not pure C.
Possible Workaround
Create a "chooser" process that presents the necessary UI for deciding which DLL you need, or maybe it can even determine it automatically. Let that process move whatever DLL has been decided on into the standard location (and name) that your main executable is expecting. Then have the chooser process launch your main executable; it will pick up its DLL from your standard location without having to know which version of the DLL is there. No delay loading, no wonkiness, no extra coding; very easy.
If this just isn't an option for you, then here are your starting points for delay loading DLLs. Its a much rockier road.
Windows
LoadLibrary() to get the DLL in memory: https://msdn.microsoft.com/en-us/library/windows/desktop/ms684175(v=vs.85).aspx
GetProcAddress() to get pointer to a function: https://msdn.microsoft.com/en-us/library/windows/desktop/ms683212(v=vs.85).aspx
OR possibly special delay-loaded DLL functionality using a custom helper function, although there are limitations and potential behavior changes.. never tried this myself: https://msdn.microsoft.com/en-us/library/151kt790.aspx (suggested by Igor Tandetnik and seems reasonable).
Linux
dlopen() to get the SO in memory: http://pubs.opengroup.org/onlinepubs/009695399/functions/dlopen.html
dladdr() to get pointer to a function: http://man7.org/linux/man-pages/man3/dladdr.3.html
To add to qexyn's answer, one can mimic delay loading on Linux by generating a small static stub library which would dlopen on first call to any of it's functions and then forward actual execution to shared library. Generation of such stub library can be automatically generated by custom project-specific script or Implib.so:
# Generate stub
$ implib-gen.py libxyz.so
# Link it instead of -lxyz
$ gcc myapp.c libxyz.tramp.S libxyz.init.c

Do I need to distribute a header file and a lib file with a DLL?

I'm updating a DLL for a customer and, due to corporate politics - among other things - my company has decided to no longer share source code with the customer.
Previously. I assume they had all the source and imported it as a VC++6 project. Now they will have to link to the pre-compiled DLL. I would imagine, at minimum, that I'll need to distribute the *.lib file with the DLL so that DLL entry-points can be defined. However, do I also need to distribute the header file?
If I can get away with not distributing it, how would the customer go about importing the DLL into their code?
Yes, you will need to distribute the header along with your .lib and .dll
Why ?
At least two reasons:
because C++ needs to know the return type and arguments of the functions in the library (roughly said, most compilers use name mangling, to map the C++ function signature to the library entry point).
because if your library uses classes, the C++ compiler needs to know their layout to generate code in you the library client (e.g. how many bytes to put on the stack for parameter passing).
Additional note: If you're asking this question because you want to hide implementation details from the headers, you could consider the pimpl idiom. But this would require some refactoring of your code and could also have some consequences in terms of performance, so consider it carefully
However, do I also need to distribute the header file?
Yes. Otherwise, your customers will have to manually declare the functions themselves before they can use it. As you can imagine, that will be very error prone and a debugging nightmare.
In addition to what others explained about header/LIB file, here is different perspective.
The customer will anyway be able to reverse-engineer the DLL, with basic tools such as Dependency Walker to find out what system DLLs your DLL is using, what functions being used by your DLL (for example some function from AdvApi32.DLL).
If you want your DLL to be obscured, your DLL must:
Load all custom DLLs dynamically (and if not possible, do the next thing anyhow)
Call GetProcAddress on all functions you want to call (GetProcessToken from ADVAPI32.DLL for example
This way, at least dependency walker (without tracing) won't be able to find what functions (or DLLs) are being used. You can load the functions of system DLL by ordinal, and not by name so it becomes more difficult to reverse-engineer by text search in DLL.
Debuggers will still be able to debug your DLL (among other tools) and reverse engineer it. You need to find techniques to prevent debugging the DLL. For example, the very basic API is IsDebuggerPresent. Other advanced approaches are available.
Why did I say all this? Well, if you intend to not to deliver header/DLL, the customer will still be able to find exported functions and would use it. You, as a DLL provider, must also provide programming elements with it. If you must hide, then hide it totally.
One alternative you could use is to pass only the DLL and make the customer to load it dynamically using LoadLibrary() + GetProcAddress(). Although you still need to let your customer know the signature of the functions in the DLL.
More detailed sample here:
Dynamically load a function from a DLL

How to hide a C++ function definition before releasing source code

I am building a Windows (Visual Studio, C++ based) console application and would like to release the source code for it. However, I do not want the definition of a particular function be visible to it. Is there a way to pre-compile (just the file containing the definition) it so that no one can view it, but the rest of the source is visible and can be built/ran using the 'pre-compiled' function definition?
You have two basic choices:
have that function in a compiled form
if you only want to ship source code, you could compile a shared library, then find or write a utility to generate C/C++ source code with a character array containing the binary file content, then your program can ifstream::write() to a file and link (e.g. dlopen()) the shared library on the fly at runtime (this doesn't buy you much over just shipping the shared library unless you have pressing reasons for needing a "source-only" release, and the above qualifies for your purposes)
obfuscate the source code in some way that removes the value or insight potentially gained by reading it
there are number of source code obfuscation utilities around (specific recommendations are off-topic for S.O.)
one form of obfuscation is to generate an assembly language version (e.g. g++ -S, cl /S) of the C++ sources, which would be verbose and harder for most programmers to understand and modify further
No you can't. When you release the source code, then the information about the function is there. If you release it in some way 'pre-compiled' and the function is still called then the information is still there. If you require the function to stay a secret you must not release it.

loading multiple similar shared libraries on linux

I am working on code that creates 'models'. A model is created from an XML file and part of its representation is, generated on the fly, C code. This C code is compiled, on the fly, into a shared library that is dynamically loaded (using POCO shared lib class). The shared library mainly contains small functions, and part of a models creation is to populate function pointers to these functions. All this works fine. However, creating several models, at the same time, causes problems.
I believe it has to do with how the dynamic loading works on Linux, and the fact that each shared library contain functions with identical names. Does PIC cause this? The problems manifest itself in no sense data being retrieved from shared libraries functions.
So question is, how to load multiple (thousands) shared libraries, containing identical function names, on linux?
The above works fine on windows, where it seems that dynamically loaded libraries data/functions are kept perfectly isolated from each other.
Poco SharedLibrary open function accepts linker flags. Default is Poco::SharedLibrary::SHLIB_GLOBAL, corresponding to dlopens RTLD_GLOBAL and Poco::SharedLibrary::SHLIB_LOCAL, corresponding to RTLD_LOCAL. See http://linux.die.net/man/3/dlopen for more info.
Passing the Poco::SharedLibrary::SHLIB_LOCAL flag fixed the problem.
First, you can dlopen many hundreds of thousands of shared objects. My manydl.c demonstrate that.
Then, you can also generate C code, compile it, and dlopen the shared object, all this from the same process. My (obsolete in 2017) MELT plugin (for GCC, MELT provides a high level language to extend GCC) does that (and so does my manydl.c example).
However, I don't think you should keep identical (defined) function names in them. I suggest avoiding that. You could
generate unique names (since the C code is generated, this is the best, most portable, and simplest solution)
compile with some -D flags to #define these names to unique names, so the source code could apparently contain duplicate names; that is if your generated code defines the foo function pass a -Dfoo=foo_123 (with foo_123 being globally unique) to the gcc command compiling it. (Of course you then dlsym for "foo_123").
add visibility("hidden") function attributes in your generated code. You could also pass -fvisibility=hidden option to gcc.
have only static functions (then, the name don't matter much so can be duplicate), and have a constructor function which somehow binds the functions (e.g. store their pointer somewhere, e.g. in some global table).
you might consider passing RTLD_LOCAL to dlopen(3). I'm not sure it is a good idea (and POCO probably don't know how to do that).
PS. I don't think it is related to position independent code (which is preferable, but not absolutely required, in shared objects; without -fPIC the dlopen would have to do a lot of inefficient relocations). It is related to Linux shared object linking & loading. Read Levine's Linkers & Loaders book for details.
See also this question, and read Drepper's paper on How to Write Shared Libraries.

How much source information is stored in c++ executables

Some days ago I accidentally opened a C++ executable of a commercial application in Notepad++ and found out that there's quite a lot information about the original source code stored in the executable.
Inside the executable I could find file names (app.c, dlgstat.c, ...), function names (GetTickCount, DispatchMessageA, ...) and small pieces of source code, mostly conditions (szChar != TEXT('\0'), iRow < XTGetRows( hwndList )). After that I checked another QT executable and: yes again source file names and method signatures.
Because of that I am wondering how much source code information is really stored in a C/C++ executable (e.g., compiled using QT or MinGW). Is this probably some kind of debug build still containing the original source? Is this information used for some reflection stuff? Is there any reason why publishers don't remove this stuff?
How much source code information is really stored in a C/C++ executable?
In practice, not much. The source code is not required at runtime. The strings you name come from two things:
The function names (e.g. GetTickCount) are the names of functions imported from other modules. The names are required at runtime because the functions are resolved dynamically (by calling GetProcAddress with the function name).
The conditions are likely assertions: the assert macro stringizes its argument so that when it fires you know what condition was not met.
If you build a DLL, it will also contain a names of all of the functions it exports, so they can be resolved at runtime (the same is likely true for other shared object formats).
Debug symbols may also contain some of the original source code, though it depends on the format used by the debug symbols. These symbols may be contained either in the binary itself or in an auxiliary file (for example, .pdb files used on Windows).
Windows function names: they probably are there just because they are being accessed dynamically - somewhere in your program there's a GetProcAddress to get their address. Still, no reason to worry, every application uses WinAPIs, so there's not much to discover about your executable from that information.
Conditions: probably from some assert-like macro; they are included to allow assert to print what failed condition triggered the failed assertion. Anyhow, in release mode assertions should be removed automatically.
Source file names and method signatures: probably from some usage of __FILE__ and __func__ macros; probably, again, from assert.
Other sources of information about the inner structure of your program is RTTI, that has to provide some representation for every type that typeid could be working on. If you don't need its functionality, you can disable it (but I don't know if that is possible in Qt projects).
Mixed into the binary of a C++ app you will find the names of most global symbols (and debugging symbols if enabled in the compiler), but with extra 'decoration text' that encodes the calling signature of the symbol if it is a function or method. Likewise, the literals of character strings are embedded in clear text. But no where will you find anything like the actual source code that the compiler used to create the binary executable. That information is lost during the compilation process, and it is especially hard to reverse engineer if C++ templates are employed in the build.