I have a set of APIs that are explicitly designed to be used only in C++. I do not expect them to be used by a C program (or any other language for that matter), and as such I export namespace and class information as opposed to going the extern "C" route and using inlined utility functions to call the plain C functions.
Right now I am only working on dlls that are linked at compile time, which means importing the functions to the executable is very easy as it involves no work on my part. However, I plan on developing a plugin system which will require me to load dlls dynamically at run time. Will I be able to find name-mangled C++ functions using GetProcAddress()?
What you're doing is not necessarily a good idea unless you control the entire build chain and can ensure both your DLL and any apps using it are built with the same version of the same compiler.
With that said, yes, you can load name-mangled functions using GetProcAddress. Just use Dependency Walker or look at the generated .def file for your DLL, if your compiler is set to produce one, to get the mangled function name. Then you can GetProcAddress it. You cannot, however, call GetProcAddress with an unmangled name and expect it to find the right mangled name. For example, if your DLL's function is named Add, and is mangled to _Z3Addv, you would need to call GetProcAddress(myDLL, "_Z3Addv"); to access the function properly.
You would need to change the call to GetProcAddress every time you change the declaration of your function, since the mangled name would also change. Be aware you will also need to change the GetProcAddress call if you change the compiler your DLL is built with - MSVC's mangling is a lot different from GCC's, and clang's mangling is probably different from both of them. So you might want to reconsider the way you're doing this, since it seems rather prone to breaking somewhere along the way.
Related
I'm updating a DLL for a customer and, due to corporate politics - among other things - my company has decided to no longer share source code with the customer.
Previously. I assume they had all the source and imported it as a VC++6 project. Now they will have to link to the pre-compiled DLL. I would imagine, at minimum, that I'll need to distribute the *.lib file with the DLL so that DLL entry-points can be defined. However, do I also need to distribute the header file?
If I can get away with not distributing it, how would the customer go about importing the DLL into their code?
Yes, you will need to distribute the header along with your .lib and .dll
Why ?
At least two reasons:
because C++ needs to know the return type and arguments of the functions in the library (roughly said, most compilers use name mangling, to map the C++ function signature to the library entry point).
because if your library uses classes, the C++ compiler needs to know their layout to generate code in you the library client (e.g. how many bytes to put on the stack for parameter passing).
Additional note: If you're asking this question because you want to hide implementation details from the headers, you could consider the pimpl idiom. But this would require some refactoring of your code and could also have some consequences in terms of performance, so consider it carefully
However, do I also need to distribute the header file?
Yes. Otherwise, your customers will have to manually declare the functions themselves before they can use it. As you can imagine, that will be very error prone and a debugging nightmare.
In addition to what others explained about header/LIB file, here is different perspective.
The customer will anyway be able to reverse-engineer the DLL, with basic tools such as Dependency Walker to find out what system DLLs your DLL is using, what functions being used by your DLL (for example some function from AdvApi32.DLL).
If you want your DLL to be obscured, your DLL must:
Load all custom DLLs dynamically (and if not possible, do the next thing anyhow)
Call GetProcAddress on all functions you want to call (GetProcessToken from ADVAPI32.DLL for example
This way, at least dependency walker (without tracing) won't be able to find what functions (or DLLs) are being used. You can load the functions of system DLL by ordinal, and not by name so it becomes more difficult to reverse-engineer by text search in DLL.
Debuggers will still be able to debug your DLL (among other tools) and reverse engineer it. You need to find techniques to prevent debugging the DLL. For example, the very basic API is IsDebuggerPresent. Other advanced approaches are available.
Why did I say all this? Well, if you intend to not to deliver header/DLL, the customer will still be able to find exported functions and would use it. You, as a DLL provider, must also provide programming elements with it. If you must hide, then hide it totally.
One alternative you could use is to pass only the DLL and make the customer to load it dynamically using LoadLibrary() + GetProcAddress(). Although you still need to let your customer know the signature of the functions in the DLL.
More detailed sample here:
Dynamically load a function from a DLL
I am having some trouble understanding why both #include and LoadLibrary() is needed in C++. In C++ "#include" forces the pre-processor to replace the #include line with the contents of the file you are including (usually a header file containing declarations). As far as I understand, this enables me to use the routines I might want in the external libraries the headers belong to.
Why do I then need LoadLibrary()? Can't i just #include the library itself?
Just as a side note: In C#, which I am more familiar with, I just Add a Reference to a DLL if I want to use types or routines from that DLL in my program. I do not have to #include anything, as the .NET framework apparently automatically searches all the referenced assemblies for the routines I want to use (as specified by the namespace)
Thank you very much in advance.
Edit: Used the word "definitions", but meant "declarations". Now fixed.
Edit 2: Tough to pick one answer, many good replies. Thanks for all contributions.
C++ uses a full separate compilation model; you can even compile
against code which hasn't been written. (This often occurs in
large projects.) When you include a file, all you are doing is
telling the compiler that the functions, etc. exist. You do not
provide an implementation (except for inline functions and
templates). In order to execute the code, you have to provide
the implementation, by linking it into your application. This
can occur in several different ways:
You have the source files; you compile them along with your
sources, and link in the resulting objects.
You have a static library; you must link against it.
You have a dynamic library. Here, what you must do will
depend on the implemention: under Windows, you must link
against a .lib stub, and put the .dll somewhere where the
runtime will find it when you execute. (Putting it in the same
directory as your application is usually a good solution.)
I don't quite understand your need to call LoadLibrary. The
only time I've needed this is when I've intentionally avoided
using anything in the library directly, and want to load it
conditionally, use GetProcAddr to get the addresses of the
functions I need.
EDIT:
Since I was asked to clarify "linking": program translation
(from the source to an executable) takes place in a number of
steps. In traditional terms, each translation unit is
"compiled" into an object file, which contains an image of the
machine instructions, but with unfilled spaces for external
references. For example, if you have:
extern void function();
in your source (probably via inclusion of a header), and you
call function, the compiler will leave the address field of
the call instruction blank, since it doesn't know where the
function will be located. Linking is the process of taking all
of the object files, and filling in these blanks. One of the
object files will define function, and the linker will
establish the actual address in the memory image, and fill in
the blank referring to function with the address of function
in that image. The result is a complete memory image of the
executable. On the early systems I worked on: literally. The
OS would simply copy the executable file directly into memory,
and then jump into it. Things like virtual memory and shared,
write protected code segments make this a little more
complicated today, but for statically linked libraries or object
files (my first two cases above), the differences aren't that
great.
Modern system technologies have blurred the lines somewhat. For
example, most Java (and I think C#) compilers don't generate
classical object files, with machine code, but rather byte code,
and the compile and link phases, above, don't take place until
runtime. Some C++ compilers also only generate byte code, which
will be compiled when the code is "linked". This is done to
permit cross-module optimizations. And all modern systems
support dynamic linking: some of the blank addresses are left
blank until execution time. And dynamic linking can be implicit
or explicit: when it is implicit, the link phase will insert
information into the executable concerning the libraries it
needs, and where to find them, and the OS will link them,
implicitly, either when the executable is loaded, or on demand,
triggered by the code attempting to use one of the unfilled
address slots. When it is explicit, you normally don't have any
explicit referenced to the name in your code. In the case of
function, above, for example, you wouldn't have any code which
directly called function. Your code would, however, load the
dynamic library using LoadLibrary (or dlopen under Unix),
then request the address of a name, using GetProcAddr (or
dlsys), and call the function indirectly through the pointer
it received.
The #include directive is, like all preprocessor functionality, merely a text replacement. The text "#include " is replaced with the contents of that file.
Typically (but not necessarily), this is used to include a header file which declares the functions that you want to use, i.e. you tell the compiler (which runs after the preprocessor) how some functions that you intend to use are named, what parameters they take, and what the return type is. It does not define what the function is actually doing.
You then also need an implementation of these functions, too. Usually, if you do not implement them in your program, you leave this task to the link stage. You give a list of libraries that your program depends on to the linker, and the linker divines via some implementation-defined way (such as an "import library") what it needs to do to "make it work". The linker will produce some glue code and write some information into the executable that will make the loader automatically load the required libraries. Everything "just works" without you having to do something special.
In some cases, however, you want to postpone the linker stage and do the loading "fully dynamically" by hand rather than automatically. This is when you have to call LoadLibrary() and GetProcAddress. The former brings the DLL into memory and does some setup (e.g. relocation), the latter gives you the address of a function that you want to call.
The #include in your code is still necessary so the compiler knows what to do with that pointer. Otherwise, you could of course call the obtained function via its address, but it would not be possible to call the function in a meaningful way.
One reason why one would want to load a library manually (using LoadLibrary) is that it is more failsafe. If you link a program against a library and the library cannot be found (or a symbol cannot be found), then your application will not start up and the user will see a more or less obscure error message.
If LoadLibrary fails or GetProcAddress doesn't work, your program can in principle still run, albeit with reduced functionality.
Another example for using LoadLibrary might be to load an alternative version of a function from a different library (some programs implement "plugins" that way). The function "looks" the same to the compiler, as defined in the include file, but may behave differently, as by whatever is in the loaded binary.
#include brings in source code only: symbol declarations for the compiler. A library (or a DLL) is object code: Use either LoadLibrary or link to a lib file to bring in object code.
LoadLibrary() causes the code module to be loaded from disk into your applications memory space for execution. This allows for dynamically loading code at runtime. You would not use LoadLibrary(), for example, if the code you want to use is compiled into a statically linked library. In that case you would provide the name of the .lib file that contained the code to the linker and it gets resolved at link time - the code is linked in to your .exe and the .lib is not distributed with the .exe in order for it to execute.
LoadLibrary() creates a dependency on an external DLL which must be present on the path provided to the method call in order for the .exe to properly execute. If LoadLibrary() fails, you must ensure your code will handle it appropriately, by either exiting gracefully or providing some other execution alternative. You must provide a .lib file to the linker the same as you would for the static library above. This .lib file however does not contain code, just entry points for the actual code that resides in the .dll.
In both cases you must #include the headers for the code you wish to execute. This is required by the compiler in order to build function call signatures properly based on the type information provided by the header.
C# assemblies contain both type information and IL. A single reference is sufficient to satisfy the need for header information and binding to the code itself.
#include is static, the substitution is done at compile time. LoadLibrary() lets you load a DLL at runtime, for example based on user imput.
currently I am writing on a cpp-DLL. Afaik I have to put the functions into a class and a namespace if another cpp-program wants to use them. But I want to use the DLL with Labview too. Labview only recognizes the functions if they are free, e.g. neither in a namespace nor in a class. How can I implement this in my DLL? At the moment, I have set a #define-variable. If this variable is set, the functions are enclosed in a namespace and a class, if not, then they are free, but I have to compile the whole thing twice and I get two separate DLL files. So, what can I do if I only want one DLL file for both applications? (Please don't tell me to write the functions twice, the administrative outlay is even worse, I have tried this before).
Or do I simply have to call the DLL via LoadLibrary() when not using namespaces?
Thank you very much!
Afaik I have to put the functions into a class and a namespace if another cpp-program wants to use them.
This is plain wrong. You do not need to do this at all. On the contrary, DLL were originally introduced as libraries of C functions. C++ uses mangled names to represent namespace/class and types of parameters. There is no standard on this. Different compilers use their own scheme.
To summarize:
If you export simple C function from your dll, this will always work.
If you export classes or something from a namespace, this will
definitely work if other .exe/.dll is compiled with the same version of the
compiler. If not - this depends.
Regarding the LoadLibrary: it should be used when you do not know the name of the DLL or a name of the function in this DLL ahead or when you do not want to load this DLL at the beginning of your process. Otherwise (simple case) link your executable with implib for that DLL. This perfectly works for simple c-functions. LoadLibrary should be used when direct linking is not good for some reason.
In C, I'm used to being able to write a shared library that can be called from any client code that wishes to use it simply by linking the library and including the related header files. However, I've read that C++'s ABI is simply too volatile and nonstandard to reliably call functions from other sources.
This would lead me to believe that creating truly shared libraries that are as universal as C's is impossible in C++, but real-world implementations seem to indicate otherwise. For example, Node.js exposes a very simple module system that allows plain C++ functions (without extern "C") to be exported dynamically using the NODE_SET_METHOD function.
Which elements of a C++ API are safe to expose, if any, and what are the common methods of allowing C++ code to interact with other pieces of C++ code? Is it possible to create shared libraries that can expose C++ classes? Or must these classes be individually recompiled for each program due to the inconsistent ABI?
Yes, C++ interop is difficult and filled with traps. Cold hard rules are that you must use the exact same compiler version with the exact same compiler settings to build the modules and ensure that they share the exact same CRT and standard C++ libraries. Breaking those rules tend to get you C++ classes that don't have the same layout on either end of the divide and trouble with memory management when one module allocates an object using a different allocator from the module that deletes the object. Problems that lead to very hard to diagnose runtime failure when code uses the wrong offset to access a class member and leaks memory or corrupts the heap.
Node.js avoids these problems by first of all not exporting anything. NODE_SET_METHOD() doesn't do what you think it does, it simply adds a symbol to the Javascript engine's symbol table, along with a function pointer that's called when the function is called in script. Furthermore, it is an open source project so building everything with the same compiler and runtime library isn't a problem.
This
For example, Node.js exposes a very simple module system that allows
plain C++ functions (without extern "C") to be exported dynamically
using the NODE_SET_METHOD function.
Is wrong, you can see that they are using an an extern "C" there in the init() function, which is clearly what node.js is calling which is then forwarding the function on to which ever C++ function they want, which isn't exposed.
As explained in this question How does an extern "C" declaration work? - When the compiler compiles the code, it mangles the function names, class names and namespace names. The reason it does this is because there can very easily be name clashes, for instance with overloaded functions.
Read about it more here: http://en.wikipedia.org/wiki/Name_mangling
The only way to refer and lookup a function is if the extern "C" declaration is used, which forces the compiler to not mangle the name. I.e. in the example above, the function init will be called init where as the function foo will be called something like _ugAGE (I made this up, because it doesn't matter, it isn't for human consumption)
In summary, you can expose any C++ to any other language, but the entry point to the library must be one or more extern "C"'d global functions as they are the only way to refer to an unmangled name.
Neither the C nor the C++ standards define an ABI. That is entirely left up to the implementation. The reason it's harder to get shared/dynamic libraries working for C++, is that C++ added things like classes, polymorphism, templates, exceptions, function overloading, STL, ...
So, the real source of information for you, is your compilers' documentation, as well as a corresponding set of guidelines for your library API to avoid any issues with any of the implementations your library will be built for. It's harder in C++ (the set of guidelines will likely be quite a bit bigger than for C, and you might have to work with a subset of C++), but not impossible.
I have this habit always a C++ project is compiled and the release is built up. I always open the .EXE with a hexadecimal editor (usually HxD) and have a look at the binary information.
What I hate most and try to find a solution for is the fact that somewhere in the string table, relevant (at least, from my point of view) information is offered. Maybe for other people this sounds like a schizophrenia obsession but I just don't like when my executable contains, for example, the names of all the Windows functions used in the application.
I have tried many compilers to see which of them published the least information. For example, GCC leaves all this in all of its produced final exe
libgcj_s.dll._Jv_RegisterClasses....\Data.ald.rb.Error.Data file is corrupt!
....Data for the application not found!.€.#.ř.#.0.#.€.#.°.#.p.#.p.#.p.#.p.#.
¸.#.$.#.€.#°.#.std::bad_alloc..__gnu_cxx::__concurrence_lock_error.__gnu_cxx
::__concurrence_unlock_error...std::exception.std::bad_exception...pure virt
ual method called..../../runtime/pseudo-reloc.c....VirtualQuery (addr, &b, s
ize of(b))............................/../../../gcc-4.4.1/libgcc/../gcc/conf
ig/i386/cygming-shared-data.c...0 && "Couldn't retrieve name of GCClib share
d data atom"....ret->size == sizeof(__cygming_shared) && "GCClib shared data
size mismatch".0 && "Couldn't add GCClib shared data atom".....-GCCLIBCYGMI
NG-EH-TDM1-SJLJ-GTHR-MINGW32........
Here, you can see what compiler I used, and what version. Now, a few lines below you can see a list with every Windows function I used, like CreateMainWindow, GetCurrentThreadId, etc.
I wonder if there are ways of not displaying this, or encrypting, obfuscating it.
With Visual C++ this information is not published. Instead, it is not so cross-platform as GCC, which even between two Windows systems like 7 and XP, doesn't need C++ run-time, frameworks or whatever programs compiled with VC++ need. Moreover, the VC++ executables also contain those procedures entry points to the Windows functions used in the application.
I know that even NASM, for example, saves the name of the called Windows functions, so it looks like it's a Windows issue. But maybe they can be encrypted or there's some trick to not show them.
I will have a look over the GCC source code to see where are those strings specified to be saved in the executables - maybe that instruction can be skipped or something.
Well, this is one of my last paranoia and maybe it can be treated some way. Thanks for your opinions and answers.
If you compile with -nostdlib then the GCC stuff should go away but you also lose some of the C++ support and std::*.
On Windows you can create an application that only links to LoadLibrary and GetProcAddress and at runtime it can get the rest of the functions you need (The names of the functions can be stored in encrypted form and you decrypt the string before passing it to GetProcAddress) Doing this is a lot of work and the Windows loader is probably faster at this than your code is going to be so it seems pointless to me to obfuscate the fact that you are calling simple functions like GetLastError and CreateWindow.
Windows API functions are loaded from dlls, like kernel32.dll. In order to get the loaded API function's memory address, a table of exported function names from the dll is searched. Thus the presence of these names.
You could manually load any Windows API functions you reference with LoadLibrary. The you could look up the functions' addresses with GetProcAddress and functions names stored in some obfuscated form. Alternately, you could use each function's "ordinal" -- a numeric value that identifies each function in a dll). This way, you could create a set of function pointers that you will use to call API functions.
But, to really make it clean, you would probably have to turn off linking of default libraries and replace components of the C Runtime library that are implicitly used by the compiler. Doing this is a hasslse, though.