Can ODR violation be avoided by using hidden visibility? - c++

So for example, I have a slightly complicated case of library dependency in one of my projects:
/--------------------------------\
| |
/----> GRPC <------------------\ |
| | |
| (c++) | |
\------- A --------------> B |
| (rust) |
| |
\------------------> c++ <---/
Rust by default will prefer to use static linkage. Executable A is also built to statically link lib(std)c++. So, to my understanding, there will be two copies of STL implementation in both A and B. This is exactly the pattern that https://developer.android.com/ndk/guides/cpp-support#sr suggests avoiding.
However, looking through the dynamic linkage table (via nm -D, for example) of B; I could see no exported lib(std)c++/grpc symbol. This is because rust marks them hidden by default.
So, is it safe (or conforming the ODR) if all common symbols in B are hidden?

conforming the ODR
One-definition rule is part of C++ programming language. It's relevant to C++ language. It's irrelevant to anything else. There is no ODR "outside" of C++. C++ standard does not apply outside of C++, it's only about C++ programming language.
There is no wildly adopted portable super-standard concerning language interoperability. These are just tools, there are no definitions and rules. ODR or any other rule from C++ does not apply here.
Trying to apply C++ standard rules to unrelated contexts makes little sense. Rust is not part of C++. RPC is platform specific, outside the scope of C++ programming language.
Ergo, reasoning that some C++ rule will or will not be broken in a work chain that uses platform specific tools - a shared library, "dynamic linkage table" - and uses multiple programming language just doesn't apply here.
In the sense of C++, this all is literally "undefined behavior" - there are no rules from C++ standard that could apply here.
Can ODR violation be avoided by using hidden visibility?
Sure.

The android doc says:
In this situation, the STL, including and global data and static constructors, will be present in both libraries. The runtime behavior of this application is undefined, and in practice crashes are very common. Other possible issues include:
Memory allocated in one library, and freed in the other, causing memory leakage or heap corruption.
You mentioned those symbols are hidden. However, is it equal to, the global data is not presented twice?
If the global data can be presented twice, then "Memory allocated in one library, and freed in the other, causing memory leakage or heap corruption" may happen. For example, if some memory is allocated in Rust, and we transfer it to C++, and C++ later frees it, then we are facing this situation. Of course your code may not have this case, but if some code in the future violates, or if one day you use a third party library which violates it, then we are in trouble (and seems to be hard to debug).
Indeed, what about letting C++ to statically link the Rust code? Then you get a giant .so file containing your C++, your Rust, your dependency, etc. However, even if you can do that, we may still need to be careful: Is the giant .so file the only one in your whole Android app? In other words, are you sure you do not have, and will never have any other native libraries? If not, IMHO we may still be facing the problem again.
Anyway I am not an expert in Android/C++. I was having a similar problem a few months ago (replace "C++ code" with "the Flutter engine which is written in C++" and so on) and made some workaround. So this post is not really an answer, but rather some (can be wrong) thoughts and suggestions. Hope someone can correct me!

Related

Dynamic memory allocation in STD

Working a lot with microcontrollers and C++ it is important for me to know that I do not perform dynamic memory allocations. However I would like to get the most out of the STD lib. What would be the best strategy to determine if a function/class from STD uses dynamic memory allocation?
So far I come up with these options:
Read and understand the STD code. This is of course possible but lets be honest, it is not the easiest code to read and there is a lot of it.
A variation on reading the code could be to have a script search for memory allocation and highlight those parts to it make it easier to read. This still would require figuring out where functions allocating memory are used, and so forts.
Just testing what I would like to use and watch the memory with the debugger. So far I have been using this method but this is a reactive approach. I would like to know before hand when designing code what I can use from STD. Also what is there to say that there are some (edge) cases where memory is allocated. Those might not show up in this limited test.
Finally what could be done is regularly scan the generated assembler code for memory allocations. I suspect this could be scripted and included in the toolchain but again this is a reactive method.
If you see any other options or have experience doing something similar, please let me know.
p.s. I work mainly with ARM Cortex-Mx chips at this moment compiling with GCC.
You have some very good suggestions in the comments, but no actual answers, so I will attempt an answer.
In essence you are implying some difference between C and C++ that does not really exist. How do you know that stdlib functions don't allocate memory?
Some STL functions are allowed to allocate memory and they are supposed to use allocators. For example, vectors take an template parameter for an alternative allocator (for example pool allocators are common). There is even a standard function for discovering if a type uses memory
But... some types like std::function sometimes use memory allocation and sometimes do not, depending on the size of the parameter types, so your paranoia is not entirely unjustified.
C++ allocates via new/delete. New/Delete allocate via malloc/free.
So the real question is, can you override malloc/free? The answer is yes, see this answer https://stackoverflow.com/a/12173140/440558. This way you can track all allocations, and catch your error at run-time, which is not bad.
You can go better, if you are really hardcore. You can edit the standard "runtime C library" to rename malloc/free to something else. This is possible with "objcopy" which is part of the gcc tool chain. After renaming the malloc/free, to say ma11oc/fr33, any call to allocate/free memory will no longer link.
Link your executable with "-nostdlib" and "-nodefaultlibs" options to gcc, and instead link your own set of libs, which you generated with objcopy.
To be honest, I've only seen this done successfully once, and by a programmer you did not trust objcopy, so he just manually found the labels "malloc" "free" using a binary editor, and changed them. It definitely works though.
Edit:
As pointed out by Fureeish (see comments), it is not guaranteed by the C++ standard that new/delete use the C allocator functions.
It is however, a very common implementation, and your question does specifically mention GCC. In 30 years of development, I have never seen a C++ program that runs two heaps (one for C, and one for C++) just because the standard allows for it. There would simply be no advantage in it. That doesn't preclude the possibility that there may be an advantage in the future though.
Just to be clear, my answer assumes new USES malloc to allocate memory. This doesn't mean you can assume that every new call calls malloc though, as there may be caching involved, and the operator new may be overloaded to use anything at all at the global level. See here for GCC/C++ allocator schemes.
https://gcc.gnu.org/onlinedocs/libstdc++/manual/memory.html
Yet another edit:
If you want to get technical - it depends on the version of libstdc++ you are using. You can find operator new in new_op.cc, in the (what I assume is the official) source repository
(I will stop now)
The options you listed are pretty comprehensive, I think I would just add some practical color to a couple of them.
Option 1: if you have the source code for the specific standard library implementation you're using, you can "simplify" the process of reading it by generating a static call graph and reading that instead. In fact the llvm opt tool can do this for you, as demonstrated in this question. If you were to do this, in theory you could just look at a given method and see if goes to an allocation function of any kind. No source code reading required, purely visual.
Option 4: scripting this is easier than you think. Prerequisites: make sure you're building with -ffunction-sections, which allows the linker to completely discard functions which are never called. When you generate a release build, you can simply use nm and grep on the ELF file to see if for example malloc appears in the binary at all.
For example I have a bare metal cortex-M based embedded system which I know for a fact has no dynamic memory allocation, but links against a common standard library implementation. On the debug build I can do the following:
$ nm Debug/Project.axf | grep malloc
700172bc T malloc
$
Here malloc is found because dead code has not been stripped.
On the release build it looks like this:
$ nm Release/Project.axf | grep malloc
$
grep here will return "0" if a match was found and something other than "0" if it wasn't, so if you were to use this in a script it would be something like:
nm Debug/Project.axf | grep malloc > /dev/null
if [ "$?" == "0" ]; then
echo "error: something called malloc"
exit 1
fi
There's a mountain of disclaimers and caveats that come with any of these approaches. Keep in mind that embedded systems in particular use a wide variety of different standard library implementations, and each implementation is free to do pretty much whatever it wants with regard to memory management.
In fact they don't even have to call malloc and free, they could implement their own dynamic allocators. Granted this is somewhat unlikely, but it is possible, and thus grepping for malloc isn't actually sufficient unless you know for a fact that all memory management in your standard library implementation goes through malloc and free.
If you're serious about avoiding all forms of dynamic memory allocation, the only sure way I know of (and have used myself) is simply to remove the heap entirely. On most bare metal embedded systems I've worked with, the heap start address, end address, and size are almost always provided a symbols in the linker script. You should remove or rename these symbols. If anything is using the heap, you'll get a linker error, which is what you want.
To give a very concrete example, newlib is a very common libc implementation for embedded systems. Its malloc implementation requires that the common sbrk() function be present in the system. For bare metal systems, sbrk() is just implemented by incrementing a pointer that starts at the end symbol provided by the linker script.
If you were using newlib, and you didn't want to mess with the linker script, you could still replace sbrk() with a function that simply hard faults so you catch any attempt to allocate memory immediately. This in my opinion would still be much better than trying to stare at heap pointers on a running system.
Of course your actual system may be different, and you may have a different libc implementation that you're using. This question can really only answered to any reasonable satisfaction in the exact context of your system, so you'll probably have to do some of your own homework. Chances are it's pretty similar to what I've described here.
One of the great things about bare metal embedded systems is the amount of flexibility that they provide. Unfortunately this also means there are so many variables that it's almost impossible to answer questions directly unless you know all of the details, which we don't here. Hopefully this will give you a better starting point than staring at a debugger window.
To make sure you do NOT use dynamic memory allocation, you can override the global new operator so that it always throws an exception. Then run unit tests against all your use of the library functions you want to use.
You may need help from the linker to avoid use of malloc and free as technically you can't override them.
Note: This would be in the test environment. You are simply validating that your code does not use dynamic allocation. Once you have done that validation, you don't need the override anymore so it would not be in place in the production code.
Are you sure you want to avoid them?
Sure, you don't want to use dynamic memory management that is designed for generic systems. That would definitely be a bad idea.
BUT does the tool chain you use not come with an implementation that is specific to your hardware that does an intelligent job for that hardware? or have some special ways to compile that allows you to use only a known piece of memory that you have pre-sized and aligned for the data area.
Moving to containers. Most STL containers allow you to specialize them with an allocator. You can write your own allocator that does not use dynamic memory.
Generally you can check (suitably thorough) documentation to see whether the function (e.g., a constructor) can throw std::bad_alloc. (The inverse is often phrased as noexcept, since that exception is often the only one risked by an operation.) There is the exception of std::inplace_merge, which becomes slower rather than throwing if allocation fails.
The gcc linker supports a -Map option which will generate a link map with all the symbols in your executable. If anything in your application does dynamic memory allocation unintentionally, you will find a section with *alloc and free functions.
If you start with a program with no allocation, you can check the map after every compile to see if you have introduced one through the library function calls.
I used this method to identify an unexpected dynamic allocation introduced by using a VLA.

adding some C++ to a library used by C programs?

I have a long-standing C library used by C and C++ programs.
It has a few compilation units (in other words, C++ source files) that are totally C++, but this has never been a problem. My understanding is that linkers (at least, Linux, Windows, etc.) always work at the file-by-file level so that an object file in a library that isn't referred to, doesn't have any effect on the linking, isn't put in the binary, and so on. The C users of the library never refer to the C++ symbols, and the library doesn't internally, so the resulting linked app is C-only. So while it's always worked perfectly, I don't know if it's because the C++ doesn't make it past the linking stage, or because more deeply, this kind of mixing would always work even if I did mix languages.
For the first time I'm thinking of adding some C++ code to the existing C API's implementation.
For purposes of discussion let us say I have a C function that does something, and logs it via stdout, and since this is buffered separately from cout, the output can become confusing. So let us say this module has an option that can be set to log to cout instead of stdout. (This is a more general question, not merely about getting cout and stdout to cooperate.) The C++ code might or might not run, but the dependencies will definitely be there.
In what way would this impact users of this library? It is widely used so I cannot check with the entire user base, and as it's used for mission-critical apps it'd be unacceptable to make a change that makes links start failing, at least unless I supply a release note explaining the problem and solution.
Just as an example of a possible problem, I know compilers have "hidden" libraries of support functions that are necessary for C and C++ programs. There are obviously also the Standard C and C++ libraries, that normally you don't have to explicitly link to. My concerns are that the compiler might not know to do these things.

Historical reason for declaration before use, include and header/source split. Need to find suitable reference

TLDR: See the last paragraph of this question.
I'm a Computer Science student trying to finish the text of my master thesis about creating a transpiler (case study).
Now for this master thesis, a part of my text is about comparing the languages involved. One of the languages is C++.
Now I'm trying to explain the difference in import/include semantics and the historical reason why C++ did it that way. I know how it works in C/C++, so I don't really need a technical explanation.
Researching extensively on Google and Stackoverflow I came up with several stackoverflow explanations and other references on this topic:
Why are forward declarations necessary?
What are forward declarations in C++?
Why does C++ need a separate header file?
http://en.wikipedia.org/wiki/Include_directive
http://www.cplusplus.com/forum/articles/10627/
https://softwareengineering.stackexchange.com/questions/180904/are-header-files-actually-good
http://en.wikipedia.org/wiki/One-pass_compiler
Why have header files and .cpp files in C++?
And last but not least the book "Design and Evolution of C++ (1994)" of Bjarne Stroustrup (page 34 - 35).
If I understand correctly this way of doing imports/includes came from C and came to be because of the following reasons:
Computers were not as fast so a One pass compiler was preferable. The only way this was possible is by enforcing the declaration before use idiom. This is because C and C++ are programming languages that have a context-sensitive grammar: they need the right symbols to be defined in the symbol table in order to disambiguate some of the rules. This is opposed to modern compilers: nowadays a first pass is usually done to construct the symbol table and sometimes (in case the language has a context-free grammar) the symbol table is not required in the parsing stage because there are no ambiguities to resolve.
Memory was very limited and expensive in those days. Therefore it was not feasible to store a whole symbol table in memory in most computers. That's why C let programmers forward declare the function prototypes and global variables they actually needed. Headers were created to enable developers to keep those declarations centralized so they could easily be reused across modules that required those symbols.
Header files were a useful way to abstract interface from implementation
C++ tried to establish backwards compatibility with software and softwarelibraries written in C. More importantly: they actually used to transpile to C (CFront) and then using a C Compiler to compile the code into machine code. This also enabled them to compile to a lot of different platforms right from the start, as each of those platforms already had a C compiler and C linker.
The above was an illustration of what I discovered by searching first ;) The problem is: I can't find a suitable reference to the historical reasons for this include strategy, aside from here on Stackoverflow. And I highly doubt my university will be happy with a stackoverflow link. The closest I've come is the "Design and Evolution of C++" reference, but it doesn't mention the hardware limitations being a reason for the include strategy. I think that's to be expected because the design of the feature came from C. Problem is that I didn't find any good source yet that describes this design decision in C, preferably with the hardware limitations in mind.
Can anyone point me in the good direction?
Thanks!
You're right that the reason C++ does it this way is because
C did it this way. The reason C did it this was is also based
in history; in the very beginning (B), there were no
declarations. If you wrote f(), then the compiler assumed
that f was a function somewhere. Which returned a word, since
everything in B was a word; there were no types. When C was
invented (to add types, since everything is a word isn't very
efficient with byte addressed machines), the basic principle
didn't change, except that the function was assumed to return
int (and to take arguments of the type you gave it). If it
didn't return int, then you had to forward declare it with the
return type. In the earlier days of C, it wasn't rare to see
applications which didn't use include, and which simply
redeclared e.g. char* malloc() in each source file that used
malloc. The preprocessor was developed to avoid having to
retype the same thing multiple times, and at the very beginning,
its most important feature was probably #define. (In early C,
all of the functions in <ctype.h>, and the character based IO
in <stdio.h> were macros.)
As for why the declaration needed to preceed the use: the main
reason is doubtlessly because if it didn't the compiler would
assume an implicit declaration (function returning int, etc.).
And at the time, compilers were generally one pass, at least for
the parsing; it was considered too complicated to go back at
"correct" an assumption that had already been made.
Of course, in C++, the language isn't constrained as much by
this; C++ has always required functions to be declared, for
example, and in certain contexts (in class member functions, for
example), doesn't require the declaration to precede the use.
(Generally, however, I would consider in class member functions
to be a misfeature, to be avoided for readability reasons. The
fact that the function definitions must be in the class in
Java is a major reason not to use that language in large
projects.)

Can C++ code reliably interact with other C++ code?

In C, I'm used to being able to write a shared library that can be called from any client code that wishes to use it simply by linking the library and including the related header files. However, I've read that C++'s ABI is simply too volatile and nonstandard to reliably call functions from other sources.
This would lead me to believe that creating truly shared libraries that are as universal as C's is impossible in C++, but real-world implementations seem to indicate otherwise. For example, Node.js exposes a very simple module system that allows plain C++ functions (without extern "C") to be exported dynamically using the NODE_SET_METHOD function.
Which elements of a C++ API are safe to expose, if any, and what are the common methods of allowing C++ code to interact with other pieces of C++ code? Is it possible to create shared libraries that can expose C++ classes? Or must these classes be individually recompiled for each program due to the inconsistent ABI?
Yes, C++ interop is difficult and filled with traps. Cold hard rules are that you must use the exact same compiler version with the exact same compiler settings to build the modules and ensure that they share the exact same CRT and standard C++ libraries. Breaking those rules tend to get you C++ classes that don't have the same layout on either end of the divide and trouble with memory management when one module allocates an object using a different allocator from the module that deletes the object. Problems that lead to very hard to diagnose runtime failure when code uses the wrong offset to access a class member and leaks memory or corrupts the heap.
Node.js avoids these problems by first of all not exporting anything. NODE_SET_METHOD() doesn't do what you think it does, it simply adds a symbol to the Javascript engine's symbol table, along with a function pointer that's called when the function is called in script. Furthermore, it is an open source project so building everything with the same compiler and runtime library isn't a problem.
This
For example, Node.js exposes a very simple module system that allows
plain C++ functions (without extern "C") to be exported dynamically
using the NODE_SET_METHOD function.
Is wrong, you can see that they are using an an extern "C" there in the init() function, which is clearly what node.js is calling which is then forwarding the function on to which ever C++ function they want, which isn't exposed.
As explained in this question How does an extern "C" declaration work? - When the compiler compiles the code, it mangles the function names, class names and namespace names. The reason it does this is because there can very easily be name clashes, for instance with overloaded functions.
Read about it more here: http://en.wikipedia.org/wiki/Name_mangling
The only way to refer and lookup a function is if the extern "C" declaration is used, which forces the compiler to not mangle the name. I.e. in the example above, the function init will be called init where as the function foo will be called something like _ugAGE (I made this up, because it doesn't matter, it isn't for human consumption)
In summary, you can expose any C++ to any other language, but the entry point to the library must be one or more extern "C"'d global functions as they are the only way to refer to an unmangled name.
Neither the C nor the C++ standards define an ABI. That is entirely left up to the implementation. The reason it's harder to get shared/dynamic libraries working for C++, is that C++ added things like classes, polymorphism, templates, exceptions, function overloading, STL, ...
So, the real source of information for you, is your compilers' documentation, as well as a corresponding set of guidelines for your library API to avoid any issues with any of the implementations your library will be built for. It's harder in C++ (the set of guidelines will likely be quite a bit bigger than for C, and you might have to work with a subset of C++), but not impossible.

Do dynamic libraries break C++ standard?

The C++ standard 3.6.3 states
Destructors for initialized objects of static duration are called as a result of returning from main and as a result of calling exit
On windows you have FreeLibrary and linux you have dlclose to unload a dynamically linked library. And you can call these functions before returning from main.
A side effect of unloading a shared library is that all destructors for static objects defined in the library are run.
Does this mean it violates the C++ standard as these destructors have been run prematurely ?
It's a meaningless question. The C++ standard doesn't say what dlclose does or should do.
If the standard were to include a specification for dlclose, it would certainly point out that dlclose is an exception to 3.6.3. So then 3.6.3 wouldn't be violated because it would be a documented exception. But we can't know that, since it doesn't cover it.
What effect dlclose has on the guarantees in the C++ standard is simply outside the scope of that standard. Nothing dlclose can do can violate the C++ standard because the standard says nothing about it.
(If this were to happen without the program doing anything specific to invoke it, then you would have a reasonable argument that the standard is being violated.)
Parapura, it may be helpful to keep in mind that the C++ standard is a language definition that imposes constraints on how the compiler converts source code into object code.
The standard does not impose constraints on the operating system, hardware, or anything else.
If a user powers off his machine, is that a violation of the C++ standard? Of course not. Does the standard need to say "unless the user powers off the device" as an "exception" to every rule? That would be silly.
Similarly, if an operating system kills a process or forces the freeing of some system resources, or even allows a third party program to clobber your data structures -- this is not a violation of the C++ standard. It may well be a bug in the OS, but the C++ language definition remains intact.
The standard is only binding on compilers, and forces the resulting executable code to have certain properties. Nevertheless, it does not bind runtime behavior, which is why we spend so much time on exception handling.
I'm taking this to be a bit of an open-ended question.
I'd say it's like this: The standard only defines what a program is. And a program (a "hosted" one, I should add) is a collection of compiled and linked translation units that has a unique main entry point.
A shared library has no such thing, so it doesn't even constitute a "program" in the sense of the standard. It's just a bunch of linked executable code without any sort of "flow". If you use load-time linking, the library becomes part of the program, and all is as expected. But if you use runtime linking, the situation is different.
Therefore, you may like to view it like this: global variables in the runtime-linked shared object are essentially dynamic objects which are constructed by the dynamic loader, and which are destroyed when the library is unloaded. The fact that those objects are declared like global objects doesn't change that, since the objects aren't part of a "program" at that point.
They are only run prematurely if you go to great effort to do so - the default behavior is standard conforming.
If it does violate the standard, who is the violator? The C++ compiler cannot be considered the violator (since things are being loaded dynamically via a library call); thus it must the the vendor of the dynamic loading functionality, aka the OS vendor. Are OS vendors bound by the C++ standard when designing their systems? That definitely seems to be outside of the scope of the standard.
Or for another perspective, consider the library itself to be a separate program providing some sort of service. When this program is terminated (by whatever means the library is unloaded) then all associated service objects should disappear as well, static or not.
This is just one of the tons and tons of platform-specific "extensions" (for a target compiler, architecture, OS, etc) that are available. All of which "violate" the standard in all sorts of ways. But there is only one expected consequence for deviating from standard C++: you aren't portable anymore. (Unless you do a lot of #ifdef or something, but still, that particular code is locked in to that platform).
Since there is currently no standard/cross-platform notion of libraries, if you want the feature, you have to either not use it or re-implement it per-platform. Since similar things are appearing on most platforms, maybe the standard will one day find a clean way to abstract them so that the standard covers them. The advantage will be a cross-platform solution and it will simplify cross platform code.