I have a C++ application running bare-metal that I want to make as small as possible.
I am not using dynamic memory allocation anywhere. I am using no STL functions. I've also overridden all the varieties of "delete" and "new" with empty functions. Nonetheless, when I look at a sorted list of symbols I see that malloc() is still one of the largest items in my compiled binary. I could shrink my binary by about 25% if I could get rid of it.
Do C++ runtimes generally require malloc() for behind-the-scenes type work?
(I'm using Xilinx's fork of gcc for the Microblaze architecture, if that matters)
Reliance of a program on malloc() can occur in both C and C++, even if the program doesn't directly use them. It is a quality of implementation matter for the compiler and standard library rather than a requirement by the standards.
This really depends on how the both the compiler startup code (code that sets things up so main() can be called) works and how standard library code is implemented.
In both C and C++, for example, startup code (in hosted environments) needs to collect information about command line arguments (possibly copy to some allocated buffer), connect to standard files/streams (like std::cout and std::cin in C++, and `stdout and stdin in C). Any of these things can involve dynamic memory allocation (e.g. for buffers associated with standard streams) or execute code that is not actually needed by the program.
C++ has two kinds of implementations, hosted and freestanding. Hosted implementations do assume that malloc is present and often do use it for internal purposes. Freestanding implementations assume that only the new function is present, because it supports the C++ keyword new, but it is easy to ensure that this function doesn't get called.
The difference between the two is that in a freestanding implementation, you can control program startup and the set of required headers and libraries is limited. Program startup is controlled by setting the entry point.
g++ -ffreestanding -e _entry program.cpp
program.cpp might be:
extern "C" int entry()
{
return 0;
}
The extern "C" is necessary to prevent C++ name mangling, which might make it difficult to figure out what the name of entry is during linking. Then, don't use new, std::string, stream I/O, STL, or the like, and avoid at_exit.
This should give you control over the startup and cleanup code and limit what the compiler can implicitly rely on being available from the standard library. Note, however, that this can be a challenging environment. Not only will you prevent initialization of heaps, I/O streams, and the like, but you will also prevent setup of exceptions, RTTI, and the calling of static storage object constructors, among other things. You will have to write code or use libraries to manually opt into several features of C++ you might want to use. If you go this route, you may want to peruse this wiki http://wiki.osdev.org/C%2B%2B. You may want to at least call global constructors, which is pretty easy to do.
Explanation
The standard
C++ has the notion of a "freestanding implementation", in which fewer headers are available.
3.6.1 Main function [basic.start.main]
1 [snip] It is implementation-defined whether a program in a freestanding environment is required to define a main function.
17.6.1.3 Freestanding implementations [compliance]
1 Two kinds of implementations are defined: hosted and freestanding (1.4). For a hosted implementation, this
International Standard describes the set of available headers.
2 A freestanding implementation has an implementation-defined set of headers. This set shall include at least
the headers shown in Table 16.
3 The supplied version of the header <cstdlib> shall declare at least the functions abort, atexit, at_quick_exit, exit, and quick_exit (18.5). [snip]
Table 16 lists ciso646, cstddef, cfloat, limits, climits, cstdint, cstdlib, new, typeinfo, exception, initializer_list, cstdalign, cstdarg, cstdbool, type_traits, and atomic.
Most of the above headers contain simple definitions of types, constants, and templates. The only ones that may seem problematic are typeinfo, exception, cstdlib, and new. The first two support RTTI and exceptions, respectively, which you can ensure are disabled with additional compiler flags. cstdlib and new you can simply ignore by not calling exit/at_exit and not using new expressions. The reason for avoiding at_exit is it might call new internally. Nothing else in the freestanding fragment should call call anything in cstdlib or new.
The options
The most important option above is -e _entry, which gives you control over what runs when your program starts.
-ffreestanding tells the compiler and the standard library (rather, its freestanding fragment) not to assume that the entire standard library is present (even if it still is). It may prevent the generation of surprising code. Note that this option doesn't actually restrict which headers are available to you. You can still use iostream, for instance, though it may be a bad idea if you also changed the entry point. What it does is it prevents the code supporting the freestanding headers from calling anything outside the freestanding headers, and prevents the compiler from generating any such calls implicitly.
If you use any STL like std::lib or std::map. Or even std::cout, there will be dynamic memory allocations behind the scene.
It will always need malloc. Because the binary has to be loaded and also the shared libraries.
Related
For example, i want to see the code of function toupper() to understand how it works, is there any way? I have searched and opened string.h library, but didn't find anything.
From a strict language point of view, you cannot "see the code" of a standard function, because the C++ language standard only defines functions' prototypes and behaviours, not how they are implemented.
In fact, from a strict language point of view, a standard function like toupper does not even have to have source code, because a standard header, like <string.h> does not even have to be a file!
Of course, in practice, you will probably never encounter a C++ implementation in which standard headers are not files, because files are just a natural and simple implementation of headers. This means that in practice, for the header <string.h>, there is actually a C++ source file called "string.h" somewhere on your computer. Just find it and open it.
I have searched and opened string.h library, but didn't find anything.
Then you have not looked close enough. Hint: This file most likely includes one or more other header files.
Note that if you actually looked for toupper, that function is not in <string.h> anyway. Look in <ctype.h> instead. cppreference.com is a good online reference to tell you which headers contain which functions.
http://en.cppreference.com/w/c/string/byte/toupper
Again, this does not mean that the corresponding header file of your compiler contains that function directly, but it may directly or indirectly include some other file which contains it.
In any case, beware of what you will see inside of your compiler's header files. It will usually be a lot more complicated than you may think, and, more importantly, it will often use constructs you are not allowed to use in your own code; after all, the code in those files is internal to the compiler implementation, and the compiler has a lot of privileges you don't have, for example using otherwise forbidden identifiers like _STD_BEGIN. Also expect a lot of completely non-standard #pragmas and other non-portable stuff.
Another important thing to keep in mind is that you are not supposed to dig through a function's implementation to find out what it does. In badly written software, i.e. software with confusing interfaces and no documentation (which exists everywhere in the real world), you unfortunately have to do this, provided you have access to the source code.
But C++ standard functions are perfectly documented and have, with some arguable exceptions, well-designed interfaces. It may be interesting, and educating, and sometimes even necessary for debugging, to look into their implementation on your system, but don't let this possibility keep you from learning two important software-engineering skills:
Reading documentation.
Programming to interfaces, not to implementations.
Yes, of course, you could (not all realizations, maybe). For example, the glibc implementation defines toupper function as:
#define __ctype_toupper \
((int32_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128)
int
toupper (int c)
{
return c >= -128 && c < 256 ? __ctype_toupper[c] : c;
}
On the std-proposals list, the following code was given:
#include <vector>
#include <algorithm>
void foo(const std::vector<int> &v) {
#ifndef _ALGORITHM
std::for_each(v.begin(), v.end(), [](int i){std::cout << i; }
#endif
}
Let's ignore, for the purposes of this question, why that code was given and why it was written that way (as there was a good reason but it's irrelevant here). It supposes that _ALGORITHM is a header guard inside the standard header <algorithm> as shipped with some known standard library implementation. There is no inherent intention of portability here.
Now, _ALGORITHM would of course be a reserved name, per:
[C++11: 2.11/3]: In addition, some identifiers are reserved for use by C++ implementations and standard libraries (17.6.4.3.2) and shall not be used otherwise; no diagnostic is required.
[C++11: 17.6.4.3.2/1]: Certain sets of names and function signatures are always reserved to the implementation:
Each name that contains a double underscore _ _ or begins with an underscore followed by an uppercase letter (2.12) is reserved to the implementation for any use.
Each name that begins with an underscore is reserved to the implementation for use as a name in the global namespace.
I was always under the impression that the intent of this passage was to prevent programmers from defining/mutating/undefining names that fall under the above criteria, so that the standard library implementors may use such names without any fear of conflicts with client code.
But, on the std-proposals list, it was claimed that this code is itself ill-formed for merely referring to such a reserved name. I can now see how the use of the phrase "shall not be used otherwise" from [C++11: 2.11/3]: may indeed suggest that.
One practical rationale given was that the macro _ALGORITHM could expand to some code that wipes your hard drive, for example. However, taking into account the likely intention of the rule, I'd say that such an eventuality has more to do with the obvious implementation-defined* nature of the _ALGORITHM name, and less to do with it being outright illegal to refer to it.
* "implementation-defined" in its English language sense, not the C++ standard sense of the phrase
I'd say that, as long as we're happy that we are going to have implementation-defined results and that we should investigate what that macro means on our implementation (if it exists at all!), it should not be inherently illegal to refer to such a macro provided we do not attempt to modify it.
For example, code such as the following is used all over the place to distinguish between code compiled as C and code compiled as C++:
#ifdef __cplusplus
extern "C" {
#endif
and I've never heard a complaint about that.
So, what do you think? Does "shall not be used otherwise" include simply writing such a name? Or is it probably not intended to be so strict (which may point to an opportunity to adjust the standard wording)?
Whether it's legal or not is implementation-specific (and identifier-specific).
When the Standard gives the implementation the sole right to use these names, that includes the right to make the names available in user code. If an implementation does so, great.
But if an implementation doesn't expressly give you the right, it is clear from "shall not be used otherwise" that the Standard does not, and you have undefined behavior.
The important part is "reserved to the implementation". It means that the compiler vendor may use those names and even document them. Your code may then use those names as documented. This is often used for extensions like __builtin_expect, where the compiler vendor avoids any clash with your identifiers (that are declared by your code) by using those reserved names. Even the standard uses them for things like __attribute__ to make sure it doesn't break existing (legal) code when adding new features.
http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html#1882
Each identifier that contains a double understore __ or begins with an underscore followed by an uppercase letter is reserved to the implementation for any use.
any use. (similar text occurs both before and after that defect fix is applied)
__cplusplus is defined by the standard. _ALGORITHM is reserved by the standard to be used by implementations. These seem quite different? (The two sections of the standard do conflict, in that one states that __cplusplus is reserved for any use, and another uses it specifically, but I think that the winner of that conflict is clear).
The _ALGORITHM identifier could, under the standard, be used as part of a pre-processing step to say "replace this source code with hard drive deleting code". Its existence (prior to pre-processing, or after) could be sufficient to completely change your program behavior.
Now this is unlikely, but I do not think it results in an non-conforming implementation. It is a matter of quality of implementation only.
An implementation is free to document and define what _ALGORITHM means. For example, it could document that it is a header guard for <algorithm>, and indicates if that header file has been included. Treating your current <algorithm> implementation as documentation is probably going to far.
I'd guess using __cplusplus in C mode is technically "just as bad" as using _ALGORITHM, but this question is a c++ question, not a c question. I haven't delved into the c standard to look for quotes about it.
The names in [cpp.predefined] are different. Those have a specified meaning, so an implementation can't reserve them for any use, and using them in a program has a well-defined portable meaning. Using an implementation-specific identifier like the example of _ALGORITHM is ill-formed because it violates a shall-rule.
Yes, I'm fully aware of multiple examples where the library specification uses "shall" to mean "this is a requirement on user code, and violations are UB, not ill-formed".
Regarding whether it's UB or implementation-defined, running an ill-formed program results in UB. The standard wording clearly says the program is ill-formed, UB occurs if the implementation still chooses to accept the program and run it.
So, if a program uses the identifier _ALGORITHM, that program is ill-formed, and running such a program is UB, but that does not mean it doesn't work fine on an implementation that uses _ALGORITHM as an include guard, nor does it mean that it doesn't work fine on an implementation that doesn't.
If users are concerned about such ill-formedness and potential UB, and said users want to write portable C++, they shouldn't use reserved identifiers in portable C++ programs. If users accept that regardless of the standard prohibiting such a use, no practical implementation will wipe your hard drive, they can freely use such reserved identifiers, but by the letter of the standard, such uses are still ill-formed.
Historically, the purpose for making the use of such tokens "undefined behavior" is that compilers are free to attach any meaning they want to any such token that are not defined within the C standard. For example, on some embedded processors, using __xdata as a storage class for a variable will ask that it be stored in an area of RAM which is slower to access than the normal variable-storage area, but is much larger. On typical processors of that family, storage for "normal" variables would be limited to about 100 bytes, but storage for xdata variables may be much larger--up to 64K. The standard says basically nothing about what compilers are allowed to do with such directives, although typically (I'm not sure if the standard mandates this behavior, though I'm unaware of compilers violating it) such tokens are generally ignored within code that is disabled using a #if or similar directives.
Some libraries' header files will start their own internal identifiers with something that starts with two underscores but includes a pattern that's unlikely to be used by a compiler for any purpose (e.g. version 23 of the Foozle library might precede its identifiers with use __FZ23). It would be perfectly legitimate for a future compilers to use identifiers starting with __FZ23 for other purposes, and if that were to happen the Foozle library would need to be changed to use something else. If, however, it is likely that a major compiler upgrade would likely necessitate rewrites of the Foozle library for other reasons anyway, that risk may be acceptable compared to the risk of identifiers conflicting with outside code.
Note also that some project header files which are targeted toward a processor that requires __ directives may conditionally define macros with those names when compiled for other processors, for example:
#ifndef USE_XDATA
#define __XDATA
#endif
though a somewhat better pattern would generally be:
#ifdef USE_XDATA
#define XDATA __XDATA
#else
#define XDATA
#endif
When writing new code, the latter pattern is often better, but the former pattern may sometimes be useful when adapting existing code written on a platform that requires __XDATA so that it may be used both on platforms that use/require that directive and on platforms that do not.
Whether or not it is legal is a matter of local law. Whether it means anything, and if so, what, is a matter for the language definition. When you use a name that's reserved to the implementation the behavior of your program is undefined. That means that the language definition does not tell you what the program does. Nothing more, nothing less. If the compiler you're using documents what a particular reserved identifier does, then you can use that identifier with that compiler. If you hunt through headers and guess what various un-documented identifiers mean you might be able to use them, but don't be surprised if your code breaks when a subsequent update changes something.
Don't get hung up on __cplusplus. It's core language, and the stuff about double underscores, etc. is library. If that's not convincing, just consider it a glitch. You can use __cplusplus in C++ programs; its meaning is well defined.
I know there are differences in the source code between C and C++ programs - this is not what I'm asking about.
I also know this will vary from CPU to CPU and OS to OS, depending on compiler.
I'm teaching myself C++ and I've seen numerous references to libraries that can be used by both languages. This has started me thinking - are there significant differences between the binary executables of the two languages?
For libraries to be easily used by both, I would think they'd have to be similar on an executable level.
Are there many situations where a person could examine a executable file and tell whether it was created by C or C++ source code? Or would the binaries be pretty similar?
In most cases, yes, it's pretty easy. Here are just a few clues that I've seen often enough to remember them easily:
C++ program will typically end up with at least a few visible symbols that have been mangled.
C++ program will typically have at least a few calls to virtual functions, which are typically quite distinctive from code you'll typically see in C.
Many C++ compilers implement a calling convention for C++ that gives special consideration to passing the this pointer into C++ member functions. Again, since the this pointer simply doesn't exist in C, you'll rarely see a direct analog (though in some cases, they will use the same convention to pass some other pointer, so you need to be careful about this one).
A executable is a executable is a executable, no matter what language it's written in. If it's built for the target architecture, it'll run on the architecture.
The (arguably) most important difference between C and C++-compiled code, and the one relevant to libraries that can be linked both against C and C++ executables, is that of name mangling. Basically: when a library is compiled, it exports a set of symbols (function names, exported variables, etc.) that executables linked against the library can use. How these symbols are named is a fairly compiler/linker-specific, and if the subsequent executable is linked using a linker using an incompatible convention, then symbols won't resolve correctly. In addition, C and C++ have slightly different conventions. The Wikipedia article linked above has more of the details; suffice to say, when declaring exported symbols in a header file, you'll usually see a construction like:
#ifdef __cplusplus
extern "C" {
#endif
/* exported declarations here */
#ifdef __cplusplus
}
#endif
__cplusplus is a preprocessor macro only defined when compiling C++ code. The idea here is that, when using the header in C++, the compiler is instructed to use the C way of naming exported symbols (inside the "extern "C" { /* foo */ }" block, so the library can be linked both in C and C++ correctly.
I think I could tell if something is C++ or C from reading the disassembled binary code [for processor architectures that I'm familiar with, x86, x86_64 and ARM]. But in reality, there isn't much difference, you'd have to look pretty hard to know for sure.
Signs to look for are "indirect calls" (function pointer calls via a table) and this-pointers. Although C can have pointer to struct arguments and will often use function pointers, it's not usually set up in the way that C++ does it. Also, you'll notice, sometimes, that the compiler takes a pointer to a struct and adds a small offset - that's removing the outer layer of an inherited class. This CAN happen in C as well, but it won't be as common/distinctive.
Looking just at the binary [unless you can "do disassembly in your head" would be a lot harder - especially if it's been stripped of symbols - that's like the guy who could tell you what classical music something was on an old Vinyl record from looking at the tracks [with the label hidden] - not something most people can do, even if they are "good".
In practice, a C program (or a C++ program) is rarely only pure standard C (or C++) (for instance the C99 standard has no mean to scan a directory). So programs use additional libraries.
On Linux, most binaries are dynamically linked. Use the ldd command to find out.
If the binary is linked to the stdc++ library, the source code is likely C++.
If only the libc.so library is linked, the source code is probably only C (but you could link statically the libstdc++.a library).
You can also use tools working on binary files (e.g. objdump, readelf, strings, nm on Linux ....) to find more about them.
The code generated by C and C++ compilers is generally the same code. There are two important differences:
Name mangling: Each function and global variable becomes a symbol at compile time. In C these symbol's names are the same as their names in your source code. In C++ they are being mangled a bit to allow for polymorphic code
Calling conventions: If you call a method in C++ the this-pointer is passed as a hidden first parameter. Other conventions might also be different such as call by reference which does not exist in C
You can use an block such as this to let the C++-compiler generate code compatible to C:
extern "C" {
/* code */
}
which compiling a multithreaded program we use gcc like below:
gcc -lpthread -D_REENTRANT -o someprogram someprogram.c
what exactly is the flag -D_REENTRANT doing over here?
Defining _REENTRANT causes the compiler to use thread safe (i.e. re-entrant) versions of several functions in the C library.
You can search your header files to see what happens when it's defined.
Excerpt from the libc 8.2 manual:
Macro: _REENTRANT
Macro: _THREAD_SAFE
These macros are obsolete. They have the same effect as defining
_POSIX_C_SOURCE with the value 199506L.
Some very old C libraries required one of these macros to be defined
for basic functionality (e.g. getchar) to be thread-safe.
We recommend you use _GNU_SOURCE in new programs. If you don’t specify
the ‘-ansi’ option to GCC, or other conformance options such as
-std=c99, and don’t define any of these macros explicitly, the effect is the same as defining _DEFAULT_SOURCE to 1.
When you define a feature test macro to request a larger class of
features, it is harmless to define in addition a feature test macro
for a subset of those features. For example, if you define
_POSIX_C_SOURCE, then defining _POSIX_SOURCE as well has no effect. Likewise, if you define _GNU_SOURCE, then defining either
_POSIX_SOURCE or _POSIX_C_SOURCE as well has no effect.
JayM replied:
Defining _REENTRANT causes the compiler to use thread safe (i.e. re-entrant) versions of several functions in the C library.
You can search your header files to see what happens when it's defined.
Since OP and I were both interested in the question, I decided to actually post the answer. :) The following things happen with _REENTRANT on Mac OS X 10.11.6:
<math.h> gains declarations for lgammaf_r, lgamma_r, and lgammal_r.
On Linux (Red Hat Enterprise Server 5.10), I see the following changes:
<unistd.h> gains a declaration for the POSIX 1995 function getlogin_r.
So it seems like _REENTRANT is mostly a no-op, these days. It might once have declared a lot of new functions, such as strtok_r; but these days those functions are mostly mandated by various decades-old standards (C99, POSIX 95, POSIX.1-2001, etc.) and so they're just always enabled.
I have no idea why the two systems I checked avoid declaring lgamma_r resp. getlogin_r when _REENTRANT is not #defined. My wild guess is that this is just historical cruft that nobody ever bothered to go through and clean up.
Of course my observations on these two systems might not generalize to all systems your code might ever encounter. You should definitely still pass -pthread to the compiler (or, less good but okay, -lpthread -D_REENTRANT) whenever your program requires pthreads.
In multithreaded programs, you tell the compiler that you need this feature by defining the _REENTRANT macro before any #include lines in your program. This does three things, and does them so elegantly that usually you don’t even need to know what was done:
Some functions get prototypes for a re-entrant safe equivalent.
These are normally the same function name, but with _r appended so
that, for example, gethostbyname is changed to gethostbyname_r.
Some stdio.h functions that are normally implemented as macros
become proper re-entrant safe functions.
The variable errno, from errno.h, is changed to call a function,
which can determine the real errno value in a multithread safe way.
Taken from Beginning Linux Programming
It simply defined _REENTRANT for the preprocessor. Somewhere in the associated code, you'll probably find #ifdef _REENTRANT or #if defined(_REENTRANT) in at least a few places.
Also note that the name "_REENTRANT: is in the implementer's name space (any name starting with an underscore followed by another underscore or a capital letter is), so defining it means you've stepped outside what the standard defines (at least the C or C++ standards).
I was really bothered by the inclusion of C stdlib functions on the global namespace and ended up writing things like ::snprintf or ::errno or struct ::stat, etc, to differentiate from some of my own functions in the enclosing namespace where those c stdlib functions were used.
Then I discovered that there is a way to declare every C stdlib function in the std namespace (as STL): just include < c(lib) > instead of < (lib).h > so I've edited my code the use those new "c for c++" includes.
On Debian/GCC 4.3.4 I had to 2 problems:
1) #error This file requires compiler and library support for the upcoming \
ISO C++ standard, C++0x. This support is currently experimental, and must be \
enabled with the -std=c++0x or -std=gnu++0x compiler options.
2) using -std=c++0x my program compiles just fine, but I have not modified ::snprintf or ::time, etc.. every C stdlib function is still on the global namespace =(! (no, I'm not using namespace std not even once)
Any thoughts?
For example.. how to stop the c stdlib from invading my global namespace? < c(lib) > is an experimental feature of the next C++ standard or could be used safely right now?
Then I've another doubt that perhaps deserves a new question.. there is no cmalloc. I know the whole history about new replacing malloc and why. but for simple plain byte buffers there is no c++ equivalent of realloc. I know that memory blocks and reallocation are implementation/so specific, but when there are contiguous free blocks of memory realloc works better than a new buffer allocation and memory copy.
Thanks =)!
For your first question, it depends on which headers you are trying to include. Most of the C headers are available in the c(lib) form in the existing version of C++. A few aren't, and may be added in C++0x. So if you tried to include any of those, you might have gotten that error.
Second, all the headers of this form guarantee that the functions will be made available in the std namespace. But they do not promise to leave the global namespace alone. Often, they put the symbols in both namespaces.
I'm not sure why ::snprintf bothers you more than std::snprintf though. You have to specify a prefix in both cases.
As for realloc, a C++ equivalent doesn't exist, probably because it's more trouble than it's worth, especially with C++'s more complicated semantics for copying objects. (In particular, if you try to use it, don't store any non-POD objects in the buffer, as realloc will just memcpy them to a newly allocated buffer if necessary, which will break non-POD objects.)
Of course, you can still use the old realloc from C by including its header. But I'd say you're probably better off using new/delete, and simply figuring out a sensible buffer allocation strategy.
The malloc() function, in standard C is not declared in the "<malloc.h>" header. It is declared in <stdlib.h>. Same for realloc() and free().
I don't know about C++, but instead of
#include <cmalloc>
try
#include <cstdlib>
c<lib>, which roughly encloses <lib>.h in a namespace std { }, is a standard feature of C++. See §17.4.1.2 if you have access to either standard.
This is not an experimental feature at all -- what header file is giving you the compatibility problems?
Using malloc et al. is fine, but be sure never to mix them with new/delete. (e.g. don't delete a malloc()'ed buffer.)
There are some of the standard C <lib.h> headers which are not yet transfered to <clib>. Probably you used <cstdint> or the like somewhere.
With the current standard you have the c libraries listed here. Note that <cstdint> is not part of it.
I didn't find a reference describing if and when <cstdint> will be part of c++, but if I try to include it, I also get an error message telling me I should use -std=c++0x, so I assume it is planned to be included in the next c++ standard.