Can I expect to see a performance hit for the casting in this . .
enum class myEnum {A,B,C};
myArray[(int)myEnum::A] = 123;
Compared to this?
enum myEnum {A,B,C};
myArray[A] = 123;
I'm leaning towards the new style enum classes for the type safety, but don't want to do it at the expense of performance.
It depends whether the enum value used as an index is known at compile time or passed in a variable.
That is myArray[(int)myEnum::A] shall not incur any penalty but myArray[(int)e] might, depending on the physical representation of e (ie, it might be necessary to "extend" it).
On the other hand, a simple extension is a trivial operation that is unlikely to ever show up as a performance issues: things like branch prediction (in conditionals) and caching are much more important in most applications (for low-level), and at a higher level algorithms matter.
Note: to avoid the extension issue in the runtime scenario, you could define the base type of myEnum to be the natural type that is expected for compiler arithmetic, I believe a ptrdiff_t would be most appropriate here. It is a big integer though.
Of course it's ultimately up to the compiler, but it's hard to see why any reasonable compiler would generate different code in these two cases.
I've tested this with g++ 4.7.2 on Intel, and they compile to identical assembly code.
No, it's not likely that the casting should have any impact on performance. This is all resolved at compile time.
I would expect NOT, but it would depend on the compiler implementation. Why don't you try both approaches, and time it? Try a few different compiler optimisation settings too, so you know that it won't make it different when it comes to production code compiles.
I doubt there would be any assembly instructions at all for the cast, since the entire expression (int)myEnum::A can be evaluated at compile time.
But if you really want to know, make a pair of sample programs, and analyze both by dumping disassembly, and/or measuring performance.
Related
I once had a very subtle bug in one of the projects I had to maintain. Essentially it was doing something like this:
union Value {
int64_t int64;
int32_t int32[2];
int16_t shorts[4];
int8_t chars[8];
float floats[2];
double float64;
};
Value v;
// in one place (not sure about exact code, it could be just memcpy):
v.shorts[0] = <some short value>;
v.shorts[1] = <some other short value>;
// in another place:
float f = v.floats[0];
Now, as far as the standard is concerned, this is simply UB. In practice, this could mean anything, but I can hardly imagine a reasonable implementation that would cause the code above to start World War III or disintegrate my PC. In real life, I can only imagine two things happening:
The compiler may screw up something with optimizations, not realizing that it deals with the same memory here. Pretty unlikely in this case, since writes and reads happened in totally different places.
Nothing bad really happens, and the float value is simply read bit-by-bit.
In practice, it was almost always case 2, except for once. After running the program compiled with MSVC 2010 in release mode on about 100-150 input files, in one of the files it generated an incorrect value that differed in exactly one bit from what it was supposed to be according to the common sense. That was a pretty significant bit, too, so instead of, say, 1.5, I got something like 117.9. I was able to trace it down to that exact read, and after fixing the code to adhere to strict aliasing rules, everything worked fine.
Now the question is, purely from low-level point of view, what could have caused that? Some peculiarities in CPU handling floating-point values? Hardware caching specifics? Compiler quirks? Why only one value was wrong?
The hardware was some old 2-core 64-bit Intel CPU running a 32-bit Windows 7, if that is of any help. The program is a single-threaded console application, nothing fancy. The problem was 100% reproducible, the same input files always produced the same output, and it's always the same value that was wrong.
From the point of view of the Standard, the code v.shorts[0] = something; takes a pointer value of type "short*", adds zero, and uses the resulting pointer to store a value. I think the authors of C89 intended that quality implementations where aliasing would be useful would recognize aliasing in this case, but nothing in the letter of the Standard would require it. Note that when the rules were included in C89, compilers were only expected to apply them on a very local level; further, the rules generally don't pose severe problems for programmers unless they're applied at a more far-reaching level. Unfortunately, some compilers are aggressively seek to extend the range of the rules as far as possible.
If you were to place each array within a separate structure within the union and then do something like:
v.floats.arr[0] = value;
v.floats.arr[1] = value;
v.floats = v.floats; // Compiler knows that float* may alter float members,
// and that writing member of union may alter other
// members
... now use other stuff
a compiler should hopefully recognize that the assignment to v.floats need
not generate any code but a conforming compiler must must still regard it as proper notice that other members of the union may have been altered. Note that the pattern does not seem to be reliable in gcc as of 6.2, however; in some cases when an assignment is not required to generate any code, the compiler will ignore the assignment--including its aliasing implications--altogether. I don't see any reason to bend over backward working around gcc's broken behavior, however--simply use -fno-strict-aliasing guilt-free unless or until gcc's aliasing logic gets fixed.
GCC, MSVC, LLVM, and probably other toolchains have support for link-time (whole program) optimization to allow optimization of calls among compilation units.
Is there a reason not to enable this option when compiling production software?
I assume that by "production software" you mean software that you ship to the customers / goes into production. The answers at Why not always use compiler optimization? (kindly pointed out by Mankarse) mostly apply to situations in which you want to debug your code (so the software is still in the development phase -- not in production).
6 years have passed since I wrote this answer, and an update is necessary. Back in 2014, the issues were:
Link time optimization occasionally introduced subtle bugs, see for example Link-time optimization for the kernel. I assume this is less of an issue as of 2020. Safeguard against these kinds of compiler and linker bugs: Have appropriate tests to check the correctness of your software that you are about to ship.
Increased compile time. There are claims that the situation has significantly improved since 2014, for example thanks to slim objects.
Large memory usage. This post claims that the situation has drastically improved in recent years, thanks to partitioning.
As of 2020, I would try to use LTO by default on any of my projects.
This recent question raises another possible (but rather specific) case in which LTO may have undesirable effects: if the code in question is instrumented for timing, and separate compilation units have been used to try to preserve the relative ordering of the instrumented and instrumenting statements, then LTO has a good chance of destroying the necessary ordering.
I did say it was specific.
If you have well written code, it should only be advantageous. You may hit a compiler/linker bug, but this goes for all types of optimisation, this is rare.
Biggest downside is it drastically increases link time.
Apart from to this,
Consider a typical example from embedded system,
void function1(void) { /*Do something*/} //located at address 0x1000
void function2(void) { /*Do something*/} //located at address 0x1100
void function3(void) { /*Do something*/} //located at address 0x1200
With predefined addressed functions can be called through relative addresses like below,
(*0x1000)(); //expected to call function2
(*0x1100)(); //expected to call function2
(*0x1200)(); //expected to call function3
LTO can lead to unexpected behavior.
updated:
In automotive embedded SW development,Multiple parts of SW are compiled and flashed on to a separate sections.
Boot-loader, Application/s, Application-Configurations are independently flash-able units. Boot-loader has special capabilities to update Application and Application-configuration. At every power-on cycle boot-loader ensures the SW application and application-configuration's compatibility and consistence via Hard-coded location for SW-Versions and CRC and many more parameters. Linker-definition files are used to hard-code the variable location and some function location.
Given that the code is implemented correctly, then link time optimization should not have any impact on the functionality. However, there are scenarios where not 100% correct code will typically just work without link time optimization, but with link time optimization the incorrect code will stop working. There are similar situations when switching to higher optimization levels, like, from -O2 to -O3 with gcc.
That is, depending on your specific context (like, age of the code base, size of the code base, depth of tests, are you starting your project or are you close to final release, ...) you would have to judge the risk of such a change.
One scenario where link-time-optimization can lead to unexpected behavior for wrong code is the following:
Imagine you have two source files read.c and client.c which you compile into separate object files. In the file read.c there is a function read that does nothing else than reading from a specific memory address. The content at this address, however, should be marked as volatile, but unfortunately that was forgotten. From client.c the function read is called several times from the same function. Since read only performs one single read from the address and there is no optimization beyond the boundaries of the read function, read will always when called access the respective memory location. Consequently, every time when read is called from client.c, the code in client.c gets a freshly read value from the address, just as if volatile had been used.
Now, with link-time-optimization, the tiny function read from read.c is likely to be inlined whereever it is called from client.c. Due to the missing volatile, the compiler will now realize that the code reads several times from the same address, and may therefore optimize away the memory accesses. Consequently, the code starts to behave differently.
Rather than mandating that all implementations support the semantics necessary to accomplish all tasks, the Standard allows implementations intended to be suitable for various tasks to extend the language by defining semantics in corner cases beyond those mandated by the C Standard, in ways that would be useful for those tasks.
An extremely popular extension of this form is to specify that cross-module function calls will be processed in a fashion consistent with the platform's Application Binary Interface without regard for whether the C Standard would require such treatment.
Thus, if one makes a cross-module call to a function like:
uint32_t read_uint32_bits(void *p)
{
return *(uint32_t*)p;
}
the generated code would read the bit pattern in a 32-bit chunk of storage at address p, and interpret it as a uint32_t value using the platform's native 32-bit integer format, without regard for how that chunk of storage came to hold that bit pattern. Likewise, if a compiler were given something like:
uint32_t read_uint32_bits(void *p);
uint32_t f1bits, f2bits;
void test(void)
{
float f;
f = 1.0f;
f1bits = read_uint32_bits(&f);
f = 2.0f;
f2bits = read_uint32_bits(&f);
}
the compiler would reserve storage for f on the stack, store the bit pattern for 1.0f to that storage, call read_uint32_bits and store the returned value, store the bit pattern for 2.0f to that storage, call read_uint32_bits and store that returned value.
The Standard provides no syntax to indicate that the called function might read the storage whose address it receives using type uint32_t, nor to indicate that the pointer the function was given might have been written using type float, because implementations intended for low-level programming already extended the language to supported such semantics without using special syntax.
Unfortunately, adding in Link Time Optimization will break any code that relies upon that popular extension. Some people may view such code as broken, but if one recognizes the Spirit of C principle "Don't prevent programmers from doing what needs to be done", the Standard's failure to mandate support for a popular extension cannot be viewed as intending to deprecate its usage if the Standard fails to provide any reasonable alternative.
LTO could also reveal edge-case bugs in code-signing algorithms. Consider a code-signing algorithm based on certain expectations about the TEXT portion of some object or module. Now LTO optimizes the TEXT portion away, or inlines stuff into it in a way the code-signing algorithm was not designed to handle. Worst case scenario, it only affects one particular distribution pipeline but not another, due to a subtle difference in which encryption algorithm was used on each pipeline. Good luck figuring out why the app won't launch when distributed from pipeline A but not B.
LTO support is buggy and LTO related issues has lowest priority for compiler developers. For example: mingw-w64-x86_64-gcc-10.2.0-5 works fine with lto, mingw-w64-x86_64-gcc-10.2.0-6 segfauls with bogus address. We have just noticed that windows CI stopped working.
Please refer the following issue as an example.
I'm interested in writing good code from the beginning instead of optimizing the code later. Sorry for not providing benchmark I don't have a working scenario at the moment. Thanks for your attention!
What are the performance gains of using FunctionY over FunctionX?
There is a lot of discussion on stackoverflow about this already but I'm in doubts in the case when accessing sub-members (recursive) as shown below. Will the compiler (say VS2008) optimize FunctionX into something like FunctionY?
void FunctionX(Obj * pObj)
{
pObj->MemberQ->MemberW->MemberA.function1();
pObj->MemberQ->MemberW->MemberA.function2();
pObj->MemberQ->MemberW->MemberB.function1();
pObj->MemberQ->MemberW->MemberB.function2();
..
pObj->MemberQ->MemberW->MemberZ.function1();
pObj->MemberQ->MemberW->MemberZ.function2();
}
void FunctionY(Obj * pObj)
{
W * localPtr = pObj->MemberQ->MemberW;
localPtr->MemberA.function1();
localPtr->MemberA.function2();
localPtr->MemberB.function1();
localPtr->MemberB.function2();
...
localPtr->MemberZ.function1();
localPtr->MemberZ.function2();
}
In case none of the member pointers are volatile or pointers to volatile and you don't have the operator -> overloaded for any members in a chain both functions are the same.
The optimization rule you suggested is widely known as Common Expression Elimination and is supported by vast majority of compilers for many decades.
In theory, you save on the extra pointer dereferences, HOWEVER, in the real world, the compiler will probably optimize it out for you, so it's a useless optimization.
This is why it's important to profile first, and then optimize later. The compiler is doing everything it can to help you, you might as well make sure you're not just doing something it's already doing.
if the compiler is good enough, it should translate functionX into something similar to functionY.
But you can have different result on different compiler and on the same compiler with different optimization flag.
Using a "dumb" compiler functionY should be faster, and IMHO it is more readable and faster to code. So stick with functionY
ps. you should take a look at some code style guide, normally member and function name should always start with a low-case letter
I want to use C with templates on a embedded environment and I wanted to know what is the cost of compiling a C program with a C++ compiler?
I'm interested in knowing if there will be more code than the one the C compiler will generate.
Note that as the program is a C program, is expect to call the C++ compiler without exception and RTTI support.
Thanks,
Vicente
The C++ compiler may take longer to compile the code (since it has to build data structures for overload resolution, it can't know ahead of time that the program doesn't use overloads), but the resulting binary should be quite similar.
Actually, one important optimization difference is that C++ follows strict aliasing rules by default, while C requires the restrict keyword to enable aliasing optimizations. This isn't likely to affect code size much, but it could affect correctness and performance significantly.
There's probably no 'cost', assuming that the two compilers are of equivalent quality. The traditional objection to this is that C++ is much more complex and so it's more likely that a C++ compiler will have bugs in it.
Realistically, this is much less of a problem that it used to be, and I tend to do most of my embedded stuff now as a sort of horrible C/C++ hybrid - taking advantage of stronger typing and easier variable declaration rules, without incurring RTTI or exception handling overheads. If you're taking a given compiler (GCC, etc) and switching it from C to C++ mode, then much of what you have to worry about is common to the two languages anyway.
The only way to really know is for you to try it with the compilers you care about. A quick experiment here on a trivial program shows that the output is the same.
Your program will be linked to the C++ runtime library, not the C one. The C++ is larger as well.
Also, there are a couple of differences between C and C++ (aliases were already pointed out) so it may happen that your C code just does not compile in C++.
If it's C, then you can expect it will be exactly the same.
To elaborate: both C and C++ will forward their parse tree into the same backend that generates code (possibly via another intermediate representation), which means that if the code is functionally identical, the output will look the same (or nearly so).
Templates do "inflate" code, but you would otherwise have to write the same code or use macros to the same effect, so this is no "extra cost". Contrarily, the compiler may be able to optimize templates better in some cases.
A C++ compiler cannot compile C code. It can only compile C++, including a very ugly language which is the intersection of C and C++ and the worst of both worlds. Some C code will fail to compile at all on a C++ compiler, for example:
char *s = malloc(len+1);
While other C code will be compiled to the wrong thing, for example:
sizeof 'a'
I have found this extra-ordinary document Technical Report on C++ Performance. I have found there all the answers i was looking for.
Thanks to all that have answered this question.
There will be more code because that is what templates do. They are a stencil for generating (more) code.
Otherwise, you should see no differences between compiling a C program with a C compiler versus compiling with a C++ compiler.
If you don't use any of the extra "features" there should be no difference in size or behavior of the end result.
Although the C code will likely compile to something very similar (assuming there's no exception support enabled), using templates can very rapidly result in large binaries - you have to be careful, because every template instantiation can recursively result in other templates being implicitly instantiated as well.
There was a time when the C++ compiler linked in a bunch of C++ stuff even if the program didnt use it and you would see binaries that were 10 to 100 times larger than the C compiler would produce. I think a lot of that has gone away.
Since this is tagged "embedded", I assume its for embedded systems?
In that case, the major difference between C and C++ is the way C++ treats structs. All structs will be treated like classes, meaning they will have constructors.
All instances of structs/classes declared at file scope or as static will then have their constructors called before main() is executed, in a similar manner to static initialization, which you already have there no matter C or C++.
All these constructor calls at bootup is a major disadvantage in efficiency for embedded systems, where the code resides in NVM and not in RAM. Just like static initialization, it will create an ugly, undesired workload peak at the start of the program, where values from NVM are copied into the RAM.
There are ways around the static initialization in C/C++: most embedded compilers have an option to disable it. But since that is a non-standard setup, all code using statics would then have to be written so that it never uses any initialization values, but instead sets all static variables in runtime.
But as far as I know, there is no way around calling constructors, without violating the standard.
EDIT:
Here is source code executed in one such C++ system, Freescale HCS08 Codewarrior 6.3. This code is injected in the user program after static initialization, but before main() is executed:
static void Call_Constructors(void) {
int i;
...
i = (int)(_startupData.nofInitBodies - 1);
while (i >= 0) {
(&_startupData.initBodies->initFunc)[i](); /* call C++ constructors */
i--;
}
...
At the very least, this overhead code must be executed at program startup, no matter how efficient the compiler is at converting constructors into static initializtion.
C++ runtime start-up differs slightly from C start-up because it must invoke the constructors for global static objects before main() is called. This call loop is trivial and should not add much.
In the case of C++ code that is also entirely C compilable no static constructors will be present so the loop will not iterate.
In most cases apart from that, you will normally see no significant difference, in C++ you only pay for what you use.
The following is an excerpt from Bjarne Stroustrup's book, The C++ Programming Language:
Section 4.6:
Some of the aspects of C++’s fundamental types, such as the size of an int, are implementation- defined (§C.2). I point out these dependencies and often recommend avoiding them or taking steps to minimize their impact. Why should you bother? People who program on a variety of systems or use a variety of compilers care a lot because if they don’t, they are forced to waste time finding and fixing obscure bugs. People who claim they don’t care about portability usually do so because they use only a single system and feel they can afford the attitude that ‘‘the language is what my compiler implements.’’ This is a narrow and shortsighted view. If your program is a success, it is likely to be ported, so someone will have to find and fix problems related to implementation-dependent features. In addition, programs often need to be compiled with other compilers for the same system, and even a future release of your favorite compiler may do some things differently from the current one. It is far easier to know and limit the impact of implementation dependencies when a program is written than to try to untangle the mess afterwards.
It is relatively easy to limit the impact of implementation-dependent language features.
My question is: How to limit the impact of implementation-dependent language features? Please mention implementation-dependent language features then show how to limit their impact.
Few ideas:
Unfortunately you will have to use macros to avoid some platform specific or compiler specific issues. You can look at the headers of Boost libraries to see that it can quite easily get cumbersome, for example look at the files:
boost/config/compiler/gcc.hpp
boost/config/compiler/intel.hpp
boost/config/platform/linux.hpp
and so on
The integer types tend to be messy among different platforms, you will have to define your own typedefs or use something like Boost cstdint.hpp
If you decide to use any library, then do a check that the library is supported on the given platform
Use the libraries with good support and clearly documented platform support (for example Boost)
You can abstract yourself from some C++ implementation specific issues by relying heavily on libraries like Qt, which provide an "alternative" in sense of types and algorithms. They also attempt to make the coding in C++ more portable. Does it work? I'm not sure.
Not everything can be done with macros. Your build system will have to be able to detect the platform and the presence of certain libraries. Many would suggest autotools for project configuration, I on the other hand recommend CMake (rather nice language, no more M4)
endianness and alignment might be an issue if you do some low level meddling (i.e. reinterpret_cast and friends things alike (friends was a bad word in C++ context)).
throw in a lot of warning flags for the compiler, for gcc I would recommend at least -Wall -Wextra. But there is much more, see the documentation of the compiler or this question.
you have to watch out for everything that is implementation-defined and implementation-dependend. If you want the truth, only the truth, nothing but the truth, then go to ISO standard.
Well, the variable sizes one mentioned is a fairly well known issue, with the common workaround of providing typedeffed versions of the basic types that have well defined sizes (normally advertised in the typedef name). This is done use preprocessor macros to give different code-visibility on different platforms. E.g.:
#ifdef __WIN32__
typedef int int32;
typedef char char8;
//etc
#endif
#ifdef __MACOSX__
//different typedefs to produce same results
#endif
Other issues are normally solved in the same way too (i.e. using preprocessor tokens to perform conditional compilation)
The most obvious implementation dependency is size of integer types. There are many ways to handle this. The most obvious way is to use typedefs to create ints of the various sizes:
typedef signed short int16_t;
typedef unsigned short uint16_t;
The trick here is to pick a convention and stick to it. Which convention is the hard part: INT16, int16, int16_t, t_int16, Int16, etc. C99 has the stdint.h file which uses the int16_t style. If your compiler has this file, use it.
Similarly, you should be pedantic about using other standard defines such as size_t, time_t, etc.
The other trick is knowing when not to use these typedef. A loop control variable used to index an array, should just take raw int types so the compile will generate the best code for your processor. for (int32_t i = 0; i < x; ++i) could generate a lot of needless code on a 64-bite processor, just like using int16_t's would on a 32-bit processor.
A good solution is to use common headings that define typedeff'ed types as neccessary.
For example, including sys/types.h is an excellent way to deal with this, as is using portable libraries.
There are two approaches to this:
define your own types with a known size and use them instead of built-in types (like typedef int int32 #if-ed for various platforms)
use techniques which are not dependent on the type size
The first is very popular, however the second, when possible, usually results in a cleaner code. This includes:
do not assume pointer can be cast to int
do not assume you know the byte size of individual types, always use sizeof to check it
when saving data to files or transferring them across network, use techniques which are portable across changing data sizes (like saving/loading text files)
One recent example of this is writing code which can be compiled for both x86 and x64 platforms. The dangerous part here is pointer and size_t size - be prepared it can be 4 or 8 depending on platform, when casting or differencing pointer, cast never to int, use intptr_t and similar typedef-ed types instead.
One of the key ways of avoiding dependancy on particular data sizes is to read & write persistent data as text, not binary. If binary data must be used then all read/write operations must be centralised in a few methods and approaches like the typedefs already described here used.
A second rhing you can do is to enable all your your compilers warnings. for example, using the -pedantic flag with g++ will warn you of lots of potential portability problems.
If you're concerned about portability, things like the size of an int can be determined and dealt with without much difficulty. A lot of C++ compilers also support C99 features like the int types: int8_t, uint8_t, int16_t, uint32_t, etc. If yours doesn't support them natively, you can always include <cstdint> or <sys/types.h>, which, more often than not, has those typedefed. <limits.h> has these definitions for all the basic types.
The standard only guarantees the minimum size of a type, which you can always rely on: sizeof(char) < sizeof(short) <= sizeof(int) <= sizeof(long). char must be at least 8 bits. short and int must be at least 16 bits. long must be at least 32 bits.
Other things that might be implementation-defined include the ABI and name-mangling schemes (the behavior of export "C++" specifically), but unless you're working with more than one compiler, that's usually a non-issue.
The following is also an excerpt from Bjarne Stroustrup's book, The C++ Programming Language:
Section 10.4.9:
No implementation-independent guarantees are made about the order of construction of nonlocal objects in different compilation units. For example:
// file1.c:
Table tbl1;
// file2.c:
Table tbl2;
Whether tbl1 is constructed before tbl2 or vice versa is implementation-dependent. The order isn’t even guaranteed to be fixed in every particular implementation. Dynamic linking, or even a small change in the compilation process, can alter the sequence. The order of destruction is similarly implementation-dependent.
A programmer may ensure proper initialization by implementing the strategy that the implementations usually employ for local static objects: a first-time switch. For example:
class Zlib {
static bool initialized;
static void initialize() { /* initialize */ initialized = true; }
public:
// no constructor
void f()
{
if (initialized == false) initialize();
// ...
}
// ...
};
If there are many functions that need to test the first-time switch, this can be tedious, but it is often manageable. This technique relies on the fact that statically allocated objects without constructors are initialized to 0. The really difficult case is the one in which the first operation may be time-critical so that the overhead of testing and possible initialization can be serious. In that case, further trickery is required (§21.5.2).
An alternative approach for a simple object is to present it as a function (§9.4.1):
int& obj() { static int x = 0; return x; } // initialized upon first use
First-time switches do not handle every conceivable situation. For example, it is possible to create objects that refer to each other during construction. Such examples are best avoided. If such objects are necessary, they must be constructed carefully in stages.