C++ class function in assembly - c++

Hello Community
I am look at C++ assembly, I have compiled a benchmark from the PARSEC suite and I am having difficulty knowing how do they name the class attribute functions in assembly language. for example if I have a class with some functions to manipulate it, in cpp we call them like test.increment();
After some investigation I found out that this function is
atomic_load_acq_ptr
represented as:
_ZL19atomic_load_acq_intPVj
in assembly, or at least this is what I have found out.
Let me know if I am wrong!
Is there some fixed rule for the mapping? or are they random?
Thanks

It's called name mangling, is necessary because of overloads and templates and such (i.e. the plain chars-and-numbers name isn't enough to identify a chunk of code unambiguously; embedding spaces or <> or :: in names usually isn't legal; copying the additional information in uncondensed, human-readable form would be wasteful), and it therefore depends on types, arity, etc.
The exact scheme can vary, but usually each compiler is self-consistent for a relatively long time (sometimes even several compilers can settle for one way).

That's called name mangling.. It is compiler dependant. No standard way, sorry :)

C++ allows function overloading, this means that one can have two functions with the same name but different parameters. Since your binary formats do not understand type this is a proble. The way that this is worked around is to use a scheme called name mangling. This adds a whole function of type information to the name used in the source file and ensures one calls the correct overload.
The extra letters etc that are added are governed by the particular Application Binary Interface (ABI) being used. Different compilers (and sometimes even different versions) may use different ABIs.

Yes there's a standard method for creating these symbols known as name mangling.

Related

Why can't C functions be name-mangled?

I had an interview recently and one question asked was what is the use of extern "C" in C++ code. I replied that it is to use C functions in C++ code as C doesn't use name-mangling. I was asked why C doesn't use name-mangling and to be honest I couldn't answer.
I understand that when the C++ compiler compiles functions, it gives a special name to the function mainly because we can have overloaded functions of the same name in C++ which must be resolved at compile time. In C, the name of the function will stay the same, or maybe with an _ before it.
My query is: what's wrong with allowing the C++ compiler to mangle C functions also? I would have assumed that it doesn't matter what names the compiler gives to them. We call functions in the same way in C and C++.
It was sort of answered above, but I'll try to put things into context.
First, C came first. As such, what C does is, sort of, the "default". It does not mangle names because it just doesn't. A function name is a function name. A global is a global, and so on.
Then C++ came along. C++ wanted to be able to use the same linker as C, and to be able to link with code written in C. But C++ could not leave the C "mangling" (or, lack there of) as is. Check out the following example:
int function(int a);
int function();
In C++, these are distinct functions, with distinct bodies. If none of them are mangled, both will be called "function" (or "_function"), and the linker will complain about the redefinition of a symbol. C++ solution was to mangle the argument types into the function name. So, one is called _function_int and the other is called _function_void (not actual mangling scheme) and the collision is avoided.
Now we're left with a problem. If int function(int a) was defined in a C module, and we're merely taking its header (i.e. declaration) in C++ code and using it, the compiler will generate an instruction to the linker to import _function_int. When the function was defined, in the C module, it was not called that. It was called _function. This will cause a linker error.
To avoid that error, during the declaration of the function, we tell the compiler it is a function designed to be linked with, or compiled by, a C compiler:
extern "C" int function(int a);
The C++ compiler now knows to import _function rather than _function_int, and all is well.
It's not that they "can't", they aren't, in general.
If you want to call a function in a C library called foo(int x, const char *y), it's no good letting your C++ compiler mangle that into foo_I_cCP() (or whatever, just made up a mangling scheme on the spot here) just because it can.
That name won't resolve, the function is in C and its name does not depend on its list of argument types. So the C++ compiler has to know this, and mark that function as being C to avoid doing the mangling.
Remember that said C function might be in a library whose source code you don't have, all you have is the pre-compiled binary and the header. So your C++ compiler can't do "it's own thing", it can't change what's in the library after all.
what's wrong with allowing the C++ compiler to mangle C functions also?
They wouldn't be C functions any more.
A function is not just a signature and a definition; how a function works is largely determined by factors like the calling convention. The "Application Binary Interface" specified for use on your platform describes how systems talk to each other. The C++ ABI in use by your system specifies a name mangling scheme, so that programs on that system know how to invoke functions in libraries and so forth. (Read the C++ Itanium ABI for a great example. You'll very quickly see why it's necessary.)
The same applies for the C ABI on your system. Some C ABIs do actually have a name mangling scheme (e.g. Visual Studio), so this is less about "turning off name mangling" and more about switching from the C++ ABI to the C ABI, for certain functions. We mark C functions as being C functions, to which the C ABI (rather than the C++ ABI) is pertinent. The declaration must match the definition (be it in the same project or in some third-party library), otherwise the declaration is pointless. Without that, your system simply won't know how to locate/invoke those functions.
As for why platforms don't define C and C++ ABIs to be the same and get rid of this "problem", that's partially historical — the original C ABIs weren't sufficient for C++, which has namespaces, classes and operator overloading, all of which need to somehow be represented in a symbol's name in a computer-friendly manner — but one might also argue that making C programs now abide by the C++ is unfair on the C community, which would have to put up with a massively more complicated ABI just for the sake of some other people who want interoperability.
MSVC in fact does mangle C names, although in a simple fashion. It sometimes appends #4 or another small number. This relates to calling conventions and the need for stack cleanup.
So the premise is just flawed.
It's very common to have programs which are partially written in C and partially written in some other language (often assembly language, but sometimes Pascal, FORTRAN, or something else). It's also common to have programs contain different components written by different people who may not have the source code for everything.
On most platforms, there is a specification--often called an ABI [Application Binary Interface] which describes what a compiler must do to produce a function with a particular name which accepts arguments of some particular types and returns a value of some particular type. In some cases, an ABI may define more than one "calling convention"; compilers for such systems often provide a means of indicating which calling convention should be used for a particular function. For example, on the Macintosh, most Toolbox routines use the Pascal calling convention, so the prototype for something like "LineTo" would be something like:
/* Note that there are no underscores before the "pascal" keyword because
the Toolbox was written in the early 1980s, before the Standard and its
underscore convention were published */
pascal void LineTo(short x, short y);
If all of the code in a project was compiled using the same compiler, it
wouldn't matter what name the compiler exported for each function, but in
many situations it will be necessary for C code to call functions that were
compiled using other tools and cannot be recompiled with the present compiler
[and may very well not even be in C]. Being able to define the linker name
is thus critical to the use of such functions.
I'll add one other answer, to address some of the tangential discussions that took place.
The C ABI (application binary interface) originally called for passing arguments on the stack in reverse order (i.e. - pushed from right to left), where the caller also frees the stack storage. Modern ABI actually uses registers for passing arguments, but many of the mangling considerations go back to that original stack argument passing.
The original Pascal ABI, in contrast, pushed the arguments from left to right, and the callee had to pop the arguments. The original C ABI is superior to the original Pascal ABI in two important points. The argument push order means that the stack offset of the first argument is always known, allowing functions that have an unknown number of arguments, where the early arguments control how many other arguments there are (ala printf).
The second way in which the C ABI is superior is the behavior in case the caller and callee do not agree on how many arguments there are. In the C case, so long as you don't actually access arguments past the last one, nothing bad happens. In Pascal, the wrong number of arguments is popped from the stack, and the entire stack is corrupted.
The original Windows 3.1 ABI was based on Pascal. As such, it used the Pascal ABI (arguments in left to right order, callee pops). Since any mismatch in argument number might lead to stack corruption, a mangling scheme was formed. Each function name was mangled with a number indicating the size, in bytes, of its arguments. So, on 16 bit machine, the following function (C syntax):
int function(int a)
Was mangled to function#2, because int is two bytes wide. This was done so that if the declaration and definition mismatch, the linker will fail to find the function rather than corrupt the stack at run time. Conversely, if the program links, then you can be sure the correct number of bytes is popped from the stack at the end of the call.
32 bit Windows and onward use the stdcall ABI instead. It is similar to the Pascal ABI, except push order is like in C, from right to left. Like the Pascal ABI, the name mangling mangles the arguments byte size into the function name to avoid stack corruption.
Unlike claims made elsewhere here, the C ABI does not mangle the function names, even on Visual Studio. Conversely, mangling functions decorated with the stdcall ABI specification isn't unique to VS. GCC also supports this ABI, even when compiling for Linux. This is used extensively by Wine, that uses it's own loader to allow run time linking of Linux compiled binaries to Windows compiled DLLs.
C++ compilers use name mangling in order to allow for unique symbol names for overloaded functions whose signature would otherwise be the same. It basically encodes the types of arguments as well, which allows for polymorphism on a function-based level.
C does not require this since it does not allow for the overloading of functions.
Note that name mangling is one (but certainly not the only!) reason that one cannot rely on a 'C++ ABI'.
C++ wants to be able to interop with C code that links against it, or that it links against.
C expects non-name-mangled function names.
If C++ mangled it, it would not find the exported non-mangled functions from C, or C would not find the functions C++ exported. The C linker must get the name it itself expects, because it does not know it is coming from or going to C++.
Mangling the names of C functions and variables would allow their types to be checked at link time. Currently, all (?) C implementations allow you to define a variable in one file and call it as a function in another. Or you can declare a function with a wrong signature (e.g. void fopen(double) and then call it.
I proposed a scheme for the type-safe linkage of C variables and functions through the use of mangling back in 1991. The scheme was never adopted, because, as other have noted here, this would destroy backward compatibility.

Why doesn't g++ generate "raw" symbols?

From C we know what legal variable names are. The general regex for the legal names looks similar to [\w_](\w\d_)*.
Using dlsym we can load arbitrary strings, and C++ mangles names that include # in the ABI..
My question is: can arbitrary strings be used? The documentation on dlsym does not seem to mention anything.
Another question that came up appears to imply that it is fully possible to have arbitrary null-terminated symbols. This inquires me to ask the following question:
Why doesn't g++ emit raw function signatures, with name and parameter list, including namespace and class membership?
Here's what I mean:
namespace test {
class A
{
int myFunction(const int a);
};
}
namespace test {
int A::myFunction(const int a){return a * 2;}
}
Does not get compiled to
int ::test::A::myFunction(const int a)\0
Instead, it gets compiled to - on my 64 bit machine, using g++ 4.9.2 -
0000000000000000 T _ZN4test1A10myFunctionEi
This output is read by nm. The code was compiled using g++ -c test.cpp -o out
I'm sure this decision was pragmatically made to avoid having to make any changes to pre-existing C linkers (quite possibly even originated from cfront). By emitting symbols with the same set of characters the C linker is used to you don't have to possibly make any number of updates and can use the linker off the shelf.
Additionally C and C++ are widely portable languages and they wouldn't want to risk breaking a more obscure binary format (perhaps on an embedded system) by including unexpected symbols.
Finally since you can always demangle (with something like gc++filt for example) it probably didn't seem worth using a full text representation.
P.S. You would absolutely not want to include the parameter name in the function name: People will not be happy if renaming a parameter breaks ABI. It's hard enough to keep ABI compatibility already.
GCC is compliant with the Itanium C++ ABI. If your question is “Why does the Itanium C++ ABI require names to be mangled that way?” then the answer is probably
because its designers thought this would b a good idea and
shorter symbols make for smaller object files and faster dynamic linking.
For the second point, there is a pretty good explanation in Ulrich Drepper's article How To Write Shared Libraries.
Because of limitations on the exported names imposed by a linker (and that includes the OS's dynamic linker) - character set, length. The very phenomenon of mangling arose because of this.
Corollary: in media where these limitations don't exist (various VMs that use their own linkers: e.g. .NET, Java), mangling doesn't exist, either.
Each compiler that produces exports that are incompatible with others must use a different scheme. Because linker (static or dynamic) doesn't care about ABIs, all it cares about is identifiers.
You basically answered your own question:
The general regex for the legal names looks similar to [\w_](\w\d_)*.
From the beginning, C++ used preexisting (C) linker / loader technology. There is nothing "C++" about either ld, ld-linux.so etc.
So linking is limited to what was legal in C already. That does not include colons, parenthesis, ampersands, asteriskes, and whatever else you would need to encode C++ identifiers in plain text.
(In this answer I ignore that you made several typos in your example of ::test::A::void myFunction(const int a)).
This format is:
not programmer-specific; consider that all these are the same, so why confuse people:
int ::test::A::myFunction(const int)
int ::test::A::myFunction(int const)
int test::A::myFunction(int const)
int test :: A :: myFunction (int const)
and so on…
unambiguous
terse; no parameter names or other unnecessary decorations
easier to parse (notice that the length of each component is present as a number)
Meanwhile, I see no benefit at all in choosing a human-readable looks-like-C++ format for a C++ ABI. This stuff is supposed to be optimised for machines. Why would you make it less optimal for machines, in order to make it more optimal for humans? And probably failing at the latter whilst doing so.
You say that your compiler does not emit "raw symbols". I posit that it does precisely that.

Is it possible to strip type names from executable while keeping RTTI enabled?

I recently disabled RTTI on my compiler (MSVC10) and the executable size decreased significantly. By comparing the produced executables using a text editor, I found that the RTTI-less version contains much less symbol names, explaining the saved space.
AFAIK, those symbol names are only used to fill the type_info structure associated with each the polymorphic type, and one can programmatically access them calling type_info::name().
According to the standard, the format of the string returned by type_info::name() is unspecified. That is, no one can rely one it to do serious things portably. So, it should be possible for an implementation to always return an empty string without breaking anything, thus reducing the executable size without disabling RTTI support (so we can still use the typeid operator & compare type_info's objects safely).
But... is it possible ? I'm using MSVC10 and I've not found any option to do that. I can either disable completely RTTI (/GR-), or enable it with full type names (/GR). Does any compiler provide such an option?
So, it should be possible for an implementation to always return an empty string without breaking anything, thus reducing the executable size without disabling RTTI support (so we can still use the typeid operator & compare type_info's objects safely).
You are misreading the standard. The intent of making the return value from type_info::name() unspecified (other than a null-terminated binary string) was to give the implementers of the compiler/library/run-time environment free reign to implement the RTTI requirements as they see best. You, the programmer, have no say in how the Application Binary Interface (if there is one) is designed or implemented.
You're asking three different questions here.
The initial question asks whether there's any way to get MSVC to not generate names, or whether it's possible with other compilers, or, failing that, whether there's any way to strip the names out of the generated type_info without breaking things.
Then you want to know whether it would be possible to modify the MS ABI (presumably not too radically) so that it would be possible to strip the names.
Finally, you want to know whether it would be possible to design an ABI that didn't have names.
Question #1 is itself a complex question. As far as I know, there's no way to get MSVC to not generate names. And most other compilers are aimed at ABIs that specifically define what typeid(foo).name() must return, so they also can't be made to not generate names.
The more interesting question is, what happens if you strip out the names. For MSVC, I don't know the answer. The best thing to do here is probably to try it—go into your DLLs and change the first character of each name to \0 and see if it breaks dynamic_cast, etc. (I know that you can do this with Mac and linux x86_64 executables generated by g++ 4.2 and it works, but let's put that aside for now.)
On to question #2, assuming blanking the names doesn't work, it wouldn't be that hard to modify a name-based system to no longer require names. One trivial solution is to use hashes of the names, or even ROT13-encoded names (remember that the original goal here is "I don't want casual users to see the embarrassing names of my classes"). But I'm not sure that would count for what you're looking for. A slightly more complex solution is as follows:
For "dllexport"ed classes, generate a UUID, put that in the typeinfo, and also put it in the .LIB import library that gets generated along with the DLL.
For "dllimport"ed classes, read the UUID out of the .LIB and use that instead.
So, if you manage to get the dllexport/dllimport right, it will work, because your exe will be using the same UUID as the dll. But what if you don't? What if you "accidentally" specify identical classes (e.g., an instantiation of the same template with the same parameters) in your DLL and your EXE, without marking one as dllexport and one as dllimport? RTTI won't see them as the same type.
Is this a problem? Well, the C++ standard doesn't say it is. And neither does any MS documentation. In fact, the documentation explicitly says that you're not allowed to do this. You cannot use the same class or function in two different modules unless you explicitly export it from one module and import it into another. The fact that this is very hard to do with class templates is a problem, and it's a problem they don't try to solve.
Let's take a realistic example: Create a node-based linkedlist class template with a global static sentinel, where every list's last node points to that sentinel, and the end() function just returns a pointer to it. (Microsoft's own implementation of std::map used to do exactly this; I'm not sure if that's still true.) New up a linkedlist<int> in your exe, and pass it by reference to a function in your dll that tries to iterate from l.begin() to l.end(). It will never finish, because none of the nodes created by the exe will point to the copy of the sentinel in the dll. Of course if you pass l.begin() and l.end() into the DLL, instead of passing l itself, you won't have this problem. You can usually get away with passing a std::string or various other types by reference, just because they don't depend on anything that breaks. But you're not actually allowed to do so, you're just getting lucky. So, while replacing the names with UUIDs that have to be looked up at link time means types can't be matched up at link-loader time, the fact that types already can't be matched up at link-loader time means this is irrelevant.
It would be possible to build a name-based system that didn't have these problems. The ARM C++ ABI (and the iOS and Android ABIs based on it) restricts what programmers can get away with much less than MS, and has very specific requirements on how the link-loader has to make it work (3.2.5). This one couldn't be modified to not be name-based because it was an explicit choice in the design that:
• type_info::operator== and type_info::operator!= compare the strings returned by type_info::name(), not just the pointers to the RTTI objects and their names.
• No reliance is placed on the address returned by type_info::name(). (That is, t1.name() != t2.name() does not imply that t1 != t2).
The first condition effectively requires that these operators (and type_info::before()) must be called out of line, and that the execution environment must provide appropriate implementations of them.
But it's also possible to build an ABI that doesn't have this problem and that doesn't use names. Which segues nicely to #3.
The Itanium ABI (used by, among other things, both OS X and recent linux on x86_64 and i386) does guarantee that a linkedlist<int> generated in one object and a linkedlist<int> generated from the same header in another object can be linked together at runtime and will be the same type, which means they must have equal type_info objects. From 2.9.1:
It is intended that two type_info pointers point to equivalent type descriptions if and only if the pointers are equal. An implementation must satisfy this constraint, e.g. by using symbol preemption, COMDAT sections, or other mechanisms.
The compiler, linker, and link-loader must work together to make sure that a linkedlist<int> created in your executable points to the exact same type_info object that a linkedlist<int> created in your shared object would.
So, if you just took out all the names, it wouldn't make any difference at all. (And this is pretty easily tested and verified.)
But how could you possibly implement this ABI spec? j_kubik effectively argues that it's impossible because you'd have to preserve some link-time information in the .so files. Which points to the obvious answer: preserve some link-time information in the .so files. In fact, you already have to do that to handle, e.g., load-time relocations; this just extends what you need to preserve. And in fact, both Apple and GNU/linux/g++/ELF do exactly that. (This is part of the reason everyone building complex linux systems had to learn about symbol visibility and vague linkage a few years ago.)
There's an even more obvious way to solve the problem: Write a C++-based link loader, instead of trying to make the C++ compiler and linker work together to trick a C-based link loader. But as far as I know, nobody's tried that since Be.
Requirements for type-descriptor:
Works correctly in multi compilation-unit and shared-library environment;
Works correctly for different versions of shared libraries;
Works correctly although different compilation units don't share any information about type, except it's name: usually one header is used for all compilation units to define same type, but it's not required; even if, it doesn't affect resulting object file.
Work correctly despite fact that template instantiations must be fully-defined (so including type_info data) in every library that uses them, and yet behave like one type if several such libs are used together.
The fourth rule essentially bans all non-name based type-descriptors like UUIDs (unless specifically mentioned in type definition, but that is just name-replacement at best, and probably requires standard-alterations).
Stroing thuse UUIDs in separate files like suggeste .LIB files also causes trouble: different library versions implementing new types would cause trouble.
Compilation units should be able to share the same type (and its type_info) without the need to involve linker - because it should stay free of any language-specifics.
So type-name can be only unique type descriptor without completely re-modeling compilation and linking (also dynamic). I could imagine it working, but not under current scheme.

Is there a portable wrapper for C++ type_info that standardizes type name string format?

The format of the output of type_info::name() is implementation specific.
namespace N { struct A; }
const N::A *a;
typeid(a).name(); // returns e.g. "const struct N::A" but compiler-specific
Has anyone written a wrapper that returns dependable, predictable type information that is the same across compilers. Multiple templated functions would allow user to get specific information about a type. So I might be able to use:
MyTypeInfo::name(a); // returns "const struct N::A *"
MyTypeInfo::base(a); // returns "A"
MyTypeInfo::pointer(a); // returns "*"
MyTypeInfo::nameSpace(a); // returns "N"
MyTypeInfo::cv(a); // returns "const"
These functions are just examples, someone with better knowledge of the C++ type system could probably design a better API. The one I'm interested in in base(). All functions would raise an exception if RTTI was disabled or an unsupported compiler was detected.
This seems like the sort of thing that Boost might implement, but I can't find it in there anywhere. Is there a portable library that does this?
There are some limitations to do such things in C++, so you probably won't find exactly what you want in the near future. The meta-information about the types that the compiler inserts in the compiled code is also implementation-specific to the RTL used by the compiler, so it'd be difficult for a third-party library to do a good job without relying to undocumented features of each specific compiler that might break in later versions.
The Qt framework has, to my knowledge, the nearest thing to what you intended. But they do that completely independent from RTTI. Instead, they have their own "compiler" that parses the source code and generates additional source modules with the meta-information. Then, you compile+link these modules along with your program and use their API to get the information. Take a look at http://doc.qt.nokia.com/latest/metaobjects.html
Jeremy Pack (from Boost Extension plugin framework) appears to have written such a thing:
http://blog.redshoelace.com/2009/06/resource-management-across-dll.html
3. RTTI does not always function as expected across DLL boundaries. Check out the type_info classes to see how I deal with that.
So you could have a look there.
PS. I remembered because I once fixed a bug in that area; this might still add information so here's the link: https://stackoverflow.com/a/5838527/85371
GCC has __cxa_demangle https://gcc.gnu.org/onlinedocs/libstdc++/manual/ext_demangling.html
If there are such extensions for all compilers you target, you could use them to write a portable function with macros to detect the compiler.

Is there a tool to add the "override" identifier to existing C++ code

The task
I am trying to work out how best to add C++0x's override identifier to all existing methods that are already overrides in a large body of C++ code, without doing it manually.
(We have many, many hundreds of thousands of lines of code, and doing it manually would be a complete non-starter.)
Current idea
Our coding standards say that we should add the virtual keyword against all implicitly virtual methods in derived classes, even though strictly unnecessary (to aid comprehension).
So if I were to script the addition myself, I'd write a script that read all our headers, found all functions beginning with virtual, and insert override before the following semi-colon. Then compile it on a compiler that supports override, and fix all the errors in base classes.
But I'd really much rather not use this home-grown way, as:
it's obviously going to be tedious and error-prone.
not everyone has remembered, every time, to add the virtual keyword, so this method would miss out some existing overrides
Is there an existing tool?
So, is there already a tool that parses C++ code, detects existing methods that overrides, and appends override to their declarations?
(I am aware of static analysis tools such as PC-lint that warn about functions that look like they should be overrides. What I'm after is something that would actually munge our code, so that future errors in overrides will be detected at compiler-time, rather than later on in static analysis)
(In case anyone is tempted to point out that C++03 doesn't support 'override'... In practice, I'd be adding a macro, rather than the actual "override" identifier, to use our code on older compilers that don't support this feature. So after the identifier was added, I'd run a separate script to replace it with whatever macro we're going to use...)
Thanks in advance...
There is a tool under development by the LLVM project called "cpp11-migrate" which currently has the following features:
convert loops to range-based for loops
convert null pointer constants (like NULL or 0) to C++11 nullptr
replace the type specifier in variable declarations with the auto type specifier
add the override specifier to applicable member functions
This tool is documented here and should be released as part of clang 3.3.
However, you can download the source and build it yourself today.
Edit
Some more info:
Status of the C++11 Migrator - a blog post, dated 2013-04-15
cpp11-migrate User’s Manual
Edit 2: 2013-09-07
"cpp11-migrate" has been renamed to "clang-modernize". For windows users, it is now included in the new LLVM Snapshot Builds.
Edit 3: 2020-10-07
"clang-modernize" has bee renamed to "Clang-Tidy".
Our DMS Software Reengineering Toolkit with its C++11-capable C++ Front End can do this.
DMS is a general purpose program transformation system for arbitrary programming languages; the C++ front end allows it to process C++. DMS parses, builds ASTs and symbol tables that are accurate (this is hard to do for C++), provides support for querying properties of the AST nodes and trees, allows procedural and source-to-source transformations on the tree. After all changes are made, the modified tree can be regenerated with comments retained.
Your problem requires that you find derived virtual methods and change them. A DMS source-to-source transformation rule to do that would look something like:
source domain Cpp. -- tells DMS the following rules are for C++
rule insert_virtual_keyword (n:identifier, a: arguments, s: statements):
method_declaration -> method_declaration " =
" void \n(\a) { \s } " -> " virtual void \n(\a) { \s }"
if is_implicitly_virtual(n).
Such rules match against the syntax trees, so they can't mismatch to a comment, string, or whatever. The funny quotes are not C++ string quotes; they are meta-quotes to allow the rule language to know that what is inside them has to be treated as target language ("Cpp") syntax. The backslashes are escapes from the target language text, allowing matches to arbitrary structures e.g., \a indicates a need for an "a", which is defined to be the syntactic category "arguments".
You'd need more rules to handle cases where the function returns a non-void result, etc. but you shouldn't need a lot of them.
The fun part is implementing the predicate (returning TRUE or FALSE) controlling application of the transformation: is_implicitly_virtual. This predicate takes (an abstract syntax tree for) the method name n.
This predicate would consult the full C++ symbol table to determine what n really is. We already know it is a method from just its syntactic setting, but we want to know in what class context.
The symbol table provides the linkage between the method and class, and the symbol table information for the class tells us what the class inherits from, and for those classes, which methods they contain and how they are declared, eventually leading to the discovery (or not) that the parent class method is virtual. The code to do this has to be implemented as procedural code going against the C++ symbol table API. However, all the hard work is done; the symbol table is correct and contains references to all the other data needed. (If you don't have this information, you can't possibly decide algorithmically, and any code changes will likely be erroneous).
DMS has been used to carry out massive changes on C++ code in the past using program transformations.(Check the Papers page at the web site for C++ rearchitecting topics).
(I'm not a C++ expert, merely the DMS architect, so if I have minor detail wrong, please forgive.)
I did something like this a few months ago with about 3 MB worth of code and while you say that "doing it manually would be a complete non-starter," I think it is the only way. The reason is that you should be applying the override keyword to the prototypes that are intended to override base class methods. Any tool that adds it will put it on the prototypes that actually override base class methods. The compiler already knows which methods those are so adding the keyword doesn't change anything. (Please note that I am not terribly familiar with the new standard and I am assuming the override keyword is optional. Visual Studio has supported override since at least VS2005.)
I used a search for "virtual" in the header files to find most of them and I still occasionally find another prototype that is missing the override keyword.
I found two bugs by going through that.
Eclipse CDT has a working C++ parser and semantic utilities. The latest version IIRC also has markers for overriding methods.
It wouldn't require much code to write a plug-in which would base on that and rewrite the code to contain the override tags where appropriate.
one option is to
Enable suggest-override compiler warning And then write a script
which can insert override keyword to location pointed by the emitted warnings