I'm trying to automatically resolve typedefs in arbitrary C++ or C projects.
Because some of the typedefs are defined in system header files (for example uint32), I'm currently trying to achieve this by running the gcc preprocessor on my code files and then scanning the preprocessed files for typedefs. I should then be able to replace the typedefs in the project's code files.
I'm wondering, if there is another, perhaps simpler way, I'm missing. Can you think of one?
The reason, why I want to do this: I'm extracting code metrics from the C/C++ projects with different tools. The metrics are method-based. After extracting the metrics, I have to merge the data, that is produced by the different tools. The problem is, that one of the tools resolves typedefs and others don't. If there are typedefs used for the parameter types of methods, I have metrics mapped to different method-names, which are actually referring to the same method in the source code.
Think of this method in the source code: int test(uint32 par1, int par2)
After running my tools I have metrics, mapped to a method named int test(uint32 par1, int par2) and some of my metrics are mapped to int test(unsigned int par1, int par2).
If you do not care about figuring out where they are defined, you can use objdump to dump the C++ symbol table which resolves typedefs.
lorien$ objdump --demangle --syms foo
foo: file format mach-o-i386
SYMBOL TABLE:
00001a24 g 1e SECT 01 0000 .text dyld_stub_binding_helper
00001a38 g 1e SECT 01 0000 .text _dyld_func_lookup
...
00001c7c g 0f SECT 01 0080 .text foo::foo(char const*)
...
This snippet is from the following structure definition:
typedef char const* c_string;
struct foo {
typedef c_string ntcstring;
foo(ntcstring s): buf(s) {}
std::string buf;
};
This does require that you compile everything and it will only show symbols in the resulting executable so there are a few limitations. The other option is to have the linker dump a symbol map. For GNU tools add -Wl,-map and -Wl,name where name is the name of the file to generate (see note). This approach does not demangle the names, but with a little work you can reverse engineer the compiler's mangling conventions. The output from the previous snippet will include something like:
0x00001CBE 0x0000005E [ 2] __ZN3fooC2EPKc
0x00001D1C 0x0000001A [ 2] __ZN3fooC1EPKc
You can decode these using the C++ ABI specification. Once you get comfortable with how this works, the mangling table included with the ABI becomes priceless. The derivation in this case is:
<mangled-name> ::= '_Z' <encoding>
<encoding> ::= <name> <bare-function-type>
<name> ::= <nested-name>
<nested-name> ::= 'N' <source-name> <ctor-dtor-name> 'E'
<source-name> ::= <number> <identifier>
<ctor-dtor-name> ::= 'C2' # base object constructor
<bare-function-type> ::= <type>+
<type> ::= 'P' <type> # pointer to
<type> ::= <cv-qualifier> <type>
<cv-qualifier> ::= 'K' # constant
<type> ::= 'c' # character
Note: it looks like GNU changes the arguments to ld so you may want to check your local manual (man ld) to make sure that the map file generation commands are -mapfilename in your version. In recent versions, use -Wl,-M and redirect stdout to a file.
You can use Clang (the LLVM C/C++ compiler front-end) to parse code in a way that preserves information on typedefs and even macros. It has a very nice C++ API for reading the data after the source code is read into the AST (abstract syntax tree). http://clang.llvm.org/
If you are instead looking for a simple program that already does the resolving for you (instead of the Clang programming API), I think you are out of luck, as I have never seen such a thing.
GCC-XML can help with resolving the typedefs, you'd have to follow the type-ids of <Typedef> elements until you resolved them to a <FundamentalType>, <Struct> or <Class> element.
For replacing the typedefs in your project you have a more fundamental problem though: you can't simply search and replace as you'd have to respect the scope of names - think of e.g. function-local typedefs, namespace aliases or using directives.
Depending on what you're actually trying to achieve, there has to be a better way.
Update: Actually, in the given context of fixing metrics data, the replacement for the typenames using gcc-xml should work fine if it supports your code-base.
Related
AVR g++ has a pointer size of 16 bits. However, my particular chip (the ATMega2560) has 256 KB of RAM. To support this, the compiler automatically generates trampoline sections in the same section of ROM as the current executing code that then contains the extended assembly code to jump into high memory or back. In order for trampolines to be generated, you must take the address of something that sits in high memory.
In my scenario, I have a bootloader that I have written sitting in high memory. The application code needs to be able to call a function in the bootloader. I know the address of this function and need to be able to directly address it by hard-coding the address in my code.
How can I get the compiler/linker to generate the appropriate trampoline for an arbitrary address?
Compiler and linker will only generate trampoline code when the far address is a symbolic address rather than a literal constant number already in code. something like (assuming the address you want to jump to is 0x20000).
extern void (*farfun)() = 0x20000;
farfun ();
Will definitely not work, it doesn't cause the linker to do anything because the address is already resolved.
You should be able to inject the symbol address in the linker command line like so:
extern void farfun ();
farfun ();
compiling "normally" and linking with
-Wl,--defsym,farfun=0x20000
I think it's clear that you need to make sure yourself that something sensible sits at farfun.
You will most probably also need --relax.
EDIT
Never tried this myself, but maybe:
You could probably try to store the function address in a table in high memory and declare it like this:
extern void (*farfunctable [10])();
(farfunctable [0])();
and use the very same linker command to resolve the external symbol (now your table at 0x20000 (in the bootloader) needs to look like this:
extern void func1();
extern void func2();
void ((*farfunctab [10])() = {
func1,
func2,....
};
I would recommend to put func1() ... func10() in a different module from farfunctab in order to make the linker know it has to generate trampolins.
I was planning on putting a dispatch struct (that is, a struct with function pointers to all the various functions). Your solution works well, but requires knowing all of the locations of all of the functions ahead of time. Is there a way to execute a function call to a far address that isn't known at compile time?
[...] My goal was to put the struct with pointers to the functions in a fixed location. That way, it would be a single thing that needed a fixed address rather than every external function.
So you have two applications, let's call them App and Boot, where Boot provides some functionalities that App wants to use. The following problems have to be addressed:
How to get addresses from Boot into App.
How to build a jump table for Boot.
Avoid constructs that will crash when App tries to use code from Boot, like: Using indirect calls or jumps, using static constructors or using static storage in Boot.
App uses Addresses of boot.elf directly
Linking with -Wl,-R,boot.elf
A simple way would be to just link app.elf against boot.elf be means of -Wl,-R,boot.elf. Option -R instructs the linker to use symbol values from the specified file without dragging any code. Problem is that there's no way to specify which symbols to use, for example this might lead to a situation where App uses libgcc functions from Boot.
Defining Symbols by means of -Wl,--defsym,symbol=value
A bit more control over which symbols are being defined can be implemented by following a specific naming convention. Suppose that all symbols from Boot that have "boot" in their name should be "exported", then you could just
> avr-nm -g boot.elf | grep ' T ' | awk '/boot/ { printf("--defsym %s=0x%s\n",$3,$1) }' > syms.opt
This prints global symbol values, and grep filters out symbols in the text section. awk then transforms lines like 00020102 T boot1 to lines like
--defsym boot1=0x00020102 which are written to an option file syms.opt. The option file can then be provided to the linker by means of -Wl,#syms.opt.
The advantage of an option file is that it is easier to provide than plain options in a build environment like make: app.elf would depend (amongst others) on syms.opt, which in turn would depend on boot.elf.
Defining Symbols in a Linker Script Snippet
An alternative would be to define the symbols in a linker script augmentation, which you would provide by means of -T syms.ld during link and which would contain
"boot1"=ABSOLUTE(0x00020102);
"boot2"=...
...
INSERT AFTER .text
Defining Symbols in an Assembly Module
Yet another way to define the symbols would be by means of an assembly module which contains definitions like .global boot1 together with boot1 = 0x00020102.
All these approaches have in common that all symbols must be defined, or otherwise the linker will throw an undefined symbol error. This means boot.elf must be available, and it does not matter whether just one symbol is undefined or whether dozends of symbols are undefined.
Let Boot provide a Dispatch Table
The problem with using boot.elf directly, like lined out in the previous section, is that it introduces a direct dependency. This means that if Boot is improved or refactored, then you'll also have to re-compile App each time, even if the interface did not change.
A solution is to let Boot provide a dispatch table whose position and layout are known ahead of time. Only when the interface itself changes, App will have to be rebuilt. Just refactoring Boot will not require to re-build App.
The Assembly Module with the Jump Table
As explained in the "Crash" section below, addresses in a dispatch table (and hence indirect jumps) won't work because EIND has a wrong value. Therefore, let's assume we have a table of JMPs to the desired Boot functions, like in an assembly module boot-table.sx that reads:
;;; Linker description file boot.ld locates input section .boot.table
;;; right after .vectors, hence the address of .boot_table will be
;;; text-section-start + _VECTORS_SIZE, where the latter is
;;; #define'd in <avr/io.h>.
;;; No "x" section flag so that the linker won't relax JMPs to RJMPs.
.section .boot.table,"a",#progbits
.global .boot_table
.type .boot_table,#object
boot_table:
jmp boot1
jmp boot2
.size boot_table, .-boot_table
In this example, we are going to locate the jump table right after .vectors, so that its location is known ahead of time. The respective symbol definitions in App's syms.opt will then read
--defsym boot1=0x20000+vectors_size+0*4
--defsym boot2=0x20000+vectors_size+1*4
provided Boot is located at 0x20000. Symbol vectors_size can be defined in a C/C++ module, here by abusing avr-gcc attribute "address":
#include <avr/io.h>
__attribute__((__address__(_VECTORS_SIZE)))
char vectors_size;
Locating the Jump Table
In order to locate input section .boot.table, we need an own linker description file, which you might already use for Boot anyways. We start with a linker script from avr-gcc installation at ./avr/lib/ldscripts/avr6.xn, copy it to boot.ld, and add the following 2 lines after vectors:
...
.text :
{
*(.vectors)
KEEP(*(.vectors))
*(.boot.table)
KEEP(*(.boot.table))
/* For data that needs to reside in the lower 64k of progmem. */
*(.progmem.gcc*)
...
Auto-Generating Boot's Jump Table Module and the Symbols for App
It's highly advisable to have an interface description used by both App and Boot, say common.h. Moreover, in order to keep Boot's boot-table.sx and App's syms.opt in sync with the interface, it's agood idea to auto-generate these two files from common.h. To that end, assume that common.h reads:
#ifndef COMMON_H
#define COMMON_H
#define EX __attribute__((__used__,__externally_visible__))
EX int boot1 /* #boot_table:0 */ (int);
EX int boot2 /* #boot_table:1 */ (void);
#endif /* COMMON_H */
For the matter of simplicity, let's assume that this is C code or the interfaces are extern "C" so that the symbols in source code match the assembly names, and there's no need to use mangled names. It' easy enough to generate boot-table.sx and syms.opt from common.h using the magic comments. The magic comment follows directly after the symbol, so a regex would retrieve the token left of the magic comment, something like Python:
# ... symbol /* #boot_table:index */...
pat = re.compile (r".*(\b\w+\b)\s*/\* #boot_table:(\d+) \*/.*")
for line in sys.stdin.readlines():
match = re.match (pat, line)
if match:
index = int (match.group(2))
symbol = match.group(1)
Output template for syms.opt would be something like:
asm_line = "--defsym {symbol}=0x20000+vectors_size+4*{index}\n"
Code that will crash
Using Boot code from App is subject to several restrictions:
Indirect Calls and Jumps
These will crash because the start addresses of App resp. Boot are in different 128KiB segments of flash. When the address of a code symbol is taken, the compiler does this per gs(symbol) which instructs the linker to generate a stub and resolve gs() to that stub in .trampolines if the target address is outside the 128KiB segment where the trampolines are located. An explanation of gs() can be found in this answer, there is however more to it: The startup code will effectively initialize
EIND = __vectors >> 17;
see gcrt1.S, the AVR-LibC bits of start-up code crt<device>.o. The compiler assumes EIND never changes during execution, see EIND and more than 128KiB of Flash in the GCC documentation.
This means code in Boot assumes EIND = 1 but is called with EIND = 0 and hence EICALL resp. EIJMP will target the wrong address. This means common code must avoid indirect calls and jumps, and should be compiled with -fno-jump-tables so that switch/case won't generate such tables.
This also implies that the dispatch table described above won't work if it would just held gs(symbol) entries, because App and Boot will disagree on EIND.
Data in Static Storage
If common Boot code is using data in static storage, the data might collide with App's static storage. One way out is to avoid static storage in respective parts of Boot and pass addresses to, say, some data buffer by means of pointer erguments of respective functions.
One could have completely separate RAM areas; one for Boot and one for App, but that would be a waste of RAM because the applications will never run at the same time.
Static Constructors
Boot's static constructors will be bypassed if App uses code from Boot. This includes:
C++ code in Boot that explicitly or implicitly generates such constructors.
C/C++ code in Boot that relies on __attribute__((__constructor__)) or code in section .initN which is supposed to run prior to main.
Start-up code that initializes static storage, EIND etc., which is also run by locating it in some .initN sections, but will be bypassed if App calls Boot code.
I have a templated class SafeInt<T> (By Microsoft).
This class in theory can be used in place of a POD integer type and can detect any integer overflows during arithmetic operations.
For this class I wrote some custom templatized overloaded arithmetic operator (+, -, *, /) functions whose both arguments are objects of SafeInt<T>.
I typedef'd all my integer types to SafeInt class type.
I want to search my codebase for instances of the said binary operators where both operands are of type SafeInt.
Some of the ways I could think of
String search using regex and weed through the code to detect operator usage instances where both operands are SafeInt objects.
Write a clang tool and process the AST to do this searching (I am yet to learn how to write such a tool.)
Somehow add a counter to count the number of times the custom overloaded operator is instantiated. I spent a lot of time trying this but doesn't seem to work.
Can anyone suggest a better way?
Please let me know if I need to clarify anything.
Thanks.
Short answer
You can do this using the clang-query command:
$ clang-query \
-c='m cxxOperatorCallExpr(callee(functionDecl(hasName("operator+"))), hasArgument(0, expr(hasType(cxxRecordDecl(hasName("SafeInt"))))), hasArgument(1, expr(hasType(cxxRecordDecl(hasName("SafeInt"))))))' \
use-si.cc --
Match #1:
/home/scott/wrk/learn/clang/clang-query1/use-si.cc:10:3: note: "root" binds here
x + y; // reported
^~~~~
1 match.
What is clang-query?
clang-query is a utility intended to facilitate writing clang-tidy checks. In particular it understands the language of AST Matchers and can be used to interactively explore what is matched by a given match expression. However, as shown here, it can also be used non-interactively to look for arbitrary AST tree patterns.
The blog post Exploring Clang Tooling Part 2: Examining the Clang AST with clang-query by Stephen Kelly provides a nice introduction to using clang-query.
The clang-query program is included in the pre-built LLVM binaries, or it can be built from source as described in the AST Matchers Tutorial.
How does the above command work?
The -c argument provides a command to run non-interactively. With whitespace added, the command is:
m // Match (and report) every
cxxOperatorCallExpr( // operator function call
callee(functionDecl( // where the callee
hasName("operator+"))), // is "operator+", and
hasArgument(0, // where the first argument
expr(hasType(cxxRecordDecl( // is a class type
hasName("SafeInt"))))), // called "SafeInt",
hasArgument(1, // and the second argument
expr(hasType(cxxRecordDecl( // is also a class type
hasName("SafeInt")))))) // called "SafeInt".
The command line ends with use-si.cc --, meaning to analyze use-si.cc and there are no extra compiler flags needed by clang to interpret it.
The clang-query command line has the same basic structure as that of clang-tidy, including the ability to pass -p compile_commands.json to scan many files at once, possibly with different compiler options per file.
Example input
For completeness, the input I used to test my matcher is use-si.cc:
// use-si.cc
#include "SafeInt.hpp" // SafeInt
void f1()
{
SafeInt<int> x(2);
SafeInt<int> y(3);
x + y; // reported
x + 2; // not reported
2 + x; // not reported
}
where SafeInt.hpp comes from https://github.com/dcleblanc/SafeInt , the repo named on the Microsoft SafeInt page.
To do this right, you clearly have to be able to identify individual uses of the operator which overload to a specific operator definition. Fundamentally, you need what the front end of a C++ compiler does: parsing and name resolution (including the overloads).
Obviously GCC and Clang have this basic capability. But you want to track/display all uses of the specific operator. You can probably bend Clang (or GCC, harder) to give you this information on a file-by-file basis.
Our DMS Software Reengineering Toolkit with its C++ Front End can be used for this, too.
DMS provides the generic parsing and symbol table support machinery; the C++ front end specializes DMS to handle C++ with full, accurate name resolution including overloads, for both GCC5 and MSVS2015. Its symbol table actually collects, for each declaration in a scope, the point of the declaration, and the list of uses of that declaration in terms of accurate source positions. The symbol scopes include an entry for each (overloaded) operator valid in the scope. You could just
go to the desired symbol table entry and enumerate/count the list of references to get a raw count. There are standard APIs for this available via DMS.
The same kind of symbol scope/definition/uses information is used by our Java Source Browser to build an HTML-based JavaDoc-like display with full HTML linkages between symbol declarations and uses. So for any symbol declaration, you can easily see the uses.
The C++ front end has a similar HTMLizer that operates on C++ source code. It isn't as mature/pretty, but it is robust. It presently doesn't show all the uses of a declared symbol, but that would be a pretty straightforward change to make to it. (I don't have a publicly visible instance of it. Contact me through my bio and I can send you a sample).
I'm looking for a way to parse a C structure in order to get the the name and the type of the variables.
For example I have a structure like this:
struct MyStruct {
int anInt ;
float aFloat ;
}
I need to get the types int and float and the 2 strings "anInt" and "aFloat".
After I have to use these values in another function:
addValue<int>("anInt") ;
add Value<float>("aFloat") ;
Do you know how to do this automatically, I guess, at compilation ?
Thanks.
You probably cannot do that with standard C++11 template code.
You could consider customizing your C++ compiler to get such information. For example, if compiling your code with a recent GCC, you might consider customizing it with your MELT extension (MELT is a domain specific language, implemented by a plugin, to customize GCC). You'll need to understand the details of GCC internal representation (so such an extension might take a week or more of your time).
The Qt MOC facility might perhaps be useful or inspirational. It is parsing a limited form of class declaration.
Alternatively you might consider generating the C++ struct or class representation from some other input; e.g. change your build procedure, perhaps your Makefile, to generate some .h header file (and perhaps some .cc C++ translation unit) from your higher level representation.
I would like to know how to use GCC as a library to parse C/C++/Java/Objective C/Ada code for my program.
I want to bypass prepocessing and prefix all the functions that are user written with a prefix My.
like so Print(); becomes MyPrint(); I also wish to do this with the variables.
You can look here:
http://codesynthesis.com/~boris/blog/2010/05/03/parsing-cxx-with-gcc-plugin-part-1/
This is description of how to use gcc plugin interface to parse C++ code. Other language should be handled in the same manner.
Also you can try pork from mozilla:
https://wiki.mozilla.org/Pork
When I tried it (pork), I spend hour or so to fix compile problems, but then
I can write scripts like this:
rewrite SyncPrimitiveUpgrade {
type PRLock* => Mutex*
call PR_NewLock() => new Mutex()
call PR_Lock(lock) => lock->Lock()
call PR_Unlock(lock) => lock->Unlock()
call PR_DestroyLock(lock) => delete lock
}
so it found all type PRLock and replate it with Mutex, also it search call of functions
like PR_NewLock and replace it with "new Mutex".
You might wish to investigate the sparse C parser. It understands a lot of C (all the C used in the Linux kernel sources, which is a fairly good subset of legal ANSI-C and GNU-C extensions) and provides a few sample compiler backends to provide a lint-like static analysis tool for type checking.
While the code looks very clean and thorough, your task might be easier done via another mechanism -- the example.c included with the sparse source that demonstrates a compiler is 1955 lines long.
For C, you cannot do that reliably. If you skip preprocessing you will -- in general -- not have valid C code to be parsed. E.g.
#define FOO
#define BAR
#define BAZ
FOO void BAR qux BAZ(void) { }
How is the parser supposed to recognize this a function definition of qux without doing the preprocessing?
First, GCC is not a library, and is not structured to be one (in contrast to LLVM).
Why (i.e. what for) do you want to parse C, C++, Ada source code?
I would consider (assuming a GCC 4.6 version) extending GCC either thru plugins written in C, or preferably using MELT, a high level domain specific language to extend GCC (disclaimer: I am the main author of MELT).
But using GCC as a library is not realistic at all.
I really think that for what you want to achieve, MELT is the right tool. However, it is poorly documented. Please use the gcc-melt#googlegroups.com list to ask questions.
And be aware that extending GCC does take some amount of work (more than a week perhaps), because you need to partly understand the GCC internal representations.
Our DMS Software Reengineering Toolkit can parse C, C++, Java and Ada code (not Objective C at this time) in a wide variety of dialects and carry out transformations on the code. DMS's C and C++ front ends include a preprocessor, so you can you can cause preprocessing before you parse.
I'm probably don't understand what you want to do, because it seems strange to rename every function and (global?) variable with a "My...." prefix. But you could do that with some DMS rules (a rough sketch of renames of user functions for GCC3:
domain C~GCC3.
rule rewrite_function_names(t: type_designator, i: IDENTIFIER, p: parameter_list, s: statements):
function_header->functionheader
"\t \i(\p) { \s } " -> "\t \renamed\(\i\) (\p) { \s }" ;
and a helper function "renames" that takes a tree node containing an identifer, and returns a tree node with the renamed identifier.
Because DMS patterns only match against the parse trees, you won't get any false positives.
You'd need some additional patterns to handle various different syntax cases within each langauge (e.g, for C, "void" return type, because "void" isn't a type designator in the syntax, and global variable declarations), and different rules for different languages (Ada's syntax is not the same as that of C).
This might seem like big hammer for your task, but if you really insist on doing this for a variety of languages in a reliable way, it seems hard to avoid the problem of getting decent parsers for all those languages. (And if you are really going to do this for all these languages, DMS can be taught to handle ObjectiveC the same we we have taught it to handle the other langauges).
Your alternative is some kind of string hacking solution, which might work 95% of the time. If you can live with that, then Perl or something similar is likely your answer.
forget about GCC, its made as a compiler's parser, not an analysis parser, you'd do way better using something like libclang, a C interface to clang, which can process both C & C++
I just want to ask your ideas regarding this matter. For a certain important reason, I must extract/acquire all function names of functions that were called inside a "main()" function of a C source file (ex: main.c).
Example source code:
int main()
{
int a = functionA(); // functionA must be extracted
int b = functionB(); // functionB must be extracted
}
As you know, the only thing that I can use as a marker/sign to identify these function calls are it's parenthesis "()". I've already considered several factors in implementing this function name extraction. These are:
1. functions may have parameters. Ex: functionA(100)
2. Loop operators. Ex: while()
3. Other operators. Ex: if(), else if()
4. Other operator between function calls with no spaces. Ex: functionA()+functionB()
As of this moment I know what you're saying, this is a pain in the $$$... So please share your thoughts and ideas... and bear with me on this one...
Note: this is in C++ language...
You can write a Small C++ parser by combining FLEX (or LEX) and BISON (or YACC).
Take C++'s grammar
Generate a C++ program parser with the mentioned tools
Make that program count the funcion calls you are mentioning
Maybe a little bit too complicated for what you need to do, but it should certainly work. And LEX/YACC are amazing tools!
One option is to write your own C tokenizer (simple: just be careful enough to skip over strings, character constants and comments), and to write a simple parser, which counts the number of {s open, and finds instances of identifier + ( within. However, this won't be 100% correct. The disadvantage of this option is that it's cumbersome to implement preprocessor directives (e.g. #include and #define): there can be a function called from a macro (e.g. getchar) defined in an #include file.
An option that works for 100% is compiling your .c file to an assembly file, e.g. gcc -S file.c, and finding the call instructions in the file.S. A similar option is compiling your .c file to an object file, e.g, gcc -c file.c, generating a disassembly dump with objdump -d file.o, and searching for call instructions.
Another option is finding a parser using Clang / LLVM.
gnu cflow might be helpful