A friend and I were discussing imaginary and real languages and a question that came up was if one of us wanted to generate headers for another language (perhaps D which already has a tool) what would be an easy and very good way to do this?
One of us said to scan C files and headers and ignore function bodies and only count the braces within to figure out when a function is finished. The counter to that was typedefs, defines (which braces but defines were considered as a trivial problem) and templates + specialization.
Another solution was to read binaries produce, not the actual exe but the object files the linker uses. The counter to that was the format and complexity. None of us knew anything of any object format so we couldnt estimate (we were thinking of gcc and VS c++).
What do you guys think? Which is easier? This should be backed up with reasonable logic and fact.
If someone can link to a helpful project, one that parses C files/headers and outputs it or one that reads in elf data and displays info in an example project would be useful. I tried googling but I didnt know what it would be called. I found libelf but at this moment I couldn't get it to compile. I might be able to soon.
You can use clang libraries to parse C/C++ source code and extract any information you want in particular function prototypes.
Due to library-based architecture it is easy to reuse parts of clang that you need. In your case these are frontend libraries (liblex, libparse, libsema). I think this is a more feasible approach then using hand-written scanner considering the difficulties that you mentioned (typedefs, defines, etc).
clang can also be used as a tool to parse the source code and output AST in XML form, for example if you have the file test.cpp:
void foo() {}
int main()
{
foo();
}
and invoke clang++ -Xclang -ast-print-xml -fsyntax-only test.cpp you'll get the file test.xml similar to the following (here irrelevant parts skipped for brevity):
<?xml version="1.0"?>
<CLANG_XML>
<TranslationUnit>
<Function id="_1D" file="f2" line="1" col="6" context="_2"
name="foo" type="_12" function_type="_1E" num_args="0">
</Function>
<Function id="_1F" file="f2" line="3" col="5" context="_2"
name="main" type="_21" function_type="_22" num_args="0">
</Function>
</TranslationUnit>
<ReferenceSection>
<Types>
<FunctionType result_type="_12" id="_1E"/>
<FundamentalType kind="int" id="_21"/>
<FundamentalType kind="void" id="_12"/>
<FunctionType result_type="_21" id="_22"/>
<PointerType type="_12" id="_10"/>
</Types>
<Files>
<File id="f2" name="test.cpp"/>
</Files>
</ReferenceSection>
</CLANG_XML>
I don't think that extracting this information from binaries is possible at least for symbols with C linkage, because they don't have name mangling.
ctags' output is quite easy to read/parse
if you want to simply generate a binding, try swig
What you're talking about is compiling: The act of transforming code in one formal language to another. There's a good solid science behind this that, if followed carefully, will guarantee your program halts with a correct analogous code.
Granted, you don't want to parse the whole of the C++ language (hooray for that!), so you just need to define the relevant grammar and define everything else as acceptable noise or comments.
Don't use regular expressions. These won't do because C++ is not a regular language.
One way to do this is to define your interfaces in an abstract language (an IDL), and generate headers for all languages that you're interested in. You can limit the scope of your IDL to those features that are possible in each target language.
Windows takes this approach in its MIDL language, for example.
Doxygen can help with this. It's an advanced topic, and somewhat documented.
In a perfect world...
Each compiler of any programming language would emit such information as part of its output. It could be an ELF extension or a new generally accepted file format, which contains an ELF/COFF/whatever section.
This would spare the (across the globe) thousands of man hours to generate "language bindings" for dozens of languages. Dynamic languages, such as Common Lisp would not need FFI libraries - as it all would happen under the hood and loading a shared library in that new format could automatically be inspected and functions in it could be made available on the fly, without any further ado.
And all, which actually would have to be stored as extra information for each (exported) function is:
Generic name
Return type
Argument list
calling convention (there are not that many in practice)
A reference to the symbol table entry, referred to.
Why has it not been done during the past 30 years?
Because language designers still see the compiler output as their "private affair", whereas IMHO, this should be part of the OS/Runtime/Loader.
Many workarounds exist, some listed here in other answers (IDL, binding generators, such as SWIG and a myriad of others, mostly ad hoc and of varying and typically insufficient quality).
The ELF guys are Unix guys and as such C guys, who still live in a bubble, thinking C is some sort of "golden standard". But even they would not have any problems, using the information to generate their beloved header files.
RUST might not have invented crates in that perfect world and you would not even care if some shared library had been written in C or C++ or Rust or Zig or Haskell - anyone could just use it.
This domain is still dominated by "computer scientists", not engineers; while in theory, there could be an infinite number of calling conventions, in fact, those in active use are very countable few (LLVM supports many of those actually relevant).
Another reason, it has not been done yet, is that there is a danger, some committee would create a monster, trying make it "open, flexible, ..." and no one would eventually dare using it. DCE (distributed computing environment) gives that idea.
Related
I'm developing a new language in LLVM using the C++ API which compiles down to target the C ABI.
I would like to support modular compilation by allowing end users to build what are effectively static libraries. I noticed the LLVM C++ API has a llvm::Linker class that I can use during compilation to combine source files (llvm::Module), however I wanted to guarantee library compatibility via metadata version numbers or at least the publicly exposed interface between separate compilation runs.
Much of the information available on metadata in LLVM suggest that it should only be used for extended information that would not break correctness when silently removed.
llvm
blog
IntrinsicsMetadataAttributes
pdf
I wouldn't think this would be a deal breaker as it could be global metadata, but it would be good to get a second opinion on that point.
I also know there is a method in IRReader to parseIRFile so I can load some previously built bc files. I would be curious if it would be reasonable practice to include size and CRC information for comparison when loading these files.
My language has concepts similar to C# including interfaces. I figure I could allow modular compilation by importing/exporting an interface type along with external functions (Much like C++, I don't restrict the language to only methods of classes).
This approach allows me to include language specific information in the interface without needing to encode it in the IR as both the library and the calling code would be required to build with the interface. This again requires the interfaces to be compatible.
One language feature that would require extended information would be named parameters in functions.
My language is very type-safe and also mandates named parameters so there is no predetermined function parameter order. This allows call sites to be more explicit, the compiler to catch erroneous parameter usage, and authors have more liberty in determining default parameters as they are not restricted to the last parameters to the function.
The compiler will need to know names, modifiers, defaults, etc. of these parameters to correctly map calls at compile time, so I figure the interface approach would work well here.
TL;DR
Does LLVM have any predefined facilities for building static libraries?
Is version number, size, and CRC information reasonable use cases for LLVM's metadata?
This is probably not QUITE an answer... Or at least not a complete answer.
I like this question, as I'm going to need a solution in the future too (some time in the next few months or years) for my Pascal compiler. It supports "units" which is meant to be a separately compiled object, but currently what I do is simply drag in the source file and compile it into the main llvm::Module - that's neither efficient nor flexible (can't use the linker to choose between the "Linux" and "Windows" version of some code, for example - not that I think there is 5% chance that my compiler will work on Windows without modification anyway...)
However, I'm not sure storing the "object" file as LLVM IR would be the right thing to do. I was thinking that a better way would be to store your AST in some serialized form - then
you don't depend on LLVM versions changing the IR format.
You can add whatever metadata you like. There won't be much
difference in generating LLVM-IR from this during your link phase or
building the IR at compile and then reading the IR to figure out if
the metadata is correct. [The slow part, as you may have already found out, is the optimisation and MC generation, and you'd still have to do that either way]
Like I started out, I'm not sure this is an answer, but it's my thoughts so far on the subject. Now I'll go back to adding debug symbol stuff to my Pascal compiler... Before Christmas, I couldn't see the source in GDB. Now I can step, but no viewing of variables yet...
I'm looking to get an AST for C++ that I can then parse with an external program. What programs are out there that are good for generating an AST for C++? I don't care what language it is implemented in or the output format (so long as it is readily parseable).
My overall goal is to transform a C++ unit test bed to its corresponding C# wrapper test bed.
You can use clang and especially libclang to parse C++ code. It's a very high quality, hand written library for lexing, parsing and compiling C++ code but it can also generate an AST.
Clang also supports C, Objective-C and Objective-C++. Clang itself is written in C++.
Actually, GCC will emit the AST at any stage in the pipeline that interests you, including the GENERIC and GIMPLE forms. Check out the (plethora of) command-line switches begining with -fdump- — e.g. -fdump-tree-original-raw
This is one of the easier (…) ways to work, as you can use it on arbitrary code; just pass the appropriate CFLAGS or CXXFLAGS into most Makefiles:
make CXXFLAGS=-fdump-tree-original-raw all
… and you get “the works.”
Updated: Saw this neat little graphing system based on GCC AST's while checking my flag name :-) Google FTW.
http://digitocero.com/en/blog/exporting-and-visualizing-gccs-abstract-syntax-tree-ast
Our C++ Front End, built on top of our DMS Software Reengineering Toolkit can parse a variety of C++ dialects (including C++11 and ObjectiveC) and export that AST as an XML document with a command line switch. See example ASTs produced by this front end.
As a practical matter, you will need more than the AST; you can't really do much with C++ (or any other modern language) without an understanding of the meaning and scope of each identifier. For C++, meaning/scope are particularly ugly. The DMS C++ front end handles all of that; it can build full symbol tables associating identifers to explicit C++ types. That information isn't dumpable in XML with a command line switch, but it is "technically easy" to code logic in DMS to walk the symbol table and spit out XML. (there is an option to dump this information, just not in XML format).
I caution you against the idea of manipulating (or even just analyzing) the XML. First, XSLT isn't a particularly good way to understand the meaning of the ASTs, let alone transform the AST, because the ASTs represent context sensitive language structures (that's why you want [nee MUST HAVE] the symbol table). You can read the XML into a dom-like tree if you like and write your own procedural code to manipulate it. But source-to-source transformations are an easier way; you can write your transformations using C++ notation rather than buckets of code goo climbing over a tree data structure.
You'll have another problem: how to generate valid C++ code from the transformed XML. If you don't mind spitting out raw text, you can solve this problem in purely ad hoc ways, at the price of having no gaurantee other than sweat that generated code is syntactically valid. If you want to generate a C++ representation of your final result as an AST, and regenerate valid text from that, you'll need a prettyprinter, which are not technically hard but still a lot of work to build especially for a language as big as C++.
Finally, the reason that tools like DMS exist is to provide the vast amount of infrastructure it takes to process/manipulate complex structure such as C++ ASTs. (parse, analyse, transform, prettyprint). You can try to replicate all this machinery yourself, but this is usually a poor time/cost/productivity tradeoff. The claim is it is best to stay within the tool ecosystem rather than escape it and build bad versions of it yourself. If you haven't done this before, you'll find this out painfully.
FWIW, DMS has been used to carry out massive analysis and transformations on C++ source code. See Publications on DMS and check the papers by Akers on "Re-engineering C++ Component Models".
Clang is based on the same kind of philosophy; there's an ecosystem of tools.
YMMV, but I'd be surprised.
I'm working on some custom C++ static code analysis for my PHD thesis. As part of an extension to the C++ type system, I want to take a C++ code base and enumerate its available functions, methods, and classes, along with their type signatures, with minimal effort (it's just a prototype). What's the best approach to doing something like this quickly and easily? Should I be hacking on Clang to spit out the information I need? Should I look at parsing header files with something like SWIG? Or is there an even easier thing I could be doing?
GCCXML, based on GCC, might be the ticket.
As I understand it, it collects and dumps all definitions but not the content of functions/methods.
Others will likely mention CLANG, which certainly parses code and must have access to the definitions of the symbols in a compilation unit. (I have no experience here).
For completeness, you should know about our DMS Software Reengineering Toolkit
with its C++ Front End. (The CLANG answers seem to say "walk the AST"). The DMS solution provides an enumerable symbol table containing all the type information. You can walk the AST, too, if you want.
Often a static analysis leads to a diagnosis, and a desire to change the source code.
DMS can apply source-to-source program transformations to carry out such changes conditioned
by the analysis.
I heartily recommend LLVM for statical analysis (see also Clang Static Analyzer)
I think your best bet is hacking on clang and getting the AST. There is a good tutorial for that here. Its very easy to modify its syntax and it also has a static analyzer.
At my work, I use the API from a software package called "Understand 4 C++" by scitools. I use this to write all my static analysis tools. I even wrote a .NET API to wrap their C API. Which I put on codeplex.
Once you have that, dumping all class types is easy:
ClassType[] allclasses = Database.GetAllClassTypes()
foreach (ClassType c in allclasses)
{
Console.WriteLine("Class Name: {0}", c.NameLong);
}
Now for a little backstory about a task I had that is similar to yours.
In some years we have to keep our SDK binary backwards compatible with the previous years SDK. In that case it's useful to compare the SDK code between releases to check for potential breaking changes. However with a couple of hundred files, and tens of thousands of lines of comments that can be a big headache using a text diff tool like Beyond Compare or Araxis. So what I really need to look at is actual code changes, not re-ordering, not moving code up and down in the file, not adding comments etc...
So, a tool I wrote to dump out all the code.
In one text file I dump all all the classes. For each class I print its inheritance tree, its member functions both virtual and non-virtual. For each virtual function I print what parent class virtual methods it overrides (if any). I also print out its member variables.
Same goes with the structs.
In another file I print all the macro's.
In another file I print all the typedefs.
Then using this I can diff these files with files from a previous release. It then becomes apparent instantly what has changed from release to release. For instance it's easy to see where a function parameter was changed from TCHAR* to const TCHAR* for instance.
You might consider developping a GCC Plugin for your purpose.
And GCC MELT is a high level domain specific language (that I designed & implemented) for easily extend GCC.
The paper at GROW09 workshop by Peter Collingbourne and Paul Kelly on a A Compile-Time Infrastructure for GCC Using Haskell might be relevant for your work.
I need to parse function headers from a .i file used by SWIG which contains all sorts of garbage beside the function headers. (final output would be a list of function declarations)
The best option for me would be using the GNU toolchain (GCC, Binutils, etc..) to do so, but i might be missing an easy way of doing it with SWIG. If I am please tell me!
Thanks :]
edit: I also don't know how to do that with GCC toolchain, if you have an idea it will be great.
I would try getting an XML dump of the abstract syntax tree either from clang or from gccxml. From there you can easily extract the function declarations you are interested in.
Our DMS Software Reengineering Toolkit provides general purpose program parsing, analysis, and transformation capability. It has front ends for a wide variety of languages, including C++.
It has been used to analyze and transforms very complex C++ programs and their header files.
You aren't clear as to what you will do after you "parse the function headers"; normally people want to extract some information or produce another artifact. DMS with its C++ front end can do the parsing; you can configure DMS to do the custom stuff.
As a practical matter, this isn't usually an afternoon's exercise; DMS is a complex beast, because it has to deal with complex beasts such as C++. And I'd expect you to face the same kind of complexity for any tool that can handle C++. The GCC toolchain can clearly handle C++, so you might be able to do it with that (at that same level of complexity) but GCC is designed to be a compiler, and IMHO you will find it a fight to get it do what you want.
Your "output function declarations" goal isn't clear. You want just the function names? You want a function signature? You want all the type declarations on which the function depends? You want all the type declarations on which the function depends, if they are not already present in an existing include file you intend to use?
The best way to extract function decls from the garbage which is C header files is to substitute out what constitutes the most smelly garbage: macros. You can do that with:
cpp - The C Preprocessor
I'm just starting to explore C++, so forgive the newbiness of this question. I also beg your indulgence on how open ended this question is. I think it could be broken down, but I think that this information belongs in the same place.
(FYI -- I am working predominantly with the QT SDK and mingw32-make right now and I seem to have configured them correctly for my machine.)
I knew that there was a lot in the language which is compiler-driven -- I've heard about pre-compiler directives, but it seems like someone would be able to write books the different C++ compilers and their respective parameters. In addition, there are commands which apparently precede make (like qmake, for example (is this something only in QT)).
I would like to know if there is any place which gives me an overview of what compilers are out there, and what their different options are. I'd also like to know how each of them views Makefiles (it seems that there is a difference in syntax between them?).
If there is no website regarding, "Everything you need to know about C++ compilers but were afraid to ask," what would be the best way to go about learning the answers to these questions?
Concerning the "numerous options of the various compilers"
A piece of good news: you needn't worry about the detail of most of these options. You will, in due time, delve into this, only for the very compiler you use, and maybe only for the options that pertain to a particular set of features. But as a novice, generally trust the default options or the ones supplied with the make files.
The broad categories of these features (and I may be missing a few) are:
pre-processor defines (now, you may need a few of these)
code generation (target CPU, FPU usage...)
optimization (hints for the compiler to favor speed over size and such)
inclusion of debug info (which is extra data left in the object/binary and which enables the debugger to know where each line of code starts, what the variables names are etc.)
directives for the linker
output type (exe, library, memory maps...)
C/C++ language compliance and warnings (compatibility with previous version of the compiler, compliance to current and past C Standards, warning about common possible bug-indicative patterns...)
compile-time verbosity and help
Concerning an inventory of compilers with their options and features
I know of no such list but I'm sure it probably exists on the web. However, suggest that, as a novice you worry little about these "details", and use whatever free compiler you can find (gcc certainly a great choice), and build experience with the language and the build process. C professionals may likely argue, with good reason and at length on the merits of various compilers and associated runtine etc., but for generic purposes -and then some- the free stuff is all that is needed.
Concerning the build process
The most trivial applications, such these made of a single unit of compilation (read a single C/C++ source file), can be built with a simple batch file where the various compiler and linker options are hardcoded, and where the name of file is specified on the command line.
For all other cases, it is very important to codify the build process so that it can be done
a) automatically and
b) reliably, i.e. with repeatability.
The "recipe" associated with this build process is often encapsulated in a make file or as the complexity grows, possibly several make files, possibly "bundled together in a script/bat file.
This (make file syntax) you need to get familiar with, even if you use alternatives to make/nmake, such as Apache Ant; the reason is that many (most?) source code packages include a make file.
In a nutshell, make files are text files and they allow defining targets, and the associated command to build a target. Each target is associated with its dependencies, which allows the make logic to decide what targets are out of date and should be rebuilt, and, before rebuilding them, what possibly dependencies should also be rebuilt. That way, when you modify say an include file (and if the make file is properly configured) any c file that used this header will be recompiled and any binary which links with the corresponding obj file will be rebuilt as well. make also include options to force all targets to be rebuilt, and this is sometimes handy to be sure that you truly have a current built (for example in the case some dependencies of a given object are not declared in the make).
On the Pre-processor:
The pre-processor is the first step toward compiling, although it is technically not part of the compilation. The purposes of this step are:
to remove any comment, and extraneous whitespace
to substitute any macro reference with the relevant C/C++ syntax. Some macros for example are used to define constant values such as say some email address used in the program; during per-processing any reference to this constant value (btw by convention such constants are named with ALL_CAPS_AND_UNDERSCORES) is replace by the actual C string literal containing the email address.
to exclude all conditional compiling branches that are not relevant (the #IFDEF and the like)
What's important to know about the pre-processor is that the pre-processor directive are NOT part of the C-Language proper, and they serve several important functions such as the conditional compiling mentionned earlier (used for example to have multiple versions of the program, say for different Operating Systems, or indeed for different compilers)
Taking it from there...
After this manifesto of mine... I encourage to read but little more, and to dive into programming and building binaries. It is a very good idea to try and get a broad picture of the framework etc. but this can be overdone, a bit akin to the exchange student who stays in his/her room reading the Webster dictionary to be "prepared" for meeting native speakers, rather than just "doing it!".
Ideally you shouldn't need to care what C++ compiler you are using. The compatability to the standard has got much better in recent years (even from microsoft)
Compiler flags obviously differ but the same features are generally available, it's just a differently named option to eg. set warning level on GCC and ms-cl
The build system is indepenant of the compiler, you can use any make with any compiler.
That is a lot of questions in one.
C++ compilers are a lot like hammers: They come in all sizes and shapes, with different abilities and features, intended for different types of users, and at different price points; ultimately they all are for doing the same basic task as the others.
Some are intended for highly specialized applications, like high-performance graphics, and have numerous extensions and libraries to assist the engineer with those types of problems. Others are meant for general purpose use, and aren't necessarily always the greatest for extreme work.
The technique for using each type of hammer varies from model to model—and version to version—but they all have a lot in common. The macro preprocessor is a standard part of C and C++ compilers.
A brief comparison of many C++ compilers is here. Also check out the list of C compilers, since many programs don't use any C++ features and can be compiled by ordinary C.
C++ compilers don't "view" makefiles. The rules of a makefile may invoke a C++ compiler, but also may "compile" assembly language modules (assembling), process other languages, build libraries, link modules, and/or post-process object modules. Makefiles often contain rules for cleaning up intermediate files, establishing debug environments, obtaining source code, etc., etc. Compilation is one link in a long chain of steps to develop software.
Also, many development environments abstract the makefile into a "project file" which is used by an integrated development environment (IDE) in an attempt to simplify or automate many programming tasks. See a comparison here.
As for learning: choose a specific problem to solve and dive in. The target platform (Linux/Windows/etc.) and problem space will narrow the choices pretty well. Which you choose is often linked to other considerations, such as working for a particular company, or being part of a team. C++ has something like 95% commonality among all its flavors. Learn any one of them well, and learning the next is a piece of cake.