How to handle different OCaml versions when generating AST with ppxlib - ocaml

I am making a ppx extension rewriter as part of a code library
Ideally the library will be usable with some range of OCaml versions
I have noticed that when building AST nodes to output from my rewriter it is unavoidable to have to construct some records, whose structure is specific to a particular OCaml AST version
For example, when building a variant type declaration we have to define a record like:
{
pcd_name = {txt = name; loc};
pcd_args = Pcstr_tuple [];
pcd_res = None;
pcd_loc = loc;
pcd_attributes = [];
}
Which is the constructor_declaration type
However this AST type differs between OCaml 4.13 and OCaml 4.14
I am hoping that mostly the ppxlib Ast_builder helpers take care of generating the correct AST version for whatever OCaml version I'm compiling my library under.
But in the places where I have to manually define one of these record instances then presumably I need to detect the current OCaml version and return the correct record format that way?
I found this:
utop # Sys.ocaml_version;;
- : string = "4.12.1"
So presumably I should parse this string into int * int * int so that I can do a safe comparison for versions newer than 4.14.0
Is there a better way, or something different I should do instead?

The very aim of ppxlib is to write ppx independently of the compiler version.
If you end up matching on the compiler version, you are duplicating the work done by ppxlib.
Ppxlib works by using an internal reference AST and migrating between this reference AST and the various versions of OCaml AST. In theory, you could write by hand the AST node corresponding to the reference AST, however this is very brittle.
This is why ppxlib provides two version-independent paths to build AST nodes, and all of them result in ppx that works on all versions of the compiler supported by ppxlib:
the quotation syntax provided by ppx.metaquot
the constructor functions in the Ast_builder module
If you find a corner case where you are required to write an AST node by hand this is a bug in ppxlib that you should report.

Related

How can I count number of times an overloaded operator was used in a code base with particular type of operands

I have a templated class SafeInt<T> (By Microsoft).
This class in theory can be used in place of a POD integer type and can detect any integer overflows during arithmetic operations.
For this class I wrote some custom templatized overloaded arithmetic operator (+, -, *, /) functions whose both arguments are objects of SafeInt<T>.
I typedef'd all my integer types to SafeInt class type.
I want to search my codebase for instances of the said binary operators where both operands are of type SafeInt.
Some of the ways I could think of
String search using regex and weed through the code to detect operator usage instances where both operands are SafeInt objects.
Write a clang tool and process the AST to do this searching (I am yet to learn how to write such a tool.)
Somehow add a counter to count the number of times the custom overloaded operator is instantiated. I spent a lot of time trying this but doesn't seem to work.
Can anyone suggest a better way?
Please let me know if I need to clarify anything.
Thanks.
Short answer
You can do this using the clang-query command:
$ clang-query \
-c='m cxxOperatorCallExpr(callee(functionDecl(hasName("operator+"))), hasArgument(0, expr(hasType(cxxRecordDecl(hasName("SafeInt"))))), hasArgument(1, expr(hasType(cxxRecordDecl(hasName("SafeInt"))))))' \
use-si.cc --
Match #1:
/home/scott/wrk/learn/clang/clang-query1/use-si.cc:10:3: note: "root" binds here
x + y; // reported
^~~~~
1 match.
What is clang-query?
clang-query is a utility intended to facilitate writing clang-tidy checks. In particular it understands the language of AST Matchers and can be used to interactively explore what is matched by a given match expression. However, as shown here, it can also be used non-interactively to look for arbitrary AST tree patterns.
The blog post Exploring Clang Tooling Part 2: Examining the Clang AST with clang-query by Stephen Kelly provides a nice introduction to using clang-query.
The clang-query program is included in the pre-built LLVM binaries, or it can be built from source as described in the AST Matchers Tutorial.
How does the above command work?
The -c argument provides a command to run non-interactively. With whitespace added, the command is:
m // Match (and report) every
cxxOperatorCallExpr( // operator function call
callee(functionDecl( // where the callee
hasName("operator+"))), // is "operator+", and
hasArgument(0, // where the first argument
expr(hasType(cxxRecordDecl( // is a class type
hasName("SafeInt"))))), // called "SafeInt",
hasArgument(1, // and the second argument
expr(hasType(cxxRecordDecl( // is also a class type
hasName("SafeInt")))))) // called "SafeInt".
The command line ends with use-si.cc --, meaning to analyze use-si.cc and there are no extra compiler flags needed by clang to interpret it.
The clang-query command line has the same basic structure as that of clang-tidy, including the ability to pass -p compile_commands.json to scan many files at once, possibly with different compiler options per file.
Example input
For completeness, the input I used to test my matcher is use-si.cc:
// use-si.cc
#include "SafeInt.hpp" // SafeInt
void f1()
{
SafeInt<int> x(2);
SafeInt<int> y(3);
x + y; // reported
x + 2; // not reported
2 + x; // not reported
}
where SafeInt.hpp comes from https://github.com/dcleblanc/SafeInt , the repo named on the Microsoft SafeInt page.
To do this right, you clearly have to be able to identify individual uses of the operator which overload to a specific operator definition. Fundamentally, you need what the front end of a C++ compiler does: parsing and name resolution (including the overloads).
Obviously GCC and Clang have this basic capability. But you want to track/display all uses of the specific operator. You can probably bend Clang (or GCC, harder) to give you this information on a file-by-file basis.
Our DMS Software Reengineering Toolkit with its C++ Front End can be used for this, too.
DMS provides the generic parsing and symbol table support machinery; the C++ front end specializes DMS to handle C++ with full, accurate name resolution including overloads, for both GCC5 and MSVS2015. Its symbol table actually collects, for each declaration in a scope, the point of the declaration, and the list of uses of that declaration in terms of accurate source positions. The symbol scopes include an entry for each (overloaded) operator valid in the scope. You could just
go to the desired symbol table entry and enumerate/count the list of references to get a raw count. There are standard APIs for this available via DMS.
The same kind of symbol scope/definition/uses information is used by our Java Source Browser to build an HTML-based JavaDoc-like display with full HTML linkages between symbol declarations and uses. So for any symbol declaration, you can easily see the uses.
The C++ front end has a similar HTMLizer that operates on C++ source code. It isn't as mature/pretty, but it is robust. It presently doesn't show all the uses of a declared symbol, but that would be a pretty straightforward change to make to it. (I don't have a publicly visible instance of it. Contact me through my bio and I can send you a sample).

parsing a C structure in C++

I'm looking for a way to parse a C structure in order to get the the name and the type of the variables.
For example I have a structure like this:
struct MyStruct {
int anInt ;
float aFloat ;
}
I need to get the types int and float and the 2 strings "anInt" and "aFloat".
After I have to use these values in another function:
addValue<int>("anInt") ;
add Value<float>("aFloat") ;
Do you know how to do this automatically, I guess, at compilation ?
Thanks.
You probably cannot do that with standard C++11 template code.
You could consider customizing your C++ compiler to get such information. For example, if compiling your code with a recent GCC, you might consider customizing it with your MELT extension (MELT is a domain specific language, implemented by a plugin, to customize GCC). You'll need to understand the details of GCC internal representation (so such an extension might take a week or more of your time).
The Qt MOC facility might perhaps be useful or inspirational. It is parsing a limited form of class declaration.
Alternatively you might consider generating the C++ struct or class representation from some other input; e.g. change your build procedure, perhaps your Makefile, to generate some .h header file (and perhaps some .cc C++ translation unit) from your higher level representation.

How to modify C++ code from user-input

I am currently writing a program that sits on top of a C++ interpreter. The user inputs C++ commands at runtime, which are then passed into the interpreter. For certain patterns, I want to replace the command given with a modified form, so that I can provide additional functionality.
I want to replace anything of the form
A->Draw(B1, B2)
with
MyFunc(A, B1, B2).
My first thought was regular expressions, but that would be rather error-prone, as any of A, B1, or B2 could be arbitrary C++ expressions. As these expressions could themselves contain quoted strings or parentheses, it would be quite difficult to match all cases with a regular expression. In addition, there may be multiple, nested forms of this expression
My next thought was to call clang as a subprocess, use "-dump-ast" to get the abstract syntax tree, modify that, then rebuild it into a command to be passed to the C++ interpreter. However, this would require keeping track of any environment changes, such as include files and forward declarations, in order to give clang enough information to parse the expression. As the interpreter does not expose this information, this seems infeasible as well.
The third thought was to use the C++ interpreter's own internal parsing to convert to an abstract syntax tree, then build from there. However, this interpreter does not expose the ast in any way that I was able to find.
Are there any suggestions as to how to proceed, either along one of the stated routes, or along a different route entirely?
What you want is a Program Transformation System.
These are tools that generally let you express changes to source code, written in source level patterns that essentially say:
if you see *this*, replace it by *that*
but operating on Abstract Syntax Trees so the matching and replacement process is
far more trustworthy than what you get with string hacking.
Such tools have to have parsers for the source language of interest.
The source language being C++ makes this fairly difficult.
Clang sort of qualifies; after all it can parse C++. OP objects
it cannot do so without all the environment context. To the extent
that OP is typing (well-formed) program fragments (statements, etc,.)
into the interpreter, Clang may [I don't have much experience with it
myself] have trouble getting focused on what the fragment is (statement? expression? declaration? ...). Finally, Clang isn't really a PTS; its tree modification procedures are not source-to-source transforms. That matters for convenience but might not stop OP from using it; surface syntax rewrite rule are convenient but you can always substitute procedural tree hacking with more effort. When there are more than a few rules, this starts to matter a lot.
GCC with Melt sort of qualifies in the same way that Clang does.
I'm under the impression that Melt makes GCC at best a bit less
intolerable for this kind of work. YMMV.
Our DMS Software Reengineering Toolkit with its full C++14 [EDIT July 2018: C++17] front end absolutely qualifies. DMS has been used to carry out massive transformations
on large scale C++ code bases.
DMS can parse arbitrary (well-formed) fragments of C++ without being told in advance what the syntax category is, and return an AST of the proper grammar nonterminal type, using its pattern-parsing machinery. [You may end up with multiple parses, e.g. ambiguities, that you'll have decide how to resolve, see Why can't C++ be parsed with a LR(1) parser? for more discussion] It can do this without resorting to "the environment" if you are willing to live without macro expansion while parsing, and insist the preprocessor directives (they get parsed too) are nicely structured with respect to the code fragment (#if foo{#endif not allowed) but that's unlikely a real problem for interactively entered code fragments.
DMS then offers a complete procedural AST library for manipulating the parsed trees (search, inspect, modify, build, replace) and can then regenerate surface source code from the modified tree, giving OP text
to feed to the interpreter.
Where it shines in this case is OP can likely write most of his modifications directly as source-to-source syntax rules. For his
example, he can provide DMS with a rewrite rule (untested but pretty close to right):
rule replace_Draw(A:primary,B1:expression,B2:expression):
primary->primary
"\A->Draw(\B1, \B2)" -- pattern
rewrites to
"MyFunc(\A, \B1, \B2)"; -- replacement
and DMS will take any parsed AST containing the left hand side "...Draw..." pattern and replace that subtree with the right hand side, after substituting the matches for A, B1 and B2. The quote marks are metaquotes and are used to distinguish C++ text from rule-syntax text; the backslash is a metaescape used inside metaquotes to name metavariables. For more details of what you can say in the rule syntax, see DMS Rewrite Rules.
If OP provides a set of such rules, DMS can be asked to apply the entire set.
So I think this would work just fine for OP. It is a rather heavyweight mechanism to "add" to the package he wants to provide to a 3rd party; DMS and its C++ front end are hardly "small" programs. But then modern machines have lots of resources so I think its a question of how badly does OP need to do this.
Try modify the headers to supress the method, then compiling you'll find the errors and will be able to replace all core.
As far as you have a C++ interpreter (as CERN's Root) I guess you must use the compiler to intercept all the Draw, an easy and clean way to do that is declare in the headers the Draw method as private, using some defines
class ItemWithDrawMehtod
{
....
public:
#ifdef CATCHTHEMETHOD
private:
#endif
void Draw(A,B);
#ifdef CATCHTHEMETHOD
public:
#endif
....
};
Then compile as:
gcc -DCATCHTHEMETHOD=1 yourfilein.cpp
In case, user want to input complex algorithms to the application, what I suggest is to integrate a scripting language to the app. So that the user can write code [function/algorithm in defined way] so the app can execute it in the interpreter and get the final results. Ex: Python, Perl, JS, etc.
Since you need C++ in the interpreter http://chaiscript.com/ would be a suggestion.
What happens when someone gets ahold of the Draw member function (auto draw = &A::Draw;) and then starts using draw? Presumably you'd want the same improved Draw-functionality to be called in this case too. Thus I think we can conclude that what you really want is to replace the Draw member function with a function of your own.
Since it seems you are not in a position to modify the class containing Draw directly, a solution could be to derive your own class from A and override Draw in there. Then your problem reduces to having your users use your new improved class.
You may again consider the problem of automatically translating uses of class A to your new derived class, but this still seems pretty difficult without the help of a full C++ implementation. Perhaps there is a way to hide the old definition of A and present your replacement under that name instead, via clever use of header files, but I cannot determine whether that's the case from what you've told us.
Another possibility might be to use some dynamic linker hackery using LD_PRELOAD to replace the function Draw that gets called at runtime.
There may be a way to accomplish this mostly with regular expressions.
Since anything that appears after Draw( is already formatted correctly as parameters, you don't need to fully parse them for the purpose you have outlined.
Fundamentally, the part that matters is the "SYMBOL->Draw("
SYMBOL could be any expression that resolves to an object that overloads -> or to a pointer of a type that implements Draw(...). If you reduce this to two cases, you can short-cut the parsing.
For the first case, a simple regular expression that searches for any valid C++ symbol, something similar to "[A-Za-z_][A-Za-z0-9_\.]", along with the literal expression "->Draw(". This will give you the portion that must be rewritten, since the code following this part is already formatted as valid C++ parameters.
The second case is for complex expressions that return an overloaded object or pointer. This requires a bit more effort, but a short parsing routine to walk backward through just a complex expression can be written surprisingly easily, since you don't have to support blocks (blocks in C++ cannot return objects, since lambda definitions do not call the lambda themselves, and actual nested code blocks {...} can't return anything directly inline that would apply here). Note that if the expression doesn't end in ) then it has to be a valid symbol in this context, so if you find a ) just match nested ) with ( and extract the symbol preceding the nested SYMBOL(...(...)...)->Draw() pattern. This may be possible with regular expressions, but should be fairly easy in normal code as well.
As soon as you have the symbol or expression, the replacement is trivial, going from
SYMBOL->Draw(...
to
YourFunction(SYMBOL, ...
without having to deal with the additional parameters to Draw().
As an added benefit, chained function calls are parsed for free with this model, since you can recursively iterate over the code such as
A->Draw(B...)->Draw(C...)
The first iteration identifies the first A->Draw( and rewrites the whole statement as
YourFunction(A, B...)->Draw(C...)
which then identifies the second ->Draw with an expression "YourFunction(A, ...)->" preceding it, and rewrites it as
YourFunction(YourFunction(A, B...), C...)
where B... and C... are well-formed C++ parameters, including nested calls.
Without knowing the C++ version that your interpreter supports, or the kind of code you will be rewriting, I really can't provide any sample code that is likely to be worthwhile.
One way is to load user code as a DLL, (something like plugins,)
this way, you don't need to compile your actual application, just the user code will be compiled, and you application will load it dynamically.

gcc for parsing code

I would like to know how to use GCC as a library to parse C/C++/Java/Objective C/Ada code for my program.
I want to bypass prepocessing and prefix all the functions that are user written with a prefix My.
like so Print(); becomes MyPrint(); I also wish to do this with the variables.
You can look here:
http://codesynthesis.com/~boris/blog/2010/05/03/parsing-cxx-with-gcc-plugin-part-1/
This is description of how to use gcc plugin interface to parse C++ code. Other language should be handled in the same manner.
Also you can try pork from mozilla:
https://wiki.mozilla.org/Pork
When I tried it (pork), I spend hour or so to fix compile problems, but then
I can write scripts like this:
rewrite SyncPrimitiveUpgrade {
type PRLock* => Mutex*
call PR_NewLock() => new Mutex()
call PR_Lock(lock) => lock->Lock()
call PR_Unlock(lock) => lock->Unlock()
call PR_DestroyLock(lock) => delete lock
}
so it found all type PRLock and replate it with Mutex, also it search call of functions
like PR_NewLock and replace it with "new Mutex".
You might wish to investigate the sparse C parser. It understands a lot of C (all the C used in the Linux kernel sources, which is a fairly good subset of legal ANSI-C and GNU-C extensions) and provides a few sample compiler backends to provide a lint-like static analysis tool for type checking.
While the code looks very clean and thorough, your task might be easier done via another mechanism -- the example.c included with the sparse source that demonstrates a compiler is 1955 lines long.
For C, you cannot do that reliably. If you skip preprocessing you will -- in general -- not have valid C code to be parsed. E.g.
#define FOO
#define BAR
#define BAZ
FOO void BAR qux BAZ(void) { }
How is the parser supposed to recognize this a function definition of qux without doing the preprocessing?
First, GCC is not a library, and is not structured to be one (in contrast to LLVM).
Why (i.e. what for) do you want to parse C, C++, Ada source code?
I would consider (assuming a GCC 4.6 version) extending GCC either thru plugins written in C, or preferably using MELT, a high level domain specific language to extend GCC (disclaimer: I am the main author of MELT).
But using GCC as a library is not realistic at all.
I really think that for what you want to achieve, MELT is the right tool. However, it is poorly documented. Please use the gcc-melt#googlegroups.com list to ask questions.
And be aware that extending GCC does take some amount of work (more than a week perhaps), because you need to partly understand the GCC internal representations.
Our DMS Software Reengineering Toolkit can parse C, C++, Java and Ada code (not Objective C at this time) in a wide variety of dialects and carry out transformations on the code. DMS's C and C++ front ends include a preprocessor, so you can you can cause preprocessing before you parse.
I'm probably don't understand what you want to do, because it seems strange to rename every function and (global?) variable with a "My...." prefix. But you could do that with some DMS rules (a rough sketch of renames of user functions for GCC3:
domain C~GCC3.
rule rewrite_function_names(t: type_designator, i: IDENTIFIER, p: parameter_list, s: statements):
function_header->functionheader
"\t \i(\p) { \s } " -> "\t \renamed\(\i\) (\p) { \s }" ;
and a helper function "renames" that takes a tree node containing an identifer, and returns a tree node with the renamed identifier.
Because DMS patterns only match against the parse trees, you won't get any false positives.
You'd need some additional patterns to handle various different syntax cases within each langauge (e.g, for C, "void" return type, because "void" isn't a type designator in the syntax, and global variable declarations), and different rules for different languages (Ada's syntax is not the same as that of C).
This might seem like big hammer for your task, but if you really insist on doing this for a variety of languages in a reliable way, it seems hard to avoid the problem of getting decent parsers for all those languages. (And if you are really going to do this for all these languages, DMS can be taught to handle ObjectiveC the same we we have taught it to handle the other langauges).
Your alternative is some kind of string hacking solution, which might work 95% of the time. If you can live with that, then Perl or something similar is likely your answer.
forget about GCC, its made as a compiler's parser, not an analysis parser, you'd do way better using something like libclang, a C interface to clang, which can process both C & C++

Is there a way to print user-defined datatypes in ocaml?

I can't use print_endline because it requires a string, and I don't (think) I have any way to convert my very simple user-defined datatypes to strings. How can I check the values of variables of these datatypes?
In many cases, it's not hard to write your own string_of_ conversion routine. That's a simple alternative that doesn't require any extra libraries or non-standard OCaml extensions. For the courses I teach that use OCaml, this is often the simplest mechanism for students.
(It would be nice if there were support for a generic conversion to strings though; perhaps the OCaml deriving stuff will catch on.)
There's nothing in the base language that does this for you. There is a project named OCaml Deriving (named after a feature of Haskell) that can automatically derive print functions from type declarations. I haven't used it, but it sounds excellent.
http://code.google.com/p/deriving/
Once you have a function for printing your type (derived or not), you can install it in the ocaml top-level. This can be handy, as the built-in top-level printing sometimes doesn't do quite what you want. To do this, use the #install-printer directive, described in Chapter 9 of the OCaml Manual.
There are third-party library functions like dump in OCaml Batteries Included or OCaml Extlib, that will generically convert any value to a string using all the runtime information it can get. But this won't be able to recover all information; for example, constructor names are lost and become just integers, so it will not look exactly the way you want. You will basically have to write your own conversion functions, or use some tool that will write them for you.
Along the lines of previous answers, ppx_sexp is a PPX for generating printers from type definitions. Here's an example of how to use it while using jbuilder as your build system, and using Base and Stdio as your stdlib.
First, the jbuild file which specifies how to do the build:
(jbuild_version 1)
(executables
((names (w))
(libraries (base stdio))
(preprocess (pps (ppx_jane ppx_driver.runner)))
))
And here's the code.
open Base
open Stdio
type t = { a: int; b: float * float }
[##deriving sexp]
let () =
let x = { a = 3; b = (4.5,5.6) } in
[%sexp (x : t)] |> Sexp.to_string_hum |> print_endline
And when you run it you get this output:
((a 3) (b (4.5 5.6)))
S-expression converters are present throughout Base and all the related libraries (Stdio, Core_kernel, Core, Async, Incremental, etc.), and so you can pretty much count on being able to serialize any data structure you encounter there, as well as anything you define on your own.