Interpreters: Handling includes/imports - c++

I've built an interpreter in C++ and everything works fine so far, but now I'm getting stuck with the design of the import/include/however you want to call it function.
I thought about the following:
Handling includes in the tokenizing process: When there is an include found in the code, the tokenizing function is recursively called with the filename specified. The tokenized code of the included file is then added to the prior position of the include.
Disadvantages: No conditional includes(!)
Handling includes during the interpreting process: I don't know how. All I know is that PHP must do it this way as conditional includes are possible.
Now my questions:
What should I do about includes?
How do modern interpreters (Python/Ruby) handle this? Do they allow conditional includes?

This problem is easy to solve if you have a clean design and you know what you're doing. Otherwise it can be very hard. I have written at least 6 interpreters that all have this feature, and it's fairly straightforward.
Your interpreter needs to maintain an environment that knows about all the global variables, functions, types and so on that have been defined. You might feel more comfortable calling this the "symbol table".
You need to define an internal function that reads a file and updates the environment. Depending on your language design, you might or might not do some evaluation the moment you read things in. My interpreters are very dynamic and evaluate each definition as soon as it is read in.
Your life will be infinitely easier if you structure your interpreter in layers:
Tokenizer (breaks input into tokens)
Parser (reads one token at a time, converts to abstract-syntax tree)
Evaluator (reads the abstract syntax and updates the environment)
The abstract-syntax tree is really the key. If you have this, when you encounter the import/include construct in the input, you just make a recursive call and get more abstract syntax back. You can do this in the parser or the evaluator. If you want conditional import, you have to do it in the evaluator, since only the evaluator can compute a condition.
Source code for my interpreters is on the web. Two of them are written in C; the others are written in Standard ML.

Related

How to programmatically alter source code in C++ file?

I need a way to take a C/C++ source code file, inspect and perform some modifications to it, and then write the modified variant back to disk. Possible use cases that I have for it are:
Mutation testing, such as intentionally corrupting calculation in order to check if tests can catch it.
Altering visibility scope or annotating functions and methods. In order to split a large file into several smaller files but still being able to link them together, I want to turn some static functions into external functions so that the linker can find them later.
Generation of mock implementations of existing functions methods. For all externally visible functions, create a function with identical prototype but with empty/dummy body so that other code can link against it.
Are there existing solutions for such workflow?
I am mostly interested in dealing with functions/methods. The rest of information contained in a file, such as includes, type definitions etc. are less important for me, but they must be preserved in the output so that the end result remains syntactically correct.
A straightforward approach of applying a bunch of regular expressions to extract/modify the text is possible. But it is obviously not reliable in a any way. I would like to avoid writing a full-blown C++ parser. Even having such a parser does not solve the follow-up task of saving the modified parse tree back to a file.
LibTooling and libclang are commonly used to develop such refactoring tools (clang-format, clang-tidy, etc.).

How to process c and c++ source code to calculate metrics for static code analysis?

Iam extending a software tool to calculate metrics for software projects.
The metrics are then used to do a static code analysis.
My task is to implement the calculation of metrics for c and c++ projects.
In the developing process i encountered problems which led to reset and starting over again with a different tool or programming language.
I will state the process, problems and things i tried to solve them in chronological order and as good as possible.
Some metrics:
Lines of Code for Classes, Structs, Unions, Functions/Methods and Sourcefiles
Method Count for Classes and Structs
Complexity for Classes, Structs and Functions/Methods
Dependencies for/between Classes and Structs
Since c++ is a difficult language to parse and writing a c++ parser on my own is out of scale i tend to use an existing c++ parser.
Therefore i began using libraries from the LLVM Project to gather syntactic and semantic information about a source file.
LLVM Tooling link: https://clang.llvm.org/docs/Tooling.html
First i started with LibTooling written in c++ since it promised me "full controll" over the Abstract Syntax Tree (AST).
I tried the RecursiveASTVistor and the Matchfinder approaches without success.
So LibTooling was dismissed because i couldnt retrieve context information about the surrounding of a node in the AST.
I was only able to react on a callback when a specific node in the AST was visited. But i didnt know in what context i currently was.
Eg. When I visit a C++RecordDeclaration (class, struct, union) i did not know if it is a nested record or not.
But that information is needed to calculate the lines of code for a single class.
Second approach was using the LibClang interface via Python Bindings.
With the LibClang interface i was able to traverse the AST node by node recursively and store needed context information on a stack.
Here i encountered a general problem with LibClang:
Before creating the AST for a file the preprocessor is started and resolves all preprocessor directives. Just as he is supposed to do.
This is good because if the preprocessor cant resolve all the include directives the output AST will be incomplete.
This is very bad because i wont be able to provide all the include files or directories for any c++ project.
This is bad because code which is surrounded by conditional preprocessor directives is not part of the AST if a preprocessor variable is defined or not. Parsing the same file multiple times with different setups of defined or undefined preprocessor variable is out of scope.
This lead to the third and current attempt with using a c++ parser generated by Antlr provided a c++14 grammar.
No preprocessor is executed before the parser.
This is good because the full source code is parsed and preprocessor directives are being ignored.
Bad thing is that the parser does not seem to be that tough. It fails on code which can be compiled leading to a broken AST. So this solution is not sufficient aswell.
My questions are:
Is there an option to deactivate the preprocessor before parsing a c/c++ source or header file with libClang?
So the source code is untouched and the AST is complete and detailed.
Is there a way to parse a c/c++ source code file without providing all the necessary include directories but still resulting in a detailed AST?
Since iam running out of options. What other approaches may be worth looking at when it comes to analysing/parsing c/c++ source code?
If you think this is not the right place to ask such questions feel free to redirect me to another place.
To answer your last question,
Since iam running out of options. What other approaches may be worth
looking at when it comes to analysing/parsing c/c++ source code?
Another approach is to parse the source code as if it were merely text. This avoids the need to preprocess the source, and to bring in a complex parser. See this paper for an example/introduction: "The Conceptual Cohesion of Classes" by Andrian Marcus, Denys Poshyvanyk. You can still collect such information as LOC and number of methods from this approach, without needing a full parser.
This approach has drawbacks (as does any approach):
It either 1) parses comments along with the source code, or 2) requires that you remove comments from the source. But the latter is an easy step. The reason that might be OK is that even the comments contain information regarding the code, which may help determine which modules are more closely coupled, etc.
It will lump local variables, method names, parameter names, etc. all into the "bag of words" that you are working with.

What does embedding a language into another do?

This may be kind of basic but... here goes.
If I decide to embed some kind of scripting language like Lua or Ruby into a C++ program by linking it's interpreter what does that allow me to do in C++ then?
Would I be able to write Ruby or Lua code right into the cpp file or simply call scripts from the program?
If the latter is true, how would I do that?
Because they're scripting languages, the code is always going to be "interpreted." In reality, you aren't "calling" the script code inside your program, but rather when you reach that point, you're executing the interpreter in the context of that thread (the thread that reaches the scripting portion), which then reads the scripting language and executes the applicable machine code after interpreting it (JIT compiling kind of, but not really, there's no compiling involved).
Because of this, its basically the same thing as forking the interpreter and running the script, unless you want access to variables in your compiled program/in your script from the compiled program. To access values to/from, because you're using the thread that has your compiled program's context, you should be able to store script variables on the stack as well and access them when your thread stops running the interpreter (assuming you stored the variables on the stack).
Edit: response:
You would have to write it yourself. Think about it this way: if you want to use assembly in c++, you use the asm keyword. You then in the c++ compiler, need to parse the source file, get to the asm keyword, and then switch to the assembly compiler. Then the assembly compiler needs to go until the end bracket of the asm region and compile this code.
If you want to do this,it will be a bit different, since assembly gets compiled, not interpreted (which is what you want to do). What you'll need to do, is change the compiler you're using (lets say c++), so that it recognizes your own user defined keyword. Lets say this keyword is scriptX{}. You need to change the c++'s parser so that when it see's scriptX{}, it stores everything between the brackets in the readonly data section of your compiled program. You then need to add a hook in the compiled assembly file to switch the context of the thread to your script interpreter, and start the program counter at the beginning of your script section (which you put in read only data section of the object file).
Good luck with that...
A common reason to embed a scripting language into a program is to provide for the ability to control the program with scripts provided by the end user.
Probably the simplest example of such a script is a configuration file. Assume that your program has options, and needs to remember the options from run to run. You could write them out to a file as a binary image of your options structure, but that would be fragile, not easy to inspect or edit, and likely not portable across systems. Writing the options out in plain text with some sort of labels for which is which addresses most of those complaints, but now you need to parse that text and recover the options. Then some users want different options on Tuesdays, want to do simple arithmetic to compute one option from another, or to write one configuration file that they can use on both Windows and Linux, and pretty soon you find yourself inventing a little language to express all of those ideas and mechanisms with. At this point, there's a better way.
The languages Lua and TCL both grew out of essentially that scenario. Larger systems needed to be configured and controlled by end users. End users wanted to edit a simple text file and get immediate satisfaction, even (especially) when working with large systems that might have required hours to compile successfully.
One advantage here is that rather than inventing a programming language one feature at a time as user's needs change, you start with a complete language along with its documentation. The language designer has already made a number of tough decisions for you (how do I represent strings and numbers, what about lists, what about named values, what does if look like, etc.) and has generally also brought a carefully designed and debugged implementation to the table.
Lua is particularly easy to integrate. Reading a simple configuration file and extracting the settings from the Lua state can be done using a small subset of its C API. Once you have Lua available, it is attractive to use it for other purposes. In many cases, you will find that it is more productive to write only the innermost loops in C, and use Lua to glue those functions together and provide all the "business logic" of the application. This is how Adobe Lightroom is implemented, as well as many games on platforms ranging from simple set-top-boxes to iOS devices and even PCs.

How do I "compile" an expression in c++ at runtime? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
compile and run c++ code runtime
I want to take as an input an expression from the user as a string and compile it into a callable c++ function. Are there any tools that allow you to do this easily?
Basically, How do I compile an Expression Tree into a callable method, C#? seems similar to what I want to do except that I need to do this in c++ and not c#.
I can certainly make a sort of generic evaluator using lex and yacc but I don't want to have to parse the string every time. Basically this expression will run in a critical inner loop so I'm looking for a way to "compile" it at run-time.
It's not easy... If you want my two cents, I will follow these steps:
Create an interface for the code that you must create at runtime. At first, you create an interface for what you can do. For example your class must inherit from a pure virtual base class that will represent your interface. Take care that your program will use not arbitrary code, but code created in a specific way, because it must know how to use it.
Call the compiler from inside your program. The compiler should create a library from your source code. You can use a predefined project that you store somewhere, and then replace its source file with your own. So it can be easy to obtain a right library.
Put your library in a specified source where you can find it.
Load the library at runtime. If you search, you will see that it's possible to load dynamic libraries at runtime, not only at linking time (in this way, for example, you can create plugins for programs). So your program can load your library and use it. For example you can find some information here.
But, as others have said, it's not a trivial task.
EDIT: Another solution is to check a parser like boost::spirit::qi, that is well used can give extremly helpful results.
You have to parse the expression to an abstract syntax tree and walk it or evaluate it in-place. Something like this should satisfy your needs for a simple mathematical expression.
You can write your mini-interpreter. With the commands same with c++ (not all of them). Of course your compiler will optimize it but not sure how much. I did it for assembly in qbasic (mov, add, sub...) but it was quite slow because of being an interpreter of an interpreter :D
Did you think about Evolutionary computation and fitness functions? Worth looking at.
You can create a data structure that represents your parsed expression tree, and the overhead of evaluating that at runtime will be small compared to parsing the string every time.
Actually getting a callable method in C++ will be quite difficult, in that you would have to generate object code and dynamically load it into your program. This would duplicate a lot of what the whole compiler tool-chain does.

Is there a better (more modern) tool than lex/flex for generating a tokenizer for C++?

I recent added source file parsing to an existing tool that generated output files from complex command line arguments.
The command line arguments got to be so complex that we started allowing them to be supplied as a file that was parsed as if it was a very large command line, but the syntax was still awkward. So I added the ability to parse a source file using a more reasonable syntax.
I used flex 2.5.4 for windows to generate the tokenizer for this custom source file format, and it worked. But I hated the code. global variables, wierd naming convention, and the c++ code it generated was awful. The existing code generation backend was glued to the output of flex - I don't use yacc or bison.
I'm about to dive back into that code, and I'd like to use a better/more modern tool. Does anyone know of something that.
Runs in Windows command prompt (Visual studio integration is ok, but I use make files to build)
Generates a proper encapsulated C++ tokenizer. (No global variables)
Uses regular expressions for describing the tokenizing rules (compatible with lex syntax a plus)
Does not force me to use the c-runtime (or fake it) for file reading. (parse from memory)
Warns me when my rules force the tokenizer to backtrack (or fixes it automatically)
Gives me full control over variable and method names (so I can conform to my existing naming convention)
Allows me to link multiple parsers into a single .exe without name collisions
Can generate a UNICODE (16bit UCS-2) parser if I want it to
Is NOT an integrated tokenizer + parser-generator (I want a lex replacement, not a lex+yacc replacement)
I could probably live with a tool that just generated the tokenizing tables if that was the only thing available.
Ragel: http://www.complang.org/ragel/ It fits most of your requirements.
It runs on Windows
It doesn't declare the variables, so you can put them inside a class or inside a function as you like.
It has nice tools for analyzing regular expressions to see when they would backtrack. (I don't know about this very much, since I never use syntax in Ragel that would create a backtracking parser.)
Variable names can't be changed.
Table names are prefixed with the machine name, and they're declared "const static", so you can put more than one in the same file and have more than one with the same name in a single program (as long as they're in different files).
You can declare the variables as any integer type, including UChar (or whatever UTF-16 type you prefer). It doesn't automatically handle surrogate pairs, though. It doesn't have special character classes for Unicode either (I think).
It only does regular expressions... has no bison/yacc features.
The code it generates interferes very little with a program. The code is also incredibly fast, and the Ragel syntax is more flexible and readable than anything I've ever seen. It's a rock solid piece of software. It can generate a table-driven parser or a goto-driven parser.
Flex also has a C++ output option.
The result is a set of classes that do that parsing.
Just add the following to the head of you lex file:
%option C++
%option yyclass="Lexer"
Then in you source it is:
std::fstream file("config");
Lexer lexer(&file)
while(int token = lexer.yylex())
{
}
Boost.Spirit.Qi (parser-tokenizer) or Boost.Spirit.Lex (tokenizer only). I absolutely love Qi, and Lex is not bad either, but I just tend to take Qi for my parsing needs...
The only real drawback with Qi tends to be an increase in compile time, and it is also runs slightly slower than hand-written parsing code. It is generally much faster than parsing with regex, though.
http://www.boost.org/doc/libs/1_41_0/libs/spirit/doc/html/index.html
There's two tools that comes to mind, although you would need to find out for yourself which would be suitable, Antlr and GoldParser. There are language bindings available in both tools in which it can be plugged into the C++ runtime environment.
boost.spirit and Yard parser come to my mind. Note that the approach of having lexer generators is somewhat substituted by C++ inner DSL (domain-specific language) to specify tokens. Simply because it is part of your code without using an external utility, just by following a series of rules to specify your grammar.