Macro Replacement during Code Generation - c++

Presently I have a some legacy code, which generates the op code. If the code has more number of macros then the code generation takes so much of time (In terms of hours!!).
I have gone through the logic, they are handling the macro by searching for it and doing a replace of each variable in it some thing like inlining.
Is there a way that I can optimize it without manipulating the string?

You must tokenize your input before starting this kind of process. (I can't recommend the famous Dragon Book highly enough - even the ancient edition stood the test of time, the updated 2006 version looks great). Compiling is the sort of job that's best split up into smaller phases: if your first phase performs lexical analysis into tokens, splitting lines into keywords, identifiers, constants, and so on, then it's much simpler to find the references to macros and look them up in a symbol table. (It's also relatively easier to use a tool like lex or flex or one of their modern equivalents to do this job for you, than to attempt to do it from scratch).
The 'clue' seems to be if the code has more number of macros then the code generation takes so much of time. That sounds like the process is linear in the number of macros, which is certainly too much. I'm assuming this process occurs one line at a time (if your language allows that, obviously that has enormous value, since you don't need to treat the program as one huge string), and the pseudocode looks something like
for(each line in the program)
{
for(each macro definition)
{
test if the macro appears;
perform replacement if needed;
}
}
That clearly scales with the number of macro definitions.
With tokenization, it looks something like this:
for(each line in the program)
{
tokenize the line;
for(each token in the line)
{
switch(based on the token type)
{
case(an identifier)
lookup the identifier in the table of macro names;
perform replacement as necessary;
....
}
}
}
which scales mostly with the size of the program (not the number of definitions) - the symbol table lookup can of course be done with more optimal data structures than looping through them all, so that no longer becomes the significant factor. That second step is something that again programs like yacc and bison (and their more modern variants) can happily generate code to do.
afterthought: when parsing the macro definitions, you can store those as a token stream as well, and mark the identifiers that are the 'placeholder' names for parameter replacement. When expanding a macro, switch to that token stream. (Again, something things like flex can easily do).

I have an application which has its own grammer. It supports all types of datatypes that a typical compiler supports (Even macros). More precisely it is a type of compiler which generates the opcodes by taking a program (which is written using that grammer) as input.
For handling the macros, it uses the text replacement logic
For Example:
Macro Add (a:int, b:int)
int c = a + b
End Macro
// Program Sum
..
int x = 10, y = 10;
Add(x, y);
..
// End of the program
After replacement it will be
// Program Sum
..
int x = 10, y = 10;
int c = x + y
..
// End of program
This text replacement is taking so much of time i.e., replacing the macro call with macro logic.
Is there a optimal way to do it?

This is really hard to answer without knowing more of your preprocessor/parse/compile process. One idea would be to store the macro names in a symbol table. When parsing, check text tokens against that table first, If you find a match, write the replacement into a new string, and run that through the parser, then continue parsing the original text following the macrto's close parens.
Depending on your opcode syntax, another idea might be - when you encounter the macro definition while parsing, generate the opcodes, but put placeholders in place of the arguments. Then when the parser encounter calls to the macro, generate the code for evaluating the arguments, and insert that code in place of the placeholders in the pre-generated macro code.

Related

How to make input part of my code?

So I have this maths project where I have to write a program which calculates definite integral of a given function within the given boundaries. I've done this using C++ and CodeBlocks, but now I would like to try and make it possible to input function using cmd when I run my code in CodeBlocks, just like I input boundaries, so I don't have to edit this line of code every time I want to run it for different function. I realised that this would require actually using then this input ( e.g. "sqrt(pow(x,2)-1)" ) as part of the code when entered, and I really don't know how to do this or if it is possible at all, so any help is welcome.
This the part of the code which handles function:
double Formula(double x)
{
double a;
a = sqrt(x);
return a;
}
If you want to evaluate expression like "sqrt(pow(x,2)-1)", you have to:
parse the string and generate an AST (Abstract syntax tree) which describes the operations to execute
use an evaluation function on the AST
For example, if you have "sqrt(pow(x,2)-1)" in input, the AST could be represented like this:
function - sqrt
function - substract
function - pow
variable - x
integer - 2
integer - -1
You have to define the structures which will be used to represent your AST.
Then, to parse the query string you have 2 choices:
parse it yourself, count the parentheses etc...
use a tool to generate the parser: yacc + lex or under linux bison + flex. These tools require time to be used to them.
If you have just a little project to do, you may have to try to parse the input yourself to generate the AST.
If the project is a compilation project, you should use bison + flex, they are exactly made for that (but require time to be used to ! ).
Alternatively, integrate with a scripting language, make it do the function parsing and evaluation. It will be considerably slower though.
JavaScript interpreters are all over the place. Python is fairly popular, too. Some people like Lua.

Counting lines of code

I was doing some research on line counters for C++ projects and I'm very interested in algorithms they use. Does anyone know where can I look at some implementation of such algorithms?
There's cloc, which is a free open-source source lines of code counter. It has support for many languages, including C++. I personally use it to get the line count of my projects.
At its sourceforge page you can find the perl source code for download.
Well, if by line counters, you mean programs which count lines, then the
algorithm is pretty trivial: just count the number of '\n' in the
code. If, on the other hand, you mean programs which count C++
statements, or produce other metrics... Although not 100% accurate,
I've gotten pretty good results in the past just by counting '}' and
';' (ignoring those in comments and string and character literals, of
course). Anything more accurate would probably require parsing the
actual C++.
You don't need to actually parse the code to count line numbers, it's enough to tokenise it.
The algorithm could look like:
int lastLine = -1;
int lines = 0;
for each token {
if (isCode(token) && lastLine != token.line) {
++lines;
lastLine = token.line;
}
}
The only information you need to collect during tokenisation is:
what type of a token it is (an operator, an identifier, a comment...) You don't need to get very precise here actually, as you only need to distinguish "non-code tokens" (comments) and "code tokens" (anything else)
at which line in the file the token occures.
On how to tokenise, that's for you to figure out, but hand-writting a tokeniser for such a simple case shouldn't be hard. You could use flex but that's probably redundant.
EDIT
I've mentioned "tokenisation", let me describe it for you quickly:
Tokenisation is the first stage of compilation. The input of tokenisation is text (multi-line program), and the output is a sequence of "tokens", as in: symbols with some meaning. For instance, the following program:
#include "something.h"
/*
This is my program.
It is quite useless.
*/
int main() {
return something(2+3); // this is equal to 5
}
could look like:
PreprocessorDirective("include")
StringLiteral("something.h")
PreprocessorDirectiveEnd
MultiLineComment(...)
Keyword(INT)
Identifier("main")
Symbol(LeftParen)
Symbol(RightParen)
Symbol(LeftBrace)
Keyword(RETURN)
Identifier("something")
Symbol(LeftParen)
NumericLiteral(2)
Operator(PLUS)
NumericLiteral(3)
Symbol(RightParen)
Symbol(Semicolon)
SingleLineComment(" this is equal to 5")
Symbol(RightBrace)
Et cetera.
Tokens, depending on their type, may have arbitrary meta-data attached to them (i.e. the symbol type, the operator type, the identifier text, or perhaps the number of the line where the token was found).
Such stream of tokens is then fed to the parser, which uses grammar production rules written in terms of these tokens, for instance, to build a syntax tree.
Doing a full parser that would give you a complete syntax tree of code is challenging, and especially challenging if it's C++ we're talking about. However, tokenising (or "lexing" or "lexical analysis") is easier, esp. when you're not concerned about much details, and you should be able to write a tokeniser yourself using a Finite state machine.
On how to actually use the output to count lines of code (i.e. lines in which at least "code" token, i.e. any token except comment, starts) - see the algorithm I've described earlier.
I think part of the reason people are having so much trouble understanding your problem is because "Count the lines of c++" is itself an algorithm. Perhaps what you're trying to ask is "How do I identify a line of c++ in a file?" That is an entirely different question which Kos seems to have done a pretty good job trying to explain.

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?
I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.
It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.

How to clear comments and intermediate whitespace in a string containing a C function declaration?

In my program, written in C++, I need to take a set of strings, each containing the declaration of a C function, and perform a number of operations on them.
One of the operations is to compare whether one function is equal to another. To do that I plan to just prune away comments and intermediate whitespace which has no effect on the semantics of the function and then do a string comparison. However, I would like to retain whitespace within a string as removing that would change the output produced by the function.
I could write some code which iterates over the string characters and enters "string mode" whenever a quote (") is encountered and recognize escaped quotes, but I wonder if there is any better way of doing this. An idea is to use a full-fledged C parser, run it over the function string, ignore all comments and excessive whitespace, and then convert the AST back to a string again. But looking around at some C parser I get the feeling that most are a bitch to integrate with my source code (prove me wrong if I am). Perhaps I could try to use yacc or something and use an existing C grammar and implement the parser myself...
So, any ideas on the best way to do this?
EDIT:
The program I'm writing takes an abstract model and converts it into C code. The model consists of a graph, where the nodes may or may not contain segments of C code (more precisely, a C function definition where its execution must be completely deterministic (i.e. no global state) and no memory operations are allowed). The program does pattern matching on the graph and merges and splits certain nodes who adhere to these patterns. However, these operations can only be performed if the nodes exhibit the same functionality (i.e. if their C function definitions are the same). This "checking that they are the same" will be done by simply comparing the strings which contain the C function declarations. If they are character-by-character identical, then they are equal.
Due to the nature of how the models are generated, this is quite a reasonable method of comparison provided that the comments and excess whitespace is removed as this is the only factor that may differ. This is the problem I'm facing -- how to do this with minimal amount of implementation effort?
What do you mean by compare whether one function is equal to another ? With a suitably precise meaning, that problem is known to be undecidable!
You did not tell what your program is really doing. Parsing all real C programs correctly is not trivial (because the C language syntax and semantics is not that simple!).
Did you consider using existing tools or libraries to help you? LLVM Clang is a possibility, or extending GCC thru plugins, or even better with extensions coded in MELT.
But we cannot help you more without understanding your real goal. And parsing C code is probably more complex than what you imagine.
It looks like you can get away with simple island grammar removing comments, string literals, and collapsing white spaces (tabs, '\n'). Since I'm working with AXE†, I wrote a quick grammar‡ for you. You can write a similar set of rules using Boost.Spirit.
#include <axe.h>
#include <string>
template<class I>
std::string clean_text(I i1, I i2)
{
// rules for non-recursive comments, and no line continuation
auto endl = axe::r_lit('\n');
auto c_comment = "/*" & axe::r_find(axe::r_lit("*/"));
auto cpp_comment = "//" & axe::r_find(endl);
auto comment = c_comment | cpp_comment;
// rules for string literals
auto esc_backslash = axe::r_lit("\\\\");
auto esc_quote = axe::r_lit("\\\"");
auto string_literal = '"' & *(*(axe::r_any() - esc_backslash - esc_quote)
& *(esc_backslash | esc_quote)) & '"';
auto space = axe::r_any(" \t\n");
auto dont_care = *(axe::r_any() - comment - string_literal - space);
std::string result;
// semantic actions
// append everything matched
auto append_all = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += std::string(i1, i2); });
// append a single space
auto append_space = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += ' '; });
// island grammar for text
auto text = *(dont_care >> append_all
& *comment
& *string_literal >> append_all
& *(space % comment) >> append_space)
& axe::r_end();
if(text(i1, i2).matched)
return result;
else
throw "error";
}
So now you can do the text cleaning:
std::string text; // this is your function
text = clean_text(text.begin(), text.end());
You might also need to create rules for superfluous ';', empty blocks {}, and alike. You might also need to merge string literals. How far you need to go depends on the way the functions were generated, you may end up writing a sizable portion of C grammar.
† AXE library is soon to be released under boost license.
‡ I didn't test the code.
Perhaps your C functions that you want to parse are not as general (in their textual form, and also as parsed by a real compiler) as we are guessing.
You might consider doing things the other way round:
It could make sense to define a small domain specific language (it could have a syntax much simpler to parse than C) and instead of parsing C code, doing it the other way: The user would use your DSL, and your tool would generate C code (to be compiled at a later stage by your usual C compiler) from your DSL.
Your DSL could actually be the description of your abstract model mixed with more procedural parts which are translated to C functions. Since the C functions you care about are quite specific, the DSL generating them could be small.
(Think that many parser generators like ANTLR or YACC or Bison are build on a similar idea).
I actually did something quite similar in MELT (read notably my DSL2011 paper). You might find some useful tricks about designing a DSL translated to C.

How to store parsed function expressions for plugging-in many times?

As the topic indicates, my program needs to read several function expressions and plug-in different variables many times. Parsing the whole expression again every time I need to plug-in a new value is definitely way too ugly, so I need a way to store parsed expression.
The expression may look like 2x + sin(tan(5x)) + x^2. Oh, and the very important point -- I'm using C++.
Currently I have three ideas on it, but all not very elegant:
Storing the S-expression as a tree; evaluate it by recurring. It may
be the old-school way to handle this, but it's ugly, and I would
have to handle with different number of parameters (like + vs. sin).
Composing anonymous functions with boost::lambda. It may work nice,
but personally I don't like boost.
Writing a small python/lisp script, use its native lambda
expression and call it with IPC... Well, this is crazy.
So, any ideas?
UPDATE:
I did not try to implement support for parenthesis and functions with only one parameter, like sin().
I tried the second way first; but I did not use boost::lambda, but a feature of gcc which could be used to create (fake) anonymous functions I found from here. The resulting code has 340 lines, and not working correctly because of scoping and a subtle issue with stack.
Using lambda could not make it better; and I don't know if it could handle with scoping correctly. So sorry for not testing boost::lambda.
Storing the parsed string as S-expressions would definitely work, but the implementation would be even longer -- maybe ~500 lines? My project is not that kind of gigantic projects with tens of thousands lines of code, so devoting so much energy on maintaining that kind of twisted code which would not be used very often seems not a nice idea.
So finally I tried the third method -- it's awesome! The Python script has only 50 lines, pretty neat and easy to read. But, on the other hand, it would also make python a prerequisite of my program. It's not that bad on *nix machines, but on windows... I guess it would be very painful for the non-programmers to install Python. So is lisp.
However, my final solution is opening bc as a subprocess. Maybe it's a bad choice for most situations, however, it fits me well.
On the other hand, for projects work only under *nix or already have python as a prerequisite, personally I recommend the third way if the expression is simple enough to be parsed with hand-written parser. If it's very complicated, like Hurkyl said, you could consider creating a mini-language.
Why not use a scripting language designed for exactly this kind of purpose? There are several such languages floating around, but my experience is with lua.
I use lua to do this kind of thing "all the time". The code to embed and parse an expression like that is very small. It would look something like this (untested):
std::string my_expression = "2*x + math.sin( math.tan( x ) ) + x * x";
//Initialise lua and load the basic math library.
lua_State * L = lua_open();
lua_openmath(L);
//Create your function and load it into lua
std::string fn = "function myfunction(x) return "+my_expression+"end";
luaL_dostring( L, fn.c_str(), fn.size() );
//Use your function
for(int i=0; i<10; ++i)
{
// add the function to the stack
lua_getfield(L, LUA_GLOBALSINDEX, "myfunction");
// add the argument to the stack
lua_pushnumber(L, i);
// Make the call, using one argument and expecting one result.
// stack looks like this : FN ARG
lua_pcall(L,1,1)
// stack looks like this now : RESULT
// so get the result and print it
double result = lua_getnumber(L,-1);
std::cout<<i<<" : "<<result<<std::endl;
// The result is still on the stack, so clean it up.
lua_pop(L,1);
}