finding a function name and counting its LOC - c++

So you know off the bat, this is a project I've been assigned. I'm not looking for an answer in code, but more a direction.
What I've been told to do is go through a file and count the actual lines of code while at the same time recording the function names and individual lines of code for the functions. The problem I am having is determining a way when reading from the file to determine if the line is the start of a function.
So far, I can only think of maybe having a string array of data types (int, double, char, etc), search for that in the line and then search for the parenthesis, and then search for the absence of the semicolon (so i know it isn't just the declaration of the function).
So my question is, is this how I should go about this, or are there other methods in which you would recommend?
The code in which I will be counting will be in C++.

Three approaches come to mind.
Use regular expressions. This is fairly similar to what you're thinking of. Look for lines that look like function definitions. This is fairly quick to do, but can go wrong in many ways.
char *s = "int main() {"
is not a function definition, but sure looks like one.
char
* /* eh? */
s
(
int /* comment? // */ a
)
// hello, world /* of confusion
{
is a function definition, but doesn't look like one.
Good: quick to write, can work even in the face of syntax errors; bad: can easily misfire on things that look like (or fail to look like) the "normal" case.
Variant: First run the code through, e.g., GNU indent. This will take care of some (but not all) of the misfires.
Use a proper lexer and parser. This is a much more thorough approach, but you may be able to re-use an open source lexer/parsed (e.g., from gcc).
Good: Will be 100% accurate (will never misfire). Bad: One missing semicolon and it spews errors.
See if your compiler has some debug output that might help. This is a variant of (2), but using your compiler's lexer/parser instead of your own.

Your idea can work in 99% (or more) of the cases. Only a real C++ compiler can do 100%, in which case I'd compile in debug mode (g++ -S prog.cpp), and get the function names and line numbers from the debug information of the assembly output (prog.s).
My thoughts for the 99% solution:
Ignore comments and strings.
Document that you ignore preprocessor directives (#include, #define, #if).
Anything between a toplevel { and } is a function body, except after typedef, class, struct, union, namespace and enum.
If you have a class, struct or union, you should be looking for method bodies inside it.
The function name is sometimes tricky to find, e.g. in long(*)(char) f(int); .
Make sure your parser works with template functions and template classes.

For recording function names I use PCRE and the regex
"(?<=[\\s:~])(\\w+)\\s*\\([\\w\\s,<>\\[\\].=&':/*]*?\\)\\s*(const)?\\s*{"
and then filter out names like "if", "while", "do", "for", "switch". Note that the function name is (\w+), group 1.
Of course it's not a perfect solution but a good one.

I feel manually doing the parsing is going to be a quite a difficult task. I would probably use a existing tool such as RSM redirect the output to a csv file (assuming you are on windows) and then parse the csv file to gather the required information.

Find a decent SLOC count program, eg, SLOCCounter. Not only can you count SLOC, but you have something against which to compare your results. (Update: here's a long list of them.)
Interestingly, the number of non-comment semicolons in a C/C++ program is a decent SLOC count.

How about writing a shell script to do this? An AWK program perhaps.

Related

Tools to refactor names of types, functions and variables?

struct Foo{
Bar get(){
}
}
auto f = Foo();
f.get();
For example you decide that get was a very poor choice for a name but you have already used it in many different files and manually changing ever occurrence is very annoying.
You also can't really make a global substitution because other types may also have a method called get.
Is there anything for D to help refactor names for types, functions, variables etc?
Here's how I do it:
Change the name in the definition
Recompile
Go to the first error line reported and replace old with new
Goto 2
That's semi-manual, but I find it to be pretty easy and it goes quickly because the compiler error message will bring you right to where you need to be, and most editors can read those error messages well enough to dump you on the correct line, then it is a simple matter of telling it to repeat the last replacement again. (In my vim setup with my hotkeys, I hit F4 for next error message, then dot for repeat last change until it is done. Even a function with a hundred uses can be changed reliably* in a couple minutes.)
You could probably write a script that handles 90% of cases automatically too by just looking for ": Error: " in the compiler's output, extracting the file/line number, and running a plain text replace there. If the word shows up only once and outside a string literal, you can automatically replace it, and if not, ask the user to handle the remaining 10% of cases manually.
But I think it is easy enough to do with my editor hotkeys that I've never bothered trying to script it.
The one case this doesn't catch is if there's another function with the same name that might still compile. That should never happen if you do this change in isolation, because an ambiguous name wouldn't compile without it.
In that case, you could probably do a three-step compiler-assisted change:
Make sure your code compiles before. Then add #disable to the thing you want to rename.
Compile. Every place it complains about it being unusable for being disabled, do the find/replace.
Remove #disable and rename the definition. Recompile again to make sure there's nothing you missed like child classes (the compiler will then complain "method foo does not override any function" so they stand right out too.
So yeah, it isn't fully automated, but just changing it and having the compiler errors help find what's left is good enough for me.
Some limited refactoring support can be found in major IDE plugins like Mono-D or VisualD. I remember that Brian Schott had plans to add similar functionality to his dfix tool by adding dependency on dsymbol but it doesn't seem implemented yet.
Not, however, that all such options are indeed of a very limited robustness right now. This is because figuring out the fully qualified name of any given symbol is very complex task in D, one that requires full semantics analysis to be done 100% correctly. Think about local imports, templates, function overloading, mixins and how it all affects identifying the symbol.
In the long run it is quite certain that we need to wait before reference D compiler frontend becomes available as a library to implement such refactoring tool in clean and truly reliable way.
A good find all feature can be better than a bad refactoring which, as mentioned previously, requires semantic.
Personally I have a find all feature in Coedit which displays the context of a match and works on all the project sources.
It's fast to process the results.

Highlight arguments in function body in vim

A little something that could be borrowed from IDEs. So the idea would be to highlight function arguments (and maybe scoped variable names) inside function bodies. This is the default behaviour for some C:
Well, if I were to place the cursor inside func I would like to see the arguments foo and bar highlighted to follow the algorithm logic better. Notice that the similarly named foo in func2 wouldn't get highlit. This luxury could be omitted though...
Using locally scoped variables, I would also like have locally initialized variables highlit:
Finally to redemonstrate the luxury:
Not so trivial to write this. I used the C to give a general idea. Really I could use this for Scheme/Clojure programming better:
This should recognize let, loop, for, doseq bindings for instance.
My vimscript-fu isn't that strong; I suspect we would need to
Parse (non-regexply?) the arguments from the function definition under the cursor. This would be language specific of course. My priority would be Clojure.
define a syntax region to cover the given function/scope only
give the required syntax matches
As a function this could be mapped to a key (if very resource intensive) or CursorMoved if not so slow.
Okay, now. Has anyone written/found something like this? Do the vimscript gurus have an idea on how to actually start writing such a script?
Sorry about slight offtopicness and bad formatting. Feel free to edit/format. Or vote to close.
This is much harder than it sounds, and borderline-impossible with the vimscript API as it stands, because you don't just need to parse the file; if you want it to work well, you need to parse the file incrementally. That's why regular syntax files are limited to what you can do with regexes - when you change a few characters, vim can figure out what's changed in the syntax highlighting, without redoing the whole file.
The vim syntax highlighter is limited to dealing with regexes, but if you're hellbent on doing this, you can roll your own parser in vimscript, and have it generate a buffer-local syntax that refers to tokens in the file by line and column, using the \%l and \%c atoms in a regex. This would have to be rerun after every change. Unfortunately there's no autocmd for "file changed", but there is the CursorHold autocmd, which runs when you've been idle for a configurable duration.
One possible solution can be found here. Not the best way because it highlights every occurrence in the whole file and you have to give the command every time (probably the second one can be avoided, don't know about the first). Give it a look though.

How To Extract Function Name From Main() Function Of C Source

I just want to ask your ideas regarding this matter. For a certain important reason, I must extract/acquire all function names of functions that were called inside a "main()" function of a C source file (ex: main.c).
Example source code:
int main()
{
int a = functionA(); // functionA must be extracted
int b = functionB(); // functionB must be extracted
}
As you know, the only thing that I can use as a marker/sign to identify these function calls are it's parenthesis "()". I've already considered several factors in implementing this function name extraction. These are:
1. functions may have parameters. Ex: functionA(100)
2. Loop operators. Ex: while()
3. Other operators. Ex: if(), else if()
4. Other operator between function calls with no spaces. Ex: functionA()+functionB()
As of this moment I know what you're saying, this is a pain in the $$$... So please share your thoughts and ideas... and bear with me on this one...
Note: this is in C++ language...
You can write a Small C++ parser by combining FLEX (or LEX) and BISON (or YACC).
Take C++'s grammar
Generate a C++ program parser with the mentioned tools
Make that program count the funcion calls you are mentioning
Maybe a little bit too complicated for what you need to do, but it should certainly work. And LEX/YACC are amazing tools!
One option is to write your own C tokenizer (simple: just be careful enough to skip over strings, character constants and comments), and to write a simple parser, which counts the number of {s open, and finds instances of identifier + ( within. However, this won't be 100% correct. The disadvantage of this option is that it's cumbersome to implement preprocessor directives (e.g. #include and #define): there can be a function called from a macro (e.g. getchar) defined in an #include file.
An option that works for 100% is compiling your .c file to an assembly file, e.g. gcc -S file.c, and finding the call instructions in the file.S. A similar option is compiling your .c file to an object file, e.g, gcc -c file.c, generating a disassembly dump with objdump -d file.o, and searching for call instructions.
Another option is finding a parser using Clang / LLVM.
gnu cflow might be helpful

parser with scopes and conditionals

I'm writing a C/C++/... build system (I understand this is madness ;)), and I'm having trouble designing my parser.
My "recipes" look like this:
global
{
SOURCE_DIRS src
HEADER_DIRS include
SOURCES bitwise.c \
framing.c
HEADERS \
ogg/os_types.h \
ogg/ogg.h
}
lib static ogg_static
{
NAME ogg
}
lib shared ogg_shared
{
NAME ogg
}
(This being based on the super simple libogg source tree)
# are comments, \ are "newline escapes", meaning the line continues on the next line (see QMake syntac). {} are scopes, like in C++, and global are settings that apply to every "target". This is all background, and not that relevant... I really don't know how to work with my scopes. I will need to be able to have multiple scopes, and also a form of conditional processing, in the lines of:
win32:DEFINES NO_CRT_SECURE_DEPRECATE
The parsing function will need to know on what level of scope it's at, and call itself whenever the scope is increased. There is also the problem with the location of the braces ( global { or global{ or as in the example).
How could I go about this, using Standard C++ and STL? I understand this is a whole lot of work, and that's exactly why I need a good starting point. Thanks!
What I have already is the whole ifstream and internal string/stringstream storage, so I can read word per word.
I would suggest (and this is more or less right out of the compiler textbooks) that you approach the problem in phases. This breaks things down so that the problem is much more manageable in each phase.
Focus first on the lexer phase. Your lexing phase should take the raw text and give you a sequence of tokens, such as words and special characters. The lexer phase can take care of line continuations, and handle whitespace or comments as appropriate. By handling whitespace, the lexer can simplify your parser's task: you can write the lexer so that global{, global {, and even
global
{
will all yield two tokens: one representing global and one representing {.
Also note that the lexer can tack line and column numbers onto the tokens for use later if you hit errors.
Once you've got a nice stream of tokens flowing, work on your parsing phase. The parser should take that sequence of tokens and build an abstract syntax tree, which models the syntactic structures of your document. At this point, you shouldn't be worrying about ifstream and operator>>, since the lexer should have done all that reading for you.
You've indicated an interest in calling the parsing function recursively once you see a scope. That's certainly one way to go. As you'll see, the design decision you'll have to repeatedly make is whether you literally want to call the same parse function recursively
(allowing for constructions like global { global { ... } } which you may want to disallow syntactically), or whether you want to define a slightly (or even significantly) different set of syntax rules that apply inside a scope.
Once you find yourself having to vary the rules: the key is to reuse, by refactoring into functions, as much stuff as you can reuse between the different variants of syntax. If you keep heading in this direction – using separate functions that represent the different chunks of syntax you want to deal with and having them call each other (possibly recursively) where needed – you'll ultimately end up with what we call a recursive descent parser. The Wikipedia entry has got a good simple example of one; see http://en.wikipedia.org/wiki/Recursive_descent_parser .
If you find yourself really wanting to delve deeper into the theory and practice of lexers and parsers, I do recommend you get a good solid compiler textbook to help you out. The Stack Overflow topic mentioned in the comments above will get you started: Learning to write a compiler
boost::spirit is a good recursive descent parser generator that uses C++ templates as a language extension to describe parser and lexer. It works well for native C++ compilers, but won't compile under Managed C++.
Codeproject has a tutorial article that may help.
ANTLR (use ANTLRWorks), after that you can look for FLEX/BISON and others like lemon. There are many tools out there but ANTLR and flex/bison should be enough. I personally like ANTLRWorks too much to recommend something else.
LATER: With ANTLR you can generate parser/lexer code for a variety of languages.
Unless the point of the project is specifically learning how to write a lexer and shift-reduce parser, I'd recommending using Flex and Bison, which will handle much of the parsing grunt-work for you. Writing the grammar and semantic analysis will still be a whole lot of work, don't worry ;)

Getting svn diff to show C++ function during commit

Whenever I do a commit cycle in svn, I examine the diff when writing my comments. I thought it would be really nice to show the actual function that I made the modifications in when showing the diff.
I checked out this page, which mentioned that the -p option will show the C function that the change is in. When I tried using the -p option with some C++ code, however, it usually returns the access specifier (private, public, protected, etc), which isn't terribly handy.
I did notice that there is a -F option for diff that does the same as -p, but takes a user-specified regex. I was wondering: is there a simple regex to match a C++ function? It seems like that would be all that is necessary to get this to work.
I'd spend some time looking at this myself, but work is in crunch-mode and this seemed like something that a lot of people would find useful, so I figured I'd post it here.
EDIT: I'm not looking for something that's a slam-dunk catch-all regex, but something that would simply find the nearest function definition above the area diff would show. The fact that it would be nowhere near perfect, and somewhat buggy is okay with me. Just as long as it works right maybe like 60% of the time would be a significant productivity improvement IMHO.
Is there a simple regex to match a C++ function? No.
Is there a (complex) regex to match a C++. Maybe or could be possible to write one.
But I would say regular expressions neither are easily up to such a task (given you want some kind of excat match) nor are they the right tool for such a task.
Just think about case like this one. How would you handle this stuff.
void (*function(int, void (*)(int)))(int);
func1(int), func2(double); double func3(int);
The only real solution is to use a parser using yacc/lex. Which for your use case of course does nothing.
So either hack together some incomplete regex which fits most functions signatures in your code
If you're going to be applying this only to your commits I would recommend making a habit of adding a commit comment to the function, e.g:
void something ()
{
...
some thing = 1;
...
}
to
void something ()
// last change by me: a better value for thing
{
...
some thing = 2;
...
}
will display for you the function and your comment with the edits. As a bonus, other people will be able to understand what you're doing.
TortoiseSVN uses the following regexes for autocompletion support in its commit dialog for C++ files:
.h, .hpp, .hxx = ^\s*(?:class|struct)\s+([\w_]+)|\W([\w_]+)\(
.cpp, .c, .cxx = \W(([\w_]+)::([\w_]+))|\W([\w_]+)\(
I don't know how accurate they are, though.
I don't know of an option in SVN that will do this, and a regex-based solution will likely be one or more of the following:
a nightmare to code and maintain, with lots of special cases
incorrect, and missing several valid C++ functions
You need some sort of parser for this. It's technically possible to enumerate all of the possible regex cases, but writing a parser is the correct way to solve this. If you have time to roll your own solution I'd check out ANTLR, they have several C/C++ grammars available on this page:
ANTLR Grammar Lists