I was doing some research on line counters for C++ projects and I'm very interested in algorithms they use. Does anyone know where can I look at some implementation of such algorithms?
There's cloc, which is a free open-source source lines of code counter. It has support for many languages, including C++. I personally use it to get the line count of my projects.
At its sourceforge page you can find the perl source code for download.
Well, if by line counters, you mean programs which count lines, then the
algorithm is pretty trivial: just count the number of '\n' in the
code. If, on the other hand, you mean programs which count C++
statements, or produce other metrics... Although not 100% accurate,
I've gotten pretty good results in the past just by counting '}' and
';' (ignoring those in comments and string and character literals, of
course). Anything more accurate would probably require parsing the
actual C++.
You don't need to actually parse the code to count line numbers, it's enough to tokenise it.
The algorithm could look like:
int lastLine = -1;
int lines = 0;
for each token {
if (isCode(token) && lastLine != token.line) {
++lines;
lastLine = token.line;
}
}
The only information you need to collect during tokenisation is:
what type of a token it is (an operator, an identifier, a comment...) You don't need to get very precise here actually, as you only need to distinguish "non-code tokens" (comments) and "code tokens" (anything else)
at which line in the file the token occures.
On how to tokenise, that's for you to figure out, but hand-writting a tokeniser for such a simple case shouldn't be hard. You could use flex but that's probably redundant.
EDIT
I've mentioned "tokenisation", let me describe it for you quickly:
Tokenisation is the first stage of compilation. The input of tokenisation is text (multi-line program), and the output is a sequence of "tokens", as in: symbols with some meaning. For instance, the following program:
#include "something.h"
/*
This is my program.
It is quite useless.
*/
int main() {
return something(2+3); // this is equal to 5
}
could look like:
PreprocessorDirective("include")
StringLiteral("something.h")
PreprocessorDirectiveEnd
MultiLineComment(...)
Keyword(INT)
Identifier("main")
Symbol(LeftParen)
Symbol(RightParen)
Symbol(LeftBrace)
Keyword(RETURN)
Identifier("something")
Symbol(LeftParen)
NumericLiteral(2)
Operator(PLUS)
NumericLiteral(3)
Symbol(RightParen)
Symbol(Semicolon)
SingleLineComment(" this is equal to 5")
Symbol(RightBrace)
Et cetera.
Tokens, depending on their type, may have arbitrary meta-data attached to them (i.e. the symbol type, the operator type, the identifier text, or perhaps the number of the line where the token was found).
Such stream of tokens is then fed to the parser, which uses grammar production rules written in terms of these tokens, for instance, to build a syntax tree.
Doing a full parser that would give you a complete syntax tree of code is challenging, and especially challenging if it's C++ we're talking about. However, tokenising (or "lexing" or "lexical analysis") is easier, esp. when you're not concerned about much details, and you should be able to write a tokeniser yourself using a Finite state machine.
On how to actually use the output to count lines of code (i.e. lines in which at least "code" token, i.e. any token except comment, starts) - see the algorithm I've described earlier.
I think part of the reason people are having so much trouble understanding your problem is because "Count the lines of c++" is itself an algorithm. Perhaps what you're trying to ask is "How do I identify a line of c++ in a file?" That is an entirely different question which Kos seems to have done a pretty good job trying to explain.
Related
There is probably a very obvious answer to this, but I was wondering how the compiler knows which line of code my error is on. In some cases it even knows the column.
The only way I can think to do this is to tokenize the input string into a 2D array. This would store [lines][tokens].
C/C++ could be tokenized into 1 long 1D array which would probably be more efficient. I am wondering what the usual parsing method would be that would keep line information.
actually most of it is covered in the dragon book.
Compilers do Lexing/Parsing i.e.: transforming the source code into a tree representation.
When doing so each keyword variable etc. is associated with a line and column number.
However during parsing the exact origin of the failure might get lost and the information might be off.
This is the first step in the long, complicated path towards "Engineering a Compiler" or Compilers Theory
The short answer to that is: there's a module called "front-end" that usually takes care of many phases:
Scanning
Parsing
IR generator
IR optimizer ...
The structure isn't fixed so each compiler will have its own set of modules but more or less the steps involved in the front-end processing are
Scanning - maps character streams into words (also ignores whitespaces/comments) or tokens
Parsing - this is where syntax and (some) semantic analysis take place and where syntax errors are reported
To make this up to you: the compiler knows the location of your error because when something doesn't fit into a structure called "abstract syntax tree" (i.e. it cannot be constructed) or doesn't follow any of the syntax-directed translation rules, well.. there's something wrong and the compiler indicates the location where this didn't happen. If there's a grammar error on just one word/token then even a precise column location can be returned since nothing matched a terminal keyword: a basic token like the if keyword in the C/C++ language.
If you want to know more about this topic my suggestion is to start with the classic academic approach of the "Compiler Book" or "Dragon Book" and then, later on, possibly study an open-source front-end like Clang
I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?
I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.
It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.
I'm running an educational website which is teaching programming to kids (12-15 years old).
As they don't all speak English in the code source of the solutions we are using French variables and functions names.
However we are planing to translate the content into other languages (German, Spanish, English). To do so I would like to translate the source code as fast as possible.
We mostly have C/C++ code.
The solution I'm planning to use :
extract all variables/functions names from the source-code, with their position in the file (where they are declared, used, called...)
remove all language keywords and library functions
ask the translator to provide translations for the remaining names
replace the names in the file
Is there already some open-source code/project that can do that ? (For the points 1,2 and 4)
If there isn't, the most difficult point in the first one : using a C/C++ parser to build a syntactical tree and then extracting the variables with their position seems the way to go. Do you have others ideas ?
Thank you for any advice.
Edit :
As noted in a comment I will also need to take care of the comments but there is only a few of them : the complete solution is already explained in plain-text and then we are showing the code-source with self-explained variable/function names. The source code is rarely more that 30/40 lines long and good names must make it understandable without comments if you already know what the code is doing.
Additional info : for the people interested the website is a training platform for the International Olympiads in Informatics and C/C++ (at least the minimum needed for programming contest) is not so difficult to learn by a 12 years old.
Are you sure you need a full syntax tree for this? I think it would be enough to do lexical analysis to find the identifiers, which is much easier. Then exclude keywords and identifiers that also appear in the header files being included.
In principle it is possible that you want different variables with the same English name to be translated to different words in French/German -- but for educational use the risk of this arising is probably small enough to ignore at first. You could sidestep the issue by writing the original sources with some disambiguating quasi-Hungarian prefixes and then remove these with the same translation mechanism for display to English-speaking end users.
Be sure to let translators see the name they are translating with full context before they choose a translation.
I really think you can use clang (libclang) to parse your sources and do what you want (see here for more information), the good news is that they have python bindings, which will make your life easier if you want to access a translation service or something like that.
You don't really need a C/C++ parser, just a simple lexer that gives you elements of the code one by one. Then you get a lot of {, [, 213, ) etc that you simply ignore and write to the result file. You translate whatever consists of only letters (except keywords) and you put them in the output.
Now that I think about it, it's as simple as this:
bool is_letter(char c)
{
return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
}
bool is_keyword(string &s)
{
return s == "if" || s == "else" || s == "void" /* rest of them */;
}
void translateCode(istream &in, ostream &out)
{
while (!in.eof())
{
char c = in.get();
if (is_letter(c))
{
string name = "";
do
{
name += c;
c = in.get();
} while (is_letter(c) && !in.eof());
if (is_keyword(name))
out << name;
else
out << translate(name);
}
out << c; // even if is_letter(c) was true, there is a new c from the
// while inside that was read (which was not letter), but
// not written, so would be written here.
}
}
I wrote the code in the editor, so there may be minor errors. Tell me if there are any and I'll fix it.
Edit: Explanation:
What the code does is simply to read input character by character, outputting whatever non-letter characters it reads (including spaces, tabs and new lines). If it does see a letter though, it will start putting all the following letters in one string (until it reaches another non-letter). Then if the string was a keyword, it would output the keyword itself. If it was not, would translate it and output it.
The output would have the exact same format as the input.
I don't think replacing identifiers in the code is a good idea.
First, you are not going to get decent translations. A very important point here is that translation (especially automatic or pretty dumb translation) loses and distorts information. You may actually end up with something that's worse than the original.
Second, if the code is meant to be compiled again, the compiler may not be able to compile code containing non-English letters in the translated identifiers.
Third, if you replace identifiers with something else, you need to make sure you don't replace 2 or more different identifiers with the same word. That'll either make the code non-compilable or ruin its logic.
Fourth, you must make sure you don't translate reserved words and identifiers coming from the standard library of the language either. Translating those will make the code non-compilable and unreadable. It may not be a very trivial task to differentiate between the identifiers that the programmer has defined from those provided by the language and its standard library.
What I'd do instead of replacing identifiers with their translations is, provide the translations as comments next to them, for example:
void eat/*comer*/(int* food/*comida*/)
{
if (*food/*comida*/ <= 0)
{
printf("nothing to eat!"/*no hay que comer!*/);
exit/*salir*/(-1);
}
(*food/*comida*/)--;
}
This way you lose no information due to incorrect translation and don't break the code.
I have to use a parser and writer in c++, i am trying to implement the functions, however i do not understand what a token is. one of my function/operations is to check to see if there are more tokens to produce
bool Parser::hasMoreTokens()
how exactly do i go about this, please help
SO!
I am opening a text file with text in it, all words are lowercased. How do i go about checking to see if it hasmoretokens?
This is what i have
bool Parser::hasMoreTokens() {
while(source.peek()!=NULL){
return true;
}
return false;
}
Tokens are the output of lexical analysis and the input to parsing. Typically they are things like
numbers
variable names
parentheses
arithmetic operators
statement terminators
That is, roughly, the biggest things that can be unambiguously identified by code that just looks at its input one character at a time.
One note, which you should feel free to ignore if it confuses you: The boundary between lexical analysis and parsing is a little fuzzy. For instance:
Some programming languages have complex-number literals that look, say, like 2+3i or 3.2e8-17e6i. If you were parsing such a language, you could make the lexer gobble up a whole complex number and make it into a token; or you could have a simpler lexer and a more complicated parser, and make (say) 3.2e8, -, 17e6i be separate tokens; it would then be the parser's job (or even the code generator's) to notice that what it's got is really a single literal.
In some programming languages, the lexer may not be able to tell whether a given token is a variable name or a type name. (This happens in C, for instance.) But the grammar of the language may distinguish between the two, so that you'd like "variable foo" and "type name foo" to be different tokens. (This also happens in C.) In this case, it may be necessary for some information to be fed back from the parser to the lexer so that it can produce the right sort of token in each case.
So "what exactly is a token?" may not always have a perfectly well defined answer.
A token is whatever you want it to be. Traditionally (and for
good reasons), language specifications broke the analysis into
two parts: the first part broke the input stream into tokens,
and the second parsed the tokens. (Theoretically, I think you
can write any grammar in only a single level, without using
tokens—or what is the same thing, using individual
characters as tokens. I wouldn't like to see the results of
that for a language like C++, however.) But the definition of
what a token is depends entirely on the language you are
parsing: most languages, for example, treat white space as
a separator (but not Fortran); most languages will predefine
a set of punctuation/operators using punctuation characters, and
not allow these characters in symbols (but not COBOL, where
"abc-def" would be a single symbol). In some cases (including
in the C++ preprocessor), what is a token depends on context, so
you may need some feedback from the parser. (Hopefully not;
that sort of thing is for very experienced programmers.)
One thing is probably sure (unless each character is a token):
you'll have to read ahead in the stream. You typically can't
tell whether there are more tokens by just looking at a single
character. I've generally found it useful, in fact, for the
tokenizer to read a whole token at a time, and keep it until the
parser needs it. A function like hasMoreTokens would in fact
scan a complete token.
(And while I'm at it, if source is an istream:
istream::peek does not return a pointer, but an int.)
A token is the smallest unit of a programming language that has a meaning. A parenthesis (, a name foo, an integer 123, are all tokens. Reducing a text to a series of tokens is generally the first step of parsing it.
A token is usually akin to a word in sponken language. In C++, (int, float, 5.523, const) will be tokens. Is the minimal unit of text which constitutes a semantic element.
When you split a large unit (long string) into a group of sub-units (smaller strings), each of the sub-units (smaller strings) is referred to as a "token". If there are no more sub-units, then you are done parsing.
How do I tokenize a string in C++?
A token is a terminal in a grammar, a sequence of one or more symbol(s) that is defined by the sequence itself, ie it does not derive from any other production defined in the grammar.
I frequently tutor fellow students in programming, most often in C++ or Java.
It is uniquely aggravating to try to verbally convey the essential syntax of a C++ expression. The speaker must give either an idiomatic translation into English, or a full specification of the code in verbal longhand, using explicit yet slow terms such as "opening parenthesis", "bitwise and", et cetera. Neither of these solutions is optimal.
In C++, there is a finite set of keywords—63—and operators—54, discounting named operators and treating compound assignment operators and prefix versus postfix auto-increment and decrement as distinct. There are just a few types of literal, a similar number of grouping symbols, and the semicolon. Unless I'm utterly mistaken, that's about it.
Would it not then be feasible to ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised.
Instead of creating new "words" to describe them, for things such as "include" you could simply prefix it with "keyword" when saying it aloud. You could use words/phrases commonly known to say other parts as well. As with any new programmer, you have to literally describe everything anyway, so I don't think that requires special attention. I think creating new words is the harder method...
So, for example:
#include <iostream>;
int main()
{
if (1 < 2)
return 1;
else
return 0;
}
Could be read out as:
(keyword) include iostream new-line
(keyword) int main no params start
block if number 1 (operator) less than
number 2 new-line (keyword) return
number 1 new-line (keyword) else
new-line (keyword) return number 0 end
block
Treat words in () as optional descriptive words, most likely to be used in more complex code. You could use the word 'literal' if you want them to actually write the descriptive word. For example
(keyword) if literal number (operator)
less than literal keyword
becomes
if (number < keyword)
Other words could be given defined meanings as well, such as 'split-line' when you want them to continue on the next line, without closing any currently open parenthesis, etc.
I personally find this method quite simple to use and easy to teach. YMMV, as always.
Of course, this doesn't solve the internationalisation issue, but at worst, would result in 'new words' being used in the non-English languages, which is no worse than the proposed solution you offered.
As a blind developer, programming since I was 13, I found this question really interesting. First of all, as mentioned by other peple, learning a new language to be able to understand code is not a practical solution, as it would probably take longer to learn the spoken utterances as it would to learn the actual programming language.
Reading the question/answers two further points occured to me:
Firstly, you'd be surprised how important "thinking time" is. I have previously programmed in C/C++/Java and now use C# as my primary language, and consider myself very competant. But when I did a couple of projects in Python, I found the reduced punctuation robbed me of my "thinking time" - subconsciously, I was using the punctuation to digest what I'd just heard - fascinating... However, the situation is a bit different when it comes to identifiers, as these aren't well known by the listener - I personally find it hard to listen to code with acronym variables (RGXRatio, RGVRatio) as I don't have time to figure out what it means. On the flip side, hungarian notation and initial underscores makes code hard to listen to as the length of the variables (in terms of time taken to speak) is much longer than the more important operations being performed on those variables.
Another thing to consider is that the length of the audio stream is an end result, but not the root cause. The reason the audio is so long is because audio is a one-dimensional medium, whereas reading text is a 2d medium with the ability to jump around and skip past irelevant/familiar text. It wouldn't work for a face-to-face lecture, but what if there were keyboard commands for controlling the speech. In text documents my screen reader lets me jump to the next line, but what if this were adapted to the semantics of a programming language. some research, such as by T V Raman at Google, includes using different voices for syntax highlighting, and audio cues to mark metadata like capitals.
I know the original question specifically related to a lecture given to a class, but if like myself you have to listen to entire files of source code , I also find the structure of the code makes a huge difference. I personally read code like a story - left to right, top to bottom. so it's very hard to trace through unfamiliar code when it's written bottom-up.
So would it not then be feasible to simply ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised
Perhaps, but you've lost sight of your goal. The premise was that the person listening did not already know the language. If he does, we can simply say "include iostream" when we mean #include <iostream>, or "vector of int" when we mean std::vector<int>.
Your premise was that the person listening is not familiar enough with the language to understand what you read out loud unless you read out exactly what it says.
Now, inventing a whole new language just to describe the primitives that occur in your source code doesn't solve the problem. Instead, you still have to read out every syntactic token (with simpler, more "standardized" pronunciations, yes, but they still have to be read out loud), and the person listening still won't understand you, because if they don't know C++ well enough to understand "include iostream", they won't understand your standardized pronunciation either. And if you're going to teach them your pronunciation, why bother, when you could've just taught them to understand C++ syntax directly instead?
There's also the root problem that C++ code tends to consist of a lot of syntactic tokens. Take a line as simple as this:
std::vector<int> v;
I count 9 tokens. Not one of them can be omitted. If the person listening does not understand the code and syntax well enough to understand a high-level description such as "declare a vector of int, named v", then you'll have to read out all 9 tokens in some form. Even if you come up with simpler names than "namespace resolution operator" and "less than sign", you still have to list 9 token names. Which is a lot of work.
In short, no, I don't think it'd work. First, it's still too cumbersome, and second, it's presuming prior knowledge on the part of the person listening, when the motivation for this was that the person listening was a student without the prior knowledge that made it possible to understand a high-level description of the code.