How to build a sentence parser using only the c++ standared library? - c++

I am designing a text based game similar to Zork, and I would like it to able to parse a sentance and draw out keywords such TAKE, DROP ect. The thing is, I would like to do this all through the standard c++ library... I have heard of external libraries (such as flex/bison) that effectively accomplish this; however I don't want to mess with those just yet.
What I am thinking of implementing is a token based system that has a list of words that the parser can recognize even if they are in a sentence such as "take sword and kill monster" and know that according to the parsers grammar rules, TAKE, SWORD, KILL and MONSTER are all recognized as tokens and would produce the output "Monster killed" or something to that effect. I have heard there is a function in the c++ standard library called strtok that does this, however I have also heard it's "unsafe". So if anyone here could lend a helping hand, I would greatly appreciate it.

The strtok function is from the C standard library, and it has a few problems. For example, it modifies the string in place and could cause security problems due to buffer overflows. You should instead look into using the IOStream classes within the C++ Standard Library as well as the Standard Template Library (STL) containers and algorithms.
Example:
#include <algorithm>
#include <cctype>
#include <iostream>
#include <sstream>
using namespace std;
int
main()
{
string line;
// grab a line from standard input
while (getline(cin, line)) {
// break the input in to tokens using a space as the delimeter
istringstream stream(line);
string token;
while (getline(stream, token, ' ')) {
// convert string to all caps
transform(token.begin(), token.end(), token.begin(), (int(*)(int)) toupper);
// print each token on a separate line
cout << token << endl;
}
}
}

Depending on how complicated this language is to parse, you may be able to use C++ Technical Report 1's regular expression libraries.
If that's not powerful enough, then stringstreams may get you somewhere, but after a point you'll likely decide that a parser generator like Flex/Bison is the most concise way to express your grammar.
You'll need to pick your tool based on the complexity of the sentences you're parsing.

Unless your language is extremely simply you want to follow the steps of writing a parser.
Write a formal grammar. By formal I don't mean to scare you: write it on a piece of napkin if it sounds less worrisome. I only mean get your grammar right, and don't advance to the next step before you do. For example:
action := ('caress' | 'kill') creature
creature := 'monster' | 'pony' | 'girlfriend'
Write a lexer. The lexer will, given a stream, take one character at a time until it can figure out which token is next, and return that token. It will discard the characters that constitute that token and leave all other characters in the stream intact. For example, it can get the character d, then r, then o and p, figure the next token is a DROP token and return that.
Write a parser. I personally find recursive descent parsers fairly easy to write, because all you have to do is write exactly one function for each of your rules, which does exactly what the rule defines. The parser will take one token at a time (by calling the lexer). It knows exactly what token it is about to receive from the lexer (or else knows that the next token is one of a limited set of possible tokens), because it follows the grammar. If it receives an unexpected token then it reports a syntax error.
Read the Dragon Book for details. The book talks about writing entire compiler systems, but you can skip the optimization phase and the code generation phase. These don't apply to you here, because you just want to interpret code and run it once, not write an executable which can then be executed to repeatedly run these instructions.

For a naive implementation using std::string, the std::set container and this tokenization function (Alavoor Vasudevan) you can do this :
#include <iostream>
#include <set>
#include <string>
int main()
{
/*You match the substring find in the while loop (tokenization) to
the ones contained in the dic(tionnary) set. If there's a match,
the substring is printed to the console.
*/
std::set<std::string> dic;
dic.insert("sword");
dic.insert("kill");
dic.insert("monster");
std::string str = "take sword and kill monster";
std::string delimiters = " ";
std::string::size_type lastPos = str.find_first_not_of(delimiters, 0);
std::string::size_type pos = str.find_first_of(delimiters, lastPos);
while (std::string::npos != pos || std::string::npos != lastPos)
{
if(dic.find(str.substr(lastPos, pos - lastPos)) != dic.end())
std::cout << str.substr(lastPos, pos - lastPos)
<< " is part of the dic.\n";
lastPos = str.find_first_not_of(delimiters, pos);
pos = str.find_first_of(delimiters, lastPos);
}
return 0;
}
This will output :
sword is part of the dic.
kill is part of the dic.
monster is part of the dic.
Remarks :
The tokenization delimiter (white space) is very (too) simple for natural languages.
You could use some utilities in boost (split,tokenizer).
If your dictionnary (word list) was really big using the hash version of set could be useful (unordered_set).
With boost tokenizer, it could look like this (this may not be very efficient):
boost::tokenizer<> tok(str);
BOOST_FOREACH(const std::string& word,tok)
{
if(dic.find(word) != dic.end())
std::cout << word << " is part of the dic.\n";
}

If you do want to code the parsing yourself, I would strongly recommend you to use "something like Lex/Yacc". In fact, I strongly recommend you to use Antlr. See my previously accepted answer to a similar question at What language should I use to write a text parser and display the results in a user friendly manner?
However, the best approach is probably to forget C++ all together - unless you have a burning desire to learn C++, but, even then, there are probably better projects on which to cut your teeth.
If what you want is to program a text adventure, then I recommend that you use one of the programming languages specifically designed for that purpose. There are many, see
http://www.brasslantern.org/writers/howto/chooselang.html
http://www.brasslantern.org/editorials/easyif.html
http://www.onlamp.com/pub/a/onlamp/2004/11/24/interactive_fiction.html
or google for "i-f programming language" (Interactive Fiction")
You will probably decide on TADS, Inform or Hugo (my personal vote goes to TADS).
You might get good advice if you post to rec.arts.int-fiction explaining what you hope to achieve and giving your level or programming ability.
Have fun!

Related

Expand escape sequences detected by flex

In my scanner.lex file I have this:
{Some rule that matches strings} return STRING; //STRING is enum
in my c++ file I have this:
if (yylex == STRING) {
cout << "STRING: " << yytext << endl;
Obviously with some logic to take the input from stdin.
Now if this program gets the input "Hello\nWorld", my output is "STRING: Hello\nWorld", while I would want my output to be:
Hello
World
The same goes for other escape characters such as \",\0, \x<hex_number>, \t, \\... But I'm not sure how to achieve this. I'm not even sure if that's a flex issue or if I can solve this using only c++ tools...
How can I get this done?
As #Some programmer dude mentions in a comment, there is an an example of how to do this using start conditions in the Flex documentation. That example puts the escaping rules into a separate start condition; each rule is implemented by appending the unescaped text to a buffer. And that's the way it's normally done.
Of course, you might find an external library which unescapes C-style escaped strings, which you could call on the string returned by flex. But that would be both slower and less flexible than the approach suggested in the Flex manual: slower because it requires a second scan of the string, and less flexible because the library is likely to have its own idea of what escapes to handle.
If you're using C++, you might find it more elegant to modify that example to use a std::string buffer instead of an arbitrary fixed-size character array. You can compile a flex-generated scanner with C++, so there is no problem using C++ standard library objects in your scanner code.
Depending on the various semantic value types you are managing, you will probably want to modify the yylex prototype to either use an additional reference parameter or a more structured return type, in order to return the token value to the caller. Note that while it is OK to use yytext before the next call to yylex, it's not generally considered good style since it won't work with most parsers: in general, parsers require the ability to look one or more tokens ahead, and thus yytext is likely to be overwritten by the time your parser needs its value. The flex manual documents the macro hook used to modify the yylex() prototype.

Improving efficiency of std::string in a compiler

I'm attempting to build a scanner for a compiler of a C like language and am getting caught up on an efficient way to generate tokens... I have a scan function:
vector<Token> scan(string &input);
And also a main function, which reads in a lexically correct file and removes comments. (the language does not support /* , */ comments) I am using a DFA with maximal munch to generate tokens... and I'm pretty sure that part of the scanner is reasonably efficient. However the scanner does not handle large files well, because they all end up in one string... and all of the concatenation of 1000 lines of a file with line 1001 is breaking the scanner. Unfortunately my FSM can not deal with comments because they are allowed to contain any Unicode and other odd characters. I was wondering... is there a better way to go from a file in stdin, to a vector of tokens, keeping in mind that the function scan must take a single string and, return a single vector, and all tokens must be in a single vector at the end of scanning... Anyway, here is the code which "scans": Please don't laugh at my bad idea too hard :)
string in = "";
string build;
while(true)
{
getline(cin, build);
if( cin.eof() )
break;
if(build.find ("//") != string::npos)
build = build.substr(0, build.find("//",0));
in += " " + build;
}
try {
vector<Token> wlpp = scan(in);
...
...
A couple of things that you might want to consider:
in += " " + build;
Is very inefficient and probably not want you want in that loop, but that doesn't seem to be where you're running in to problems. (at the very least, get some idea about the size of your inputs and do in.reserve(size) before that.
The better design for your scanner might be as a class that wraps the input file as an istream_iterator<Token> and implement an appropriate operator>> for Token. If you really wanted it in a vector, you could then do something like vector<Token> v(istream_iterator<Token>(cin), istream_iterator<Token>()); and be done with it. Your operator>> would then just swallow comments and populate a token before returning.

Counting lines of code

I was doing some research on line counters for C++ projects and I'm very interested in algorithms they use. Does anyone know where can I look at some implementation of such algorithms?
There's cloc, which is a free open-source source lines of code counter. It has support for many languages, including C++. I personally use it to get the line count of my projects.
At its sourceforge page you can find the perl source code for download.
Well, if by line counters, you mean programs which count lines, then the
algorithm is pretty trivial: just count the number of '\n' in the
code. If, on the other hand, you mean programs which count C++
statements, or produce other metrics... Although not 100% accurate,
I've gotten pretty good results in the past just by counting '}' and
';' (ignoring those in comments and string and character literals, of
course). Anything more accurate would probably require parsing the
actual C++.
You don't need to actually parse the code to count line numbers, it's enough to tokenise it.
The algorithm could look like:
int lastLine = -1;
int lines = 0;
for each token {
if (isCode(token) && lastLine != token.line) {
++lines;
lastLine = token.line;
}
}
The only information you need to collect during tokenisation is:
what type of a token it is (an operator, an identifier, a comment...) You don't need to get very precise here actually, as you only need to distinguish "non-code tokens" (comments) and "code tokens" (anything else)
at which line in the file the token occures.
On how to tokenise, that's for you to figure out, but hand-writting a tokeniser for such a simple case shouldn't be hard. You could use flex but that's probably redundant.
EDIT
I've mentioned "tokenisation", let me describe it for you quickly:
Tokenisation is the first stage of compilation. The input of tokenisation is text (multi-line program), and the output is a sequence of "tokens", as in: symbols with some meaning. For instance, the following program:
#include "something.h"
/*
This is my program.
It is quite useless.
*/
int main() {
return something(2+3); // this is equal to 5
}
could look like:
PreprocessorDirective("include")
StringLiteral("something.h")
PreprocessorDirectiveEnd
MultiLineComment(...)
Keyword(INT)
Identifier("main")
Symbol(LeftParen)
Symbol(RightParen)
Symbol(LeftBrace)
Keyword(RETURN)
Identifier("something")
Symbol(LeftParen)
NumericLiteral(2)
Operator(PLUS)
NumericLiteral(3)
Symbol(RightParen)
Symbol(Semicolon)
SingleLineComment(" this is equal to 5")
Symbol(RightBrace)
Et cetera.
Tokens, depending on their type, may have arbitrary meta-data attached to them (i.e. the symbol type, the operator type, the identifier text, or perhaps the number of the line where the token was found).
Such stream of tokens is then fed to the parser, which uses grammar production rules written in terms of these tokens, for instance, to build a syntax tree.
Doing a full parser that would give you a complete syntax tree of code is challenging, and especially challenging if it's C++ we're talking about. However, tokenising (or "lexing" or "lexical analysis") is easier, esp. when you're not concerned about much details, and you should be able to write a tokeniser yourself using a Finite state machine.
On how to actually use the output to count lines of code (i.e. lines in which at least "code" token, i.e. any token except comment, starts) - see the algorithm I've described earlier.
I think part of the reason people are having so much trouble understanding your problem is because "Count the lines of c++" is itself an algorithm. Perhaps what you're trying to ask is "How do I identify a line of c++ in a file?" That is an entirely different question which Kos seems to have done a pretty good job trying to explain.

How to clear comments and intermediate whitespace in a string containing a C function declaration?

In my program, written in C++, I need to take a set of strings, each containing the declaration of a C function, and perform a number of operations on them.
One of the operations is to compare whether one function is equal to another. To do that I plan to just prune away comments and intermediate whitespace which has no effect on the semantics of the function and then do a string comparison. However, I would like to retain whitespace within a string as removing that would change the output produced by the function.
I could write some code which iterates over the string characters and enters "string mode" whenever a quote (") is encountered and recognize escaped quotes, but I wonder if there is any better way of doing this. An idea is to use a full-fledged C parser, run it over the function string, ignore all comments and excessive whitespace, and then convert the AST back to a string again. But looking around at some C parser I get the feeling that most are a bitch to integrate with my source code (prove me wrong if I am). Perhaps I could try to use yacc or something and use an existing C grammar and implement the parser myself...
So, any ideas on the best way to do this?
EDIT:
The program I'm writing takes an abstract model and converts it into C code. The model consists of a graph, where the nodes may or may not contain segments of C code (more precisely, a C function definition where its execution must be completely deterministic (i.e. no global state) and no memory operations are allowed). The program does pattern matching on the graph and merges and splits certain nodes who adhere to these patterns. However, these operations can only be performed if the nodes exhibit the same functionality (i.e. if their C function definitions are the same). This "checking that they are the same" will be done by simply comparing the strings which contain the C function declarations. If they are character-by-character identical, then they are equal.
Due to the nature of how the models are generated, this is quite a reasonable method of comparison provided that the comments and excess whitespace is removed as this is the only factor that may differ. This is the problem I'm facing -- how to do this with minimal amount of implementation effort?
What do you mean by compare whether one function is equal to another ? With a suitably precise meaning, that problem is known to be undecidable!
You did not tell what your program is really doing. Parsing all real C programs correctly is not trivial (because the C language syntax and semantics is not that simple!).
Did you consider using existing tools or libraries to help you? LLVM Clang is a possibility, or extending GCC thru plugins, or even better with extensions coded in MELT.
But we cannot help you more without understanding your real goal. And parsing C code is probably more complex than what you imagine.
It looks like you can get away with simple island grammar removing comments, string literals, and collapsing white spaces (tabs, '\n'). Since I'm working with AXE†, I wrote a quick grammar‡ for you. You can write a similar set of rules using Boost.Spirit.
#include <axe.h>
#include <string>
template<class I>
std::string clean_text(I i1, I i2)
{
// rules for non-recursive comments, and no line continuation
auto endl = axe::r_lit('\n');
auto c_comment = "/*" & axe::r_find(axe::r_lit("*/"));
auto cpp_comment = "//" & axe::r_find(endl);
auto comment = c_comment | cpp_comment;
// rules for string literals
auto esc_backslash = axe::r_lit("\\\\");
auto esc_quote = axe::r_lit("\\\"");
auto string_literal = '"' & *(*(axe::r_any() - esc_backslash - esc_quote)
& *(esc_backslash | esc_quote)) & '"';
auto space = axe::r_any(" \t\n");
auto dont_care = *(axe::r_any() - comment - string_literal - space);
std::string result;
// semantic actions
// append everything matched
auto append_all = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += std::string(i1, i2); });
// append a single space
auto append_space = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += ' '; });
// island grammar for text
auto text = *(dont_care >> append_all
& *comment
& *string_literal >> append_all
& *(space % comment) >> append_space)
& axe::r_end();
if(text(i1, i2).matched)
return result;
else
throw "error";
}
So now you can do the text cleaning:
std::string text; // this is your function
text = clean_text(text.begin(), text.end());
You might also need to create rules for superfluous ';', empty blocks {}, and alike. You might also need to merge string literals. How far you need to go depends on the way the functions were generated, you may end up writing a sizable portion of C grammar.
† AXE library is soon to be released under boost license.
‡ I didn't test the code.
Perhaps your C functions that you want to parse are not as general (in their textual form, and also as parsed by a real compiler) as we are guessing.
You might consider doing things the other way round:
It could make sense to define a small domain specific language (it could have a syntax much simpler to parse than C) and instead of parsing C code, doing it the other way: The user would use your DSL, and your tool would generate C code (to be compiled at a later stage by your usual C compiler) from your DSL.
Your DSL could actually be the description of your abstract model mixed with more procedural parts which are translated to C functions. Since the C functions you care about are quite specific, the DSL generating them could be small.
(Think that many parser generators like ANTLR or YACC or Bison are build on a similar idea).
I actually did something quite similar in MELT (read notably my DSL2011 paper). You might find some useful tricks about designing a DSL translated to C.

Translating source code into a foreign language

I'm running an educational website which is teaching programming to kids (12-15 years old).
As they don't all speak English in the code source of the solutions we are using French variables and functions names.
However we are planing to translate the content into other languages (German, Spanish, English). To do so I would like to translate the source code as fast as possible.
We mostly have C/C++ code.
The solution I'm planning to use :
extract all variables/functions names from the source-code, with their position in the file (where they are declared, used, called...)
remove all language keywords and library functions
ask the translator to provide translations for the remaining names
replace the names in the file
Is there already some open-source code/project that can do that ? (For the points 1,2 and 4)
If there isn't, the most difficult point in the first one : using a C/C++ parser to build a syntactical tree and then extracting the variables with their position seems the way to go. Do you have others ideas ?
Thank you for any advice.
Edit :
As noted in a comment I will also need to take care of the comments but there is only a few of them : the complete solution is already explained in plain-text and then we are showing the code-source with self-explained variable/function names. The source code is rarely more that 30/40 lines long and good names must make it understandable without comments if you already know what the code is doing.
Additional info : for the people interested the website is a training platform for the International Olympiads in Informatics and C/C++ (at least the minimum needed for programming contest) is not so difficult to learn by a 12 years old.
Are you sure you need a full syntax tree for this? I think it would be enough to do lexical analysis to find the identifiers, which is much easier. Then exclude keywords and identifiers that also appear in the header files being included.
In principle it is possible that you want different variables with the same English name to be translated to different words in French/German -- but for educational use the risk of this arising is probably small enough to ignore at first. You could sidestep the issue by writing the original sources with some disambiguating quasi-Hungarian prefixes and then remove these with the same translation mechanism for display to English-speaking end users.
Be sure to let translators see the name they are translating with full context before they choose a translation.
I really think you can use clang (libclang) to parse your sources and do what you want (see here for more information), the good news is that they have python bindings, which will make your life easier if you want to access a translation service or something like that.
You don't really need a C/C++ parser, just a simple lexer that gives you elements of the code one by one. Then you get a lot of {, [, 213, ) etc that you simply ignore and write to the result file. You translate whatever consists of only letters (except keywords) and you put them in the output.
Now that I think about it, it's as simple as this:
bool is_letter(char c)
{
return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
}
bool is_keyword(string &s)
{
return s == "if" || s == "else" || s == "void" /* rest of them */;
}
void translateCode(istream &in, ostream &out)
{
while (!in.eof())
{
char c = in.get();
if (is_letter(c))
{
string name = "";
do
{
name += c;
c = in.get();
} while (is_letter(c) && !in.eof());
if (is_keyword(name))
out << name;
else
out << translate(name);
}
out << c; // even if is_letter(c) was true, there is a new c from the
// while inside that was read (which was not letter), but
// not written, so would be written here.
}
}
I wrote the code in the editor, so there may be minor errors. Tell me if there are any and I'll fix it.
Edit: Explanation:
What the code does is simply to read input character by character, outputting whatever non-letter characters it reads (including spaces, tabs and new lines). If it does see a letter though, it will start putting all the following letters in one string (until it reaches another non-letter). Then if the string was a keyword, it would output the keyword itself. If it was not, would translate it and output it.
The output would have the exact same format as the input.
I don't think replacing identifiers in the code is a good idea.
First, you are not going to get decent translations. A very important point here is that translation (especially automatic or pretty dumb translation) loses and distorts information. You may actually end up with something that's worse than the original.
Second, if the code is meant to be compiled again, the compiler may not be able to compile code containing non-English letters in the translated identifiers.
Third, if you replace identifiers with something else, you need to make sure you don't replace 2 or more different identifiers with the same word. That'll either make the code non-compilable or ruin its logic.
Fourth, you must make sure you don't translate reserved words and identifiers coming from the standard library of the language either. Translating those will make the code non-compilable and unreadable. It may not be a very trivial task to differentiate between the identifiers that the programmer has defined from those provided by the language and its standard library.
What I'd do instead of replacing identifiers with their translations is, provide the translations as comments next to them, for example:
void eat/*comer*/(int* food/*comida*/)
{
if (*food/*comida*/ <= 0)
{
printf("nothing to eat!"/*no hay que comer!*/);
exit/*salir*/(-1);
}
(*food/*comida*/)--;
}
This way you lose no information due to incorrect translation and don't break the code.