Improving efficiency of std::string in a compiler - c++

I'm attempting to build a scanner for a compiler of a C like language and am getting caught up on an efficient way to generate tokens... I have a scan function:
vector<Token> scan(string &input);
And also a main function, which reads in a lexically correct file and removes comments. (the language does not support /* , */ comments) I am using a DFA with maximal munch to generate tokens... and I'm pretty sure that part of the scanner is reasonably efficient. However the scanner does not handle large files well, because they all end up in one string... and all of the concatenation of 1000 lines of a file with line 1001 is breaking the scanner. Unfortunately my FSM can not deal with comments because they are allowed to contain any Unicode and other odd characters. I was wondering... is there a better way to go from a file in stdin, to a vector of tokens, keeping in mind that the function scan must take a single string and, return a single vector, and all tokens must be in a single vector at the end of scanning... Anyway, here is the code which "scans": Please don't laugh at my bad idea too hard :)
string in = "";
string build;
while(true)
{
getline(cin, build);
if( cin.eof() )
break;
if(build.find ("//") != string::npos)
build = build.substr(0, build.find("//",0));
in += " " + build;
}
try {
vector<Token> wlpp = scan(in);
...
...

A couple of things that you might want to consider:
in += " " + build;
Is very inefficient and probably not want you want in that loop, but that doesn't seem to be where you're running in to problems. (at the very least, get some idea about the size of your inputs and do in.reserve(size) before that.
The better design for your scanner might be as a class that wraps the input file as an istream_iterator<Token> and implement an appropriate operator>> for Token. If you really wanted it in a vector, you could then do something like vector<Token> v(istream_iterator<Token>(cin), istream_iterator<Token>()); and be done with it. Your operator>> would then just swallow comments and populate a token before returning.

Related

Why do we need to use a stringstream when splitting a string?

Please note, I've never used streams before today, so my understanding of them remains rather vague. Apologies when I say something appallingly stupid.
Here I have a short bit of code that splits up a stringstream into a bunch of strings at each space.
vector<string> words;
stringstream ss("some random words that I wrote just now");
string word;
while(getline(ss, word, ' ')){
words.push_back(word);
}
I'm wondering why we're using a stringstream here, rather than just a string.
This would work like:
Create a string object at memory location x
When referenced, go through each character and check if it is a space. All previous characters should be saved somewhere temporary.
If it is a space, grab all the stuff we've just stored and stick it on the end of the vector, then, clear the temp storage thing. If it's not a space, go back to step 2
What's storing "some random words that I wrote just now" as a stringstream going to do to help us here?
Is it just making a stream of characters so that we can check through them? Is this necessary? Are we always doing this, even in other languages?
I'm wondering why we're using a stringstream here, rather than just a string.
If this is the question, then one big reason why stringstream is used is simply -- because it works with little effort by the programmer. The less code you write, the less chance for bugs to occur.
Your method of using just std::string and searching for spaces requires the C++ programmer to write all of those steps (create a string, manually search for spaces, etc). It may be trivial to write, but even the best programmers can make mistakes in trivial code. The code may have bugs, may not cover all of the corner cases, etc.
As to ease of use:
When a C++ programmer sees stringstream with respect to usage of separating a sting with whitespace, the purpose of the code is immediately known.
If on the other hand, a programmer decides to manually parse the data by using just string and searching for spaces, the code is not immediately realized as to what it does when another programmer reads the code. Sure, it may be a quick realization of the code by the other programmer, but I can bet the other programmer will say "why didn't you use stringstream?".
What's storing "some random words that I wrote just now" as a stringstream going to do to help us here? Is it just making a stream of characters so that we can check through them? Is this necessary?
std::stringstream just allows you to use the usual input/output operations such as >> and std::getline on a string. You can't use std::getline to read from an std::string, so you put the string in a std::streamstream first. You can totally parse a string by looping over the characters yourself as you described.
Are we always doing this, even in other languages?
Not in Python at least. There you would just do words = line.split(' ').

Why on earth is my file reading function placing null-terminators where excess CR LF carriages should be?

Today I tried to put together a simple OpenGL shader class, one that loads text from a file, does a little bit of parsing to build a pair of vertex and fragment shaders according to some (pretty sweet) custom syntax (for example, writing ".varying [type] [name];" would allow you to define a varying variable in both shaders while only writing it once, same with ".version",) then compiles an OpenGL shader program using the two, then marks the shader class as 'ready' if and only if the shader code compiled correctly.
Now, I did all this, but then encountered the most bizarre (and frankly kinda scary) problems. I set everything up, declared a new 'tt::Shader' with some file containing valid shader code, only to have it tell me that the shader was invalid but then give me an empty string when I asked what the error was (which means OpenGL gave me an empty string as that's where it gets it from.)
I tried again, this time with obviously invalid shader code, and while it identified that the shader was invalid, it still gave me nothing in terms of what the error was, just an empty string (from which I assumed that obviously the error identification portion of it was also just the same as before.)
Confused, I re-wrote both shaders, the valid and invalid one, by hand as a string, compiling the classes again with the string directly, with no file access. Doing this, the error vanished, the first one compiled correctly, and the second one failed but correctly identified what the error was.
Even more confused, I started comparing the strings from the files to those I wrote myself. Turns out the former were a tad longer than the ladder, despite printing the same. After doing a bit of counting, I realised that these characters must be Windows CR LF line ending carriage characters that got cut off in the importing process.
To test this, I took the hand-written strings, inserted carriages where they would be cut off, and ran my string comparison tests again. This time, it evaluated there lengths to be the same, but also told me that the two where still not equal, which was quite puzzling.
So, I wrote a simple for-loop to iterate through the characters of the two strings and print then each next to one another, and cast to integers so I could see their index values. I ran the program, looked through the (quite lengthy) list, and came to a vary insightful though even less clarifying answer: The hidden characters were in the right places, but they weren't carriages ... they were null-terminators!
Here's the code for the file reading function I'm using. It's nothing fancy, just standard library stuff.
// Attempts to read the file with the given path, returning a string of its contents.
// If the file could not be found and read, an empty string will be returned.
// File strings are build by reading the file line-by-line and assembling a single with new lines placed between them.
// Given this line-by-line method, take note that it will copy no more than 4096 bytes from a single line before moving on.
inline std::string fileRead(const std::string& path) {
if (!tt::fileExists(path))
return "";
std::ifstream a;
a.open(path);
std::string r;
const tt::uint32 _LIMIT = 4096;
char r0[_LIMIT];
tt::uint32 i = 0;
while (a.good()) {
a.getline(r0, _LIMIT);
if (i > 0)
r += "\n";
i++;
r += std::string(r0, static_cast<tt::uint32>(a.gcount()));
}
// TODO: Ask StackOverflow why on earth our file reading function is placing null characters where excess carriages should go.
for (tt::uint32 i = 0; i < r.length(); i++)
if (r[i] == '\0')
r[i] = '\r';
a.close();
tt::printL("Reading file '" + path + "' ...");
return r;
}
If y'all could take a read and tell me what the hell is going on with it, that'd be awesome, as I'm at a total loss for what its doing to my string to cause this.
Lastly, I do get why the null-terminators didn't show up to me but did for OpenGL, the ladder was using C-strings, while I was just doing everything with std::string objects, where store things based on length given that they're pretty much just fancy std::vector objects.
Read the documentation for std::string constructor. Constructor std::string(const char*, size_t n) creates string of size n regardless of input. It may contain null character inside or even more than 1. Note that size of std::string doesn't include the null character (so that str[str.size()] == '\0' always).
So clearly the code simply copies the null character from the output buffer of the getline function.
Why would it do that? Go to gcount() function documentation - it returns the number of extracted characters by the last operation. I.e., it includes the extracted character \n which is replaced in output by \0 voila. Exactly one number more that the constructor ask for.
So to fix it simply do replace:
r += std::string(r0, static_cast<tt::uint32>(a.gcount()-1));
Or you could've simply used getline() with std::string as input instead of a buffer - and none of this would've happened.

Why does the VS2008 std::string.erase() move its buffer?

I want to read a file line by line and capture one particular line of input. For maximum performance I could do this in a low level way by reading the entire file in and just iterating over its contents using pointers, but this code is not performance critical so therefore I wish to use a more readable and typesafe std library style implementation.
So what I have is this:
std::string line;
line.reserve(1024);
std::ifstream file(filePath);
while(file)
{
std::getline(file, line);
if(line.substr(0, 8) == "Whatever")
{
// Do something ...
}
}
While this isn't performance critical code I've called line.reserve(1024) before the parsing operation to preclude multiple reallocations of the string as larger lines are read in.
Inside std::getline the string is erased before having the characters from each line added to it. I stepped through this code to satisfy myself that the memory wasn't being reallocated each iteration, what I found fried my brain.
Deep inside string::erase rather than just resetting its size variable to zero what it's actually doing is calling memmove_s with pointer values that would overwrite the used part of the buffer with the unused part of the buffer immediately following it, except that memmove_s is being called with a count argument of zero, i.e. requesting a move of zero bytes.
Questions:
Why would I want the overhead of a library function call in the middle of my lovely loop, especially one that is being called to do nothing at all?
I haven't picked it apart myself yet but under what circumstances would this call not actually do nothing but would in fact start moving chunks of buffer around?
And why is it doing this at all?
Bonus question: What the C++ standard library tag?
This is a known issue I reported a year ago, to take advantage of the fix you'll have to upgrade to a future version of the compiler.
Connect Bug: "std::string::erase is stupidly slow when erasing to the end, which impacts std::string::resize"
The standard doesn't say anything about the complexity of any std::string functions, except swap.
std::string::clear() is defined in terms of std::string::erase(),
and std::string::erase() does have to move all of the characters after
the block which was erased. So why shouldn't it call a standard
function to do so? If you've got some profiler output which proves that
this is a bottleneck, then perhaps you can complain about it, but
otherwise, frankly, I can't see it making a difference. (The logic
necessary to avoid the call could end up costing more than the call.)
Also, you're not checking the results of the call to getline before
using them. Your loop should be something like:
while ( std::getline( file, line ) ) {
// ...
}
And if you're so worried about performance, creating a substring (a new
std::string) just in order to do a comparison is far more expensive
than a call to memmove_s. What's wrong with something like:
static std::string const target( "Whatever" );
if ( line.size() >= target.size()
&& std::equal( target.begin(), target().end(), line.being() ) ) {
// ...
}
I'ld consider this the most idiomatic way of determining whether a
string starts with a specific value.
(I might add that from experience, the reserve here doesn't buy you
much either. After you've read a couple of lines in the file, your
string isn't going to grow much anyway, so there'll be very few
reallocations after the first couple of lines. Another case of
premature optimization?)
In this case, I think the idea you mention of reading the entire file and iterating over the result may actually give about as simple of code. You're simply changing: "read line, check for prefix, process" to "read file, scan for prefix, process":
size_t not_found = std::string::npos;
std::istringstream buffer;
buffer << file.rdbuf();
std::string &data = buffer.str();
char const target[] = "\nWhatever";
size_t len = sizeof(target)-1;
for (size_t pos=0; not_found!=(pos=data.find(target, pos)); pos+=len)
{
// process relevant line starting at contents[pos+1]
}

Tokenizer efficiency question

I'm writing a compiler front end for a project and I'm trying to understand what's the best method of tokenize the source code.
I can't choose between two ways:
1) the tokenizer read all tokens:
bool Parser::ReadAllTokens()
{
Token token;
while( m_Lexer->ReadToken( &token ) )
{
m_Tokens->push_back( token );
token.Reset(); // reset the token values..
}
return !m_Tokens->empty();
}
and then the parsing phase begins, operating on the m_Tokens list. In this way the methods getNextToken(), peekNextToken() and ungetToken() are relatively easy to implement by iterator, and the parsing code is well written and clear ( not broken by getNextToken() i.e. :
getNextToken();
useToken();
getNextToken();
peekNextToken();
if( peeked is something )
ungetToken();
..
..
)
2) the parsing phase begins and when needed, the token is created and used ( the code seems not so clear )
What's the best method??and why??and the efficiency??
thanks in advance for the answers
Traditionally, compiler construction classes teach you to read tokens, one by one, as you parse. The reason for that, is that back in the days, memory resources were scarce. You had kilobytes to your disposal, and not gigabytes as you do today.
Having said that, I don't mean to recommend you to read all tokens in advance, and then parse from your list of tokens. Input is of arbitrary size. If you hog too much memory, the system will become slow. Since it looks like you only need one token in the lookahead, I'd read one at a time from the input stream. The operating system will buffer and cache the input stream for you, so it'll be fast enough for most purposes.
It would be better to use something like Boost::Spirit to tokenise. Why reinvent the wheel?
Your method (1) is generally overkill - it is not required to tokenize an entire file prior parsing it.
A good way to go is to implement a buffered tokenizer, which will store in a list the tokens that were poke or unget, and which will consume the element of this list on "get" or read tokens from the file when the list gets empty (a la FILE*).
The first method is better, as you can then also understand the code 3 month later...

How to build a sentence parser using only the c++ standared library?

I am designing a text based game similar to Zork, and I would like it to able to parse a sentance and draw out keywords such TAKE, DROP ect. The thing is, I would like to do this all through the standard c++ library... I have heard of external libraries (such as flex/bison) that effectively accomplish this; however I don't want to mess with those just yet.
What I am thinking of implementing is a token based system that has a list of words that the parser can recognize even if they are in a sentence such as "take sword and kill monster" and know that according to the parsers grammar rules, TAKE, SWORD, KILL and MONSTER are all recognized as tokens and would produce the output "Monster killed" or something to that effect. I have heard there is a function in the c++ standard library called strtok that does this, however I have also heard it's "unsafe". So if anyone here could lend a helping hand, I would greatly appreciate it.
The strtok function is from the C standard library, and it has a few problems. For example, it modifies the string in place and could cause security problems due to buffer overflows. You should instead look into using the IOStream classes within the C++ Standard Library as well as the Standard Template Library (STL) containers and algorithms.
Example:
#include <algorithm>
#include <cctype>
#include <iostream>
#include <sstream>
using namespace std;
int
main()
{
string line;
// grab a line from standard input
while (getline(cin, line)) {
// break the input in to tokens using a space as the delimeter
istringstream stream(line);
string token;
while (getline(stream, token, ' ')) {
// convert string to all caps
transform(token.begin(), token.end(), token.begin(), (int(*)(int)) toupper);
// print each token on a separate line
cout << token << endl;
}
}
}
Depending on how complicated this language is to parse, you may be able to use C++ Technical Report 1's regular expression libraries.
If that's not powerful enough, then stringstreams may get you somewhere, but after a point you'll likely decide that a parser generator like Flex/Bison is the most concise way to express your grammar.
You'll need to pick your tool based on the complexity of the sentences you're parsing.
Unless your language is extremely simply you want to follow the steps of writing a parser.
Write a formal grammar. By formal I don't mean to scare you: write it on a piece of napkin if it sounds less worrisome. I only mean get your grammar right, and don't advance to the next step before you do. For example:
action := ('caress' | 'kill') creature
creature := 'monster' | 'pony' | 'girlfriend'
Write a lexer. The lexer will, given a stream, take one character at a time until it can figure out which token is next, and return that token. It will discard the characters that constitute that token and leave all other characters in the stream intact. For example, it can get the character d, then r, then o and p, figure the next token is a DROP token and return that.
Write a parser. I personally find recursive descent parsers fairly easy to write, because all you have to do is write exactly one function for each of your rules, which does exactly what the rule defines. The parser will take one token at a time (by calling the lexer). It knows exactly what token it is about to receive from the lexer (or else knows that the next token is one of a limited set of possible tokens), because it follows the grammar. If it receives an unexpected token then it reports a syntax error.
Read the Dragon Book for details. The book talks about writing entire compiler systems, but you can skip the optimization phase and the code generation phase. These don't apply to you here, because you just want to interpret code and run it once, not write an executable which can then be executed to repeatedly run these instructions.
For a naive implementation using std::string, the std::set container and this tokenization function (Alavoor Vasudevan) you can do this :
#include <iostream>
#include <set>
#include <string>
int main()
{
/*You match the substring find in the while loop (tokenization) to
the ones contained in the dic(tionnary) set. If there's a match,
the substring is printed to the console.
*/
std::set<std::string> dic;
dic.insert("sword");
dic.insert("kill");
dic.insert("monster");
std::string str = "take sword and kill monster";
std::string delimiters = " ";
std::string::size_type lastPos = str.find_first_not_of(delimiters, 0);
std::string::size_type pos = str.find_first_of(delimiters, lastPos);
while (std::string::npos != pos || std::string::npos != lastPos)
{
if(dic.find(str.substr(lastPos, pos - lastPos)) != dic.end())
std::cout << str.substr(lastPos, pos - lastPos)
<< " is part of the dic.\n";
lastPos = str.find_first_not_of(delimiters, pos);
pos = str.find_first_of(delimiters, lastPos);
}
return 0;
}
This will output :
sword is part of the dic.
kill is part of the dic.
monster is part of the dic.
Remarks :
The tokenization delimiter (white space) is very (too) simple for natural languages.
You could use some utilities in boost (split,tokenizer).
If your dictionnary (word list) was really big using the hash version of set could be useful (unordered_set).
With boost tokenizer, it could look like this (this may not be very efficient):
boost::tokenizer<> tok(str);
BOOST_FOREACH(const std::string& word,tok)
{
if(dic.find(word) != dic.end())
std::cout << word << " is part of the dic.\n";
}
If you do want to code the parsing yourself, I would strongly recommend you to use "something like Lex/Yacc". In fact, I strongly recommend you to use Antlr. See my previously accepted answer to a similar question at What language should I use to write a text parser and display the results in a user friendly manner?
However, the best approach is probably to forget C++ all together - unless you have a burning desire to learn C++, but, even then, there are probably better projects on which to cut your teeth.
If what you want is to program a text adventure, then I recommend that you use one of the programming languages specifically designed for that purpose. There are many, see
http://www.brasslantern.org/writers/howto/chooselang.html
http://www.brasslantern.org/editorials/easyif.html
http://www.onlamp.com/pub/a/onlamp/2004/11/24/interactive_fiction.html
or google for "i-f programming language" (Interactive Fiction")
You will probably decide on TADS, Inform or Hugo (my personal vote goes to TADS).
You might get good advice if you post to rec.arts.int-fiction explaining what you hope to achieve and giving your level or programming ability.
Have fun!