Getting started with the code structure of a grammar parser in C++

Getting started with the code structure of a grammar parser in C++ - c++

I'm doing an assignment for school right now that focuses on building a tokenizer and parser for simple sets of instructions. This instruction set uses EBNF grammar which we have to output to the user. As a simple example, imagine this instruction set code looks like this:
set 0, 2 * (4 + 20)
halt
And the EBNF grammar is given like this:
<Program> -> <Statement> { <Statement> }
<Statement> -> <Set> | 'halt'
<Set> -> set ('write'|<Expr>), ('read'|<Expr>)
<Expr> -> <Term> {(+|-) <Term>}
<Term> -> <Factor> {(*|/|%) <Factor>}
<Factor> -> <Number> | 'D['<Expr>']' | '('<Expr>')'
<Number> -> 0 | (1...9){0...9}
In this case, <...> is what I output to the user, {...} means choose 0 or more, (...) means choose 1 or more and '...' is a string literal.
So in this case, when running the program, the correct output should look like this:
Program
Statement
Set // set
Expr // 0
Term // 0 is a term
Factor // 0 is a factor
Number // 0 is finally a number
Expr // go to the next string, "2*(4+2)"
Term // start with the number 2
Factor // 2 is a factor
Number // 2 is a number
Factor // after 2, we have a * so after that is a factor
Expr // the factor leads to a (...) which leads to an expr
Term // the expr leads to a term 4
Factor // 4 is a factor
Number // 4 is returned as a number
Term // we have a + so the next string (20) is a term
Factor // 20 is a factor
Number // return 20 as a number
Statement // halt is a statement, end
I've already built the tokenizer portion of the assignment, so running the above instruction set produces a std::vector<std::string> that looks like this (where each new string is separated by a new line):
set
0
2*(4+20)
halt
Now the parser is where I am stuck. I first push each string into a std::queue, examine the string, parse it, and then pop it from the queue. I already have it so that a set statement pushes the string "Set" in another std::vector<std::string> so I can print it out later.
I also have a function, parse_expression(std::string& str) to parse my expressions. The real trouble I have is with how I could correctly parse the 2 * (4 + 20) part of it.
My teacher told me to go character by character through the string, check whether it is a number (which I know how to do), or if it matches a '+', '-', '/', '*', '%' character, or if the next character is a 'D' then I should parse another expression.
I'm sort of confused on how I should attempt this though. If I go character by character, I can get the first number, go until I hit a non number character, and then just push out Term followed by Factor followed by Number. But how can I then go back to that original character and keep pushing onwards towards the next ones. Almost like, how do I go back to the original level I was at so that I can correctly determine whether or not I need to create another Expr, or whether or not I should just be doing another term.
I realize this is a long, confusing question, but I would appreciate any push in the right direction, whether it be something glaringly obvious that I'm doing wrong, or whether it be how a parser actually works.

Related

How does recursion get previous values?

I'm in the basic of the basic of learning c++, and ran into an example of recursion that I don't understand. The equation is for Fibonacci numbers, and is shown below:
int fibo(int f)
{
if (f < 3)
{
return 1;
}
else
{
return fibo(f - 2) + fibo(f - 1);
}
}
How does the "else" statement work? I know that it adds the two previous numbers to get the current fibbonacci number, but how does it know, without any prior information, where to start? If I want the 7th fibonacci number, how does it know what the 6th and 5th numbers are?

In this given equation, It will go deeper in the root. When you have given Value 7 initially, it will go to function itself to get value of 7-2 = 5 and 7-1=6, still its has not value of 5 and 6. so further it will decrease value of 5 to 3 and 4 and 6 to 5 and 4.
at the end when f is less then 3 it will return value 1. something like that after getting root values it will sum up those values to get total answer.

A recursive function will call itself as many times as it needs to compute the final value. For example, if you call fibo(3), it will call itself with fibo(2) and fibo(1).
You can understand it better if you write down a tree representing all the function calls (the numbers in brackets are the return values):
fibo(3) [1+1]
|
.--------------.
| |
fibo(2) [1] fibo(1) [1]
For fibo(7), you will have multple calls like so:
fibo(7) [fibo(6) + fibo(5)]
|
.-----------------------------------------------.
| |
fibo(6) [fibo(5) + fibo(4)] fibo(5) [fibo(4) + fibo(3)]
| |
.---------------------------------. ...
| |
fibo(5) [fibo(4) + fibo(3)] fibo(4) [fibo(3) + fibo(2)]
| |
... ...
Each recursive call will execute the same code, but with a different value of f. And each recursive call will have to call their own "editions" of the sub-cases (smaller values). This happens until everyone reaches the base case (f < 3).
I didn't draw the entire tree. But I guess you can see this grows very quick. There's a lot of repetition (fibo(7) calls fibo(6) and fibo(5), then fibo(6) calls fibo(5) again). This is why we usually don't implement Fibonacci recursively, except for studying recursion.

How to use a string or a char vector (containing any chemical composition respectively formula) and calculate its molar mass?

I try to write a simple console application in C++ which can read any chemical formula and afterwards compute its molar mass, for example:
Na2CO3, or something like:
La0.6Sr0.4CoO3, or with brackets:
Fe(NO3)3
The problem is that I don't know in detail how I can deal with the input stream. I think that reading the input and storing it into a char vector may be in this case a better idea than utilizing a common string.
My very first idea was to check all elements (stored in a char vector), step by step: When there's no lowercase after a capital letter, then I have found e.g. an element like Carbon 'C' instead of "Co" (Cobalt) or "Cu" (Copper). Basically, I've tried with the methods isupper(...), islower(...) or isalpha(...).
// first idea, but it seems to be definitely the wrong way
// read input characters from char vector
// check if element contains only one or two letters
// ... and convert them to a string, store them into a new vector
// ... finally, compute the molar mass elsewhere
// but how to deal with the numbers... ?
for (unsigned int i = 0; i < char_vec.size()-1; i++)
{
if (islower(char_vec[i]))
{
char arr[] = { char_vec[i - 1], char_vec[i] };
string temp_arr(arr, sizeof(arr));
element.push_back(temp_arr);
}
else if (isupper(char_vec[i]) && !islower(char_vec[i+1]))
{
char arrSec[] = { char_vec[i] };
string temp_arrSec(arrSec, sizeof(arrSec));
element.push_back(temp_arrSec);
}
else if (!isalpha(char_vec[i]) || char_vec[i] == '.')
{
char arrNum[] = { char_vec[i] };
string temp_arrNum(arrNum, sizeof(arrNum));
stoechiometr_num.push_back(temp_arrNum);
}
}
I need a simple algorithm which can handle with letters and numbers. There also may be the possibility working with pointer, but currently I am not so familiar with this technique. Anyway I am open to that understanding in case someone would like to explain to me how I could use them here.
I would highly appreciate any support and of course some code snippets concerning this problem, since I am thinking for many days about it without progress… Please keep in mind that I am rather a beginner than an intermediate.

This problem is surely not for a beginner but I will try to give you some idea about how you can do that.
Assumption: I am not considering Isotopes case in which atomic mass can be different with same atomic number.
Model it to real world.
How will you solve that in real life?
Say, if I give you Chemical formula: Fe(NO3)3, What you will do is:
Convert this to something like this:
Total Mass => [1 of Fe] + [3 of NO3] => [1 of Fe] + [ 3 of [1 of N + 3 of O ] ]
=> 1 * Fe + 3 * (1 * N + 3 * O)
Then, you will search for individual masses of elements and then substitute them.
Total Mass => 1 * 56 + 3 * (1 * 14 + 3 * 16)
=> 242
Now, come to programming.
Trust me, you have to do the same in programming also.
Convert your chemical formula to the form discussed above i.e. Convert Fe(NO3)3 to Fe*1+(N*1+O*3)*3. I think this is the hardest part in this problem. But it can be done also by breaking down into steps.
Check if all the elements have number after it. If not, then add "1" after it. For example, in this case, O has a number after it which is 3. But Fe and N doesn't have it.
After this step, your formula should change to Fe1(N1O3)3.
Now, Convert each number, say num of above formula to:
*num+ If there is some element after current number.
*num If you encountered ')' or end of formula after it.
After this, your formula should change to Fe*1+(N*1+O*3)*3.
Now, your problem is to solve the above formula. There is a very easy algorithm for this. Please refer to: https://www.geeksforgeeks.org/expression-evaluation/. In your case, your operands can be either a number (say 2) or an element (say Fe). Your operators can be * and +. Parentheses can also be present.
For finding individual masses, you may maintain a std::map<std::string, int> containing element name as key and its mass as value.
Hope this helps a bit.

getline() Adding Character to Front of String? -- Actually substr syntax error

I'm writing a program that will balance Chemistry Equations; I thought it'd be a good challenge and help reinforce the information I've recently learned.
My program is set up to use getline(cin, std::string) to receive the equation. From there it separates the equation into two halves: a left side and right side by making a substring when it encounters a =.
I'm having issues which only concerns the left side of my string, which is called std::string leftSide. My program then goes into a for loop that iterates over the length of leftSide. The first condition checks to see if the character is uppercase, because chemical formulas are written with the element symbols and a symbol consists of either one upper case letter, or an upper case and one lower case letter. After it checks to see if the current character is uppercase, it checks to see if the next character is lower case; if it's lower case then I create a temporary string, combine leftSide[index] with leftSide[index+1] in the temp string then push the string to my vector.
My problem lies on the first iteration; I've been using CuFe3 = 8 (right side doesn't matter right now) to test it out. The only thing stored in std::string temp is C. I'm not sure why this happening; also, I'm still getting numbers in my final answer and I don't understand why. Some help fixing these two issues, along with an explanation, would be greatly appreciated.
[CODE]
int index = 0;
for (it = leftSide.begin(); it!=leftSide.end(); ++it, index++)
{
bool UPPER_LETTER = isupper(leftSide[index]);
bool NEXT_LOWER_LETTER = islower(leftSide[index+1]);
if (UPPER_LETTER)// if the character is an uppercase letter
{
if (NEXT_LOWER_LETTER)
{
string temp = leftSide.substr(index, (index+1));//add THIS capital and next lowercase
elementSymbol.push_back(temp); // add temp to vector
temp.clear(); //used to try and fix problem initially
}
else if (UPPER_LETTER && !NEXT_LOWER_LETTER) //used to try and prevent number from getting in
{
string temp = leftSide.substr(index, index);
elementSymbol.push_back(temp);
}
}
else if (isdigit(leftSide[index])) // if it's a number
num++;
}
[EDIT] When I entered in only ASDF, *** ***S ***DF ***F was the output.

string temp = leftSide.substr(index, (index+1));
substr takes the first index and then a length, rather than first and last indices. You want substr(index, 2). Since in your example index is 0 you're doing: substr(index, 1) which creates a string of length 1, which is "C".
string temp = leftSide.substr(index, index);
Since index is 0 this is substr(index, 0), which creates a string of length 0, that is, an empty string.
When you're processing parts of the string with a higher index, such as Fe in "CuFe3" the value you pass in as the length parameter is higher and so you're creating strings that are longer. F is at index 2 and you call substr(index, 3), which creates the string "Fe3".
Also the standard library usually uses half open ranges, so even if substr took two indices (which, again, it doesn't) you would do substr(index, index+2) to get a two character string.
bool NEXT_LOWER_LETTER = islower(leftSide[index+1]);
You might want to check that index+1 is a valid index. If you don't want to do that manually you might at least switch to using the bounds checked function at() instead of operator[].

Solving a linear equation in one variable

What would be the most efficient algorithm to solve a linear equation in one variable given as a string input to a function? For example, for input string:
"x + 9 – 2 - 4 + x = – x + 5 – 1 + 3 – x"
The output should be 1.
I am considering using a stack and pushing each string token onto it as I encounter spaces in the string. If the input was in polish notation then it would have been easier to pop numbers off the stack to get to a result, but I am not sure what approach to take here.
It is an interview question.

Solving the linear equation is (I hope) extremely easy for you once you've worked out the coefficients a and b in the equation a * x + b = 0.
So, the difficult part of the problem is parsing the expression and "evaluating" it to find the coefficients. Your example expression is extremely simple, it uses only the operators unary -, binary -, binary +. And =, which you could handle specially.
It is not clear from the question whether the solution should also handle expressions involving binary * and /, or parentheses. I'm wondering whether the interview question is intended:
to make you write some simple code, or
to make you ask what the real scope of the problem is before you write anything.
Both are important skills :-)
It could even be that the question is intended:
to separate those with lots of experience writing parsers (who will solve it as fast as they can write/type) from those with none (who might struggle to solve it at all within a few minutes, at least without some hints).
Anyway, to allow for future more complicated requirements, there are two common approaches to parsing arithmetic expressions: recursive descent or Dijkstra's shunting-yard algorithm. You can look these up, and if you only need the simple expressions in version 1.0 then you can use a simplified form of Dijkstra's algorithm. Then once you've parsed the expression, you need to evaluate it: use values that are linear expressions in x and interpret = as an operator with lowest possible precedence that means "subtract". The result is a linear expression in x that is equal to 0.
If you don't need complicated expressions then you can evaluate that simple example pretty much directly from left-to-right once you've tokenised it[*]:
x
x + 9
// set the "we've found minus sign" bit to negate the first thing that follows
x + 7 // and clear the negative bit
x + 3
2 * x + 3
// set the "we've found the equals sign" bit to negate everything that follows
3 * x + 3
3 * x - 2
3 * x - 1
3 * x - 4
4 * x - 4
Finally, solve a * x + b = 0 as x = - b/a.
[*] example tokenisation code, in Python:
acc = None
for idx, ch in enumerate(input):
if ch in '1234567890':
if acc is None: acc = 0
acc = 10 * acc + int(ch)
continue
if acc != None:
yield acc
acc = None
if ch in '+-=x':
yield ch
elif ch == ' ':
pass
else:
raise ValueError('illegal character "%s" at %d' % (ch, idx))
Alternative example tokenisation code, also in Python, assuming there will always be spaces between tokens as in the example. This leaves token validation to the parser:
return input.split()

ok some simple psuedo code that you could use to solve this problem
function(stinrgToParse){
arrayoftokens = stringToParse.match(RegexMatching);
foreach(arrayoftokens as token)
{
//now step through the tokens and determine what they are
//and store the neccesary information.
}
//Use the above information to do the arithmetic.
//count the number of times a variable appears positive and negative
//do the arithmetic.
//add up the numbers both positive and negative.
//return the result.
}

The first thing is to parse the string, to identify the various tokens (numbers, variables and operators), so that an expression tree can be formed by giving operator proper precedences.
Regular expressions can help, but that's not the only method (grammar parsers like boost::spirit are good too, and you can even run your own: its all a "find and recourse").
The tree can then be manipulated reducing the nodes executing those operation that deals with constants and by grouping variables related operations, executing them accordingly.
This goes on recursively until you remain with a variable related node and a constant node.
At the point the solution is calculated trivially.
They are basically the same principles that leads to the production of an interpreter or a compiler.

Consider:
from operator import add, sub
def ab(expr):
a, b, op = 0, 0, add
for t in expr.split():
if t == '+': op = add
elif t == '-': op = sub
elif t == 'x': a = op(a, 1)
else : b = op(b, int(t))
return a, b
Given an expression like 1 + x - 2 - x... this converts it to a canonical form ax+b and returns a pair of coefficients (a,b).
Now, let's obtain the coefficients from both parts of the equation:
le, ri = equation.split('=')
a1, b1 = ab(le)
a2, b2 = ab(ri)
and finally solve the trivial equation a1*x + b1 = a2*x + b2:
x = (b2 - b1) / (a1 - a2)
Of course, this only solves this particular example, without operator precedence or parentheses. To support the latter you'll need a parser, presumable a recursive descent one, which would be simper to code by hand.

Which Data Structure used to solve a simple math equation

When taking in a expression like (10+5*15) and following orders of operations.
How would one best solve a problem like this? What kind of data structure is best?
Thanks.

I'd go with Dijkstra's Shunting yard algorithm to create the AST.

Try parsing the expression using recursive descent. This would give you a parse tree respecting order of operations.

The usual data structure for this task is a stack. When you're doing things like compiling, creating an abstract syntax tree is useful, but for simple evaluation it's usually overkill.

Think about it for a second - what is an operator? Pretty much every operator (+, -, *, /) are all binary operators. Parenthesis are depth constructors; you move one level deeper with parenthesis.
In fact, constructing the tree of data you need to solve this problem is going to be your biggest hurdle.

It's in Java, but this seems to convert from infix to postfix, and then evaluates using a stack-based approach. It puts numbers onto the stack, reaches operators, and then pops the two numbers from the stack to evaluate them with the operator (x + / -).
http://enel.ucalgary.ca/People/Norman/enel315_winter1999/lab_solutions/lab5sol/exF/Calculator.java
The conversion is as follows:
Scan the Infix string from left to
right.
Initialise an empty stack.
If the scannned character is an operand, add it to the Postfix string. If the scanned character is an operator and if the stack is empty
Push the character to stack.
If the scanned character is an Operand and the stack is not empty, compare the precedence of the character with the element on top of the stack (topStack). If topStack has higher precedence over the scanned character Pop the stack else Push the scanned character to stack. Repeat this step as long as stack is not empty and topStack has precedence over the character.
Repeat this step till all the characters are scanned. (After all characters are scanned, we have to add any character that the stack may have to the Postfix string.)
If stack is not empty add topStack to
Postfix string and Pop the stack.
Repeat this step as long as stack is
not empty.
Return the Postfix string.
Evaluate the Postfix string.

If you need to simply compute the result of the expression that is available as a string then I'd go with no data structure at all and just functions like:
//
// expression ::= addendum [ { "-" | "+" } addendum ]
// addendum ::= factor [ { "*" | "/" } factor ]
// factor ::= { number | sub-expression | "-" factor }
// sub-expression ::= "(" expression ")"
// number ::= digit [ digit ]
// digit ::= { "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" }
//
int calcExpression(const char *& p);
int calcDigit(const char *& p);
int calcNumber(const char *& p);
int calcFactor(const char *& p);
int calcAddendum(const char *& p);
where each function just accepts a const char * by reference that reads from it (incrementing the pointer) and returning as value the numeric value of the result, throwing instead an exception in case of problems.
This approach doesn't need any data structure because uses the C++ stack for intermediate results. As an example...
int calcDigit(const char *& p)
{
if (*p >= '0' && *p <= '9')
return *p++ - '0';
throw std::runtime_error("Digit expected");
}
int calcNumber(const char *& p)
{
int acc = calcDigit(p);
while (*p >= '0' && *p <= '9')
acc = acc * 10 + calcDigit(p);
return acc;
}
If you need instead to write a compiler that transforms a string (for example including variables or function calls) into code or bytecode then probably the best solution is to start either using a generic n-way tree or a tree with specific structures for the different AST node types.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js