Regular expression in C++ for mathematical expressions - c++

I have this trouble: I must verify the correctness of many mathematical expressions especially check for consecutive operators + - * /.
For example:
6+(69-9)+3
is ok while
6++8-(52--*3)
no.
I am not using the library <regex> since it is only compatible with C++11.
Is there a alternative method to solve this problem? Thanks.

You can use a regular expression to verify everything about a mathematical expression except the check that parentheses are balanced. That is, the regular expression will only ensure that open and close parentheses appear at the point in the expression they should appear, but not their correct relationship with other parentheses.
So you could check both that the expression matches a regex and that the parentheses are balanced. Checking for balanced parentheses is really simple if there is only one type of parenthesis:
bool check_balanced(const char* expr, char open, char close) {
int parens = 0;
for (const char* p = expr; *p; ++p) {
if (*p == open) ++parens;
else if (*p == close && parens-- == 0) return false;
}
return parens == 0;
}
To get the regular expression, note that mathematical expressions without function calls can be summarized as:
BEFORE* VALUE AFTER* (BETWEEN BEFORE* VALUE AFTER*)*
where:
BEFORE is sub-regex which matches an open parenthesis or a prefix unary operator (if you have prefix unary operators; the question is not clear).
AFTER is a sub-regex which matches a close parenthesis or, in the case that you have them, a postfix unary operator.
BETWEEN is a sub-regex which matches a binary operator.
VALUE is a sub-regex which matches a value.
For example, for ordinary four-operator arithmetic on integers you would have:
BEFORE: [-+(]
AFTER: [)]
BETWEEN: [-+*/]
VALUE: [[:digit:]]+
and putting all that together you might end up with the regex:
^[-+(]*[[:digit:]]+[)]*([-+*/][-+(]*[[:digit:]]+[)]*)*$
If you have a Posix C library, you will have the <regex.h> header, which gives you regcomp and regexec. There's sample code at the bottom of the referenced page in the Posix standard, so I won't bother repeating it here. Make sure you supply REG_EXTENDED in the last argument to regcomp; REG_EXTENDED|REG_NOSUB, as in the example code, is probably even better since you don't need captures and not asking for them will speed things up.

You can loop over each charin your expression.
If you encounter a + you can check whether it is follow by another +, /, *...
Additionally you can group operators together to prevent code duplication.
int i = 0
while(!EOF) {
switch(expression[i]) {
case '+':
case '*': //Do your syntax checks here
}
i++;
}

Well, in general case, you can't solve this with regex. Arithmethic expressions "language" can't be described with regular grammar. It's context-free grammar. So if what you want is to check correctness of an arbitrary mathemathical expression then you'll have to write a parser.
However, if you only need to make sure that your string doesn't have consecutive +-*/ operators then regex is enough. You can write something like this [-+*/]{2,}. It will match substrings with 2 or more consecutive symbols from +-*/ set.
Or something like this ([-+*/]\s*){2,} if you also want to handle situations with spaces like 5+ - * 123

Well, you will have to define some rules if possible. It's not possible to completely parse mathamatical language with Regex, but given some lenience it may work.
The problem is that often the way we write math can be interpreted as an error, but it's really not. For instance:
5--3 can be 5-(-3)
So in this case, you have two choices:
Ensure that the input is parenthesized well enough that no two operators meet
If you find something like --, treat it as a special case and investigate it further
If the formulas are in fact in your favor (have well defined parenthesis), then you can just check for repeats. For instance:
--
+-
+*
-+
etc.
If you have a match, it means you have a poorly formatted equation and you can throw it out (or whatever you want to do).
You can check for this, using the following regex. You can add more constraints to the [..][..]. I'm giving you the basics here:
[+\-\*\\/][+\-\*\\/]
which will work for the following examples (and more):
6++8-(52--*3)
6+\8-(52--*3)
6+/8-(52--*3)
An alternative, probably a better one, is just write a parser. it will step by step process the equation to check it's validity. A parser will, if well written, 100% accurate. A Regex approach leaves you to a lot of constraints.

There is no real way to do this with a regex because mathematical expressions inherently aren't regular. Heck, even balancing parens isn't regular. Typically this will be done with a parser.
A basic approach to writing a recursive-descent parser (IMO the most basic parser to write) is:
Write a grammar for a mathematical expression. (These can be found online)
Tokenize the input into lexemes. (This will be done with a regex, typically).
Match the expressions based on the next lexeme you see.
Recurse based on your grammar
A quick Google search can provide many example recursive-descent parsers written in C++.

Related

Complicated Search and Replace using RegEx

I'm trying to convert a bunch of custom "recipes" from an old proprietary format to something that is ultimately compatible with C#. And I think that the easiest way to do this would be to use regular expressions. But I'm having trouble figuring out the expression. The piece that I need to convert with this RegEx is the IF statements. Here are a few examples of the original recipes...
IF(A = B,C,D)
IF(AA = BB,IF(E=F,G,H),DD)
IF(S1<>R1,ROUND(ROUND(S2/S1,R2)*S3,R3),R4)
The first one is straightforward... If A = B then C else D.
The second one is similar, except that the IF statements are nested.
And the third one includes additional ROIND function calls in the results.
I've stumbled across regex101.com and have managed to put together the following pattern which is getting close. It works for the first example, but not for the other two: (.*?)IF[^\S\r\n]*\((.*?),(.*?),(.*?)\)
Ultimately, what I want to do is use a regular expression to turn the three examples above into:
if (A == B) { C } else { D }
if (AA == BB) { if (E == F) { G } else { H } } else { DD }
if (S1 <> R1) { ROUND(ROUND(S2/S1,R2)*S3,R3) } else { R4 }
Note that the whitespace in the results is not particularly important. I just formatted it for readability. Also, the "ROUND" functions will be replaced separately with C# Math.Round() calls. No need to worry about those, here. (All I should need to do to them is add, "Math." and fix the capitalization.)
I'll keep plugging away at this, but if anyone out there has the RegEx experience to figure this out, I would appreciate it.
EDIT: With some additional effort, I've expounded upon my first expression and got it into the following... (.*?)IF[^\S\r\n]*\((.*?),(([^\(]*)|(.*?\(.*?\))),(([^\(]*)|(.*?\(.*?\)))\) And with the following replace expression... $1if($2) {$3} else {$6} I'm almost there. It's just the nested IF statements that are left. (And although I'd prefer to do this with a single pass, if a recursive expression is not going to work, I could rig something up to run the results of the expression through it a second time to deal with the nested IF statements. It's not ideal, but if it's the best I have, I could live with it.
The problem with using regex for parsing arbitrary recursive grammar, is that regex are not particularly suitable for recursion. There is a limited support for recursion in some regex implementation, but it's tricky to make it work for anything slightly more complicated than simple balanced parentheses.
That being said, for your particular case, although at the first sight it appears as recursive grammar, it might be possible to cheat.
In
IF(S1<>R1,ROUND(ROUND(S2/S1,R2)*S3,R3),R4)
if it is guaranteed that both S1<>R1 and R4 don't contain comma symbol, then you can use the following regex:
IF\(([^,]*),(.*),([^,]+)\)
Try it here: https://regexr.com/67r56
How it works: the first matching group greedily matches everything from the beginning of the string, until it encounters the first comma, then the second group greedily matches everything to the end, and starts backtracking, until the very last comma of the string is "released" from the second group. After that the third group matches the "released tail" of the string.
However, as I mentioned in the comments, if S1, R1 or R4 are expressions themself, this regex trick won't work, and you'd need to use a proper recursive parser. Fortunately, there are plenty of parser/combinator libraries for user defined grammars (or you might even find one that already works for your grammar). When your expression is parsed into AST, it's fairly easy to transform it into the desired form.
Alternatively, you can look into writing your own simple parser. It should be fairly straightforward, as you only care about nested parentheses and commas.

Counterpart of regular expressions for parsing nested strucures

Regular expressions are a standard tool used for parsing strings across many languages. However their scope of applicability is limited. Regular expressions can only match a list. There is no way to describe arbitrary deep nested structures using regular expressions. Question: what is a technology/framework as widely used/spread and as standatd as regular expessions are that can match tree structures (produce AST).
Regular expressions describe a finite-state automaton.
Since the late 1960's, the "bread and butter" of parsing (though not necessarily the "state of the art") has been push-down automata generated by parser generators according to "LR" algorithms like LALR(1).
The connection to regular expressions is this: the parsing machine does in fact use rules very similar to regular expressions in order to recognize viable prefixes. The "shift" state transitions among the "core LR(0) items" constitute a finite automaton, and can be described by a regular expression. The recursion is is handled thanks to the semantic action of pushing symbols onto a stack when doing the "shifts", and removing them ("reducing"). Reductions rewrite a portion of the stack, and perform a "goto" to another state. This type of goto, together with the stack, is absent in the regular expression automaton.
Parse Expression Grammars are also related to regular expressions. Regular expressions themselves can be endowed with recursion. Firstly, we can take pieces of regular expressions and give them names, and then construct bigger regular expressions by writing expressions which invoke these names. (Such as feature is found in the lex tool where you can define a named expressions like letters [A-Za-z]+ and refer to it later as {letters}. Now suppose you allow circular references, like letters [A-Za-z]{letters}?. You now have recursion; the only problem is to adjust the model in order to implement it.
Implementations of so-called "regular expressions" in various modern languages and environments in fact support recursion. Perl-compatible regular expressions (PCRE) support it, for instance.
Expressions that feature recursion or backreferencing are not handled by the classic NFA compilation route (possibly converted to a DFA); they cannot be.
How the above letters recursion can be handled is with actual recursion. The ? operator can be implemented as a function which tries to match its respective argument object. If it succeeds, then it consumes whatever it has matched, otherwise it consumes nothing. That is to say, the regular expression can be converted to a syntax tree, and interpreted "as is" rather than compiled to a state machine (or trivially compiled to functions corresponding to the nodes of the tree), and such interpretation can naturally handle recursion. The interpretation then constitutes, effectively, a syntax-directed recursive-descent parser. (Note how I avoided left recursion in defining letters to make that example compatible with this approach).
Example: parenthesis-matching regex:
par-match := ({par-match})|
This gets compiled to a tree:
branch-op <-- "par-match" name points at this node
/ \
catenate-op <empty>
/ \
"(" catenate-op
/ \
{par-match} ")"
This can then converted to a recursive descent parser, or interpreted directly.
Pattern matching starts by invoking the top-level "branch-op". This operator simply tries all of the alternatives. Suppose the input is empty. Then the left alternative will fail: it demands an open parenthesis. So then the right alternative will succeed: empty matches empty. (The operators either "fail" or indicate "success" and consume input.)
But suppose your input is (()). The left catenate-op will in turn invoke its left subtree, which matches and consumes the left parenthesis, leaving ()). It will then invoke its right subtree, another catenate-op. This catenate-op matches its left subtree, which triggers recursion into the top level via the named par-match references. That recursion will match and consume (), leaving ). The catenate-op then invokes its right subtree which matches ). Control returns up to branch-op. (Though the left side of branch-op matched something, branch-op must still try the other alternative; more than one branch can match, and some can match longer than others.)
This is closely related to Parsing Expression Grammars work.
Practically speaking, the recursive definition could be encoded into the regex syntax somehow. Say we invent some new operator like (?name:definition) which means "match definition which is allowed to contain invocations of itself via name. The invocation syntax could be (*name), so that we can write the par-match example as (?par-match:\((*par-match)\)|). The combinations (? and (* are invalid under "classic" regex syntax and so we can use them for extension.
As a final note, regexes correspond to grammars. That is the fundamental connection btween regexes and parsing. That is to say, regexes correspond to a particular subset of grammars describe only regular languages. An example of a grammar which describes a regular language:
S -> A | B
B -> b
A -> A a | c
Although there is A -> A ... recursion, this is still regular, and corresponds to the regex ac*|b, which is just a more compact way to denote the same language. The grammar lets us notate languages that aren't regular and for which we can't write a regex, but as we have seen, we can extend the regex notation and semantics to express some of these things. Regular expressions aren't separate from grammars. The two aren't counterparts, but rather one is a special case or subset of the other.
Parser generators like Yacc, Bison, and derivatives are what you're after. They aren't as widespread as regular expressions because they generate actual C code. There are translations like Jison for example which implement the Yacc/Bison syntax using javascript. I know there are similar tools for other languages.
I get the impression Parsing expression grammar systems are up and coming though.

How to write the regex for this expression

I want to match strings like this: !! so I suppose the input have the right elements but whether the they are evaluable, that is left for the evaluator!
1+(2-3)*(4/5)
what is the regex for matching this, something like this: ([0-9\+-\*/\(\)]+)? but this seems not working.
If you are only asking for a character validation, you can use
^[0-9+*/()-]*$
You don't need to escape characters in a character class (inside square brackets). And if you must include an hyphen, you HAVE to put it at the end, otherwise it would be considered as the character range operator.
That said, keep in mind this will only guarantee you that you have no other characters. It will NOT validate the structure (regexes are not the right tool for that). However, since you stated an evaluator will then process the input, that might be right for you.
You can't, this is not a regular language. Though some regexp implementations may provide additional features to match balanced parenthesis.
Regular expressions can not match arbitrary arithmetic formulas. Regexps only describe regular languages, while arithmetic formulas use a recursive grammar. See http://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory
A regex may be possible if you limit nesting depth, but if you want it all the way, with matching bracket detection, it will probably be very, very complicated.
If you want to match "1+(2-3)*(4/5)", then you can use this regular expression.
/1+\(2-3\)\*\(4\/5)/
What's that? That doesn't tell you what you want to know? Well, then what do you want to know? What information are you trying to extract from the string?
You can't just say "strings like this". Your question is not nearly enough clear.
If your question is to evaluate if a equation is valid then you will need a parser to Tokenize the expression than a grammar to evaluate if the expression is right.
You cant check if the equation as balanced parenthesis using regex. This is because a regular expression is equivalent to a Deterministic Finite Automata. Since the automata is finite, you will never have a automata big enough to check parenthesis.

Finding Elvis ?:

I have been tasked to find Elvis (using eclipse search). Is there any regex that I can use to find him?
The "Elvis Operator" (?:) is a shortening of Java's ternary operator.
I have tried \?[\s\S]*[:] but it doesn't match multiline.
Is there such a refactoring where I could change Elvis into an if-else block?
Edit
Sorry, I had posted a regex for the ternary operator, if your problem is multiline you could use this:
\?(\p{Z}|\r\n|\n)*:
You'll need to explicitly match line delimiters if you want to match across multiple lines. \R will match any of them(platform-independent), in Eclipse 3.4 anyway, or you can use the proper one for your file (\r, \n, \r\n). E.g. \?.*\R*.*: will work if there's only one line break. You can't use \R in a character class, though, so if you don't know how many lines the operator might span, you'd have to construct a character class with your line delimiter and any character that might appear in an operand. Something like ([-\r\n\w\s\[\](){}=!/%*+&^|."']*)\?([-\r\n\w\s\[\](){}=!/%*+&^|."']*):([-\r\n\w\s\[\](){}=!/%*+&^|."']*). I've included parentheses to capture the operands as groups so you could find and replace.
You've got a pretty big problem, though, if this is Java (and probably any other language). The ternary conditional ?: operator creates an expression, while an if statement is not an expression. Consider:
boolean even = true;
int foo = even ? 2 : 3;
int bar = if (even) 2 else 3;
The third line is syntactically incorrect; the two conditional constructs are not equivalent. (What you'd actually get from the second line if you used my regex to find and replace is if (int foo = even) 2 else 3; which has additional problems.)
So, you can find the ?: operators with the regex above (or something similar; I may have missed some characters you need to include in the class), but you won't necessarily be able to replace them with 'if' statements.

Regular expressions Lexical Analysis

Why repeated strings such as
[wcw|w is a string of a's and b's]
cannot be denoted by regular expressions?
pls. give me detailed answer as i m new to lexical analysis.
Thanks ...
Regular expressions in their original form describe regular languages/grammars. Those cannot contain nested structures as those languages can be described by a simple finite state machine. Simplified you can picture that as if each word of the language grows strictly from left to right (or right to left), where repeating structures have to be explicitly defined and are static.
What this means is, that no information whatsoever from previous states can be carried over to later states (a few characters further in the input). So if you have your symbol w you can't specify that the input must have exactly the same string w later in the sequence. Similarly you can't ensure that each opening paranthesis needs a closin paren as well (so regular expressions themselves are not even a regular language and thus cannot be described by regular expressions :-)).
In theoretical computer science we worked with a very restricted set of regex operators, basically only consisting of sequence, alternative (|) and repetition (*), everything else can be described with those operations.
However, usually regex engines allow grouping of certain sub-patterns into matches which can then be referenced or extracted later. Some engines even allow to use such a backreference in the search expression string itself, thereby allowing the expression to describe more than just a regular language. If I remember correctly such use of backreferences can even yield languages that are not context-free.
Additional pointers:
This StackOverflow question
Wikipedia
It can be, you just can't assure that it's the same string of "a"s and "b"s because there's no way to retain the information acquired in traversing the first half for use in traversing the second.