Complicated Search and Replace using RegEx - regex

I'm trying to convert a bunch of custom "recipes" from an old proprietary format to something that is ultimately compatible with C#. And I think that the easiest way to do this would be to use regular expressions. But I'm having trouble figuring out the expression. The piece that I need to convert with this RegEx is the IF statements. Here are a few examples of the original recipes...
IF(A = B,C,D)
IF(AA = BB,IF(E=F,G,H),DD)
IF(S1<>R1,ROUND(ROUND(S2/S1,R2)*S3,R3),R4)
The first one is straightforward... If A = B then C else D.
The second one is similar, except that the IF statements are nested.
And the third one includes additional ROIND function calls in the results.
I've stumbled across regex101.com and have managed to put together the following pattern which is getting close. It works for the first example, but not for the other two: (.*?)IF[^\S\r\n]*\((.*?),(.*?),(.*?)\)
Ultimately, what I want to do is use a regular expression to turn the three examples above into:
if (A == B) { C } else { D }
if (AA == BB) { if (E == F) { G } else { H } } else { DD }
if (S1 <> R1) { ROUND(ROUND(S2/S1,R2)*S3,R3) } else { R4 }
Note that the whitespace in the results is not particularly important. I just formatted it for readability. Also, the "ROUND" functions will be replaced separately with C# Math.Round() calls. No need to worry about those, here. (All I should need to do to them is add, "Math." and fix the capitalization.)
I'll keep plugging away at this, but if anyone out there has the RegEx experience to figure this out, I would appreciate it.
EDIT: With some additional effort, I've expounded upon my first expression and got it into the following... (.*?)IF[^\S\r\n]*\((.*?),(([^\(]*)|(.*?\(.*?\))),(([^\(]*)|(.*?\(.*?\)))\) And with the following replace expression... $1if($2) {$3} else {$6} I'm almost there. It's just the nested IF statements that are left. (And although I'd prefer to do this with a single pass, if a recursive expression is not going to work, I could rig something up to run the results of the expression through it a second time to deal with the nested IF statements. It's not ideal, but if it's the best I have, I could live with it.

The problem with using regex for parsing arbitrary recursive grammar, is that regex are not particularly suitable for recursion. There is a limited support for recursion in some regex implementation, but it's tricky to make it work for anything slightly more complicated than simple balanced parentheses.
That being said, for your particular case, although at the first sight it appears as recursive grammar, it might be possible to cheat.
In
IF(S1<>R1,ROUND(ROUND(S2/S1,R2)*S3,R3),R4)
if it is guaranteed that both S1<>R1 and R4 don't contain comma symbol, then you can use the following regex:
IF\(([^,]*),(.*),([^,]+)\)
Try it here: https://regexr.com/67r56
How it works: the first matching group greedily matches everything from the beginning of the string, until it encounters the first comma, then the second group greedily matches everything to the end, and starts backtracking, until the very last comma of the string is "released" from the second group. After that the third group matches the "released tail" of the string.
However, as I mentioned in the comments, if S1, R1 or R4 are expressions themself, this regex trick won't work, and you'd need to use a proper recursive parser. Fortunately, there are plenty of parser/combinator libraries for user defined grammars (or you might even find one that already works for your grammar). When your expression is parsed into AST, it's fairly easy to transform it into the desired form.
Alternatively, you can look into writing your own simple parser. It should be fairly straightforward, as you only care about nested parentheses and commas.

Related

Regex ignore redundant braces

I am building a lex program that will analyze something like the following...
function myFunc {
if a = b {
print "Cool"
}
}
Is it possible, specifically using flex, to create a regex that will single out everything in the first { }
so i will get
{ if a = b { print "Cool" } }
instead of
{ if a = b { print "Cool" }
Currently in my flex file i have this regex
{[^\0]*}
One problem with what you are trying to do is that RegEx is greedy by default (could do some tricks to change that, but you'll still have problems), and you will match more than intended if you run this on a file with multiple functions in it. The reason is that most programming languages are Type 1 grammars in the Chomsky hierarchy, or context-sensitive grammars, and RegEx is a Type 2 (context-free) grammar. It is fundamentally impossible to directly parse the former using the later without a LOT of work. The full explanation for that is ... long. But it boils down to in context sensitive grammars the meaning of a given element can change depending on where you are in the input, while in a context-free grammar every element has exactly one meaning. In your case, you don't want to match any ole' }, you want to match the corresponding } to an open {, which involves counting the number of { and } you have seen so far.
If you really want to do code parsing without having to re-invent the wheel, the plow, fire, steel, and all the way up to electricity, I would recommend that you go check out AnTLR over on GitHub. AnTLR will allow you to create a grammar (if one does not already exist) for the language you are trying to parse and provide the parsed source code to you in the form of a Parse Tree. Parse trees are very, very easy to use and AnTLR has grammars already for almost every language imaginable, and plugins for several languages.
Other than that, both the online regex tester I used and Notepad++ with your sample code matched everything. You could try the RegEx {.*} which also matches everything.

Efficient regex to insert missing apostrophes?

I'm trying to build a regular expression and replacement string that I can use to insert missing apostrophes. Examples:
Dont -> Don't
Ill -> I'll
I can get this working with capture groups, but I'm trying to only have to call .Replace one time. Right now I have something like:
$apostropheregex = '\b((didn|won|ain|don)(t)|(i)(ll|m))\b'
$apostrophereplacement='$2$4''$3$5'
But it feels ugly to be mashing together both prefix groups and both postfix groups with the assumption that we only matched one or the other (either a "ll" or a "t" match)
Does anyone have any suggestions? Is there a better way to approach this problem? Should I indeed treat these as two separate scenarios and run replace twice with separate regexes and replacement strings?
Update: To clarify, I'm aware that this could have unintended consequences, replacing strings that shouldn't be replaced since English grammatical context is not considered. I'm running this manually after reviewing strings first and I still think this is an interesting question.
Just a note: This is ill suited for... ill suited, which becomes i'll suited.
But you asked for a better regex and you shall receive. I would use:
\b(?|(don)(t)|(won)(t)|(you)(re))\b
The replacement will be $1'$2.
The main advantage of this regex is legibility. You should easily be able to add new alterations.
It works by using the branch reset group (?| ). This means that each alternation uses $1 and $2 (instead of 2n+1 and 2n+2).

Regular expression in C++ for mathematical expressions

I have this trouble: I must verify the correctness of many mathematical expressions especially check for consecutive operators + - * /.
For example:
6+(69-9)+3
is ok while
6++8-(52--*3)
no.
I am not using the library <regex> since it is only compatible with C++11.
Is there a alternative method to solve this problem? Thanks.
You can use a regular expression to verify everything about a mathematical expression except the check that parentheses are balanced. That is, the regular expression will only ensure that open and close parentheses appear at the point in the expression they should appear, but not their correct relationship with other parentheses.
So you could check both that the expression matches a regex and that the parentheses are balanced. Checking for balanced parentheses is really simple if there is only one type of parenthesis:
bool check_balanced(const char* expr, char open, char close) {
int parens = 0;
for (const char* p = expr; *p; ++p) {
if (*p == open) ++parens;
else if (*p == close && parens-- == 0) return false;
}
return parens == 0;
}
To get the regular expression, note that mathematical expressions without function calls can be summarized as:
BEFORE* VALUE AFTER* (BETWEEN BEFORE* VALUE AFTER*)*
where:
BEFORE is sub-regex which matches an open parenthesis or a prefix unary operator (if you have prefix unary operators; the question is not clear).
AFTER is a sub-regex which matches a close parenthesis or, in the case that you have them, a postfix unary operator.
BETWEEN is a sub-regex which matches a binary operator.
VALUE is a sub-regex which matches a value.
For example, for ordinary four-operator arithmetic on integers you would have:
BEFORE: [-+(]
AFTER: [)]
BETWEEN: [-+*/]
VALUE: [[:digit:]]+
and putting all that together you might end up with the regex:
^[-+(]*[[:digit:]]+[)]*([-+*/][-+(]*[[:digit:]]+[)]*)*$
If you have a Posix C library, you will have the <regex.h> header, which gives you regcomp and regexec. There's sample code at the bottom of the referenced page in the Posix standard, so I won't bother repeating it here. Make sure you supply REG_EXTENDED in the last argument to regcomp; REG_EXTENDED|REG_NOSUB, as in the example code, is probably even better since you don't need captures and not asking for them will speed things up.
You can loop over each charin your expression.
If you encounter a + you can check whether it is follow by another +, /, *...
Additionally you can group operators together to prevent code duplication.
int i = 0
while(!EOF) {
switch(expression[i]) {
case '+':
case '*': //Do your syntax checks here
}
i++;
}
Well, in general case, you can't solve this with regex. Arithmethic expressions "language" can't be described with regular grammar. It's context-free grammar. So if what you want is to check correctness of an arbitrary mathemathical expression then you'll have to write a parser.
However, if you only need to make sure that your string doesn't have consecutive +-*/ operators then regex is enough. You can write something like this [-+*/]{2,}. It will match substrings with 2 or more consecutive symbols from +-*/ set.
Or something like this ([-+*/]\s*){2,} if you also want to handle situations with spaces like 5+ - * 123
Well, you will have to define some rules if possible. It's not possible to completely parse mathamatical language with Regex, but given some lenience it may work.
The problem is that often the way we write math can be interpreted as an error, but it's really not. For instance:
5--3 can be 5-(-3)
So in this case, you have two choices:
Ensure that the input is parenthesized well enough that no two operators meet
If you find something like --, treat it as a special case and investigate it further
If the formulas are in fact in your favor (have well defined parenthesis), then you can just check for repeats. For instance:
--
+-
+*
-+
etc.
If you have a match, it means you have a poorly formatted equation and you can throw it out (or whatever you want to do).
You can check for this, using the following regex. You can add more constraints to the [..][..]. I'm giving you the basics here:
[+\-\*\\/][+\-\*\\/]
which will work for the following examples (and more):
6++8-(52--*3)
6+\8-(52--*3)
6+/8-(52--*3)
An alternative, probably a better one, is just write a parser. it will step by step process the equation to check it's validity. A parser will, if well written, 100% accurate. A Regex approach leaves you to a lot of constraints.
There is no real way to do this with a regex because mathematical expressions inherently aren't regular. Heck, even balancing parens isn't regular. Typically this will be done with a parser.
A basic approach to writing a recursive-descent parser (IMO the most basic parser to write) is:
Write a grammar for a mathematical expression. (These can be found online)
Tokenize the input into lexemes. (This will be done with a regex, typically).
Match the expressions based on the next lexeme you see.
Recurse based on your grammar
A quick Google search can provide many example recursive-descent parsers written in C++.

Do regex implementations actually need a split() function?

Is there any application for a regex split() operation that could not be performed by a single match() (or search(), findall() etc.) operation?
For example, instead of doing
subject.split('[|]')
you could get the same result with a call to
subject.findall('[^|]*')
And in nearly all regex engines (except .NET and JGSoft), split() can't do some things like "split on | unless they are escaped \|" because you'd need to have unlimited repetition inside lookbehind.
So instead of having to do something quite unreadable like this (nested lookbehinds!)
splitArray = Regex.Split(subjectString, #"(?<=(?<!\\)(?:\\\\)*)\|");
you can simply do (even in JavaScript which doesn't support any kind of lookbehind)
result = subject.match(/(?:\\.|[^|])*/g);
This has led me to wondering: Is there anything at all that I can do in a split() that's impossible to achieve with a single match()/findall() instead? I'm willing to bet there isn't, but I'm probably overlooking something.
(I'm defining "regex" in the modern, non-regular sense, i. e., using everything that modern regexes have at their disposal like backreferences and lookaround.)
The purpose of regular expressions is to describe the syntax of a language. These regular expressions can then be used to find strings that match the syntax of these languages. That’s it.
What you actually do with the matches, depends on your needs. If you’re looking for all matches, repeat the find process and collect the matches. If you want to split the string, repeat the find process and split the input string at the position the matches where found.
So basically, regular expression libraries can only do one thing: perform a search for a match. Anything else are just extensions.
A good example for this is JavaScript where there is RegExp.prototype.exec that actually performs the match search. Any other method that accepts regular expression (e. g. RegExp.prototype.test, String.prototype.match, String.prototype.search) just uses the basic functionality of RegExp.prototype.exec somehow:
// pseudo-implementations
RegExp.prototype.test = function(str) {
return RegExp(this).exec(str);
};
String.prototype.match = function(pattern) {
return RegExp(pattern).exec(this);
};
String.prototype.search = function(pattern) {
return RegExp(pattern).exec(this).index;
};

Using regexp to evaluate search query

Is it possible to convert a properly formed (in terms of brackets) expression such as
((a and b) or c) and d
into a Regex expression and use Java or another language's built-in engine with an input term such as ABCDE (case-insensitive...)?
So far I've tried something along the lines of (b)(^.?)(a|e)* for the search b and (a or e) but it isn't really working out. I'm looking for it to match the characters 'b' and any of 'a' or 'e' that appear in the input string.
About the process - I'm thinking of splitting the input string into an array (based on this Regex) and receiving as output the characters that match (or none if the AND/OR conditions are not met). I'm relatively new to Regex and haven't spent a lot of time on it, so I'm sorry if what I'm asking about is not possible or the answer is really obvious.
Thanks for any replies.
The language of strings with balanced parentheses is not a regular language, which means no (pure) regular expression will match it.
That is because some kind of memory construct, usually a stack, is needed to maintain open parentheses.
That said, many languages offer recursive evaluation in regexes, notably Perl. I don't know the fine details, but I'm not going to bother with them because you can probably write your own parser.
Just iterate over every character in the string and keep track of a counter of open parentheses and a stack of strings. When you get to an open parentheses, push the stack in and put characters that aren't parentheses into string of the stack. When you get to a closed parentheses, evaluate the expression that you had built up and store the result onto the back of the string that's on the top of the stack.
Then again, I'm not fully sure I understand what you're doing. I apologize, then, if this is no help.
I'm not entirely certain I understand what you're trying to do, but here's something that might help. Start with something like
((a and b) or c) and d
And pass it through these substitution statements:
s/or/|/g
s/and| //g
s/([^()|])/(?=.*$1)/g
That will give you
(((?=.*a)(?=.*b))|(?=.*c))(?=.*d)
which is a regex that will match what you want.
No. A regex isn't computationally powerful enough to make sure that the opening and closing parentheses match. You need something that can describe it using a formal grammar.