Regex ignore redundant braces - regex

I am building a lex program that will analyze something like the following...
function myFunc {
if a = b {
print "Cool"
}
}
Is it possible, specifically using flex, to create a regex that will single out everything in the first { }
so i will get
{ if a = b { print "Cool" } }
instead of
{ if a = b { print "Cool" }
Currently in my flex file i have this regex
{[^\0]*}

One problem with what you are trying to do is that RegEx is greedy by default (could do some tricks to change that, but you'll still have problems), and you will match more than intended if you run this on a file with multiple functions in it. The reason is that most programming languages are Type 1 grammars in the Chomsky hierarchy, or context-sensitive grammars, and RegEx is a Type 2 (context-free) grammar. It is fundamentally impossible to directly parse the former using the later without a LOT of work. The full explanation for that is ... long. But it boils down to in context sensitive grammars the meaning of a given element can change depending on where you are in the input, while in a context-free grammar every element has exactly one meaning. In your case, you don't want to match any ole' }, you want to match the corresponding } to an open {, which involves counting the number of { and } you have seen so far.
If you really want to do code parsing without having to re-invent the wheel, the plow, fire, steel, and all the way up to electricity, I would recommend that you go check out AnTLR over on GitHub. AnTLR will allow you to create a grammar (if one does not already exist) for the language you are trying to parse and provide the parsed source code to you in the form of a Parse Tree. Parse trees are very, very easy to use and AnTLR has grammars already for almost every language imaginable, and plugins for several languages.
Other than that, both the online regex tester I used and Notepad++ with your sample code matched everything. You could try the RegEx {.*} which also matches everything.

Related

Complicated Search and Replace using RegEx

I'm trying to convert a bunch of custom "recipes" from an old proprietary format to something that is ultimately compatible with C#. And I think that the easiest way to do this would be to use regular expressions. But I'm having trouble figuring out the expression. The piece that I need to convert with this RegEx is the IF statements. Here are a few examples of the original recipes...
IF(A = B,C,D)
IF(AA = BB,IF(E=F,G,H),DD)
IF(S1<>R1,ROUND(ROUND(S2/S1,R2)*S3,R3),R4)
The first one is straightforward... If A = B then C else D.
The second one is similar, except that the IF statements are nested.
And the third one includes additional ROIND function calls in the results.
I've stumbled across regex101.com and have managed to put together the following pattern which is getting close. It works for the first example, but not for the other two: (.*?)IF[^\S\r\n]*\((.*?),(.*?),(.*?)\)
Ultimately, what I want to do is use a regular expression to turn the three examples above into:
if (A == B) { C } else { D }
if (AA == BB) { if (E == F) { G } else { H } } else { DD }
if (S1 <> R1) { ROUND(ROUND(S2/S1,R2)*S3,R3) } else { R4 }
Note that the whitespace in the results is not particularly important. I just formatted it for readability. Also, the "ROUND" functions will be replaced separately with C# Math.Round() calls. No need to worry about those, here. (All I should need to do to them is add, "Math." and fix the capitalization.)
I'll keep plugging away at this, but if anyone out there has the RegEx experience to figure this out, I would appreciate it.
EDIT: With some additional effort, I've expounded upon my first expression and got it into the following... (.*?)IF[^\S\r\n]*\((.*?),(([^\(]*)|(.*?\(.*?\))),(([^\(]*)|(.*?\(.*?\)))\) And with the following replace expression... $1if($2) {$3} else {$6} I'm almost there. It's just the nested IF statements that are left. (And although I'd prefer to do this with a single pass, if a recursive expression is not going to work, I could rig something up to run the results of the expression through it a second time to deal with the nested IF statements. It's not ideal, but if it's the best I have, I could live with it.
The problem with using regex for parsing arbitrary recursive grammar, is that regex are not particularly suitable for recursion. There is a limited support for recursion in some regex implementation, but it's tricky to make it work for anything slightly more complicated than simple balanced parentheses.
That being said, for your particular case, although at the first sight it appears as recursive grammar, it might be possible to cheat.
In
IF(S1<>R1,ROUND(ROUND(S2/S1,R2)*S3,R3),R4)
if it is guaranteed that both S1<>R1 and R4 don't contain comma symbol, then you can use the following regex:
IF\(([^,]*),(.*),([^,]+)\)
Try it here: https://regexr.com/67r56
How it works: the first matching group greedily matches everything from the beginning of the string, until it encounters the first comma, then the second group greedily matches everything to the end, and starts backtracking, until the very last comma of the string is "released" from the second group. After that the third group matches the "released tail" of the string.
However, as I mentioned in the comments, if S1, R1 or R4 are expressions themself, this regex trick won't work, and you'd need to use a proper recursive parser. Fortunately, there are plenty of parser/combinator libraries for user defined grammars (or you might even find one that already works for your grammar). When your expression is parsed into AST, it's fairly easy to transform it into the desired form.
Alternatively, you can look into writing your own simple parser. It should be fairly straightforward, as you only care about nested parentheses and commas.

How do I check if a string is a valid prefix of a regex?

So I'm using Rust to build a parser for my programming language and I need it to well ... parse. I'm implementing a Stream struct that is responsible for some basic operations on Vec<char> that is possesses. One of those functions is as follows:
fn consume_while_regex_matches(&self, regex: Regex, mut acc: String) -> (Self, String)
It's a recursive function that takes a regex and an accumulator which it's supposed to be filling up until the next char in the stream makes it no longer match the Regex. I'd use this function to match simple tokens like integers, strings, etc.
I tried implementing it in the following kind of way (this is not the exact version, but it explains the issue):
acc.push(self.peek())
if !regex.is_match() {
acc.pop();
return self.clone(), acc;
}
return self.consume_while_regex_matches(regex, acc);
The problem with this code is that it is testing whether acc matches the whole regex. Imagine if we want to consume a string -42 with a regex like ^-[0-9]+$. The algorithm would read the very first char -, the match would fail and the accumulator is going to be empty.
Is there a way to check that a string (e.g. acc) is a prefix of a potential regex match?
Like - is not a match on its own, but -42 is a match and - is a valid prefix.
And it'd be great if it's like a library way and it doesn't require me to produce my own regex engine.
Update: I'm not using the described function for parsing. I use it for lexing. I am aware that regex is not enough to parse a complex language.
What I'm asking is whether I can use some regex lib to match tokens gradually as opposed to having to provide the whole string to match against. I'm looking for a way to check whether the underlying DFA is or isn't in the error state by the end of marching a string without having to write my own regex parser and DFA implementation. If this was possible, I'd pass the - to the integer regex, check that after marching it didn't end up in the ERROR state, and if so, it's a valid prefix.
Taking the question at face value. The crate regex-automata (maintained by one of the regex authors) provides some access to the lower level details of building and parsing regexes. In particular, you can access and drive the internal DFA (deterministic finite automata) yourself.
use regex_automata::{Regex, DFA}; // 0.1.9
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
enum PotentialMatch {
Match,
CouldMatch,
DoesntMatch,
}
fn potentially_matches(pattern: &Regex, input: &str) -> PotentialMatch {
let input = input.as_bytes();
let dfa = pattern.forward();
let mut state = dfa.start_state();
for byte in input {
state = dfa.next_state(state, *byte);
if dfa.is_dead_state(state) {
return PotentialMatch::DoesntMatch;
}
}
if dfa.is_match_state(state) {
PotentialMatch::Match
}
else {
PotentialMatch::CouldMatch
}
}
fn main() {
let pattern = Regex::new("-[0-9]+").unwrap();
assert_eq!(potentially_matches(&pattern, ""), PotentialMatch::CouldMatch);
assert_eq!(potentially_matches(&pattern, "-"), PotentialMatch::CouldMatch);
assert_eq!(potentially_matches(&pattern, "-1"), PotentialMatch::Match);
assert_eq!(potentially_matches(&pattern, "-12"), PotentialMatch::Match);
assert_eq!(potentially_matches(&pattern, "-12a"), PotentialMatch::DoesntMatch);
}
You could probably integrate this state tracking into your implementation to be more performant over calling potentially_matches() over and over.
Regular expressions are basically a succinct format for state machines that work on inputs of characters/strings, whether finite-state machines or pushdown automata, depending on the complexity of the language you need to accept/reject.
Your approach is to build the string until it matches against a complex regex, but regexs are essentially often recompiled into state machines for efficiency. The way these state machines work is to break down the complex regex into individual states, and then, against these states, parse and transition between states token-by-token (or character-by-character). A final regex may have many different states. Here's a (somewhat improper) FSM example for [+-][1-9][0-9]* with s2 as the accept state:
Is there a way to check that a string (e.g. acc) is a prefix of a potential regex match? Like - is not a match on its own, but -42 is a match and - is a valid prefix. And it'd be great if it's like a library way and it doesn't require me to produce my own regex engine.
You could use pure regex for simpler parsing problems, but if you're building a programming language, you're eventually going to be looking for a parsing library to handle parsing to arbitrary nested depth (which requires a system stronger than regexes). In Rust, one of the most popular crates for this is nom. Nom is a parsing combinator-style library, which means it relies on putting smaller parsers together in different ways with "operators" called "combinators." Regular expressions are often a weaker form of parser than these due to the limited number and limited properties of their operators/combinators.
I will also note that the full job of validating a proper program written in a programming language won't be satisfied by either, which is why we bother going beyond the lexing/parsing process. All traditional and proper programming languages are somewhere between context-sensitive grammars and Turing-complete languages, which means parsing alone won't satisfy your needs because there is additional context and semantic information that needs to be checked by type checking, reference checking, etc.

Regex for extracting functions from C++ code

I have sample C++ code (http://pastebin.com/6q7zs7tc) from which I have to extract functions names as well as the number of parameters that a function requires. So far I have written this regex, but it's not working perfectly for me.
(?![a-z])[^\:,>,\.]([a-z,A-Z]+[_]*[a-z,A-Z]*)+[(]
You can't parse C++ reliably with regex.
In fact, you can't parse it with weak parsing technology (See Why can't C++ be parsed with a LR(1) parser?). If you expect to get extract this information reliably from source files, you will need a time-tested C++ parser; see https://stackoverflow.com/a/28825789/120163
If you don't care that your extraction process is flaky, then you can use a regex and maybe some additional hackery. Your key problem for heuristic extraction is matching various kinds of brackets, e.g., [...], < ... > (which won't quite work for shift operators) and { ... }. Bracket matching requires you to keep a stack of seen brackets. And bracket matching may fail in the presence of macros and preprocessor conditionals.

Do regex implementations actually need a split() function?

Is there any application for a regex split() operation that could not be performed by a single match() (or search(), findall() etc.) operation?
For example, instead of doing
subject.split('[|]')
you could get the same result with a call to
subject.findall('[^|]*')
And in nearly all regex engines (except .NET and JGSoft), split() can't do some things like "split on | unless they are escaped \|" because you'd need to have unlimited repetition inside lookbehind.
So instead of having to do something quite unreadable like this (nested lookbehinds!)
splitArray = Regex.Split(subjectString, #"(?<=(?<!\\)(?:\\\\)*)\|");
you can simply do (even in JavaScript which doesn't support any kind of lookbehind)
result = subject.match(/(?:\\.|[^|])*/g);
This has led me to wondering: Is there anything at all that I can do in a split() that's impossible to achieve with a single match()/findall() instead? I'm willing to bet there isn't, but I'm probably overlooking something.
(I'm defining "regex" in the modern, non-regular sense, i. e., using everything that modern regexes have at their disposal like backreferences and lookaround.)
The purpose of regular expressions is to describe the syntax of a language. These regular expressions can then be used to find strings that match the syntax of these languages. That’s it.
What you actually do with the matches, depends on your needs. If you’re looking for all matches, repeat the find process and collect the matches. If you want to split the string, repeat the find process and split the input string at the position the matches where found.
So basically, regular expression libraries can only do one thing: perform a search for a match. Anything else are just extensions.
A good example for this is JavaScript where there is RegExp.prototype.exec that actually performs the match search. Any other method that accepts regular expression (e. g. RegExp.prototype.test, String.prototype.match, String.prototype.search) just uses the basic functionality of RegExp.prototype.exec somehow:
// pseudo-implementations
RegExp.prototype.test = function(str) {
return RegExp(this).exec(str);
};
String.prototype.match = function(pattern) {
return RegExp(pattern).exec(this);
};
String.prototype.search = function(pattern) {
return RegExp(pattern).exec(this).index;
};

Efficiently querying one string against multiple regexes

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches.
The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?
EDIT:
I have tried substituting it with DFA's (lex)
The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.
This is the way lexers work.
The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).
The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.
There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.
You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.
I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.
I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.
I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.
We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.
Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.
Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).
The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.
Martin Sulzmann Has done quite a bit of work in this field.
He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.
The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).
The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.
Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.
this is a joke - I like professors, the less the smart ones try to work the more chance I have of getting paid!
10,000 regexen eh? Eric Wendelin's suggestion of a hierarchy seems to be a good idea. Have you thought of reducing the enormity of these regexen to something like a tree structure?
As a simple example: All regexen requiring a number could branch off of one regex checking for such, all regexen not requiring one down another branch. In this fashion you could reduce the number of actual comparisons down to a path along the tree instead of doing every single comparison in 10,000.
This would require decomposing the regexen provided into genres, each genre having a shared test which would rule them out if it fails. In this way you could theoretically reduce the number of actual comparisons dramatically.
If you had to do this at run time you could parse through your given regular expressions and "file" them into either predefined genres (easiest to do) or comparative genres generated at that moment (not as easy to do).
Your example of comparing "hello" to "[H|h]ello" and ".{0,20}ello" won't really be helped by this solution. A simple case where this could be useful would be: if you had 1000 tests that would only return true if "ello" exists somewhere in the string and your test string is "goodbye;" you would only have to do the one test on "ello" and know that the 1000 tests requiring it won't work, and because of this, you won't have to do them.
If you're thinking in terms of "10,000 regexes" you need to shift your though processes. If nothing else, think in terms of "10,000 target strings to match". Then look for non-regex methods built to deal with "boatloads of target strings" situations, like Aho-Corasick machines. Frankly, though, it seems like somethings gone off the rails much earlier in the process than which machine to use, since 10,000 target strings sounds a lot more like a database lookup than a string match.
Aho-Corasick was the answer for me.
I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.
Main Caveat: The patters to match were all language patters not regex patterns e.g. 'cat' vs r'\w+'.
I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.
import ahocorasick
A = ahocorasick.Automaton()
patterns = [
[['cat','dog'],'mammals'],
[['bass','tuna','trout'],'fish'],
[['toad','crocodile'],'amphibians'],
]
for row in patterns:
vals = row[0]
for val in vals:
A.add_word(val, (row[1], val))
A.make_automaton()
_string = 'tom loves lions tigers cats and bass'
def test():
vals = []
for item in A.iter(_string):
vals.append(item)
return vals
Running %timeit test() on my 2000 categories with about 2-3 traces per category and a _string length of about 100,000 got me 2.09 ms vs 631 ms doing sequential re.search() 315x faster!.
You'd need to have some way of determining if a given regex was "additive" compared to another one. Creating a regex "hierarchy" of sorts allowing you to determine that all regexs of a certain branch did not match
You could combine them in groups of maybe 20.
(?=(regex1)?)(?=(regex2)?)(?=(regex3)?)...(?=(regex20)?)
As long as each regex has zero (or at least the same number of) capture groups, you can look at what what captured to see which pattern(s) matched.
If regex1 matched, capture group 1 would have it's matched text. If not, it would be undefined/None/null/...
If you're using real regular expressions (the ones that correspond to regular languages from formal language theory, and not some Perl-like non-regular thing), then you're in luck, because regular languages are closed under union. In most regex languages, pipe (|) is union. So you should be able to construct a string (representing the regular expression you want) as follows:
(r1)|(r2)|(r3)|...|(r10000)
where parentheses are for grouping, not matching. Anything that matches this regular expression matches at least one of your original regular expressions.
I would recommend using Intel's Hyperscan if all you need is to know which regular expressions match. It is built for this purpose. If the actions you need to take are more sophisticated, you can also use ragel. Although it produces a single DFA and can result in many states, and consequently a very large executable program. Hyperscan takes a hybrid NFA/DFA/custom approach to matching that handles large numbers of expressions well.
I'd say that it's a job for a real parser. A midpoint might be a Parsing Expression Grammar (PEG). It's a higher-level abstraction of pattern matching, one feature is that you can define a whole grammar instead of a single pattern. There are some high-performance implementations that work by compiling your grammar into a bytecode and running it in a specialized VM.
disclaimer: the only one i know is LPEG, a library for Lua, and it wasn't easy (for me) to grasp the base concepts.
I'd almost suggest writing an "inside-out" regex engine - one where the 'target' was the regex, and the 'term' was the string.
However, it seems that your solution of trying each one iteratively is going to be far easier.
You could compile the regex into a hybrid DFA/Bucchi automata where each time the BA enters an accept state you flag which regex rule "hit".
Bucchi is a bit of overkill for this, but modifying the way your DFA works could do the trick.
I use Ragel with a leaving action:
action hello {...}
action ello {...}
action ello2 {...}
main := /[Hh]ello/ % hello |
/.+ello/ % ello |
any{0,20} "ello" % ello2 ;
The string "hello" would call the code in the action hello block, then in the action ello block and lastly in the action ello2 block.
Their regular expressions are quite limited and the machine language is preferred instead, the braces from your example only work with the more general language.
Try combining them into one big regex?
I think that the short answer is that yes, there is a way to do this, and that it is well known to computer science, and that I can't remember what it is.
The short answer is that you might find that your regex interpreter already deals with all of these efficiently when |'d together, or you might find one that does. If not, it's time for you to google string-matching and searching algorithms.
The fastest way to do it seems to be something like this (code is C#):
public static List<Regex> FindAllMatches(string s, List<Regex> regexes)
{
List<Regex> matches = new List<Regex>();
foreach (Regex r in regexes)
{
if (r.IsMatch(string))
{
matches.Add(r);
}
}
return matches;
}
Oh, you meant the fastest code? i don't know then....