Let's say I have the line below:
one two three
Is it possible to write a regex that would return below?
one three
I can of course get each part in a separate group but is it possible to capture that in a single match?
To put it simply: no, it can't be done (as discussed in comments on your original question).
To find out why, let's look at it a bit more generally. A regular expression can be modelled as a (often complex) deterministic finite automaton, also known as a DFA, and your average regex engine is implemented as one. What this means is that the regex will slurp zero or one character at a time, and see if it matches the current token. If not, it will backtrack and attempt to match any possible token at the current stage (done with the alternation operation |). If unable, it halts and reports it cannot match. Since a DFA operates on the input in sequential order, what you're asking for is basically impossible by definition.
Related
I am trying to write a regex which excludes certain characters from a class based on the current content of capturing groups. The specific task that made me look for such a thing was to match lowercase letters in alphabetical order.
I searched through Rex's page (https://www.rexegg.com/regex-class-operations.html) to see if there was any way to change the class' content, but was unable to find anything.
Take the following attempt as a brief example: ([a-z])[a-z--[\1]]
Though it's not a correct regular expression, it demonstrates the concept. The idea is that it would match two letters that are not the same.
Note: the expression shown follows a Python-like syntax, and can also be written as:
([a-z])[a-z&&[\1]] or ([a-z])(?![\1])[a-z]
But I am going to use the Python syntax.
In the examples above the nested brackets are optional(in certain engines), but for the ultimate goal they are necessary. The pattern I am trying to match the ordered letters with would be something like this:
^(?:([a-z])([a-z--[a-(?(2)\2|\1)]])*+)?$
The first character class matches a letter which is immediately captured by the group, meaning that the letter will be excluded from the group containing the conditional. the first time the second group tries to match, condition inside the conditional statement evaluates to false, since there has not been a second capture yet, so it "matches" the first group's content, which should result in the exclusion of the first letter from the class. In later steps the second group will be set, meaning that all the letters between 'a' and the most recently captured letter will be excluded.
I know, it seems complicated. Maybe refactoring the pattern will help, take a look at this one:
^(?:([a-z])([(?(2)\2|\1)-z])*+)?$
This example makes no use of set operations, but the idea is roughly the same. The first group matches a letter, then the class inside the second group matches anything between the captured letter and 'z', which is noted by the [(?(2)\2|\1)-z] part. The conditional is there to ensure that the lower boundary of the character interval is the most recently captured character.
This could also be written using subroutine calls, but I doubt it would solve the problem. The issue might be that the classes are precompiled (and so are subroutines), so they cannot change during the matching process.
Are you guys aware of a workaround or an engine that supports such operations? I am interested in the dynamic class operation itself rather than a different way to match alphabetically ordered letters.
In my PCRE regular expression I used an atomic group to reduce backtracks.
<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b
REGEX101
But in the example debug, I see that after f checking ends, it goes further for other options. I'm trying to stop it after f check fails so it doesn't check the rest of expression. What's wrong?
I will assume you know what you're doing by using regex here, since there's probably an argument to be made that PCRE is not the best approach to implementing this sort of matching in a "tree"-like fashion. But I'm not fussed about that.
The idea of using conditionals isn't bad, but it adds extra steps in the form of the conditions themselves. Also, you can only branch off in two directions per conditional.
PCRE has a feature called "backtracking control verbs" which allow you to do precisely what you want. They have varying levels of control, and the one I would suggest in this case is the strongest:
<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(*COMMIT)(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b
https://regex101.com/r/p572K8/2
Just by adding a single (*COMMIT) verb after the 'f' branch, it's cut the number of steps required to find a failure in this case by half.
(*COMMIT) tells the engine to commit to the match at that point. It won't even re-attempt the match starting from </ again if no match is found.
To fully optimize the expression, you'll have to add (*COMMIT) at every point after branching has occurred.
Another thing you can do is try to re-order your alternatives in such a way as to prioritize those that are found most commonly. That might be something else to consider in your optimization process.
Because that's how atomic group works. The idea is:
at the current position, find the first sequence that matches the pattern inside atomic grouping and hold on to it.
(Source: Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?)
So if there is no match inside an atomic group, it will iterate through all options.
You can use conditionals instead:
</?\s*\b(?(?=a)a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|(?(?=b)b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|(?(?=c)c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|(?(?=d)d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|(?(?=e)em(?:bed)?|(?(?=f)f(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|(?(?=h)h(?:[1-6r]|ead(?:er)?|tml)|(?(?=i)i(?:frame|mg|nput|ns)?|(?(?=k)kbd|(?(?=l)l(?:abel|egend|i(?:nk)?)|(?(?=m)m(?:a(?:in|p|rk)|et(?:a|er))|(?(?=n)n(?:av|o(?:frames|script))|(?(?=o)o(?:bject|l|pt(?:group|ion)|utput)|(?(?=p)p(?:aram|icture|re|rogress)?|(?(?=q)q|(?(?=r)r[pt]|(?(?=r)ruby|(?(?=s)s|(?(?=s)s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|(?(?=t)t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|(?(?=u)ul?|(?(?=v)v(?:ar|ideo)|wbr))))))))))))))))))))))\b
Regex101
This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).
I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.
That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.
Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.
Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.
From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
The answer -
A match is Not Discontinuous, it is continuous !!
Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.
So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.
This is a Tennant of Regular Expressions.
Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.
So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....
There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.
Unfortunately, there is no magical recipe to negate a pattern.
As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.
To do it with the pattern itself, workarounds are:
1. consuming the characters that match the pattern
"other content" is the content until the next pattern or the end of the string.
alternation + capture group:
(pattern)|other content
Then you must check if the capture group exists to know which part of the alternation succeeds.
"other content" can be for example described in this way: .*?(?=pattern|$)
With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:
pattern(*SKIP)(*FAIL)|other content
With this variant, you don't need to check anything after, since the first branch is forced to fail.
or without alternation:
((?:pattern)*)(other content)
variant in PCRE, Perl, or Ruby with the \K feature:
(?:pattern)*\Kother content
Where \K removes all on the left from the match result.
2. checking characters of the string one by one
(?:(?!pattern).)*
if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.
The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):
[^a]*(?:(?!pattern)a[^a]*)*
3. list all that is not the pattern.
using character classes
Lets say your pattern is /hello/:
([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*
This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.
I am trying to implement a pattern matching "syntax" and language.
I know of regular expressions but these aren't enough for my scopes.
I have individuated some "mathematical" operators.
In the examples that follow I will suppose that the subject of pattern mathing are character strings but it isn't necessary.
Having read the description bellow: The question is, does any body knows of a mathematical theory explicitating that or any language that takes the same approach implementing it ? I would like to look at it in order to have ideas !
Descprition of approach:
At first we have characters. Characters may be aggregated to form strings.
A pattern is:
a) a single character
b) an ordered group of patterns with the operator matchAny
c) an ordered group of patterns with the operator matchAll
d) other various operators to see later on.
Explanation:
We have a subject character string and a starting position.
If we check for a match of a single character, then if it matches it moves the current position forward by one position.
If we check for a match of an ordered group of patterns with the operator matchAny then it will check each element of the group in sequence and we will have a proliferation of starting positions that will get multiplied by the number of possible matches being advanced by the length of the match.
E.G suppose the group of patterns is { "a" "aba" "ab" "x" "dd" } and the string under examination is:
"Dabaxddc" with current position 2 ( counting from 1 ).
Then applying matchAny with the previous group we have that "a" mathces "aba" matches and "ab" matches while "x" and "dd" do not match.
After having those matches there are 3 starting positions 3 4 5 ( corresponding to "a" "ab" "aba" ).
We may continue our pattern matching by accepting to have more then one starting positions. So now we may continue to the next case under examination and check for a matchAll.
matchAll means that all patterns must match sequentially and are applied sequentially.
subcases of matchAll are match0+ match1+ etc.
I have to add that the same fact to try to ask the question has already helped me and cleared me out some things.
But I would like to know of similar approaches in order to study them.
Please only languages used by you and not bibliography !!!
I suggest you have a look at the paper "Parsing Permutation Phrases". It deals with recognizing a set of things in any order where the "things" can be recognizers themselves. The presentation in the paper might be a little different than what you expect; they don't compile to finite automaton. However, they do give an implementation in a functional language and that should be helpful to you.
Your description of matching strings against patterns is exactly what a compiler does. In particular, your description of multiple potential matches is highly reminiscent of the way an LR parser works.
If the patterns are static and can be described by an EBNF, then you could use an LR parser generator (such as YACC) to generate a recogniser.
If the patterns are dynamic but can still be formulated as EBNF there are other tools that can be applied. It just gets a bit more complicated.
[In Australia at least, Computer Science was a University course in 1975, when I did mine. YACC dates from around 1970 in its original form. EBNF is even older.]
I'm trying to build a tool that uses something like regexes to find patterns in a string (not a text string, but that is not important right now). I'm familiar with automata theory, i.e. I know how to implement basic regex matching, and output true or false if the string matches my regex, by simulating an automaton in the textbook way.
Say I'm interested in all as that comes before bs, with no more as before the bs, so, this regex: a[^a]*b. But I don't just want to find out if my string contains such a part, I want to get as output the a, so that I can inspect it (remember, I'm not actually dealing with text).
In summary: Let's say I mark the a with parentheses, like so: (a)[^a]*b and run it on the input string bcadacb then I want the second a as output.
Or, more generally, can one find out which characters in the input string matches which part of the regex? How is it done in text editors? They at least know where the match started, because they can highlight the matches. Do I have to use a backtracking approach, or is there a smarter, less computationally expensive, way?
EDIT: Proper back references, i.e. capturing with parens and referencing with \1, etc. may not be necessary. I do know that back references do introduce the need for backtracking (or something similar) and make the problem (IIRC) NP-hard. My question, in essence, is: Is the capturing part, without the back referencing, less computationally expensive than proper back references?
Most text editors do this by using a backtracking algorithm, in which case recording the match locations is trivial to add.
It is possible to do with a direct NFA simulation too, by augmenting the state lists with parenthesis location information. This can be done in a way that preserves the linear time guarantee. See http://swtch.com/~rsc/regexp/regexp2.html#submatch.
Timos's answer is on the right track, but you cannot tag DFA states, because a DFA state corresponds to a collection of possible NFA states, and so one DFA state might represent the possibility of having passed a paren (but maybe something else too) and if that turns out not to be the case, it would be incorrect to record it as fact. You really need to work on the NFA simulation instead.
After you constructed your DFA for the matching, mark all states which correspond to the first state after an opening parenthesis in the regex. When you visit such a state, save the index of the current input character, when you visit a state which corresponds to a closing parenthesis, also save the index.
When you reach an accepting state, output the two indices. I am not sure if this is the algorithm used in text editors, but that's how I would do it.