Regex: Match all permutations [duplicate] - regex

This question already has answers here:
Regex to match all permutations of {1,2,3,4} without repetition
(4 answers)
Closed 4 years ago.
First of all, I am aware that this is a problem you wouldn't usually use regex for, I am just trying to find out whether this is even possible.
That being said, what I am trying to do is match ALL occurrences of any permutation of a string (for now, I don't care if overlapping occurences match or not); for example, if I have the string abc, I want to match all occurrences of abc, acb, bac, bca, cab and cba.
What I have until now is the following regex: (?:([abc])(?!.{0,1}\1)){3} (note: I know that I could use + instead of {0,1}, but that only works for strings with length 3). This kind of works, but if there are two permutations next to each other where a letter of the first one is too close to a letter of the second one (eg. abc cba → c c), the first permutation does not match. Is it possible to solve this using regex?

Direct Approach
[abc]{3} would match too many results since it would also match aab.
In order to not double match a you would need to remove a from the group that follows leaving you with a[bc]{2}.
a[bc]{2} would match too many results since it would also match 'abb'.
In order to not double match b you would need to remove a from the group that follows leaving you with ab[c]{1} or abc for short.
abc would not match all combinations so you would need another group.
(abc)|([abc]{3}) which would match too many combinations again.
This path leads you down the road of having all permutations listed explicitly in groups.
Can you create combinations so that you do not need to write out all combinations?
(abc)|(acb) could be writtean as a((bc)|(cb)).
(bc)|(cb) I can not shorten that any further.
Match too many and remove unwanted
Depending on the regex engine you may be able to express AND as a look ahead so that you can remove matches. THIS and not THAT consume THIS.
(?=[abc]{3})(?=(?!a.a))[abc]{3} would not match aca.
This problem is now simmilar to the one above where you need to remove all combinations that would violate your permutations. In this example that is any expression containing the same character mutltiple times.
'(.)\1+' this expression uses grouping references on its own matches the same character multiple times but requires knowing how many groups exist in the expression and is very brittle Adding groups kills the expression ((.)\1+) no longer matches. Relative back references exist and require knowledge of your specific regex engine. \k<-1> may be what you could be looking for. I will assume .net since I happen to have a regex tester bookmarked for that.
The permutations that I want to exclude are: nn. n.n .nn nnn
So I create these patterns: ((?<1>.)\k<1>.) ((?<2>.).\k<2>) (.(?<3>.)\k<3>) ((?<4>.)\k<4>\k<4>)
Putting it all together gives me this expression, note that I used relative back references as they are in .net - your milage may vary.
(?=[abc]{3})(?=(?!((?<1>.)\k<1>.)))(?=(?!((?<2>.).\k<2>)))(?=(?!(.(?<3>.)\k<3>)))(?=(?!((?<4>.)\k<4>\k<4>)))[abc]{3}
The answer is yes for a specific length.
Here is some testing data.

Related

Regex match a string within 2 different strings containing other characters

Given bar(alvin the chipmunk dude) and chipmunk(alvin the chipmunk dude), how would you match the word "chipmunk" only on the "bar" function?
Another question I just asked, but without the needed complexity I was looking for, is answered here. I do not believe this is a duplicate given the answer to the question from #revo. That answer does answer the other question however I see no way to adapt it to ensure the match is contained within two different strings ("bar(" and ")").
chipmunk(?=[^\)\(\\]*(?:\\.[^\)\(\\]*)*\)) (courtesy of #revo) matches "chipmunk" inside of the parentheses, but I want to constrain it to only to to being within "bar(" and ")".
Test here.
Using JetBrains IDE which uses Java.
Since you are using a Java regex library, you may leverage the constrained-width lookbehind feature:
Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or four characters. Likewise, (?<=A{1,10}) is valid.
You may use
(?<=bar\([^()]{0,1000})chipmunk
It matches any chipmunk string that is immediately preceded with bar( followed with 0 to 1000 chars other than ( and ).
You may test it at RegexPlanet.com.

Why can "a*a+" and "(a{2,3})*a{2,3}" match "aaaa" while "(a{2,3})*" cannot?

My understanding of * is that it consumes as many characters as possible (greedily) but "gives back" when necessary. Therefore, in a*a+, a* would give one (or maybe more?) character back to a+ so it can match.
However, in (a{2,3})*, why doesn't the first "instance" of a{2,3} gives a character to the second "instance" so the second one can match?
Also, in (a{2,3})*a{2,3} the first part does seem to give a character to the second part.
A simple workaround for your question is to match aaaa with regex ^(a{2,3})*$.
Your problem is that:
In the case of (a{2,3})*, regex doesn't seem to consume as much
character as possible.
I suggest not to think in giving back characters. Instead, the key is acceptance.
Once regex accept your string, the matching will be over. The pattern a{2,3} only matches aa or aaa. So in the case of matching aaaa with (a{2,3})*, the greedy engine would match aaa. And then, it can't match more a{2,3} because there is only one a remained. Though it's able for regex engine to do backtrack and match an extra a{2,3}, it wouldn't. aaa is now accepted by the regex, thus regex engine would not do expensive backtracking.
If you add an $ to the end of the regex, it simply tells regex engine that a partly match is unacceptable. Moreover, it's easy to explain the (a{2,3})*a{2,3} case with accepting and backtracking.
The main problem is this:
My understanding of * is that it consumes as many characters as possible (greedily) but "gives back" when necessary
This is completely wrong. It is not what greedy means.
Greedy simply means "use the longest possible match". It does not give anything back.
Once you interpret the expressions with this new understanding everything makes sense.
a*a+ - zero or more a followed by one or more a
(a{2,3})*a{2,3} - zero or more of either two or three a followed by either two or three a (note: the KEY THING to remember is "zero or more", the first part not matching any character is considered a match)
(a{2,3})* - zero or more of either two or three a (this means that after matching three as the last single a left cannot match)
backtracking is done only if match fails however aaa is a valid match, a negative lookahead (?!a) can be use to prevent the match be followed by a a.
compare
(aaa?)*
and
(aaa?)*(?!a)

How to invert an arbitrary Regex expression

This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).
I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.
That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.
Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.
Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.
From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
The answer -
A match is Not Discontinuous, it is continuous !!
Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.
So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.
This is a Tennant of Regular Expressions.
Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.
So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....
There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.
Unfortunately, there is no magical recipe to negate a pattern.
As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.
To do it with the pattern itself, workarounds are:
1. consuming the characters that match the pattern
"other content" is the content until the next pattern or the end of the string.
alternation + capture group:
(pattern)|other content
Then you must check if the capture group exists to know which part of the alternation succeeds.
"other content" can be for example described in this way: .*?(?=pattern|$)
With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:
pattern(*SKIP)(*FAIL)|other content
With this variant, you don't need to check anything after, since the first branch is forced to fail.
or without alternation:
((?:pattern)*)(other content)
variant in PCRE, Perl, or Ruby with the \K feature:
(?:pattern)*\Kother content
Where \K removes all on the left from the match result.
2. checking characters of the string one by one
(?:(?!pattern).)*
if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.
The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):
[^a]*(?:(?!pattern)a[^a]*)*
3. list all that is not the pattern.
using character classes
Lets say your pattern is /hello/:
([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*
This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.

Is it possible to match any wide character that appears more than once using only regxp?

For example, in this string with no \s:
abodnpjdcqe
only d should be matched.
But in my case there are thousands of different characters, is it possible to use ONLY regxp to match all characters that appear in the string more than once? It seems that all other problems use other tools.
It is possible to find characters that are present two times in a string as anubhava demonstrates it, and I don't see any other regex pattern to do it.
However, there are problems with an only regex way:
The complexity of this kind of pattern is very high, and you will experience problems (with backtracking limits and execution time) if your string is long and if there are few duplicates.
This way is unable to see if a duplicate character have been already found. For example the string a123a456a789a, the pattern will return a three times instead of one. If your goal is to obtain a list of unique duplicate characters, it can be problematic (but easy to solve programmatically)
So, to answer your question: my answer is no.
a simple way, to do it with code is to loop over the characters of your string and to build an associative array where the keys are the characters and the values the number of occurences. Then, removes each item that has the value 1 and extract the keys.
Note: you can solve the problem of duplicate results (2.) using this pattern:
(.)(?=(?:(?!\1).)*\1(?:(?!\1).)*$)
or if possessive quantifiers are available:
(.)(?=(?:(?!\1).)*+\1(?:(?!\1).)*+$)
but I'm afraid that the complexity may be even more high.
So, using your favorite language stay from far the best way.
You can use this regex:
([a-zA-Z])(?=.*\1)
Explanation:
Regex uses ([a-zA-Z]) to match any letter and captures it as group #1 i.e. \1
A positive lookahead (?=.*\1) then makes sure this match is successful only when it is followed by at least one of the backreference \1 i.e. the character itself.
RegEx Demo

How do I find words with all the specified characters, with repetition?

Is there a way to find the words containing all the given characters, include the repetitive ones, with regular expression? For example, I want to find all words from list
aabc, abbc, bbbc, aaac, aaab, baac, caab, abca
that contain exactly one 'b' and two 'a's, i.e. aabc, baac, caab, and abca (but NOT aaab as it has an additional 'a'). Word length doesn't matter.
While this question
GREP How do I only retrieve words with only the specified letters?
could give me some hint, I wasn't able to extend it so it will find repeative characters.
I am just playing with re module from Python, but there is no restrcition on language / tool for the question.
EDIT:
A better example / usecase would be: Given a list of words, show only those that contain all the letters entered by a user, e.g. I would like to find all words containing exactly one 'a', two 'd's and one 's'. Is this something regex capable of? (I already know how to do it without regex.)
To match exactly 2 a's and 1 b (in any order) in your input string use this regex:
(?=^(?:[^a]*a){2}[^a]*$)(?=^[^b]*b[^b]*$)^.+$
Here is a live demo for you.
If your regex flavor supports lookaheads, then you can use this:
\b(?=.*b)(?=([^a]*a){2}[^a]*\b)[abc]+\b
This requires at least one b and exactly 2 a's, and allows only a, b and c in the string. If you want to require exactly one b and exactly 4 characters in total, use this:
\b(?=[^b]*b[^b]*\b)(?=([^a]*a){2}[^a]*\b)[abc]{4}\b