Regular expression to match strings that do NOT contain all specified elements - regex

I'd like to find a regular expression that matches strings that do NOT contain all the specified elements, independently of their order. For example, given the following data:
one two three four
one three two
one two
one three
four
Passing the words two three to the regex should match the lines one two, one three and four.
I know how to implement an expression that matches lines that do not contain ANY of the words, matching only line four:
^((?!two|three).)*$
But for the case I'm describing, I'm lost.

Nice question. It looks like you are looking for some AND logic. I am sure someone can come up with something better, but I thought of two ways:
^(?=(?!.*\btwo\b)|(?!.*\bthree\b)).*$
See the online demo
Or:
^(?=.*\btwo\b)(?=.*\bthree\b)(*SKIP)(*F)|^.*$
See the online demo
In both cases we are using positive lookahead to mimic the AND logic to prevent both words being present in a text irrespective of their position in the full string. If just one of those words is present, the string will pass.

Use this pattern:
(?!.*two.*three|.*three.*two)^.*$
See Demo

Related

Regex: Match all permutations [duplicate]

This question already has answers here:
Regex to match all permutations of {1,2,3,4} without repetition
(4 answers)
Closed 4 years ago.
First of all, I am aware that this is a problem you wouldn't usually use regex for, I am just trying to find out whether this is even possible.
That being said, what I am trying to do is match ALL occurrences of any permutation of a string (for now, I don't care if overlapping occurences match or not); for example, if I have the string abc, I want to match all occurrences of abc, acb, bac, bca, cab and cba.
What I have until now is the following regex: (?:([abc])(?!.{0,1}\1)){3} (note: I know that I could use + instead of {0,1}, but that only works for strings with length 3). This kind of works, but if there are two permutations next to each other where a letter of the first one is too close to a letter of the second one (eg. abc cba → c c), the first permutation does not match. Is it possible to solve this using regex?
Direct Approach
[abc]{3} would match too many results since it would also match aab.
In order to not double match a you would need to remove a from the group that follows leaving you with a[bc]{2}.
a[bc]{2} would match too many results since it would also match 'abb'.
In order to not double match b you would need to remove a from the group that follows leaving you with ab[c]{1} or abc for short.
abc would not match all combinations so you would need another group.
(abc)|([abc]{3}) which would match too many combinations again.
This path leads you down the road of having all permutations listed explicitly in groups.
Can you create combinations so that you do not need to write out all combinations?
(abc)|(acb) could be writtean as a((bc)|(cb)).
(bc)|(cb) I can not shorten that any further.
Match too many and remove unwanted
Depending on the regex engine you may be able to express AND as a look ahead so that you can remove matches. THIS and not THAT consume THIS.
(?=[abc]{3})(?=(?!a.a))[abc]{3} would not match aca.
This problem is now simmilar to the one above where you need to remove all combinations that would violate your permutations. In this example that is any expression containing the same character mutltiple times.
'(.)\1+' this expression uses grouping references on its own matches the same character multiple times but requires knowing how many groups exist in the expression and is very brittle Adding groups kills the expression ((.)\1+) no longer matches. Relative back references exist and require knowledge of your specific regex engine. \k<-1> may be what you could be looking for. I will assume .net since I happen to have a regex tester bookmarked for that.
The permutations that I want to exclude are: nn. n.n .nn nnn
So I create these patterns: ((?<1>.)\k<1>.) ((?<2>.).\k<2>) (.(?<3>.)\k<3>) ((?<4>.)\k<4>\k<4>)
Putting it all together gives me this expression, note that I used relative back references as they are in .net - your milage may vary.
(?=[abc]{3})(?=(?!((?<1>.)\k<1>.)))(?=(?!((?<2>.).\k<2>)))(?=(?!(.(?<3>.)\k<3>)))(?=(?!((?<4>.)\k<4>\k<4>)))[abc]{3}
The answer is yes for a specific length.
Here is some testing data.

Regex to match strings containing two of any character but not three

I want a Regex to match strings containing the same character twice (not necessarily consecutive) but not if that character appears three times or more.
For example, given these two inputs:
abcbde
abcbdb
The first, abcbde would match because it contains b twice. However, abcbdb contains b three times, so that would not match.
I have created this Regex, however it matches both:
(\w).*\1{1}
I've also tried to use the ? modifier, however that still matches abcbdb, which I don't want it to.
You need two checks: a first check to ensure no character exists 3 times in the input, and a second check to look for one that exists 2 times:
^(?!.*(\w).*\1.*\1).*?(\w).*\2
This is horribly inefficient compared to, say, using your programming language to construct an array of character frequencies, requiring only 1 pass through the entire input. But it works.

Pattern matching for strings independent from symbols

I have need for an algorithm which can find pre-defined patterns in data (which is present in the form of strings) independent from the actual symbols/characters of the data and the pattern. I only care about the relations between the symbols, not the symbols themselves. It is also legal to have different pattern symbols for the same symbol in the data. The only thing the pattern matching algorithm has to enforce is that multiple occurences of the same symbol in the pattern are preserved. To give you an example:
The pattern is abca, so the first and the last letter are the same. For my application, an equivalent way to write this would be 1 2 3 1, where the digits are just variables. The data I have is thistextisatest. The resulting algorithm should give me two correct matches here, text and test. Because only in these two cases, the first and the fourth letter are the same, as in the pattern.
As a second example, the pattern abcd should return 12 matches (one for each position in thistextisat). Since no variable in the pattern is repeated, it is trivially matched everywhere. Even in the case of text and test, because it is legal that the variables a and d of the pattern map to the same symbol.
The goal of this algorithm should be to detect similarities in written language. Imagine having a dictionary of the English language and parsing it with the pattern unseen or equivalently 1 2 3 4 4 2. You would then see that, for example, the word belittle contains the same pattern of letters.
So, now that I hopefully made clear what I need, I have some questions:
What is this algorithm called? Is it a well-known problem that has been solved?
Are there publications on the matter? It is really hard to find anything useful when you don't know the correct search terms to separate this problem from regular pattern matching.
Is there a ready implementation of this?
I have not used Regex for anything too complicated, so I don't know if anything like this would even be possible in Regex, when you basically do not care about the symbols as such, but only consider the pattern of their occurences.
I'd really appreciate your help!
I don't think you need regular expressions here. Your search term:
unseen
123442
This has six characters, so index each word of your text into 6-mers
belittle
12,12,12,12,11,12,12 2-mers
123,123,123,122,112,123 3-mers
1234,1234,1233,1223,1123 4-mers
12345,12344,12334,12234 5-mers
123455,123442,123321 6-mers
So just looking at the 6-mers, you've got a match. Any 6 digit number less than your search term would also be a match, to allow for the abcd (1234) case matching an abca (1231) word.
So given a search term of n characters, just split each word into its constituent n-mers and check for numeric equal or less than.

Why does this regexp return infinite?

Regular expression [13579]?[13579]? returns infinite (as http://regexr.com/ says).
Why? I just want to find two jointed odd numbers (two, not more) 😒.
The ? character in RegEx means zero or one of the preceding set. So, your regular expression would match literally everything, as well as two odd numbers in a row.
You'll probably want something like:
[13579]{2}
Debuggex Demo
Which means two and only two of the preceding set.

How do I find words with all the specified characters, with repetition?

Is there a way to find the words containing all the given characters, include the repetitive ones, with regular expression? For example, I want to find all words from list
aabc, abbc, bbbc, aaac, aaab, baac, caab, abca
that contain exactly one 'b' and two 'a's, i.e. aabc, baac, caab, and abca (but NOT aaab as it has an additional 'a'). Word length doesn't matter.
While this question
GREP How do I only retrieve words with only the specified letters?
could give me some hint, I wasn't able to extend it so it will find repeative characters.
I am just playing with re module from Python, but there is no restrcition on language / tool for the question.
EDIT:
A better example / usecase would be: Given a list of words, show only those that contain all the letters entered by a user, e.g. I would like to find all words containing exactly one 'a', two 'd's and one 's'. Is this something regex capable of? (I already know how to do it without regex.)
To match exactly 2 a's and 1 b (in any order) in your input string use this regex:
(?=^(?:[^a]*a){2}[^a]*$)(?=^[^b]*b[^b]*$)^.+$
Here is a live demo for you.
If your regex flavor supports lookaheads, then you can use this:
\b(?=.*b)(?=([^a]*a){2}[^a]*\b)[abc]+\b
This requires at least one b and exactly 2 a's, and allows only a, b and c in the string. If you want to require exactly one b and exactly 4 characters in total, use this:
\b(?=[^b]*b[^b]*\b)(?=([^a]*a){2}[^a]*\b)[abc]{4}\b