How to create a regular expression to match non-consecutive characters? - regex

How to create a regular expression for strings of a,b and c such that aa and bb will be rejected?
For example, abcabccababcccccab will be accepted and aaabc or aaabbcccc or abcccababaa will be rejected.

If this is not a purely academical question you can simply search for aa and bb and negate your logic, for example:
s='abcccabaa'
# continue if string does not match.
if re.search('(?:aa|bb)', s) is None:
...
or simply scan the string for the two patterns, avoiding expensive regular expressions:
if 'aa' not in s and 'bb' not in s:
...
For such an easy task RE is probably total overkill.
P.S.: The examples are in Python but the principle applies to other languages of course.

^(?!.*(?:aa|bb))[abc]+$
See it here on Regexr
This regex would do two things
verify that your string consist only of a,b and c
fail on aa and bb
^ matches the start of the string
(?!.*(?:aa|bb)) negative lookahead assertion, will fail if there is aa or bb in the string
[abc]+ character class, allows only a,b,c at least one (+)
$ matches the end of the string

Using the & operator (intersection) and ~ (complement):
(a|b|c)*&~(.*(aa|cc).*)
Rewriting this without the these operators is tricky. The usual approach is to break it into cases.
In this case it is not all that difficult.
Suppose that the letter c is taken out of the picture. The only sequences then which don't have aa and bb are:
e (empty string)
a
b
b?(ab)*a?
Next what we can do is insert some optional 'c' runs into all possible interior places:
e (empty string)
a
b
(bc*)?(ac*bc*)*a?
Next, we have to acknowledge that illegal sequences like aabb become accepted if non-optional 'c' runs are put in the middle, as in for example acacbcbc'. We allow a finalaandb. This pattern can take care of our loneaandb` cases as well as matching the empty string:
(ac+|bc+)*(a|b)?
Then combine them together:
((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?)
We are almost there: we also need to recognize that this pattern can occur an arbitrary number of times, as long as there are dividing 'c'-s between the occurences, and with arbitrary leading or trailing runs of c-s around the whole thing
c*((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?)(c+((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?))*c*
Mr. Regex Philbin, I'm not coming up with any cases that this doesn't handle, so I'm leaving it as my final answer.

Related

Regex match pair ocurrences of a specific character

I've been trying to make a regex that satisfies this conditions:
The word consists of characters a,b
The number of b characters must be pair (consecutive or not)
So for example:
abb -> accepted
abab -> accepted
aaaa -> rejected
baaab -> accepted
So far i got this: ([a]*)(((b){2}){1,})
As you can see i know very little about the matter, this checks for pairs but it does still accept words with odd number of b's.
You could use this regex to check for some number of as with an even number of bs:
^(?:a*ba*ba*)+$
This looks for 1 or more occurrences of 2 bs surrounded by some number (which may be 0) as.
Demo on regex101
Note this will match bb (or bbbb, bbbbbb etc.). If you don't want to do that, the easiest way is to add a positive lookahead for an a:
^(?=b*a)(?:a*ba*ba*)+$
Demo on regex101
Checking an Array of Characters Against Two Conditionals
While you could do this using regular expressions, it would be simpler to solve it by applying some conditional checks against your two rules against an Array of characters created with String#chars. For example, using Ruby 3.1.2:
# Rule 1: string contains only the letters `a` and `b`
# Rule 2: the number of `b` characters in the word is even
#
# #return [Boolean] whether the word matches *both* rules
def word_matches_rules word
char_array = word.chars
char_array.uniq.sort == %w[a b] and char_array.count("b").even?
end
words = %w[abb abab aaaa baaab]
words.map { |word| [word, word_matches_rules(word)] }.to_h
#=> {"abb"=>true, "abab"=>true, "aaaa"=>false, "baaab"=>true}
Regular expressions are very useful, but string operations are generally faster and easier to conceptualize. This approach also allows you to add more rules or verify intermediate steps without adding a lot of complexity.
There are probably a number of ways this could be simplified further, such as using a Set or methods like Array#& or Array#-. However, my goal with this answer was to make the code (and the encoded rules you're trying to apply) easier to read, modify, and extend rather than to make the code as minimalist as possible.

Regex | Containing "bbb"

I'm trying to create a Regex with chars 'a' and 'b'.
The only rule is that the regex must contain the word 'bbb' somewhere.
These are possible: aabbbaaaaaababa, abbba, bbb, aabbbaa, abbabbba, ...
These are not possible: abba, a, abb, bba, abbaaaabbaaaabba, ...
I have no idea how can I can express that.
Any ideas? Thanks in Advance!
Based on the tag "automata", I am guessing you are after the formal regular expression for this formal language. In that case, a regular expression is (a+b)bbb(a+b). The anatomy of this regular expression is the following:
(a+b) gives either "a" or "b"
(a+b)* gives any string of "a"s and "b"s whatever
bbb gives the string bbb only
the whole regular expression describes any string that begins with anything, then has bbb, then ends with anything
To prove this regular expression is correct, note that:
This regular expression only generates strings that contain the substring bbb. This is due to the middle part.
This regular expression generates all strings that contain the substring bbb. Suppose there were some string containing the substring bbb that this regular expression didn't generate. The string either starts with bbb or it doesn't. If it does, then the string is generated by our regular expression by repeating the first (a+b) zero times and the second (a+b) n - 3 times, where n is the length of the string. Otherwise, if it doesn't start with bbb, consider the suffix of length n - 1 as a recursive case. Continue thusly until the subcase does begin with bbb (it eventually must). Because this suffix is describable by our regular expression, the original case must be too since we can just repeat the first (a+b) an additional number of times equal to the depth of recursion.
The patter is kind simple
/b{3}/g
if you need it to match 3 and only 3 'b's, you can use
/b{3}[^b]?/g
Good evening! you can use this expression it might work
(a+b)* (bbb)(a+b)*
using this results in generating triple (bbb) minimum string
and by taking closure of (a+b) you can generate any type of strings containing triple b in them

regex that will check expression is valid arithmetic expression, for example 'a+b*c=d' is valid and accepted

I made separate regexs for both but its not giving desired result. and it should work like check whole input string and return valid if its valid or invalid if its invalid.
import re
identifiers = r'^[^\d\W]\w*\Z'
operators = r'[\+\*\-\/=]'
a = re.compile(identifiers, re.UNICODE)
b = re.compile(operators, re.UNICODE)
c = re.findall(a, 'a+b*c=d')
d = re.findall(b, 'a+b*c=d')
print c, 'identifiers'
print d, 'operators'
Result of this snippet is
[ ] identifiers &
['+', '*', '='] operators
I want results like input string is valid or invalid by checking all characters of input string by both regex
I think the issue you're having with your current code is that your identifiers pattern only works if it matches the whole string.
The problem is that the current pattern requires that both the beginning and end of the input be matched (by the ^ and \Z respectively). That's usually causing you to not finding any identifiers, since only an input like "foo" would be matched, since it's a single identifier that contacts both the start and end of the string. (I'd also note that it is a bit odd to mix ^ and \Z together, though it is not invalid. It would just be more natural to pair ^ with $ or \Z with \A.)
I suspect that you don't actually want ^ and \Z in your pattern, but rather should be using \b in place of both. The \b escape matches "word breaks", which means either the start or end of the input, or a change between word-characters and non-word characters.
>>> re.findall(r'\b[^\d\W]\w*\b', 'a+b*c=d', re.U)
['a', 'b', 'c', 'd']
This still isn't going to do what you say you ultimately want (testing if the string to ensure it's a valid expression). That's a much more difficult task, and regular expressions are not up to it in general. Certain specific forms of expressions can perhaps be matched with regex, but supporting things like parentheses will break the whole system in a hurry. To identify arbitrary arithmetic expressions, you'd need a more sophisticated parser, which might use regex in some of it's steps, but not for the whole thing.
For the simple cases like your example an expression like this will work:
^[0-9a-z]+([+/*-][0-9a-z]+)+=[0-9a-z]+$

ReGex - matching a logical implication

Given the alphabet {a, b, c}, how can I create a simple regular expression which matches exactly those words which meet the following criteria:
If the string "aa" occurs, then consequently "cc" must also occur (Note the logical implication).
The order of occurence doesn't matter ("cc" as well as "aa" can occur first).
Due to the former logical implication (if-then relationship), the string "cc" can occur even without "aa", but not vice versa.
I am looking for a way to implement this by using these syntax elements (., *, +, ?, |, ) as well as brackets.
Example what should be matched:
cc
abba
bccb
bccaa
ε (epsilon - empty string)
What shouldn't be matched:
aa
aacbcb
abaaba
baaa
caac
I have tried the following: a?b?(ba)*(ccaa)*(aacc)*c*b?
Not sure if I should answer this, since it's homework, but here goes...
First of, you identify the clauses P:"string 'aa' occurs" and Q:"string 'cc' occurs". Then you transform P -> Q into !P v Q. Then you translate this into the regular expression formed of the two parts:
^(((b|c)*(a(b|c)+)*a?)|(.*cc.*))$
The first part denies any 'aa' groups, and goes like this: allow any number of bs and cs at the beginning (including none), then if you find an a, force at least one different character after it. as followed by non as may happen any number of times. At the end we also have the a? to allow for the string to end in an a and we are sure it has no a before it because none of the 2 preceding groups end in a.
The second part is trivial.
Test it here: http://rubular.com/r/mAPHO8bulo
See it here: http://jex.im/regulex/...

Regular expression matching any subset of a given set?

Is it possible to write a regular expression which will match any subset of a given set of characters a1 ... an ?
I.e. it should match any string where any of these characters appears at most once, there are no other characters and the relative order of the characters doesn't matter.
Some approaches that arise at once:
1. [a1,...,an]* or (a1|a2|...|an)*- this allows multiple presence of characters
2. (a1?a2?...an?) - no multiple presence, but relative order is important - this matches any subsequence but not subset.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1), i.e. write all possible subsequences (just hardcode all matching strings :)) of course, not acceptable.
I also have a guess that it may be theoretically impossible, because during parsing the string we will need to remember which character we have already met before, and as far as I know regular expressions can check out only right-linear languages.
Any help will be appreciated. Thanks in advance.
This doesn't really qualify for the language-agnostic tag, but...
^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$
see a demo on ideone.com
The first time an element is matched, it gets "checked off" by the capturing group following it. Because the group has now participated in the match, a negative lookahead for its corresponding backreference (e.g., (?!\1)) will never match again, even though the group only captured an empty string. This is an undocumented feature that is nevertheless supported in many flavors, including Java, .NET, Perl, Python, and Ruby.
This solution also requires support for forward references (i.e., a reference to a given capturing group (\1) appearing in the regex before the group itself). This seems to be a little less widely supported than the empty-groups gimmick.
Can't think how to do it with a single regex, but this is one way to do it with n regexes: (I will usr 1 2 ... m n etc for your as)
^[23..n]*1?[23..n]*$
^[13..n]*2?[13..n]*$
...
^[12..m]*n?[12..m]*$
If all the above match, your string is a strict subset of 12..mn.
How this works: each line requires the string to consist exactly of:
any number of charactersm drawn fromthe set, except a particular one
perhaps a particular one
any number of charactersm drawn fromthe set, except a particular one
If this passes when every element in turn is considered as a particular one, we know:
there is nothing else in the string except the allowed elements
there is at most one of each of the allowed elements
as required.
for completeness I should say that I would only do this if I was under orders to "use regex"; if not, I'd track which allowed elements have been seen, and iterate over the characters of the string doing the obvious thing.
Not sure you can get an extended regex to do that, but it's pretty easy to do with a simple traversal of your string.
You use a hash (or an array, or whatever) to store if any of your allowed characters has already been seen or not in the string. Then you simply iterate over the elements of your string. If you encounter an element not in your allowed set, you bail out. If it's allowed, but you've already seen it, you bail out too.
In pseudo-code:
foreach char a in {a1, ..., an}
hit[a1] = false
foreach char c in string
if c not in {a1, ..., an} => fail
if hit[c] => fail
hit[c] = true
Similar to Alan Moore's, using only \1, and doesn't refer to a capturing group before it has been seen:
#!/usr/bin/perl
my $re = qr/^(?:([abc])(?!.*\1))*$/;
foreach (qw(ba pabc abac a cc cba abcd abbbbc), '') {
print "'$_' ", ($_ =~ $re) ? "matches" : "does not match", " \$re \n";
}
We match any number of blocks (the outer (?:)), where each block must consist of "precisely one character from our preferred set, which is not followed by a string containing that character".
If the string might contain newlines or other funny stuff, it might be necessary to play with some flags to make ^, $ and . behave as intended, but this all depends on the particular RE flavor.
Just for sillyness, one can use a positive look-ahead assertion to effectively AND two regexps, so we can test for any permutation of abc by asserting that the above matches, followed by an ordinary check for 'is N characters long and consists of these characters':
my $re2 = qr/^(?=$re)[abc]{3}$/;
foreach (qw(ba pabc abac a cc abcd abbbbc abc acb bac bca cab cba), '') {
print "'$_' ", ($_ =~ $re2) ? "matches" : "does not match", " \$re2 \n";
}