Regex match pair ocurrences of a specific character - regex

I've been trying to make a regex that satisfies this conditions:
The word consists of characters a,b
The number of b characters must be pair (consecutive or not)
So for example:
abb -> accepted
abab -> accepted
aaaa -> rejected
baaab -> accepted
So far i got this: ([a]*)(((b){2}){1,})
As you can see i know very little about the matter, this checks for pairs but it does still accept words with odd number of b's.

You could use this regex to check for some number of as with an even number of bs:
^(?:a*ba*ba*)+$
This looks for 1 or more occurrences of 2 bs surrounded by some number (which may be 0) as.
Demo on regex101
Note this will match bb (or bbbb, bbbbbb etc.). If you don't want to do that, the easiest way is to add a positive lookahead for an a:
^(?=b*a)(?:a*ba*ba*)+$
Demo on regex101

Checking an Array of Characters Against Two Conditionals
While you could do this using regular expressions, it would be simpler to solve it by applying some conditional checks against your two rules against an Array of characters created with String#chars. For example, using Ruby 3.1.2:
# Rule 1: string contains only the letters `a` and `b`
# Rule 2: the number of `b` characters in the word is even
#
# #return [Boolean] whether the word matches *both* rules
def word_matches_rules word
char_array = word.chars
char_array.uniq.sort == %w[a b] and char_array.count("b").even?
end
words = %w[abb abab aaaa baaab]
words.map { |word| [word, word_matches_rules(word)] }.to_h
#=> {"abb"=>true, "abab"=>true, "aaaa"=>false, "baaab"=>true}
Regular expressions are very useful, but string operations are generally faster and easier to conceptualize. This approach also allows you to add more rules or verify intermediate steps without adding a lot of complexity.
There are probably a number of ways this could be simplified further, such as using a Set or methods like Array#& or Array#-. However, my goal with this answer was to make the code (and the encoded rules you're trying to apply) easier to read, modify, and extend rather than to make the code as minimalist as possible.

Related

Match same number of repetitions as previous group

I'm trying to match strings that are repeated the same number of times, like
abc123
abcabc123123
abcabcabc123123123
etc.
That is, I want the second group (123) to be matched the same number of times as the first group (abc). Something like
(abc)+(123){COUNT THE PREVIOUS GROUP MATCHED}
This is using the Rust regex crate https://docs.rs/regex/1.4.2/regex/
Edit As I feared, and pointed out by answers and comments, this is not possible to represent in regex, at least not without some sort of recursion which the Rust regex crate doesn't for the time being support. In this case, as I know the input length is limited, I just generated a rule like
(abc123)|(abcabc123123)|(abcabcabc123123123)
Horribly ugly, but got the job done, as this wasn't "serious" code, just a fun exercise.
As others have commented, I don't think it's possible to accomplish this in a single regex. If you can't guarantee the strings are well-formed then you'd have to validate them with the regex, capture each group, and then compare the group lengths to verify they are of equal repetitions. However, if it's guaranteed all strings will be well-formed then you don't even need to use regex to implement this check:
fn matching_reps(string: &str, group1: &str, group2: &str) -> bool {
let group2_start = string.find(group2).unwrap();
let group1_reps = (string.len() - group2_start) / group1.len();
let group2_reps = group2_start / group2.len();
group1_reps == group2_reps
}
fn main() {
assert_eq!(matching_reps("abc123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123", "abc", "123"), false);
assert_eq!(matching_reps("abcabc123123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123123123", "abc", "123"), false);
}
playground
Pure regular expressions are not able to represent that.
There may be some way to define back references, but I am not familiar with regexp syntax in Rust, and this would technically be a way to represent something more than a pure regular expression.
There is however a simple way to compute it :
use a regexp to make sure your string is a ^((abc)*)((123)*)$
if your string matches, take the two captured substrings, and compare their lengths
Building a pattern dynamically is also an option. Matching one, two or three nested abc and 123 is possible with
abc(?:abc(?:abc(?:)?123)?123)?123
See proof. (?:)? is redundant, it matches no text, (?:...)? matches an optional pattern.
Rust snippet:
let a = "abc"; // Prefix
let b = "123"; // Suffix
let level = 3; // Recursion (repetition) level
let mut result = "".to_string();
for _n in 0..level {
result = format!("{}(?:{})?{}", a, result, b);
}
println!("{}", result);
// abc(?:abc(?:abc(?:)?123)?123)?123
There's an extension to the regexp libraries, that is implemented from the old times unix and that allows to match (literally) an already scanned group literally after the group has been matched.
For example... let's say you have a number, and that number must be equal to another (e.g. the score of a soccer game, and you are interested only in draws between the two teams) You can use the following regexp:
([0-9][0-9]*) - \1
and suppose we feed it with "123-123" (it will match) but if we use "123-12" that will not match, as the \1 is not the same string as what was matched in the first group. When the first group is matched, the actual regular expression converts the \1 into the literal sequence of characters that was matched in the first group.
But there's a problem with your sample... is that there's no way to end the first group if you try:
([0-9][0-9]*)\1
to match 123123, because the automaton cannot close the first group (you need at least a nondigit character to make the first group to finalize)
But for example, this means that you can use:
\+(\([0-9][0-9]*\))\1(-\1)*
and this will match phone numbers in the form
+(358)358-358-358
or
+(1)1-1-1-1-1-1-1
(the number in between the parenthesys is catched as a sample, and then you use the group to build a sequence of that number separated by dashes. You can se the expression working in this demo.)

Regex to match either range or list of numbers

I need a regex to match lists of numbers and another one to match ranges of numbers (expressions shall never fail in both cases). Ranges shall consist of a number, a dash, and another number (N-N), while lists shall consist of numbers separated by a comma (N,N,N). Here below are some examples.
Ranges:
'1-10' => OK
Whateverelse => NOK (e.g. '1-10 11-20')
List:
'1,2,3' => OK
Whateverelse => NOK
And here are my two regular expressions:
[0-9]+[\-][0-9]+
([0-9]+,?)+
... but I have a few problems with them... for example:
When evaluating '1-10', regex 2 matches 1... but it shouldn't match anything because the string does not contain a list.
Then, when evaluating '1-10 11-14', regex 1 matches 1-10... but it shouldn't match anything because the string contains more than just a range.
What am I missing? Thanks.
Try this:
^((\d+-(\*|\d+))|((\*|\d+)-\d+)|((\d)(,\d)+))$
Test results:
1-10 OK
1,2,3 OK
1-* OK
*-10 OK
1,2,3 1-10 NOK
1,2,3 2,3,4 NOK
*-* NOK
Visualization of the regex:
Edit: Added for wildcard * as per OP's comment.
This one is a little different. It's for ports on a Procurve switch.
^(((\d+)|(\d+-\d+))(,((\d+)|(\d+-\d+)))*)$
It's in perl.
1 OK
2 OK
3 OK
1-4 OK
0-A NOK
83-91 OK
14,15,16 OK
14,20-25,91 OK
a,b-c,5,5,5 NOK
this-is,5,7,9 NOK
9,8,1-2,1-7 OK
I didn't include the * from above. And what did you (#unlimit) use for that wonderful diagram?
-E
First, you should use anchors to make sure that the regex match encompasses the entire string and not just a substring:
^[0-9]+-[0-9]+$
Then, the comma is optional in your second regex. Try this instead:
^([0-9]+,)+[0-9]+$
The simplest solution to your issue is to wrap an extra set of brackets around the second result:
(([0-9]+,?)+)
As others have noted if you are taking text input and thats the whole input you should start and finish it with ^ and $:
^(([0-9]+,?)+)$
If you are searching a body of text to extract these values then you wouldn't need that.
The brackets mean a match group. Its also possible to mark the inner bracket as "non-capturing group" if you add (?: to the start instead of (. This would leave you with:
((?:[0-9]+,?)+)
Which would mean the only captured value is the one you wanted. You could also just ignore the second capture.
I needed something to match a list of integers that are comma separated, such as 1,2,3,4 but also specify ranges such as 100-255 and combinations thereof, such as 1011,1100-1300,1111,1919-9999,2111. Basically the OP request and combinations of it.
For this, I use the following regular expression tested over at Regex101.com:
^\d+((\,|-)\d+)*$
You can think of this as:
From the start of the string
Expect 1 or more digits, and either...
A literal comma and 1 or more digits, or...
A hyphen and 1 or more digits
with (3) and (4) repeating zero or more times
Until the string end
This permits all the following to be valid:
2011,2100-2300
2011,2013
1014-2024
999
1011,1100-1300,1111,1919-9999,2111
Note: global and multiline regex options /gm should be included if being used for multiline input
The downside is something like 100-100-100 is still valid, even though other types of change will ensure no match. Not sure of the complexity to resolve it further, but it was good enough for my needs.

How to create a regular expression to match non-consecutive characters?

How to create a regular expression for strings of a,b and c such that aa and bb will be rejected?
For example, abcabccababcccccab will be accepted and aaabc or aaabbcccc or abcccababaa will be rejected.
If this is not a purely academical question you can simply search for aa and bb and negate your logic, for example:
s='abcccabaa'
# continue if string does not match.
if re.search('(?:aa|bb)', s) is None:
...
or simply scan the string for the two patterns, avoiding expensive regular expressions:
if 'aa' not in s and 'bb' not in s:
...
For such an easy task RE is probably total overkill.
P.S.: The examples are in Python but the principle applies to other languages of course.
^(?!.*(?:aa|bb))[abc]+$
See it here on Regexr
This regex would do two things
verify that your string consist only of a,b and c
fail on aa and bb
^ matches the start of the string
(?!.*(?:aa|bb)) negative lookahead assertion, will fail if there is aa or bb in the string
[abc]+ character class, allows only a,b,c at least one (+)
$ matches the end of the string
Using the & operator (intersection) and ~ (complement):
(a|b|c)*&~(.*(aa|cc).*)
Rewriting this without the these operators is tricky. The usual approach is to break it into cases.
In this case it is not all that difficult.
Suppose that the letter c is taken out of the picture. The only sequences then which don't have aa and bb are:
e (empty string)
a
b
b?(ab)*a?
Next what we can do is insert some optional 'c' runs into all possible interior places:
e (empty string)
a
b
(bc*)?(ac*bc*)*a?
Next, we have to acknowledge that illegal sequences like aabb become accepted if non-optional 'c' runs are put in the middle, as in for example acacbcbc'. We allow a finalaandb. This pattern can take care of our loneaandb` cases as well as matching the empty string:
(ac+|bc+)*(a|b)?
Then combine them together:
((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?)
We are almost there: we also need to recognize that this pattern can occur an arbitrary number of times, as long as there are dividing 'c'-s between the occurences, and with arbitrary leading or trailing runs of c-s around the whole thing
c*((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?)(c+((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?))*c*
Mr. Regex Philbin, I'm not coming up with any cases that this doesn't handle, so I'm leaving it as my final answer.

Regular expression matching any subset of a given set?

Is it possible to write a regular expression which will match any subset of a given set of characters a1 ... an ?
I.e. it should match any string where any of these characters appears at most once, there are no other characters and the relative order of the characters doesn't matter.
Some approaches that arise at once:
1. [a1,...,an]* or (a1|a2|...|an)*- this allows multiple presence of characters
2. (a1?a2?...an?) - no multiple presence, but relative order is important - this matches any subsequence but not subset.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1), i.e. write all possible subsequences (just hardcode all matching strings :)) of course, not acceptable.
I also have a guess that it may be theoretically impossible, because during parsing the string we will need to remember which character we have already met before, and as far as I know regular expressions can check out only right-linear languages.
Any help will be appreciated. Thanks in advance.
This doesn't really qualify for the language-agnostic tag, but...
^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$
see a demo on ideone.com
The first time an element is matched, it gets "checked off" by the capturing group following it. Because the group has now participated in the match, a negative lookahead for its corresponding backreference (e.g., (?!\1)) will never match again, even though the group only captured an empty string. This is an undocumented feature that is nevertheless supported in many flavors, including Java, .NET, Perl, Python, and Ruby.
This solution also requires support for forward references (i.e., a reference to a given capturing group (\1) appearing in the regex before the group itself). This seems to be a little less widely supported than the empty-groups gimmick.
Can't think how to do it with a single regex, but this is one way to do it with n regexes: (I will usr 1 2 ... m n etc for your as)
^[23..n]*1?[23..n]*$
^[13..n]*2?[13..n]*$
...
^[12..m]*n?[12..m]*$
If all the above match, your string is a strict subset of 12..mn.
How this works: each line requires the string to consist exactly of:
any number of charactersm drawn fromthe set, except a particular one
perhaps a particular one
any number of charactersm drawn fromthe set, except a particular one
If this passes when every element in turn is considered as a particular one, we know:
there is nothing else in the string except the allowed elements
there is at most one of each of the allowed elements
as required.
for completeness I should say that I would only do this if I was under orders to "use regex"; if not, I'd track which allowed elements have been seen, and iterate over the characters of the string doing the obvious thing.
Not sure you can get an extended regex to do that, but it's pretty easy to do with a simple traversal of your string.
You use a hash (or an array, or whatever) to store if any of your allowed characters has already been seen or not in the string. Then you simply iterate over the elements of your string. If you encounter an element not in your allowed set, you bail out. If it's allowed, but you've already seen it, you bail out too.
In pseudo-code:
foreach char a in {a1, ..., an}
hit[a1] = false
foreach char c in string
if c not in {a1, ..., an} => fail
if hit[c] => fail
hit[c] = true
Similar to Alan Moore's, using only \1, and doesn't refer to a capturing group before it has been seen:
#!/usr/bin/perl
my $re = qr/^(?:([abc])(?!.*\1))*$/;
foreach (qw(ba pabc abac a cc cba abcd abbbbc), '') {
print "'$_' ", ($_ =~ $re) ? "matches" : "does not match", " \$re \n";
}
We match any number of blocks (the outer (?:)), where each block must consist of "precisely one character from our preferred set, which is not followed by a string containing that character".
If the string might contain newlines or other funny stuff, it might be necessary to play with some flags to make ^, $ and . behave as intended, but this all depends on the particular RE flavor.
Just for sillyness, one can use a positive look-ahead assertion to effectively AND two regexps, so we can test for any permutation of abc by asserting that the above matches, followed by an ordinary check for 'is N characters long and consists of these characters':
my $re2 = qr/^(?=$re)[abc]{3}$/;
foreach (qw(ba pabc abac a cc abcd abbbbc abc acb bac bca cab cba), '') {
print "'$_' ", ($_ =~ $re2) ? "matches" : "does not match", " \$re2 \n";
}

How can I check if every substring of four zeros is followed by at least four ones using regular expressions?

How can I write regular expression in which
whenever there is 0000 there should be 1111 after this for example:
00101011000011111111001111010 -> correct
0000110 -> incorect
11110 -> correct
thanks for any help
If you are using Perl, you can use a zero-width negative-lookahead assertion:
#!/usr/bin/perl
use strict; use warnings;
my #strings = qw(
00101011000011111111001111010
00001111000011
0000110
11110
);
my $re = qr/0000(?!1111)/;
for my $s ( #strings ) {
my $result = $s =~ $re ? 'incorrect' : 'correct';
print "$s -> $result\n";
}
The pattern matches if there is a string of 0000 not followed by at least four 1s. So, a match indicates an incorrect string.
Output:
C:\Temp> s
00101011000011111111001111010 -> correct
00001111000011 -> incorrect
0000110 -> incorrect
11110 -> correct
While some languages' alleged "regular expressions" actually implement something quite different (generally a superset of) what are called regular expressions in computer science (including e.g. pushdown automatas or even arbitrary code execution within "regexes"), to answer in actual regex terms is, I think, best done as follows:
regular expressions are in general a good way to answer many questions of the form "is there any spot in the text in which the following pattern occurs" (with limitations on the pattern, of course -- for example, balancing of nested parentheses is beyond regexes' power, although of course it may well not be beyond the power of arbitrary supersets of regexes). "Does the whole text match this pattern" is obviously a special case of the question "does any spot in the text match this pattern", given the possibility to have special markers meaning "start of text" and "end of text" (typically ^ and $ in typical regex-pattern syntax).
However, the question "can you check that NO spot in the text matches this pattern" is not an answer which regex matching can directly answer... but, adding (outside the regex) the logical operation not obviously solves the problem in practice, because "check that no spot matches" is clearly the same "tell me if any spot matches" followed by "transform success into failure and vice versa" (the latter being the logical not part). This is the key insight in Sinan's answer, beyond the specific use of Perl's negative-lookahead (which is really just a shortcut, not an extension of regex power per se).
If your favorite language for using regexes in doesn't have negative lookahead but does have the {<number>} "count shortcut", parentheses, and the vertical bar "or" operation:
00001{0,3}([^1]|$)
i.e., "four 0s followed by zero to three 1s followed by either a non-1 character or end-of-text" is exactly a pattern such that, if the text matches it anywhere, violates your constraint (IOW, it can be seen as a slight expansion of the negative-lookahead shortcut syntax). Add a logical-not (again, in whatever language you prefer), and there you are!
There are several approaches that you can take here, and I will list some of them.
Checking that for all cases, the requirement is always met
In this approach, we simply look for 0000(1111)? and find all matches. Since ? is greedy, it will match the 1111 if possible, so we simply check that each match is 00001111. If it's only 0000 then we say that the input is not valid. If we didn't find any match that is only 0000 (perhaps because there's no match at all to begin with), then we say it's valid.
In pseudocode:
FUNCTION isValid(s:String) : boolean
FOR EVERY match /0000(1111)?/ FOUND ON s
IF match IS NOT "00001111" THEN
RETURN false
RETURN true
Checking that there is a case where the requirement isn't met (then oppose)
In this approach, we're using regex to try to find a violation instead, Thus, a successful match means we say the input is not valid. If there's no match, then there's no violation, so we say the input is valid (this is what is meant by "check-then-oppose").
In pseudocode
isValid := NOT (/violationPattern/ FOUND ON s)
Lookahead option
If your flavor supports it, negative lookahead is the most natural way to express this pattern. Simply look for 0000(?!1111).
No lookahead option
If your flavor doesn't support negative lookahead, you can still use this approach. Now the pattern becomes 00001{0,3}(0|$). That is, we try to match 0000, followed by 1{0,3} (that is, between 0-3 1), followed by either 0 or the end of string anchor $.
Fully spelled out option
This is equivalent to the previous option, but instead of using repetition and alternation syntax, you explicitly spell out what the violations are. They are
00000|000010|0000110|00001110|
0000$|00001$|000011$|0000111$
Checking that there ISN'T a case where the requirement isn't met
This relies on negative lookahead; it's simply taking the previous approach to the next level. Instead of:
isValid := NOT (/violationPattern/ FOUND ON s)
we can bring the NOT into the regex using negative lookahead as follows:
isValid := (/^(?!.*violationPattern)/ FOUND ON s)
That is, anchoring ourself at the beginning of the string, we negatively assert that we can match .*violationPattern. The .* allows us to "search" for the violationPattern as far ahead as necessary.
Attachments
Here are the patterns showcased on rubular:
Approach 1: Matching only 0000 means invalid
0000(?:1111)?
Approach 2: Match means invalid
0000(?!1111)
00001{0,3}(?:0|$)
00000|000010|0000110|00001110|0000$|00001$|000011$|0000111$
Approach 3: Match means valid
^(?!.*0000(?!1111)).*$
The input used is (annotated to show which ones are valid):
+ 00101011000011111111001111010
- 000011110000
- 0000110
+ 11110
- 00000
- 00001
- 000011
- 0000111
+ 00001111
References
regular-expressions.info/Lookarounds and Flavor comparison