Does order not matter in regular expressions? - regex

I was looking at the question posed in this stackoverflow link (Regular expression for odd number of a's) for which it is asked to find the regular expression for strings that have odd number of a over Σ = {a,b}.
The answer given by the top comment which works is b*(ab*ab*)*ab*.
I am quite confused - a was placed just before the last b*, does this ordering actually matter? Why can't it be b*a(ab*ab*)*b* instead (where a is placed after the first b*), or any other permutation of it?
Another thing I am confused about is why it is (ab*ab*)* and not (b*ab*ab*)*. Isn't b*ab*ab* the more accurate definition of 'having exactly 2 a'?

Why can't it be b*a(ab*ab*)*b* instead?
b*a(ab*ab*)*b* does not work because it would require the string to have two consecutive as before the first non-leading b, wouldn't it? For example, abaa would not be matched by your proposed regex when it should. Use the regex debugger on a site like Regex101 to see this for yourself.
On the other hand, moving the whole ab* part to the start (b*ab*(ab*ab*)*) works as well.
why it is (ab*ab*)* and not (b*ab*ab*)*?
(b*ab*ab*)* does work, but the first b* is quite redundant because whatever b there is left, will be matched by the last b* in the group. There is also a b* before the group, which causes the b* to not be able to match anything, hence it is redundant.

There are infinitely many equivalent regular expressions which generate a given (infinite) regular language. A particular expression might be preferable in some cases and by certain authors: one might prefer a minimal expression, or one which shows structure or symmetry, or even one that simplifies the reasoning in a proof by induction.
Your particular suggestion to move the a is insufficient since, as noted above, that ensures the substring aa will appear in any string with more than one a. However, abab could be changed to baba to make that placement work. Choosing babab* would work with either placement. You could even go for an expression like bab + bababab + (babab*)a(babab*) which might be nice to work with depending on your application. Something like b*(abab)ab* has the advantage of being minimal (if it's not strictly minimal, it must be pretty close).

Related

find a regular expression where a is never immediately followed by b (Theory of formal languages)

I need to find a simplified regular expression for the language of all strings
of a's, b's, and c's where a is never immediately followed by b.
I tried something and reached till (a+c)*c(b+c)* + (b+c)*(a+c)*
Is this fine and if so can this be simplified?
Thanks in advance.
You are looking for a negative lookbehind:
(?<!a)b
This will find you all the b instances that are not immediately following a
Or a negative lookahead:
a(?!b)
This will find you all the a instances that are not immediately followed by b
Here is a regex101 example for the lookbehind:
https://regex101.com/r/RsqXbW/1
Here is a regex101 example for the lookahead:
https://regex101.com/r/qiDIZU/1
You solution contains only strings from the desired language. However, it does not contain all of them. For example acbac is not contained. Your basic idea is fine, but you need to be able to iterate the possible factors. In:
(b+c)*(a (a)*(c(b+c)*)*)*
the first part generates all strings withhout a.
After the first a there come either nothing, another a or c. Another a leaves us with the same three options. c basically starts the game again. This is what the part after the first a formalizes. The many * are needed to possibly generate the empty string in all of the different options.

Automatically find short regexp to match a set of words?

I am not looking for a specific regular expression, but for a software that find them.
Let us say I have a file A and a file B: how to find a regexp that matches all words of A, but does not match any of the words in A?
If A contains "truit fruit" and B contains "ridiculous", then the software could return something like ".ru." but '.r.' only would be invalid.
It is the "practical" aspect of another question [1], though what interests me is to find an actual software that solves it in practice.
Thanks for your help,
Nathann
[1] https://cstheory.stackexchange.com/questions/1854/is-finding-the-minimum-regular-expression-an-np-complete-problem
There is no algorithm to somehow "cleverly derive" a regular expression from examples. You can only implement a brute force attempt of an iteration through all permutations of common substrings of the words in A and tests B against it until you find a solution. You are not guaranteed to find a solution, though.
For the case that there are no common substrings of all words in A you could then extend that approach to introduce the "or" operator in regular expressions. But that get's really ugly and slow.
If that does not lead to a solution, then you'd have to go on extending your attempts such that also exclusion rules are added to the expression by iterating through all words in B and creating anti patterns from it. Horrible attempt.
And as said: you are never guaranteed to find a solution.
There is one thing though:
If you are not interested in how the final regular expression looks like you can do this: create a regex simply combining all words in a "whitespace padded version of A" with an "or" operation (so \struit\s|\sfruit\s in your example). Obviously that attempt creates huge expressions. You then would have to take care to exclude exact substrings that might occur in B again. Which may lead to much longer expressions still.
Bottom line: there is no really elegant solution for this. Simply because the question does not allow for that. Question is: why does it have to be a regular expression? Why can't you simply do string comparisions? That would probably not be more expensive anyway in such an vaguely defined scenario...

Regular expression with a given length with an exact number of a character

I'm looking for a regular expression for the language with the exact number of k a's in it.
I'm pretty much stuck at this. For a various length the solution would be easy with .
Does anybody have any advice on how I can achieve such an regex?
I'd use this one :
(b*ab*){k}
It just makes k blocks containing exactly one a. Therefore words have k a.
One of the b* can be factored out on the left or on the right.
There is no simple solution to this.
While that language is regular, it's ugly to describe. You can get it by intersecting the (trivial) DFAs for both languages ((a|b)^n and b*(ab*)^k) with each other, but you'll get a DFA with (n-k)*k states back. And transforming that it into a regular expression won't make it better.
However, if you're looking for an actual implementation it gets much easier. You can simply test the input against both regexes, or you can use lookahead to compose them into one regex:
/^(?=[ab]{n}$)b*(ab*){k}$/
You can use a look ahead to enforce the overall length:
^(?=.{5}$)([^a]*a){2}[^a]*$
See this demonstrated on rubular

Regex for password requirements

I want to require the following:
Is greater than seven characters.
Contains at least two digits.
Contains at least two special (non-alphanumeric) characters.
...and I came up with this to do it:
(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Now, I'd also like to make sure that no two sequential characters are the same. I'm having a heck of a time getting that to work though. Here's what I got that works by itself:
(\S)\1+
...but if I try to combine the two together, it fails.
I'm operating within the constraints of the application. It's default requirement is 1 character length, no regex, and no nonstandard characters.
Anyway...
Using this test harness, I would expect y90e5$ to match but y90e5$$ to not.
What an i missing?
This is a bad place for a regex. You're better off using simple validation.
Sometimes we cannot influence specifications and have to write the implementation regardless, i.e., when some ancient backoffice system has to be interfaced through the web but has certain restrictions on input, or just because your boss is asking you to.
EDIT: removed the regex that was based on the original regex of the asker.
altered original code to fit your description, as it didn't seem to really work:
EDIT: the q. was then updated to reflect another version. There are differences which I explain below:
My version: the two or more \W and \d can be repeated by each other, but cannot appear next to each other (this was my incorrect assumption), i fixed it for length>7 which is slightly more efficient to place as a typical "grab all" expression.
^(?!.*((\S)\1|\s))(?=.*(\d.+){2,})(?=.*(\W.+){2,}).{8,}
New version in original question: the two or more \W and the \d are allowed to appear next to each other. This version currently support length>=6, not length>7 as is explained in the text.
The current answer, corrected, should be something like this, which takes the updated q., my comments on length>7 and optimizations, then it looks like: ^(?!.*((\S)\1|\s))(?=(.*\d){2,})(?=(.*\W){2,}).{8,}.
Update: your original code doesn't seem to work, so I changed it a bit
Update: updated answer to reflect changes in question, spaces not allowed anymore
This may not be the most efficient but appears to work.
^(?!.*(\S)\1)(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Test strings:
ad2f#we1$ //match valid.
adfwwe12#$ //No Match repeated ww.
y90e5$$ //No Match repeated $$.
y90e5$ //No Match too Short and only 1 \W class value.
One of the comments pointed out that the above regex allows spaces which are typically not used for password fields. While this doesn't appear to be a requirement of the original post, as pointed out a simple change will disallow spaces as well.
^(?!.*(\S)\1|.*\s)(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Your regex engine may parse (?!.*(\S)\1|.*\s) differently. Just be aware and adjust accordingly.
All previous test results the same.
Test string with whitespace:
ad2f #we1$ //No match space in string.
If the rule was that passwords had to be two digits followed by three letters or some such, or course a regular expression would work very nicely. But I don't think regexes are really designed for the sort of rule you actually have. Even if you get it to work, it would be pretty cryptic to the poor sucker who has to maintain it later -- possibly you. I think it would be a lot simpler to just write a quick function that loops through the characters and counts how many total and how many of each type. Then at the end check the counts.
Just because you know how to use regexes doesn't mean you have to use them for everything. I have a cool cordless drill but I don't use it to put in nails.

How to determine if a regex is orthogonal to another regex?

I guess my question is best explained with an (simplified) example.
Regex 1:
^\d+_[a-z]+$
Regex 2:
^\d*$
Regex 1 will never match a string where regex 2 matches.
So let's say that regex 1 is orthogonal to regex 2.
As many people asked what I meant by orthogonal I'll try to clarify it:
Let S1 be the (infinite) set of strings where regex 1 matches.
S2 is the set of strings where regex 2 matches.
Regex 2 is orthogonal to regex 1 iff the intersection of S1 and S2 is empty.
The regex ^\d_a$ would be not orthogonal as the string '2_a' is in the set S1 and S2.
How can it be programmatically determined, if two regexes are orthogonal to each other?
Best case would be some library that implements a method like:
/**
* #return True if the regex is orthogonal (i.e. "intersection is empty"), False otherwise or Null if it can't be determined
*/
public Boolean isRegexOrthogonal(Pattern regex1, Pattern regex2);
By "Orthogonal" you mean "the intersection is the empty set" I take it?
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
Then again, I'm a theorist...
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
That seems like shooting sparrows with a cannon. Why not just construct the product automaton and check if an accept state is reachable from the initial state? That'll also give you a string in the intersection straight away without having to construct a regular expression first.
I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem.
I only know of a way to do it which involves creating a DFA from a regexp, which is exponential time (in the degenerate case). It's reducible to the halting problem, because everything is, but the halting problem is not reducible to it.
If the last, then you can use the fact that any RE can be translated into a finite state machine. Two finite state machines are equal if they have the same set of nodes, with the same arcs connecting those nodes.
So, given what I think you're using as a definition for orthogonal, if you translate your REs into FSMs and those FSMs are not equal, the REs are orthogonal.
That's not correct. You can have two DFAs (FSMs) that are non-isomorphic in the edge-labeled multigraph sense, but accept the same languages. Also, were that not the case, your test would check whether two regexps accepted non-identical, whereas OP wants non-overlapping languages (empty intersection).
Also, be aware that the \1, \2, ..., \9 construction is not regular: it can't be expressed in terms of concatenation, union and * (Kleene star). If you want to include back substitution, I don't know what the answer is. Also of interest is the fact that the corresponding problem for context-free languages is undecidable: there is no algorithm which takes two context-free grammars G1 and G2 and returns true iff L(G1) ∩ L(g2) ≠ Ø.
It's been two years since this question was posted, but I'm happy to say this can be determined now simply by calling the "genex" program here: https://github.com/audreyt/regex-genex
$ ./binaries/osx/genex '^\d+_[a-z]+$' '^\d*$'
$
The empty output means there is no strings that matches both regex. If they have any overlap, it will output the entire list of overlaps:
$ runghc Main.hs '\d' '[123abc]'
1.00000000 "2"
1.00000000 "3"
1.00000000 "1"
Hope this helps!
The fsmtools can do all kinds of operations on finite state machines, your only problem would be to convert the string representation of the regular expression into the format the fsmtools can work with. This is definitely possible for simple cases, but will be tricky in the presence of advanced features like look{ahead,behind}.
You might also have a look at OpenFst, although I've never used it. It supports intersection, though.
Excellent point on the \1, \2 bit... that's context free, and so not solvable. Minor point: Not EVERYTHING is reducible to Halt... Program Equivalence for example.. – Brian Postow
[I'm replying to a comment]
IIRC, a^n b^m a^n b^m is not context free, and so (a\*)(b\*)\1\2 isn't either since it's the same. ISTR { ww | w ∈ L } not being "nice" even if L is "nice", for nice being one of regular, context-free.
I modify my statement: everything in RE is reducible to the halting problem ;-)
I finally found exactly the library that I was looking for:
dk.brics.automaton
Usage:
/**
* #return true if the two regexes will never both match a given string
*/
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
Automaton automaton1 = new RegExp(regex1).toAutomaton();
Automaton automaton2 = new RegExp(regex2).toAutomaton();
return automaton1.intersection(automaton2).isEmpty();
}
It should be noted that the implementation doesn't and can't support complex RegEx features like back references. See the blog post "A Faster Java Regex Package" which introduces dk.brics.automaton.
You can maybe use something like Regexp::Genex to generate test strings to match a specified regex and then use the test string on the 2nd regex to determine whether the 2 regexes are orthogonal.
Proving that one regular expression is orthogonal to another can be trivial in some cases, such as mutually exclusive character groups in the same locations. For any but the simplest regular expressions this is a nontrivial problem. For serious expressions, with groups and backreferences, I would go so far as to say that this may be impossible.
I believe kdgregory is correct you're using Orthogonal to mean Complement.
Is this correct?
Let me start by saying that I have no idea how to construct such an algorithm, nor am I aware of any library that implements it. However, I would not be at all surprised to learn that nonesuch exists for general regular expressions of arbitrary complexity.
Every regular expression defines a regular language of all the strings that can be generated by the expression, or if you prefer, of all the strings that are "matched by" the regular expression. Think of the language as a set of strings. In most cases, the set will be infinitely large. Your question asks whether the intersections of the two sets given by the regular expressions is empty or not.
At least to a first approximation, I can't imagine a way to answer that question without computing the sets, which for infinite sets will take longer than you have. I think there might be a way to compute a limited set and determine when a pattern is being elaborated beyond what is required by the other regex, but it would not be straightforward.
For example, just consider the simple expressions (ab)* and (aba)*b. What is the algorithm that will decide to generate abab from the first expression and then stop, without checking ababab, abababab, etc. because they will never work? You can't just generate strings and check until a match is found because that would never complete when the languages are disjoint. I can't imagine anything that would work in the general case, but then there are folks much better than me at this kind of thing.
All in all, this is a hard problem. I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem. Although, given that regular expressions are not Turing complete, it seems at least possible that a solution exists.
I would do the following:
convert each regex to a FSA, using something like the following structure:
struct FSANode
{
bool accept;
Map<char, FSANode> links;
}
List<FSANode> nodes;
FSANode start;
Note that this isn't trivial, but for simple regex shouldn't be that difficult.
Make a new Combined Node like:
class CombinedNode
{
CombinedNode(FSANode left, FSANode right)
{
this.left = left;
this.right = right;
}
Map<char, CombinedNode> links;
bool valid { get { return !left.accept || !right.accept; } }
public FSANode left;
public FSANode right;
}
Build up links based on following the same char on the left and right sides, and you get two FSANodes which make a new CombinedNode.
Then start at CombinedNode(leftStart, rightStart), and find the spanning set, and if there are any non-valid CombinedNodes, the set isn't "orthogonal."
Convert each regular expression into a DFA. From the accept state of one DFA create an epsilon transition to the start state of the second DFA. You will in effect have created an NFA by adding the epsilon transition. Then convert the NFA into a DFA. If the start state is not the accept state, and the accept state is reachable, then the two regular expressions are not "orthogonal." (Since their intersection is non-empty.)
There are know procedures for converting a regular expression to a DFA, and converting an NFA to a DFA. You could look at a book like "Introduction to the Theory of Computation" by Sipser for the procedures, or just search around the web. No doubt many undergrads and grads had to do this for one "theory" class or another.
I spoke too soon. What I said in my original post would not work out, but there is a procedure for what you are trying to do if you can convert your regular expressions into DFA form.
You can find the procedure in the book I mentioned in my first post: "Introduction to the Theory of Computation" 2nd edition by Sipser. It's on page 46, with details in the footnote.
The procedure would give you a new DFA that is the intersection of the two DFAs. If the new DFA had a reachable accept state then the intersection is non-empty.