Regular Expressions with repeated characters - regex

I need to write a regular expression that can detect a string that contains only the characters x,y, and z, but where the characters are different from their neighbors.
Here is an example
xyzxzyz = Pass
xyxyxyx = Pass
xxyzxz = Fail (repeated x)
zzzxxzz = Fail (adjacent characters are repeated)
I thought that this would work ((x|y|z)?)*, but it does not seem to work. Any suggestions?
EDIT
Please note, I am looking for an answer that does not allow for look ahead or look behind operations. The only operations allowed are alternation, concatenation, grouping, and closure

Usually for this type of question, if the regex is not simple enough to be derived directly, you can start from drawing a DFA and derive a regex from there.
You should be able to derive the following DFA. q1, q2, q3, q4 are end states, with q1 also being the start state. q5 is the failed/trap state.
There are several methods to find Regular Expression for a DFA. I am going to use Brzozowski Algebraic Method as explained in section 5 of this paper:
For each state qi, the equation Ri is a union of terms: for a transition a from qi to qj, the term is aRj. Basically, you will look at all the outgoing edges from a state. If Ri is a final state, λ is also one of the terms.
Let me quote the identities from the definition section of the paper, since they will come in handy later (λ is the empty string and ∅ is the empty set):
(ab)c = a(bc) = abc
λx = xλ = x
∅x = x∅ = ∅
∅ + x = x
λ + x* = x*
(λ + x)* = x*
Since q5 is a trap state, the formula will end up an infinite recursion, so you can drop it in the equations. It will end up as empty set and disappear if you include it in the equation anyway (explained in the appendix).
You will come up with:
R1 = xR2 + yR3 + zR4 + λ
R2 = + yR3 + zR4 + λ
R3 = xR2 + + zR4 + λ
R4 = xR2 + yR3 + λ
Solve the equation above with substitution and Arden's theorem, which states:
Given an equation of the form X = AX + B where λ ∉ A, the equation has the solution X = A*B.
You will get to the answer.
I don't have time and confidence to derive the whole thing, but I will show the first few steps of derivation.
Remove R4 by substitution, note that zλ becomes z due to the identity:
R1 = xR2 + yR3 + (zxR2 + zyR3 + z) + λ
R2 = + yR3 + (zxR2 + zyR3 + z) + λ
R3 = xR2 + + (zxR2 + zyR3 + z) + λ
Regroup them:
R1 = (x + zx)R2 + (y + zy)R3 + z + λ
R2 = zxR2 + (y + zy)R3 + z + λ
R3 = (x + zx)R2 + zyR3 + z + λ
Apply Arden's theorem to R3:
R3 = (zy)*((x + zx)R2 + z + λ)
= (zy)*(x + zx)R2 + (zy)*z + (zy)*
You can substitute R3 back to R2 and R1 and remove R3. I leave the rest as exercise. Continue ahead and you should reach the answer.
Appendix
We will explain why trap states can be discarded from the equations, since they will just disappear anyway. Let us use the state q5 in the DFA as an example here.
R5 = (x + y + z)R5
Use identity ∅ + x = x:
R5 = (x + y + z)R5 + ∅
Apply Arden's theorem to R5:
R5 = (x + y + z)*∅
Use identity ∅x = x∅ = ∅:
R5 = ∅
The identity ∅x = x∅ = ∅ will also take effect when R5 is substituted into other equations, causing the term with R5 to disappear.

This should do what you want:
^(?!.*(.)\1)[xyz]*$
(Obviously, only on engines with lookahead)
The content itself is handled by the second part: [xyz]* (any number of x, y, or z characters). The anchors ^...$ are here to say that it has to be the entirety of the string. And the special condition (no adjacent pairs) is handled by a negative lookahead (?!.*(.)\1), which says that there must not be a character followed by the same character anywhere in the string.

I've had an idea while I was walking today and put it on regex and I have yet to find a pattern that it doesn't match correctly. So here is the regex :
^((y|z)|((yz)*y?|(zy)*z?))?(xy|xz|(xyz(yz|yx|yxz)*y?)|(xzy(zy|zx|zxy)*z?))*x?$
Here is a fiddle to go with it!
If you find a pattern mismatch tell me I'll try to modify it! I know it's a bit late but I was really bothered by the fact that I couldn't solve it.

I understand this is quite an old question and has an approved solution as well. But then I am posting 1 more possible and quick solution for the same case, where you want to check your regular expression that contains consecutive characters.
Use below regular expression:
String regex = "\\b\\w*(\\w)\\1\\1\\w*";
Listing possible cases that above expression returning the result.
Case 1: abcdddd or 123444
Result: Matched
Case 2: abcd or 1234
Result: Unmatched
Case 3: &*%$$$ (Special characters)
Result: Unmatched
Hope this will be helpful...
Thanks:)

Related

Sympy - Simplify expression within domain

Can Sympy automatically simplify an expression that includes terms like this one:
cos(x)/(cos(x)**2)**(1/2)
which can be simplified to 1 in the domain that I am interested in 0 <= x <= pi/2 ?
(Examples of other terms that could be simplified in that domain: acos(cos(x)); sqrt(sin(x)**2); sqrt(cos(2*x) + 1); etc.)
If you know the functions that are in your expression (such as sin, cos and tan), you can do the following according to this stack overflow question:
from sympy import *
x = symbols("x", positive=True)
ex = cos(x)/(cos(x)**2)**(S(1)/2)
ex = refine(ex, Q.positive(sin(x)))
ex = refine(ex, Q.positive(cos(x)))
ex = refine(ex, Q.positive(tan(x)))
print(ex)
Note that Q.positive(x*(pi/2-x)) did not help in the process of simplification for trig functions even though this is exactly what you want in general.
But what if you might have crazy functions like polygamma? The following works for some arbitrary choices for ex according to my understanding.
It wouldn't be a problem if the expression was already generated before by SymPy, but if you are inputting the expression manually, I suggest using S(1)/2 or Rational(1, 2) to describe one half.
from sympy import *
# define everything as it would have come from previous code
# also define another variable y to be positive
x, y = symbols("x y", positive=True)
ex = cos(x)/(cos(x)**2)**(S(1)/2)
# If you can, always try to use S(1) or Rational(1, 2)
# if you are defining fractions.
# If it's already a pre-calculated variable in sympy,
# it will already understand it as a half, and you
# wouldn't have any problems.
# ex = cos(x)/(cos(x)**2)**(S(1)/2)
# if x = arctan(y) and both are positive,
# then we have implicitly that 0 < x < pi/2
ex = simplify(ex.replace(x, atan(y)))
# revert back to old variable x if x is still present
ex = simplify(ex.replace(y, tan(x)))
print(ex)
This trick can also be used to define other ranges. For example, if you wanted 1 < x, then you could have x = exp(y) where y = Symbol("y", positive=True).
I think subs() will also work instead of replace() but I just like to be forceful with substitutions, since SymPy can sometimes ignore the subs() command for some variable types like lists and stuff.
You can substitute for a symbol that has the assumptions you want:
In [27]: e = cos(x)/(cos(x)**2)**(S(1)/2) + cos(x)
In [28]: e
Out[28]:
cos(x)
cos(x) + ────────────
_________
╱ 2
╲╱ cos (x)
In [29]: cosx = Dummy('cosx', positive=True)
In [30]: e.subs(cos(x), cosx).subs(cosx, cos(x))
Out[30]: cos(x) + 1

SymPy: unable to simplify rather simple expression

I have an expression (expr, see below) that I am unable to simplify in SymPy. For real and positive x, expr is equivalent to x**3 + 2*x, but simplify and refine do not simplify the expression at all. (Mathematica does the simplication without any effort).
How to simplify this expression with SymPy?
from sympy import *
x = var('x')
expr = 16*x**3/(-x**2 + sqrt(8*x**2 + (x**2 - 2)**2) + 2)**2 - 2*2**(S(4)/5)*x*(-x**2 + sqrt(8*x**2 + (x**2 - 2)**2) + 2)**(S(3)/5) + 10*x
expr1 = simplify(expr) # does nothing
expr2 = refine(expr, Q.positive(x)) # does nothing
It can be done!
I rescind my earlier answer. Your expression can be simplified using Sympy. Here's how:
import sympy as sym
x = sym.symbols('x', positive=True)
expr = 16*x**3/(-x**2 + sym.sqrt(8*x**2 + (x**2 - 2)**2) + 2)**2 - 2*2**(sym.S(4)/5)*x*(-x**2 + sym.sqrt(8*x**2 + (x**2 - 2)**2) + 2)**(sym.S(3)/5) + 10*x
sym.simplify(sym.factor(sym.factor(sym.expand(sym.radsimp(expr))), deep=True))
Output:
x*(x**2 + 2)
Basically, I dug through all of the docs on sympy.simplify until I found that magic combination. Also, you have to define x as positive when you create the symbol, just as I did in the code above.
Comment on Mathematica
"Mathematica does the simplication without any effort"
I don't think you should ever underestimate the quantity of time and money that has gone into making the heuristic nightmare that is Mathematica's Simplify seem like it "just works". Sadly, in a lot of ways Sympy is still in it's infancy in comparison. sympy.simplify is one of those ways.

Matching coefficients with sympy

I am attempting to work a problem from a textbook in sympy, but sympy fails to find a solution which appears valid. For interest, it is the design of a PID controller using direct synthesis with a second order plus dead time model.
The whole problem can be reduced to finding K_C, tau_I and tau_D which will make
K_C*(s**2*tau_D*tau_I + s*tau_I + 1)/(s*tau_I)
= (s**2*tau_1*tau_2 + s*tau_1 + s*tau_2 + 1)/(K*s*(-phi + tau_c))
for given tau_1, tau_2, K and phi.
I have tried to solve this by matching coefficients:
import sympy
s, tau_c, tau_1, tau_2, phi, K = sympy.symbols('s, tau_c, tau_1, tau_2, phi, K')
target = (s**2*tau_1*tau_2 + s*tau_1 + s*tau_2 + 1)/(K*s*(-phi + tau_c))
K_C, tau_I, tau_D = sympy.symbols('K_C, tau_I, tau_D', real=True)
PID = K_C*(1 + 1/(tau_I*s) + tau_D*s)
eq = (target - PID).together()
eq *= sympy.denom(eq).simplify()
eq = sympy.poly(eq, s)
sympy.solve(eq.coeffs(), [K_C, tau_I, tau_D])
This returns an empty list. However, the textbook provides the following solution:
booksolution = {K_C: 1/K*(tau_1 + tau_2)/(tau_c - phi),
tau_I: tau_1 + tau_2,a
tau_D: tau_1*tau_2/(tau_1 + tau_2)}
Which appears to satisfy the equations I'm trying to solve:
[c.subs(booksolution).simplify() for c in eq.coeffs()]
returns
[0, 0, 0]
Can I massage this into a form which sympy can solve? What am I doing wong?
Edit: This finds the correct solution, but requires a little too much thought from my side to order the equations:
eqs = eq.coeffs()
solution = {}
solution[K_C] = sympy.solve(eqs[1], K_C)[0]
solution[tau_D] = sympy.solve(eqs[0], tau_D)[0].subs(solution)
solution[tau_I] = sympy.solve(eqs[2], tau_I)[0].subs(solution).simplify()
In SymPy 1.0 (to be released soon) I get this answer
In [25]: sympy.solve(eq.coeffs(), [K_C, tau_I, tau_D])
Out[25]:
⎡ ⎧ -(τ₁ + τ₂) τ₁⋅τ₂ ⎫⎤
⎢{K_C: 0, τ_I: 0}, ⎨K_C: ───────────, τ_D: ───────, τ_I: τ₁ + τ₂⎬⎥
⎣ ⎩ K⋅(φ - τ_c) τ₁ + τ₂ ⎭⎦
which looks like your textbook's solution.

How to convert a regular grammar to regular expression?

Is there an algorithm or tool to convert regular grammar to regular expression?
Answer from dalibocai:
My goal is to convert regular grammer to DFA. Finally, I found an excellent tool : JFLAP.
A tutorial is available here: https://www2.cs.duke.edu/csed/jflap/tutorial/framebody.html
The algorithm is pretty straightforward if you can compute an automaton from your regular expression. Once you have your automaton. For instance for (aa*b|c), an automaton would be (arrows go to the right):
a
/ \
a \ / b
-> 0 ---> 1 ---> 2 ->
\___________/
c
Then just "enumerate" your transitions as rules. Below, consider that 0, 1, and 2 are nonterminal symbols, and of course a, b and c are the tokens.
0: a1 | c2
1: a1 | b2
2: epsilon
or, if you don't want empty right-hand sides.
0: a1 | c
1: a1 | b
And of course, the route in the other direction provides one means to convert a regular grammar into an automaton, hence a rational expression.
From a theoretical point of view, an algorithm to solve this problem works by creating a regular expression from each rule in the grammar, and solving the resulting system of equations for the initial symbol.
For example, for regular grammar ({S,A},{a,b,c},P,S):
P:
S -> aA | cS | a | c
A -> aA | a | bS
Take each non-termimal symbol and generate regular expression from right hand:
S = aA + cS + a + c
A = aA + bS + c
Solve equation system for initial symbol S:
A = a(aA + bS + c) + bS + c
A = a⁺bS + a⁺c + bS + c
S = aA + c(aA + cS + a + c)
S = aA + c⁺aA + c⁺a + c⁺
S = a(a⁺bS + a⁺c + bS + c) + c⁺a(a⁺bS + a⁺c + bS + c) + c⁺a + c⁺
S = a⁺bS + a⁺c + c⁺a⁺bS + c⁺a⁺c + c⁺a + c⁺
S = (c⁺ + ε)a⁺bS + a⁺c + c⁺(a⁺c + a + ε)
substitution: x = (c⁺ + ε)a⁺b
S = x(xS + a⁺c + c⁺(a⁺c + a + ε)) + a⁺c + c⁺(a⁺c + a + ε)
S = x⁺a⁺c + x⁺c⁺(a⁺c + a + ε) + a⁺c + c⁺(a⁺c + a + ε)
S = x*(a⁺c + c⁺(a⁺c + a + ε))
S = ((c⁺ + ε)a⁺b)*(⁺a⁺c + c⁺(a⁺c + a + ε))
Because all modifications were equivalent, ((c⁺ + ε)a⁺b)*(⁺a⁺c + c⁺(a⁺c + a + ε)) is a regular expression equivalent to all words which can be produced from the initial symbol. Thus the value of this expression must be equivalent to the language generated by the grammar whose initial symbol is S.
It ain't pretty, but i purposefully picked a grammar including cycles to portray the way the algorithm works. The hardest part is recognizing that S = xS | x is equivalent to S = x⁺, then just doing the substitutions.
I'll leave this as an answer to this old question, in case that anybody finds it useful:
I have recently released a library for exactly that purpose:
https://github.com/rindPHI/grammar2regex
You can precisely convert regular grammars, but also compute approximate regular expressions for more general general context-free grammars. The output format can be configured to be a custom ADT type or the regular expression format of the z3 SMT solver (z3.ReRef).
Internally, the tool converts grammars to finite automata. If you're interested in the automaton itself, you can call the method right_linear_grammar_to_nfa.

Regular expression puzzle

This is not homework, but an old exam question. I am curious to see the answer.
We are given an alphabet S={0,1,2,3,4,5,6,7,8,9,+}. Define the language L as the set of strings w from this alphabet such that w is in L if:
a) w is a number such as 42 or w is the (finite) sum of numbers such as 34 + 16 or 34 + 2 + 10
and
b) The number represented by w is divisible by 3.
Write a regular expression (and a DFA) for L.
This should work:
^(?:0|(?:(?:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\
+)*[369]0*)*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:
\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[
258](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0
\+)*[147])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+)(?:\+(?:0|(?:(?
:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:\+?(?:0\+)*
[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[258](?:0*(?
:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])*
(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)
*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+))*$
It works by having three states representing the sum of the digits so far modulo 3. It disallows leading zeros on numbers, and plus signs at the start and end of the string, as well as two consecutive plus signs.
Generation of regular expression and test bed:
a = r'0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*'
b = r'a[147]'
c = r'a[258]'
r1 = '[369]|[147](?:bc)*(?:c|bb)|[258](?:cb)*(?:b|cc)'
r2 = '(?:0|(?:(?:' + r1 + ')0*)+)'
r3 = '^' + r2 + r'(?:\+' + r2 + ')*$'
r = r3.replace('b', b).replace('c', c).replace('a', a)
print r
# Test on 10000 examples.
import random, re
random.seed(1)
r = re.compile(r)
for _ in range(10000):
x = ''.join(random.choice('0123456789+') for j in range(random.randint(1,50)))
if re.search(r'(?:\+|^)(?:\+|0[0-9])|\+$', x):
valid = False
else:
valid = eval(x) % 3 == 0
result = re.match(r, x) is not None
if result != valid:
print 'Failed for ' + x
Note that my memory of DFA syntax is woefully out of date, so my answer is undoubtedly a little broken. Hopefully this gives you a general idea. I've chosen to ignore + completely. As AmirW states, abc+def and abcdef are the same for divisibility purposes.
Accept state is C.
A=1,4,7,BB,AC,CA
B=2,5,8,AA,BC,CB
C=0,3,6,9,AB,BA,CC
Notice that the above language uses all 9 possible ABC pairings. It will always end at either A,B,or C, and the fact that every variable use is paired means that each iteration of processing will shorten the string of variables.
Example:
1490 = AACC = BCC = BC = B (Fail)
1491 = AACA = BCA = BA = C (Success)
Not a full solution, just an idea:
(B) alone: The "plus" signs don't matter here. abc + def is the same as abcdef for the sake of divisibility by 3. For the latter case, there is a regexp here: http://blog.vkistudios.com/index.cfm/2008/12/30/Regular-Expression-to-determine-if-a-base-10-number-is-divisible-by-3
to combine this with requirement (A), we can take the solution of (B) and modify it:
First read character must be in 0..9 (not a plus)
Input must not end with a plus, so: Duplicate each state (will use S for the original state and S' for the duplicate to distinguish between them). If we're in state S and we read a plus we'll move to S'.
When reading a number we'll go to the new state as if we were in S. S' states cannot accept (another) plus.
Also, S' is not "accept state" even if S is. (because input must not end with a plus).