Regular expression puzzle - regex

This is not homework, but an old exam question. I am curious to see the answer.
We are given an alphabet S={0,1,2,3,4,5,6,7,8,9,+}. Define the language L as the set of strings w from this alphabet such that w is in L if:
a) w is a number such as 42 or w is the (finite) sum of numbers such as 34 + 16 or 34 + 2 + 10
and
b) The number represented by w is divisible by 3.
Write a regular expression (and a DFA) for L.

This should work:
^(?:0|(?:(?:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\
+)*[369]0*)*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:
\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[
258](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0
\+)*[147])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+)(?:\+(?:0|(?:(?
:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:\+?(?:0\+)*
[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[258](?:0*(?
:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])*
(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)
*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+))*$
It works by having three states representing the sum of the digits so far modulo 3. It disallows leading zeros on numbers, and plus signs at the start and end of the string, as well as two consecutive plus signs.
Generation of regular expression and test bed:
a = r'0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*'
b = r'a[147]'
c = r'a[258]'
r1 = '[369]|[147](?:bc)*(?:c|bb)|[258](?:cb)*(?:b|cc)'
r2 = '(?:0|(?:(?:' + r1 + ')0*)+)'
r3 = '^' + r2 + r'(?:\+' + r2 + ')*$'
r = r3.replace('b', b).replace('c', c).replace('a', a)
print r
# Test on 10000 examples.
import random, re
random.seed(1)
r = re.compile(r)
for _ in range(10000):
x = ''.join(random.choice('0123456789+') for j in range(random.randint(1,50)))
if re.search(r'(?:\+|^)(?:\+|0[0-9])|\+$', x):
valid = False
else:
valid = eval(x) % 3 == 0
result = re.match(r, x) is not None
if result != valid:
print 'Failed for ' + x

Note that my memory of DFA syntax is woefully out of date, so my answer is undoubtedly a little broken. Hopefully this gives you a general idea. I've chosen to ignore + completely. As AmirW states, abc+def and abcdef are the same for divisibility purposes.
Accept state is C.
A=1,4,7,BB,AC,CA
B=2,5,8,AA,BC,CB
C=0,3,6,9,AB,BA,CC
Notice that the above language uses all 9 possible ABC pairings. It will always end at either A,B,or C, and the fact that every variable use is paired means that each iteration of processing will shorten the string of variables.
Example:
1490 = AACC = BCC = BC = B (Fail)
1491 = AACA = BCA = BA = C (Success)

Not a full solution, just an idea:
(B) alone: The "plus" signs don't matter here. abc + def is the same as abcdef for the sake of divisibility by 3. For the latter case, there is a regexp here: http://blog.vkistudios.com/index.cfm/2008/12/30/Regular-Expression-to-determine-if-a-base-10-number-is-divisible-by-3
to combine this with requirement (A), we can take the solution of (B) and modify it:
First read character must be in 0..9 (not a plus)
Input must not end with a plus, so: Duplicate each state (will use S for the original state and S' for the duplicate to distinguish between them). If we're in state S and we read a plus we'll move to S'.
When reading a number we'll go to the new state as if we were in S. S' states cannot accept (another) plus.
Also, S' is not "accept state" even if S is. (because input must not end with a plus).

Related

Second order combinatoric probabilities

Let say I have a set of symbols s = {a,b,c, ... } and a corpus1 :
a b g k.
o p a r b.
......
by simple counting I can calculate probabilities p(sym), p(sym1,sym2), p(sym1|sym2)
will use upper-case for set S and CORPUS2 and PROBABILITIES related to them
now I create Set 'S' of all combinations of 's', S = { ab,ac,ad,bc,bd,... }, such that I create CORPUS2 from corpus1 in the following manner :
ab ag ak, bg bk, gk.
op oa or ob, pa pr pb, ar ab, rb.
......
i.e all pairs combinations, order does not matter ab == ba. Commas are for visual purpose.
My question : Is it possible to express probabilities P(SYM), P(SYM1,SYM2), P(SYM1|SYM2) via p(sym), p(sym1,sym2), p(sym1|sym2) i.e. have a formula
PS> In my thinking I'm stuck at the following dilema ...
p(sym) = count(sym) / n
but to calculate P(SYM) w/o materializing CORPUS2 there seems to be no way, because it depends on p(sub-sym1),p(sub-sym2) multiplied by the lenght of the sequences they participate in. SYM = sub-sym1:sub-sym2
may be : ~P(SYM) = p(sub-sym1,sub-sym2) * p(sub-sym1) * p(sub-sym2) * avg-seq-len
P(SYM) = for seq in corpus1 :
total += ( len(seq) * (len(seq)+1)) / 2
for sub-sym1 and sub-sym2 in combinations(seq,2) :
if sub-sym1 and sub-sym2 == SYM :
count += 1
return count/total
there is a condition and hidden/random parameter/length involved ..
P(SYM1,SYM2), P(SYM1|SYM2) ??
Probabilities are defined/calculated in the usual way by counting .. for lower case symbols using corpus1 and for upper case symbols using CORPUS2.

Substitute numerical constants with symbols in sympy

I have a question similar to this one: How to substitute multiple symbols in an expression in sympy? but in reverse.
I have a sympy expression with numerical values and symbols alike. I would like to substitute all numerical values with symbolic constants. I appreciate that such query is uncommon for sympy. What can I try next?
For example, I have:
-0.5967695*sin(0.15280747*x0 + 0.89256966) + 0.5967695*sin(sin(0.004289882*x0 - 1.5390939)) and would like to replace all numbers with a, b, c etc. ideally in a batch type of way.
The goal is to then apply trig identities to simplify the expression.
I'm not sure if there is already such a function. If there is not, it's quite easy to build one. For example:
import string
def num2symbols(expr):
# wild symbol to select all numbers
w = Wild("w", properties=[lambda t: isinstance(t, Number)])
# extract the numbers from the expression
n = expr.find(w)
# get a lowercase alphabet
alphabet = list(string.ascii_lowercase)
# create a symbol for each number
s = symbols(" ".join(alphabet[:len(n)]))
# create a dictionary mapping a number to a symbol
d = {k: v for k, v in zip(n, s)}
return d, expr.subs(d)
x0 = symbols("x0")
expr = -0.5967695*sin(0.15280747*x0 + 0.89256966) + 0.5967695*sin(sin(0.004289882*x0 - 1.5390939))
d, new_expr = num2symbols(expr)
print(new_expr)
# out: b*sin(c + d*x0) - b*sin(sin(a + f*x0))
print(d):
# {-1.53909390000000: a, -0.596769500000000: b, 0.892569660000000: c, 0.152807470000000: d, 0.596769500000000: e, 0.00428988200000000: f}
I feel like dict.setdefault was made for this purpose in Python :-)
>>> c = numbered_symbols('c',cls=Dummy)
>>> d = {}
>>> econ = expr.replace(lambda x:x.is_Float, lambda x: sign(x)*d.setdefault(abs(x),next(c)))
>>> undo = {v:k for k,v in d.items()}
Do what you want with econ and when done (after saving results to econ)
>>> econ.xreplace(undo) == expr
True
(But if you change econ the exact equivalence may no longer hold.) This uses abs to store symbols so if the expression has constants that differ by a sign they will appear in econ with +/-ci instead of ci and cj.

ROT 13 Cipher: Creating a Function Python

I need to create a function that replaces a letter with the letter 13 letters after it in the alphabet (without using encode). I'm relatively new to Python so it has taken me a while to figure out a way to do this without using Encode.
Here's what I have so far. When I use this to type in a normal word like "hello" it works but if I pass through a sentence with special characters I can't figure out how to JUST include letters of the alphabet and skip numbers, spaces or special characters completely.
def rot13(b):
b = b.lower()
a = [chr(i) for i in range(ord('a'),ord('z')+1)]
c = []
d = []
x = a[0:13]
for i in b:
c.append(a.index(i))
for i in c:
if i <= 13:
d.append(a[i::13][1])
elif i > 13:
y = len(a[i:])
z = len(x)- y
d.append(a[z::13][0])
e = ''.join(d)
return e
EDIT
I tried using .isalpha() but this doesn't seem to be working for me - characters are duplicating for some reason when I use it. Is the following format correct:
def rot13(b):
b1 = b.lower()
a = [chr(i) for i in range(ord('a'),ord('z')+1)]
c = []
d = []
x = a[0:13]
for i in b1:
if i.isalpha():
c.append(a.index(i))
for i in c:
if i <= 12:
d.append(a[i::13][1])
elif i > 12:
y = len(a[i:])
z = len(x)- y
d.append(a[z::13][0])
else:
d.append(i)
if message[0].istitle() == True:
d[0] = d[0].upper()
e = ''.join(d)
return e
Following on from comments. OP was advised to use isalpha, and wondering why that's causing duplication (see OP's edit)
This isn't tied to the use of isalpha, it's to do with the second for loop
for i in c:
isn't necessary, and is causing the duplication. You should remove that. Instead you can do the same by just using index = a.index(i). You were already doing this, but for some reason appending to a list instead and causing confusion
Use the index variable any time you would have used i inside the for i in c loop. On a side note, in nested for loops try not to reuse the same variables. It just causes confusion...but that's a matter for code review
Assuming you do all that right it should work.

Understanding Recursive Function

I'm working through the book NLP with Python, and I came across this example from an 'advanced' section. I'd appreciate help understanding how it works. The function computes all possibilities of a number of syllables to reach a 'meter' length n. Short syllables "S" take up one unit of length, while long syllables "L" take up two units of length. So, for a meter length of 4, the return statement looks like this:
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
The function:
def virahanka1(n):
if n == 0:
return [""]
elif n == 1:
return ["S"]
else:
s = ["S" + prosody for prosody in virahanka1(n-1)]
l = ["L" + prosody for prosody in virahanka1(n-2)]
return s + l
The part I don't understand is how the 'SSL', 'SLS', and 'LSS' matches are made, if s and l are separate lists. Also in the line "for prosody in virahanka1(n-1)," what is prosody? Is it what the function is returning each time? I'm trying to think through it step by step but I'm not getting anywhere. Thanks in advance for your help!
Adrian
Let's just build the function from scratch. That's a good way to understand it thoroughly.
Suppose then that we want a recursive function to enumerate every combination of Ls and Ss to make a given meter length n. Let's just consider some simple cases:
n = 0: Only way to do this is with an empty string.
n = 1: Only way to do this is with a single S.
n = 2: You can do it with a single L, or two Ss.
n = 3: LS, SL, SSS.
Now, think about how you might build the answer for n = 4 given the above data. Well, the answer would either involve adding an S to a meter length of 3, or adding an L to a meter length of 2. So, the answer in this case would be LL, LSS from n = 2 and SLS, SSL, SSSS from n = 3. You can check that this is all possible combinations. We can also see that n = 2 and n = 3 can be obtained from n = 0,1 and n=1,2 similarly, so we don't need to special-case them.
Generally, then, for n ≥ 2, you can derive the strings for length n by looking at strings of length n-1 and length n-2.
Then, the answer is obvious:
if n = 0, return just an empty string
if n = 1, return a single S
otherwise, return the result of adding an S to all strings of meter length n-1, combined with the result of adding an L to all strings of meter length n-2.
By the way, the function as written is a bit inefficient because it recalculates a lot of values. That would make it very slow if you asked for e.g. n = 30. You can make it faster very easily by using the new lru_cache from Python 3.3:
#lru_cache(maxsize=None)
def virahanka1(n):
...
This caches results for each n, making it much faster.
I tried to melt my brain. I added print statements to explain to me what was happening. I think the most confusing part about recursive calls is that it seems to go into the call forward but come out backwards, as you may see with the prints when you run the following code;
def virahanka1(n):
if n == 4:
print 'Lets Begin for ', n
else:
print 'recursive call for ', n, '\n'
if n == 0:
print 'n = 0 so adding "" to below'
return [""]
elif n == 1:
print 'n = 1 so returning S for below'
return ["S"]
else:
print 'next recursivly call ' + str(n) + '-1 for S'
s = ["S" + prosody for prosody in virahanka1(n-1)]
print '"S" + each string in s equals', s
if n == 4:
print '**Above is the result for s**'
print 'n =',n,'\n', 'next recursivly call ' + str(n) + '-2 for L'
l = ["L" + prosody for prosody in virahanka1(n-2)]
print '\t','what was returned + each string in l now equals', l
if n == 4:
print '**Above is the result for l**','\n','**Below is the end result of s + l**'
print 'returning s + l',s+l,'for below', '\n','='*70
return s + l
virahanka1(4)
Still confusing for me, but with this and Jocke's elegant explanation, I think I can understand what is going on.
How about you?
Below is what the code above produces;
Lets Begin for 4
next recursivly call 4-1 for S
recursive call for 3
next recursivly call 3-1 for S
recursive call for 2
next recursivly call 2-1 for S
recursive call for 1
n = 1 so returning S for below
"S" + each string in s equals ['SS']
n = 2
next recursivly call 2-2 for L
recursive call for 0
n = 0 so adding "" to below
what was returned + each string in l now equals ['L']
returning s + l ['SS', 'L'] for below
======================================================================
"S" + each string in s equals ['SSS', 'SL']
n = 3
next recursivly call 3-2 for L
recursive call for 1
n = 1 so returning S for below
what was returned + each string in l now equals ['LS']
returning s + l ['SSS', 'SL', 'LS'] for below
======================================================================
"S" + each string in s equals ['SSSS', 'SSL', 'SLS']
**Above is the result for s**
n = 4
next recursivly call 4-2 for L
recursive call for 2
next recursivly call 2-1 for S
recursive call for 1
n = 1 so returning S for below
"S" + each string in s equals ['SS']
n = 2
next recursivly call 2-2 for L
recursive call for 0
n = 0 so adding "" to below
what was returned + each string in l now equals ['L']
returning s + l ['SS', 'L'] for below
======================================================================
what was returned + each string in l now equals ['LSS', 'LL']
**Above is the result for l**
**Below is the end result of s + l**
returning s + l ['SSSS', 'SSL', 'SLS', 'LSS', 'LL'] for below
======================================================================
This function says that:
virakhanka1(n) is the same as [""] when n is zero, ["S"] when n is 1, and s + l otherwise.
Where s is the same as the result of "S" prepended to each elements in the resulting list of virahanka1(n - 1), and l the same as "L" prepended to the elements of virahanka1(n - 2).
So the computation would be:
When n is 0:
[""]
When n is 1:
["S"]
When n is 2:
s = ["S" + "S"]
l = ["L" + ""]
s + l = ["SS", "L"]
When n is 3:
s = ["S" + "SS", "S" + "L"]
l = ["L" + "S"]
s + l = ["SSS", "SL", "LS"]
When n is 4:
s = ["S" + "SSS", "S" + "SL", "S" + "LS"]
l = ["L" + "SS", "L" + "L"]
s + l = ['SSSS", "SSL", "SLS", "LSS", "LL"]
And there you have it, step by step.
You need to know the results of the other function calls in order to calculate the final value, which can be pretty messy to do manually as you can see. It is important though that you do not try to think recursively in your head. This would cause your mind to melt. I described the function in words, so that you can see that these kind of functions is are descriptions, and not a sequence of commands.
The prosody you see, that is a part of s and l definitions, are variables. They are used in a list-comprehension, which is a way of building lists. I've described earlier how this list is built.

Regular Expressions with repeated characters

I need to write a regular expression that can detect a string that contains only the characters x,y, and z, but where the characters are different from their neighbors.
Here is an example
xyzxzyz = Pass
xyxyxyx = Pass
xxyzxz = Fail (repeated x)
zzzxxzz = Fail (adjacent characters are repeated)
I thought that this would work ((x|y|z)?)*, but it does not seem to work. Any suggestions?
EDIT
Please note, I am looking for an answer that does not allow for look ahead or look behind operations. The only operations allowed are alternation, concatenation, grouping, and closure
Usually for this type of question, if the regex is not simple enough to be derived directly, you can start from drawing a DFA and derive a regex from there.
You should be able to derive the following DFA. q1, q2, q3, q4 are end states, with q1 also being the start state. q5 is the failed/trap state.
There are several methods to find Regular Expression for a DFA. I am going to use Brzozowski Algebraic Method as explained in section 5 of this paper:
For each state qi, the equation Ri is a union of terms: for a transition a from qi to qj, the term is aRj. Basically, you will look at all the outgoing edges from a state. If Ri is a final state, λ is also one of the terms.
Let me quote the identities from the definition section of the paper, since they will come in handy later (λ is the empty string and ∅ is the empty set):
(ab)c = a(bc) = abc
λx = xλ = x
∅x = x∅ = ∅
∅ + x = x
λ + x* = x*
(λ + x)* = x*
Since q5 is a trap state, the formula will end up an infinite recursion, so you can drop it in the equations. It will end up as empty set and disappear if you include it in the equation anyway (explained in the appendix).
You will come up with:
R1 = xR2 + yR3 + zR4 + λ
R2 = + yR3 + zR4 + λ
R3 = xR2 + + zR4 + λ
R4 = xR2 + yR3 + λ
Solve the equation above with substitution and Arden's theorem, which states:
Given an equation of the form X = AX + B where λ ∉ A, the equation has the solution X = A*B.
You will get to the answer.
I don't have time and confidence to derive the whole thing, but I will show the first few steps of derivation.
Remove R4 by substitution, note that zλ becomes z due to the identity:
R1 = xR2 + yR3 + (zxR2 + zyR3 + z) + λ
R2 = + yR3 + (zxR2 + zyR3 + z) + λ
R3 = xR2 + + (zxR2 + zyR3 + z) + λ
Regroup them:
R1 = (x + zx)R2 + (y + zy)R3 + z + λ
R2 = zxR2 + (y + zy)R3 + z + λ
R3 = (x + zx)R2 + zyR3 + z + λ
Apply Arden's theorem to R3:
R3 = (zy)*((x + zx)R2 + z + λ)
= (zy)*(x + zx)R2 + (zy)*z + (zy)*
You can substitute R3 back to R2 and R1 and remove R3. I leave the rest as exercise. Continue ahead and you should reach the answer.
Appendix
We will explain why trap states can be discarded from the equations, since they will just disappear anyway. Let us use the state q5 in the DFA as an example here.
R5 = (x + y + z)R5
Use identity ∅ + x = x:
R5 = (x + y + z)R5 + ∅
Apply Arden's theorem to R5:
R5 = (x + y + z)*∅
Use identity ∅x = x∅ = ∅:
R5 = ∅
The identity ∅x = x∅ = ∅ will also take effect when R5 is substituted into other equations, causing the term with R5 to disappear.
This should do what you want:
^(?!.*(.)\1)[xyz]*$
(Obviously, only on engines with lookahead)
The content itself is handled by the second part: [xyz]* (any number of x, y, or z characters). The anchors ^...$ are here to say that it has to be the entirety of the string. And the special condition (no adjacent pairs) is handled by a negative lookahead (?!.*(.)\1), which says that there must not be a character followed by the same character anywhere in the string.
I've had an idea while I was walking today and put it on regex and I have yet to find a pattern that it doesn't match correctly. So here is the regex :
^((y|z)|((yz)*y?|(zy)*z?))?(xy|xz|(xyz(yz|yx|yxz)*y?)|(xzy(zy|zx|zxy)*z?))*x?$
Here is a fiddle to go with it!
If you find a pattern mismatch tell me I'll try to modify it! I know it's a bit late but I was really bothered by the fact that I couldn't solve it.
I understand this is quite an old question and has an approved solution as well. But then I am posting 1 more possible and quick solution for the same case, where you want to check your regular expression that contains consecutive characters.
Use below regular expression:
String regex = "\\b\\w*(\\w)\\1\\1\\w*";
Listing possible cases that above expression returning the result.
Case 1: abcdddd or 123444
Result: Matched
Case 2: abcd or 1234
Result: Unmatched
Case 3: &*%$$$ (Special characters)
Result: Unmatched
Hope this will be helpful...
Thanks:)