Second order combinatoric probabilities - combinations

Let say I have a set of symbols s = {a,b,c, ... } and a corpus1 :
a b g k.
o p a r b.
......
by simple counting I can calculate probabilities p(sym), p(sym1,sym2), p(sym1|sym2)
will use upper-case for set S and CORPUS2 and PROBABILITIES related to them
now I create Set 'S' of all combinations of 's', S = { ab,ac,ad,bc,bd,... }, such that I create CORPUS2 from corpus1 in the following manner :
ab ag ak, bg bk, gk.
op oa or ob, pa pr pb, ar ab, rb.
......
i.e all pairs combinations, order does not matter ab == ba. Commas are for visual purpose.
My question : Is it possible to express probabilities P(SYM), P(SYM1,SYM2), P(SYM1|SYM2) via p(sym), p(sym1,sym2), p(sym1|sym2) i.e. have a formula
PS> In my thinking I'm stuck at the following dilema ...
p(sym) = count(sym) / n
but to calculate P(SYM) w/o materializing CORPUS2 there seems to be no way, because it depends on p(sub-sym1),p(sub-sym2) multiplied by the lenght of the sequences they participate in. SYM = sub-sym1:sub-sym2
may be : ~P(SYM) = p(sub-sym1,sub-sym2) * p(sub-sym1) * p(sub-sym2) * avg-seq-len
P(SYM) = for seq in corpus1 :
total += ( len(seq) * (len(seq)+1)) / 2
for sub-sym1 and sub-sym2 in combinations(seq,2) :
if sub-sym1 and sub-sym2 == SYM :
count += 1
return count/total
there is a condition and hidden/random parameter/length involved ..
P(SYM1,SYM2), P(SYM1|SYM2) ??
Probabilities are defined/calculated in the usual way by counting .. for lower case symbols using corpus1 and for upper case symbols using CORPUS2.

Related

Substitute numerical constants with symbols in sympy

I have a question similar to this one: How to substitute multiple symbols in an expression in sympy? but in reverse.
I have a sympy expression with numerical values and symbols alike. I would like to substitute all numerical values with symbolic constants. I appreciate that such query is uncommon for sympy. What can I try next?
For example, I have:
-0.5967695*sin(0.15280747*x0 + 0.89256966) + 0.5967695*sin(sin(0.004289882*x0 - 1.5390939)) and would like to replace all numbers with a, b, c etc. ideally in a batch type of way.
The goal is to then apply trig identities to simplify the expression.
I'm not sure if there is already such a function. If there is not, it's quite easy to build one. For example:
import string
def num2symbols(expr):
# wild symbol to select all numbers
w = Wild("w", properties=[lambda t: isinstance(t, Number)])
# extract the numbers from the expression
n = expr.find(w)
# get a lowercase alphabet
alphabet = list(string.ascii_lowercase)
# create a symbol for each number
s = symbols(" ".join(alphabet[:len(n)]))
# create a dictionary mapping a number to a symbol
d = {k: v for k, v in zip(n, s)}
return d, expr.subs(d)
x0 = symbols("x0")
expr = -0.5967695*sin(0.15280747*x0 + 0.89256966) + 0.5967695*sin(sin(0.004289882*x0 - 1.5390939))
d, new_expr = num2symbols(expr)
print(new_expr)
# out: b*sin(c + d*x0) - b*sin(sin(a + f*x0))
print(d):
# {-1.53909390000000: a, -0.596769500000000: b, 0.892569660000000: c, 0.152807470000000: d, 0.596769500000000: e, 0.00428988200000000: f}
I feel like dict.setdefault was made for this purpose in Python :-)
>>> c = numbered_symbols('c',cls=Dummy)
>>> d = {}
>>> econ = expr.replace(lambda x:x.is_Float, lambda x: sign(x)*d.setdefault(abs(x),next(c)))
>>> undo = {v:k for k,v in d.items()}
Do what you want with econ and when done (after saving results to econ)
>>> econ.xreplace(undo) == expr
True
(But if you change econ the exact equivalence may no longer hold.) This uses abs to store symbols so if the expression has constants that differ by a sign they will appear in econ with +/-ci instead of ci and cj.

ROT 13 Cipher: Creating a Function Python

I need to create a function that replaces a letter with the letter 13 letters after it in the alphabet (without using encode). I'm relatively new to Python so it has taken me a while to figure out a way to do this without using Encode.
Here's what I have so far. When I use this to type in a normal word like "hello" it works but if I pass through a sentence with special characters I can't figure out how to JUST include letters of the alphabet and skip numbers, spaces or special characters completely.
def rot13(b):
b = b.lower()
a = [chr(i) for i in range(ord('a'),ord('z')+1)]
c = []
d = []
x = a[0:13]
for i in b:
c.append(a.index(i))
for i in c:
if i <= 13:
d.append(a[i::13][1])
elif i > 13:
y = len(a[i:])
z = len(x)- y
d.append(a[z::13][0])
e = ''.join(d)
return e
EDIT
I tried using .isalpha() but this doesn't seem to be working for me - characters are duplicating for some reason when I use it. Is the following format correct:
def rot13(b):
b1 = b.lower()
a = [chr(i) for i in range(ord('a'),ord('z')+1)]
c = []
d = []
x = a[0:13]
for i in b1:
if i.isalpha():
c.append(a.index(i))
for i in c:
if i <= 12:
d.append(a[i::13][1])
elif i > 12:
y = len(a[i:])
z = len(x)- y
d.append(a[z::13][0])
else:
d.append(i)
if message[0].istitle() == True:
d[0] = d[0].upper()
e = ''.join(d)
return e
Following on from comments. OP was advised to use isalpha, and wondering why that's causing duplication (see OP's edit)
This isn't tied to the use of isalpha, it's to do with the second for loop
for i in c:
isn't necessary, and is causing the duplication. You should remove that. Instead you can do the same by just using index = a.index(i). You were already doing this, but for some reason appending to a list instead and causing confusion
Use the index variable any time you would have used i inside the for i in c loop. On a side note, in nested for loops try not to reuse the same variables. It just causes confusion...but that's a matter for code review
Assuming you do all that right it should work.

Efficient way for the following code

I saw this problem on hackerrank.com, the problem is to find a 4 letter palindrome from a given string which can be a long string also.
Constraint is as follows:
where, |s| is the length of the string and a,b,c,d are the positions of the corresponding letters in the palindrome.
I found out the solution for this, but it isn't efficient enough, as in during the processing time it gives 'time out' error. The code is as follows:
s='kkkkkkz'
n=0
c_i,c_j,c_k,c_l=0,0,0,0
for i in range(len(s)):
j=0;c_i+=1
while j>=0 and j<len(s):
c_j+=1
if j>i:
k=0
while k>=0 and k<len(s):
c_k+=1
if k>j:
l=0
while l>=0 and l<len(s):
c_l+=1
if l>k:
a=s[i]+s[j]+s[k]+s[l]
if a[0]==a[3] and a[1]==a[2]: n+=1
l+=1
k+=1
j+=1
print n
I thought of noticing the number of times each loop runs, which right now is 7,49,147 and 245.
It is still better than the techniques I followed before, but I am not able to to do better than this.
Suggestions please ?
One way is to use the following, but this will still not be efficient enough. Scores 12/40 ..
import itertools
s=WHATEVERSTRING
n=0
for a in itertools.combinations(s, 4):
n += (a[0] == a[3])*(a[1]==a[2])
print(n)
A working solution is to go down the following route: create a set of unique characters in the string, and map substring pairs to a dictionary. Then count all the occurrences of pairwise pairs.
from collections import defaultdict as di
data = [x for x in s.strip()]
chars = set(data)
sum_a = 0
for c in chars:
a = 0
b = di(int)
double_pairs = 0
for d in data:
if d == c:
sum_a += double_pairs
double_pairs += b[c]
b[c]+=a
a += 1
else:
double_pairs += b[d]
b[d] += a
print(sum_a%(10**9+7))

Split an equation string into a list of symbols with coefficient

I'd like a user to enter an equation such as :
"(-6/8) + (2/3)x + (-2/3)bar + (5/8) = (-2) + z + (-5/1245)foo"
and then get an unordered lists of li as
<li class='monome #{side}' data-value='-6/8' data-type='rationnal'></li>
or
<li class='monome #{side}' data-value='2/3' data-type='symbol' data-symbol='x'></li>
depending on term's type for each member of the equation...
an ugly solution would be :
member_as_html = (membre,side) ->
html = "<ul>"
for monome in membre
m = monome.split(")")
if m[1]
html += "<li class='monome #{side}' data-value='#{m[0][1..]}' data-type='symbol' data-symbol='#{m[1]}'></li>"
else
html += "<li class='monome #{side}' data-value='#{m[0][1..]}' data-type='rationnel'></li>"
html += "</ul>"
s = $( "#equation_string" ).val()
s = s.replace(/\s+/g, '').split("=")
ml = s[0].split("+")
mr = s[1].split("+")
ul_left = member(ml,"left")
ul_right = member(mr,"right")
but there's no verification on the string nor any flexibility on symbol length
finally to motivate people help me with those regex, here's the link of my working project.
You can play with equation till solve them : it's quite fun and useful for teacher :
http://jsfiddle.net/cphY2/
EDIT
For now, Complex equation with any level of parenthesis and operator precedence ln, exp and factorial would be too much complicated for the state of devellopment. That's why I chose this convention of a simple equation made of a sum of terms. A term could be a rationnal or a symbol (any length) with a rationnal as coefficient. Any (better) proposal about the convention used to enter the equation would be appreciated (and especially the fu##"#[|#king regex along with !)
I don't know coffeescript but here is a python solution, maybe it will get you on the right track?
s = "(-6/8) + (2/3)x + (-2/3)y + (5/8)"
s = s.split(" + ")
D = []
for u in s:
if u[-1] == ')': D += [{u,"frac"}]
else: D += [{u[0:-1],u[-1]}]

Regular expression puzzle

This is not homework, but an old exam question. I am curious to see the answer.
We are given an alphabet S={0,1,2,3,4,5,6,7,8,9,+}. Define the language L as the set of strings w from this alphabet such that w is in L if:
a) w is a number such as 42 or w is the (finite) sum of numbers such as 34 + 16 or 34 + 2 + 10
and
b) The number represented by w is divisible by 3.
Write a regular expression (and a DFA) for L.
This should work:
^(?:0|(?:(?:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\
+)*[369]0*)*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:
\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[
258](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0
\+)*[147])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+)(?:\+(?:0|(?:(?
:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:\+?(?:0\+)*
[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[258](?:0*(?
:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])*
(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)
*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+))*$
It works by having three states representing the sum of the digits so far modulo 3. It disallows leading zeros on numbers, and plus signs at the start and end of the string, as well as two consecutive plus signs.
Generation of regular expression and test bed:
a = r'0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*'
b = r'a[147]'
c = r'a[258]'
r1 = '[369]|[147](?:bc)*(?:c|bb)|[258](?:cb)*(?:b|cc)'
r2 = '(?:0|(?:(?:' + r1 + ')0*)+)'
r3 = '^' + r2 + r'(?:\+' + r2 + ')*$'
r = r3.replace('b', b).replace('c', c).replace('a', a)
print r
# Test on 10000 examples.
import random, re
random.seed(1)
r = re.compile(r)
for _ in range(10000):
x = ''.join(random.choice('0123456789+') for j in range(random.randint(1,50)))
if re.search(r'(?:\+|^)(?:\+|0[0-9])|\+$', x):
valid = False
else:
valid = eval(x) % 3 == 0
result = re.match(r, x) is not None
if result != valid:
print 'Failed for ' + x
Note that my memory of DFA syntax is woefully out of date, so my answer is undoubtedly a little broken. Hopefully this gives you a general idea. I've chosen to ignore + completely. As AmirW states, abc+def and abcdef are the same for divisibility purposes.
Accept state is C.
A=1,4,7,BB,AC,CA
B=2,5,8,AA,BC,CB
C=0,3,6,9,AB,BA,CC
Notice that the above language uses all 9 possible ABC pairings. It will always end at either A,B,or C, and the fact that every variable use is paired means that each iteration of processing will shorten the string of variables.
Example:
1490 = AACC = BCC = BC = B (Fail)
1491 = AACA = BCA = BA = C (Success)
Not a full solution, just an idea:
(B) alone: The "plus" signs don't matter here. abc + def is the same as abcdef for the sake of divisibility by 3. For the latter case, there is a regexp here: http://blog.vkistudios.com/index.cfm/2008/12/30/Regular-Expression-to-determine-if-a-base-10-number-is-divisible-by-3
to combine this with requirement (A), we can take the solution of (B) and modify it:
First read character must be in 0..9 (not a plus)
Input must not end with a plus, so: Duplicate each state (will use S for the original state and S' for the duplicate to distinguish between them). If we're in state S and we read a plus we'll move to S'.
When reading a number we'll go to the new state as if we were in S. S' states cannot accept (another) plus.
Also, S' is not "accept state" even if S is. (because input must not end with a plus).