I have the following expression in Sympy
s = e0*a01*d1**2*u0 - e0*a01*d1**2*u1 - e0*a11*d1**2*u0 - e0*a11*d1**2*u1 + e0*d0*a00*d1*u1 + e0*d0*a01*d1*u0 + e0*d0*a10*d1*u0 - e0*d0*a11*d1*u1 + e0*d0*b0*u0 - e0*d0*b1*u1 + e0*d1*a00*d1*u0 - e0*d1*a01*d1*u1 - e0*d1*a10*d1*u1 - e0*d1*a11*d1*u0 - e0*d1*b0*u1 - e0*d1*b1*u0 - e1*a00*d1**2*u0 + e1*a00*d1**2*u1 + e1*a10*d1**2*u0 + e1*a10*d1**2*u1 - e1*d0*a00*d1*u0 + e1*d0*a01*d1*u1 + e1*d0*a10*d1*u1 + e1*d0*a11*d1*u0 + e1*d0*b0*u1 + e1*d0*b1*u0 + e1*d1*a00*d1*u1 + e1*d1*a01*d1*u0 + e1*d1*a10*d1*u0 - e1*d1*a11*d1*u1 + e1*d1*b0*u0 - e1*d1*b1*u1
So first I simpify it:
s = sympify(s,locals=T)
(T contains all these symbols in the string, that are non commutative). And I want to get the coefficient of
d1**2*u0
after "factoring" it. So I did the following:
e=sympify(d1**2*u0,locals=T)
collected_expr = collect(s,e,exact=True)
print(collected_expr)
coeff = collected_expr.coeff(e)
print(coeff)
The result of collected_expr is ok:
d1**2*u0*(e0*a01 - e0*a11 - e1*a00 + e1*a10) - e0*a01*d1**2*u1 - e0*a11*d1**2*u1 + e0*d0*a00*d1*u1 + e0*d0*a01*d1*u0 + e0*d0*a10*d1*u0 - e0*d0*a11*d1*u1 + e0*d0*b0*u0 - e0*d0*b1*u1 + e0*d1*a00*d1*u0 - e0*d1*a01*d1*u1 - e0*d1*a10*d1*u1 - e0*d1*a11*d1*u0 - e0*d1*b0*u1 - e0*d1*b1*u0 + e1*a00*d1**2*u1 + e1*a10*d1**2*u1 - e1*d0*a00*d1*u0 + e1*d0*a01*d1*u1 + e1*d0*a10*d1*u1 + e1*d0*a11*d1*u0 + e1*d0*b0*u1 + e1*d0*b1*u0 + e1*d1*a00*d1*u1 + e1*d1*a01*d1*u0 + e1*d1*a10*d1*u0 - e1*d1*a11*d1*u1 + e1*d1*b0*u0 - e1*d1*b1*u1
But coeff is not ok, as it returns 1, but I really want
e0*a01 - e0*a11 - e1*a00 + e1*a10
EDIT: I also tried
coeff = collected_expr.coeff(u0).coeff(d1).coeff(d1)
and
coeff = collected_expr.coeff(u0).coeff(d1**2)
But both things returned 0
The docstring of Expr.coeff says
When x is noncommutative, the coefficient to the left (default) or
right of x can be returned. The keyword 'right' is ignored when
x is commutative.
collect does not seem to be noncommutative-aware, however, so the factors that were on the right may collect to the left.
>>> var("A B", commutative=False)
(A, B)
>>> collect(A*B+B*A**2,B)
B*(A + A**2)
I don't understand why the expression a * (... + 1) - a is not being removed while simplification. The example below shows the bug:
import sympy as sy
a,b,c = sy.symbols('a b c')
expr = a * (b - c + 1) - a + (b - c) * (a - b)
print expr # printed: a*(b - c + 1) - a + (a - b)*(b - c)
print expr.simplify() # printed: a*(b - c + 1) - a + (a - b)*(b - c)
On the other side, if I change the expression by
expr = a * (b - c + 1) - a
and call simplify(), I will obtain the expected result a * (b - c).
Sympy version is 1.1rc1.
simplify usually can only do a limited amount of magic. It could be arguably more in this case, but if you want that, you need to make a feature request. In any case it’s better to tell SymPy what specific kind of modifications you want to make.
Here, the following will probably satisfy you:
print(expr.factor()) # (2*a - b)*(b - c)
I have two univariate functions, f(x) and g(x), and I'd like to substitute g(x) = y to rewrite f(x) as some f2(y).
Here is a simple example that works:
In [240]: x = Symbol('x')
In [241]: y = Symbol('y')
In [242]: f = abs(x)**2 + 6*abs(x) + 5
In [243]: g = abs(x)
In [244]: f.subs({g: y})
Out[244]: y**2 + 6*y + 5
But now, if I try a slightly more complex example, it fails:
In [245]: h = abs(x) + 1
In [246]: f.subs({h: y})
Out[246]: Abs(x)**2 + 6*Abs(x) + 5
Is there a general approach that works for this problem?
The expression abs(x)**2 + 6*abs(x) + 5 does not actually contain abs(x) + 1 anywhere, so there is nothing to substitute for.
One can imagine changing it to abs(x)**2 + 5*(abs(x) + 1) + abs(x), with the substitution result being abs(x)**2 + 5*y + abs(x). Or maybe changing it to abs(x)**2 + 6*(abs(x) + 1) - 1, with the result being abs(x)**2 + 6*y - 1. There are other choices too. What should the result be?
There is no general approach to this task because it's not a well-defined task to begin with.
In contrast, the substitution f.subs(abs(x), y-1) is a clear instruction to replace all occurrences of abs(x) in the expression tree with y-1. It returns 6*y + (y - 1)**2 - 1.
The substitution above of abs(x) + 1 in abs(x)**2 + 6*abs(x) + 5 is a clear instruction too: to find exact occurrences of the expression abs(x) + 1 in the syntax tree of the expression abs(x)**2 + 6*abs(x) + 5, and replace those subtrees with the syntax tree of the expression abs(x) + 1. There is a caveat about heuristics though.
Aside: in addition to subs SymPy has a method .replace which supports wildcards, but I don't expect it to help here. In my experience, it is overeager to replace:
>>> a = Wild('a')
>>> b = Wild('b')
>>> f.replace(a*(abs(x) + 1) + b, a*y + b)
5*y/(Abs(x) + 1) + 6*y*Abs(x*y)/(Abs(x) + 1)**2 + (Abs(x*y)/(Abs(x) + 1))**(2*y/(Abs(x) + 1))
Eliminate a variable
There is no "eliminate" in SymPy. One can attempt to emulate it with solve by introducing another variable, e.g.,
fn = Symbol('fn')
solve([Eq(fn, f), Eq(abs(x) + 1, y)], [fn, x])
which attempts to solve for "fn" and "x", and therefore the solution for "fn" is an expression without x. If this works
In fact, it does not work with abs(); solving for something that sits inside an absolute value is not implemented in SymPy. Here is a workaround.
fn, ax = symbols('fn ax')
solve([Eq(fn, f.subs(abs(x), ax)), Eq(ax + 1, y)], [fn, ax])
This outputs [(y*(y + 4), y - 1)] where the first term is what you want; a solution for fn.
what will be the output of following code
int x,a=3;
x=+ +a+ + +a+ + +5;
printf("%d %d",x,a);
ouput is: 11 3. I want to know how? and what does + sign after a means?
I think DrYap has it right.
x = + + a + + + a + + + 5;
is the same as:
x = + (+ a) + (+ (+ a)) + (+ (+ 5));
The key points here are:
1) c, c++ don't have + as a postfix operator, so we know we have to interpret it as a prefix
2) monadic + binds more tightly (is higher precedence) than dyadic +
Funny isn't it ? If these were - signs it wouldn't look so strange. Monadic +/- is just a leading sign, or to put it another way, "+x" is the same as "0+x".
The + after a just gets seen as a + before the next value. If you use consistent spacing it is the same as:
x = + + a + + + a + + + 5;
But not all the +s are necessary so it will act the same as doing:
x = a + a + 5;
The value of a is unchanged because you have never used the incrementing operator which is ++ with no white space between the two + symbols. + and ++ are two separate operators.
Since the + operators are never two next to each other but always separated by a white space the statement
x=+ +a+ + +a+ + +5; is actually read as
x=+ (nothing)+a+(nothing) +(nothing) +a+(nothing) +(nothing) +5;
so basically the final equation becomes of the sort
x=a+a+5; and hence the result.
The code seems to be equivalent to:
x= (+(+(a)))+ (+ (+(a)))+ (+(+(5)));
I.e. x = a + a + 5. Which is 11. You know that you can put + or - sign before number, right? Now those + merely indicate sign of variable. Since sign is +, variable remains unchanged I.e. "+5" means "5", so "+a" means "a", and "+ +a" means "+(+a)" which means "a". In same fashion you could write x = + + + 3 + + + + 3 + + + + 5. Or x = - + + - 3 + - + - 3 - - + 5;.
x=+ +a+ + +a+ + +5 : This is equivalent to
x = x=+ +a+ + +a+ + +5 or
we can write it as x = + (+ a) + (+ (+ a)) + (+ (+ 5))
and the +'s are only indicating the signs which will be finally evaluated as
x = a + a + 5.
Good night,
I have been working with fuzzy string matching for some time now, and using C with some pointers I could write a very fast (for my needs) implementation of the Levenshtein distance between two strings. I tried to port the code to C# using unsafe code and the fixed keyword, but the performance was way slower. So I chose to build a C++ dll and use [DllImport] from C#, automatically marshalling every string. The problem is that, after profiling, this keeps being the most time-consuming part of my program, taking between 50-57% of the total running time of the program. Since I think I will need to do some heavy work with lots of substrings of a text field coming from some 3-million database records, I think the time the Levenshtein distance is taking is almost unacceptable. That being, I would like to know if you have any suggestions, both algorithmic or programming-related, to the code below, or if you know of any better algorithm to calculate this distance?
#define Inicio1 (*(BufferVar))
#define Inicio2 (*(BufferVar+1))
#define Fim1 (*(BufferVar+2))
#define Fim2 (*(BufferVar+3))
#define IndLinha (*(BufferVar+4))
#define IndCol (*(BufferVar+5))
#define CompLinha (*(BufferVar+6))
#define TamTmp (*(BufferVar+7))
int __DistanciaEdicao (char * Termo1, char * Termo2, int TamTermo1, int TamTermo2, int * BufferTab, int * BufferVar)
{
*(BufferVar) = *(BufferVar + 1) = 0;
*(BufferVar + 2) = TamTermo1 - 1;
*(BufferVar + 3) = TamTermo2 - 1;
while ((Inicio1 <= *(BufferVar + 2)) && (Inicio2 <= *(BufferVar + 3)) && *(Termo1 + Inicio1) == *(Termo2 + Inicio2))
Inicio1 = ++Inicio2;
if (Inicio2 > Fim2) return (Fim1 - Inicio1 + 1);
while ((Fim1 >= 0) && (Fim2 >= 0) && *(Termo1 + Fim1) == *(Termo2 + Fim2))
{ Fim1--; Fim2--;}
if (Inicio2 > Fim2) return (Fim1 - Inicio1 + 1);
TamTermo1 = Fim1 - Inicio1 + 1;
TamTermo2 = Fim2 - Inicio2 + 1;
CompLinha = ((TamTermo1 > TamTermo2) ? TamTermo1 : TamTermo2) + 1;
for (IndLinha = 0; IndLinha <= TamTermo2; *(BufferTab + CompLinha * IndLinha) = IndLinha++);
for (IndCol = 0; IndCol <= TamTermo1; *(BufferTab + IndCol) = IndCol++);
for (IndCol = 1; IndCol <= TamTermo1; IndCol++)
for (IndLinha = 1; IndLinha <= TamTermo2; IndLinha++)
*(BufferTab + CompLinha * IndLinha + IndCol) = ((*(Termo1 + (IndCol + Inicio1 - 1)) == *(Termo2 + (IndLinha + Inicio2 - 1))) ? *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)) : ((*(BufferTab + CompLinha * (IndLinha - 1) + IndCol) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) : ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)))) + 1);
return *(BufferTab + CompLinha * TamTermo2 + TamTermo1);
}
Please note that BufferVar and BufferTab are two external int * (in th case, int[] variables being marshalled from C#) which I do not instantiate in every function call to make the whole process faster. Still, this code is pretty slow for my needs. Can anyone please give me some suggestions, or, if possible, provide some better code?
Edit: The distance can't be bounded, I need the actual distance.
Thank you very much,
1. Brute Force
Here is an implementation of the Levenshtein Distance in Python.
def levenshtein_matrix(lhs, rhs):
def move(index): return (index+1)%2
m = len(lhs)
n = len(rhs)
states = [range(n+1), [0,]*(n+1)]
previous = 0
current = 1
for i in range(1, m+1):
states[current][0] = i
for j in range(1,n+1):
add = states[current][j-1] + 1
sub = states[previous][j] + 1
repl = states[previous][j-1] + abs(cmp(lhs[i-1], rhs[j-1]))
states[current][j] = min( repl, min(add,sub) )
previous = move(previous)
current = move(current)
return states[previous][n]
It's the typical dynamic programming algorithm, just taking advantage that since one only need the last row, keeping only two rows at a time is sufficient.
For a C++ implementation, you might look at LLVM's one (line 70-130), note the use of a stack allocated array of fixed size, replaced only when necessary by a dynamically allocated array.
I just can't follow up your code to try and diagnose it... so let's change the angle of attack. Instead of micro-optimizing the distance, we'll change the algorithm altogether.
2. Doing better: using a Dictionary
One of the issue you face is that you could do much better.
The first remark is that the distance is symmetric, though it doesn't change the overall complexity it will halve the time necessary.
The second is that since you actually have a dictionary of known words, you can build on that: "actor" and "actual" share a common prefix ("act") and thus you need not recompute the first stages.
This can be exploited using a Trie (or any other sorted structure) to store your words. Next you will take one word, and compute its distance relatively to all of the words stored in the dictionary, taking advantage of the prefixes.
Let's take an example dic = ["actor", "actual", "addict", "atchoum"] and we want to compute the distance for word = "atchoum" (we remove it from the dictionary at this point)
Initialize the matrix for the word "atchoum": matrix = [[0, 1, 2, 3, 4, 5, 6, 7]]
Pick the next word "actor"
Prefix = "a", matrix = [[0, 1, 2, 3, 4, 5, 6, 7], [1, 0, 1, 2, 3, 4, 5, 6]]
Prefix = "ac", matrix = [[0, 1, 2, 3, 4, 5, 6, 7], [1, 0, 1, 2, 3, 4, 5, 6], [2, 1, 1, 2, 3, 4, 5, 6]]
Prefix = "act", matrix = [[..], [..], [..], [..]]
Continue until "actor", you have your distance
Pick the next word "actual", rewind the matrix until the prefix is a prefix of our word, here up to "act"
Prefix = "actu", matrix = [[..], [..], [..], [..], [..]]
Continue until "actual"
Continue for the other words
What's important here is the rewind step, by preserving the computation done for the previous word, with which you share a good-length prefix, you effectively save a lot of work.
Note that this is trivially implemented with a simple stack and does not require any recursive call.
Try the simple approach first - don't use pointers and unsafe code - just code plain ordinary C#... but use the correct algorithm.
There is a simple and efficient algorithm on Wikipedia that uses dynamic programming and runs O(n*m) where n and m are the lengths of the inputs. I suggest you try implementing that algorithm first, as it is described there and only start optimizing it after you've implemented it, measured the performance and found it to be insufficient.
See also the section Possible improvements where it says:
By examining diagonals instead of rows, and by using lazy evaluation, we can find the Levenshtein distance in O(m (1 + d)) time (where d is the Levenshtein distance), which is much faster than the regular dynamic programming algorithm if the distance is small
If I had to guess where the problem is I'd probably start by looking at this line that runs inside two loops:
*(BufferTab + CompLinha * IndLinha + IndCol) = ((*(Termo1 + (IndCol + Inicio1 - 1)) == *(Termo2 + (IndLinha + Inicio2 - 1))) ? *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)) : ((*(BufferTab + CompLinha * (IndLinha - 1) + IndCol) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) : ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)))) + 1);
There appears to be a lot of duplication there though it's hard for me to spot exactly what's going on. Could you factor some of that out? And you definitely need to make it more readable.
You shouldn't try all your possible words with the Levenshtein distance algortihm. You should use another faster metric to filter out the likely candidates and only on then use the Levenshtein to remove ambiguity. The first sieve can be based on a n-gram (trigram works often well) frequency histogram or a hash function.