sympy solve vs. solveset vs. nsolve - sympy

I am trying to solve the following equation for r:
from sympy import pi, S, solve, solveset, nsolve, symbols
(n_go, P_l, T, gamma_w, P_g, r, R_mol) = symbols(
'n_go, P_l, T, gamma_w, P_g, r, R_mol', real=True)
expr = -P_g + P_l - 3*R_mol*T*n_go/(4*r**3*pi) + 2*gamma_w/r
soln = solveset(expr, r, domain=S.Reals)
soln1 = solve(expr, r)
soln is of the form Complement(Intersection(FiniteSet(...))), which I really don't know what to do with.
soln1 is a list of 3 expressions, two of which are complex. In fact, if I substitute values for the symbols and compute the solutions for soln1, all are complex:
vdict = {n_go: 1e-09, P_l: 101325, T: 300, gamma_w: 0.07168596252716256, P_g: 3534.48011713030, R_mol: 8.31451457896800}
for result in soln1:
print(result.subs(vdict).n())
returns:
-9.17942953565355e-5 + 0.000158143657514283*I
-9.17942953565355e-5 - 0.000158143657514283*I
0.000182122477993494 + 1.23259516440783e-32*I
Interestingly, substituting values first and then using solveset() or solve() gives a real result:
solveset(expr.subs(vdict), r, domain=S.Reals).n()
{0.000182122477993494}
Conversely, nsolve fails with this equation, unless the starting point contains the first 7 significant digits of the solution(!):
nsolve(expr.subs(vdict), r,0.000182122 )
ValueError: Could not find root within given tolerance. (9562985778.9619347103 > 2.16840434497100886801e-19)
It should not be that hard, here is the plot:
My questions:
Why is nsolve so useless here?
How can I use the solution returned from solveset to compute any numerical solutions?
Why can I not obtain a real solution from solve if I solve first and then substitute values?

The answer from Maelstrom is good but I just want to add a few points.
The values you substitute are all floats and with those values the polynomial is ill-conditioned. That means that the form of the expression that you substitute into can affect the accuracy of the returned results. That is one reason why substituting values into the solution from solve does not necessarily give exactly the same value that you get from substituting before calling solve.
Also before you substitute the symbols it isn't possible for solve to know which of the three roots is real. That's why you get three solutions from solve(expr, r) and only one solution from solve(expr.subs(vdict), r). The third solution which is real after the substitution is the same (ignoring the tiny imaginary part) as returned by solve after the substitution:
In [7]: soln1[2].subs(vdict).n()
Out[7]: 0.000182122477993494 + 1.23259516440783e-32⋅ⅈ
In [8]: solve(expr.subs(vdict), r)
Out[8]: [0.000182122477993494]
Because the polynomial is ill-conditioned and has a large gradient at the root nsolve has a hard time finding this root. However nsolve can find the root if given a narrow enough interval:
In [9]: nsolve(expr.subs(vdict), r, [0.0001821, 0.0001823])
Out[9]: 0.000182122477993494
Since this is essentially a polynomial your best bet is actually to convert it to a polynomial and use nroots. The quickest way to do this is using as_numer_denom although in this case that introduces a spurious root at zero:
In [26]: Poly(expr.subs(vdict).as_numer_denom()[0], r).nroots()
Out[26]: [0, 0.000182122477993494, -9.17942953565356e-5 - 0.000158143657514284⋅ⅈ, -9.17942953565356e-5 + 0.000158143657514284⋅ⅈ]

Your expr is essentially a cubic equation.
Applying the subs before or after solving should not substantially change anything.
soln
soln is of the form Complement(Intersection(FiniteSet(<3 cubic solutions>), Reals), FiniteSet(0)) i.e. a cubic solution on a real domain excluding 0.
The following should give you a simple FiniteSet back but evalf does not seem to be implemented well for sets.
print(soln.subs(vdict).evalf())
Hopefully something will be done about it soon.
1
The reason why nsolve is not useful is because the graph is almost asymptotically vertical. According to your graph, the gradient is roughly 1.0e8. I don't think nsolve is useful for such steep graphs.
Plotting your substituted expression we get:
Zooming out we get:
This is a pretty wild function and I suspect nsolve uses an epsilon that is to large to be useful in this situation. To fix this, you could provide more reasonable numbers that are closer to 1 when substituting. (Consider providing different units of measurement. eg. instead of meters/year consider km/hour)
2
It is difficult to tell you how to deal with the output of solveset in general because every type of set needs to be dealt with in different ways. It's also not mathematically sensible since soln.args[0].args[0].args[0] should give the first cubic solution but it forgets that this must be real and nonzero.
You can use args or preorder_traversal or things to navigate the tree. Also reading documentation of various sets should help. solve and solveset need to be used "interactively" because there are lots of possible outputs with lots of ways to understand it.
3
I believe soln1 has 3 solutions instead of 4 as you state. Otherwise, your loop would print 4 lines instead of 3. All of them are technically of complex (as is the nature with floats in Python). However, the third solution you provide has a very small imaginary component. To remove these kinds of finicky things, there is an argument called chop which should help:
for result in soln1:
print(result.subs(vdict).n(chop=True))
One of the results is 0.000182122477993494 which looks like your root.

Here is an unswer to the underlying question: How to compute the roots of the above equation efficiently?
Based on the suggestion by #OscarBenjamin, we can do even better and faster by using Poly and roots instead of nroots. Below, sympy computes in no time the roots of the equation for 100 different values of P_g, while keeping everything else constant:
from sympy import pi, Poly, roots, solve, solveset, nsolve, nroots, symbols
(n_go, P_l, T, gamma_w, P_g, r, R_mol) = symbols(
'n_go, P_l, T, gamma_w, P_g, r, R_mol', real=True)
vdict = {pi:pi.n(), n_go:1e-09, P_l:101325, T:300, gamma_w:0.0717, R_mol: 8.31451457896800}
expr = -P_g + P_l - 3*R_mol*T*n_go/(4*r**3*pi) + 2*gamma_w/r
expr_poly = Poly(expr.as_numer_denom()[0], n_go, P_l, T, gamma_w, P_g, r, R_mol, domain='RR[pi]')
result = [roots(expr_poly.subs(vdict).subs(P_g, val)).keys() for val in range(4000,4100)]
All that remains is to check if the solutions are fulfilling our conditions (positive, real). Thank you, #OscarBenjamin!
PS: Should I expand the topic above to include nroots and roots?

Related

Sympy simplify Euler's formula not working

I am doing my physics homework and tried to simplify an expression using the Euler formula. The minimal not-working example looks like this.
from sympy import *
x, phi = symbols("x varphi", real=True)
simplify(x * (E**(I*phi) + E**(-I*phi)))
My Jupiter notebook outputs the exact same thing back
While the desired expression using the Euler formula is
However, sympy actually knows how to use the Euler formula to represent the cosine function, because it outputs the simplified expression nicely when the x is removed:
simplify(E**(I*phi) + E**(-I*phi))
gives
Since the distributive property of multiplication apply to complex numbers, I don't see why sympy can't figure out the desired simplification of the first expression.
May be it is by design. As a workaround you can do
expr=x* (E**(I*phi) + E**(-I*phi))
expr.rewrite(cos)
which gives
2*x*cos(varphi)

expecting ints or fractions, got % and % in sympy

I need to solve differential equation y'=6e^(2x-y).
I am trying to do that in sympy with dsolve().
sol = dsolve(Derivative(f(x), x) - 6 *(e**(2*x-f(x))), f(x))
But always get error
expecting ints or fractions, got 7.38905609893065022723042746058 and 6
What is the problem?
Where did you get e from? It seems you used math.exp(1) or similar to get a floating point value that the symbolic package can not treat correctly
Using sympy.exp instead works perfectly, even defining e=sympy.exp(1) is correctly recognized. Both with the result
Eq(f(x), log(C1 + 3*exp(2*x)))

How to rewrite `sin(x)^2` to cos(2*x) form in Sympy

It is easy to obtain such rewrite in other CAS like Mathematica.
TrigReduce[Sin[x]^2]
(*1/2 (1 - Cos[2 x])*)
However, in Sympy, trigsimp with all methods tested returns sin(x)**2
trigsimp(sin(x)*sin(x),method='fu')
While dealing with a similar issue, reducing the order of sin(x)**6, I notice that sympy can reduce the order of sin(x)**n with n=2,3,4,5,... by using, rewrite, expand, and then rewrite, followed by simplify, as shown here:
expr = sin(x)**6
expr.rewrite(sin, exp).expand().rewrite(exp, sin).simplify()
this returns:
-15*cos(2*x)/32 + 3*cos(4*x)/16 - cos(6*x)/32 + 5/16
That works for every power similarly to what Mathematica will do.
On the other hand if you want to reduce sin(x)**2*cos(x) a similar strategy works. In that case you have to rewrite the cos and sin to exp and as before expand rewrite and simplify again as:
(sin(x)**2*cos(x)).rewrite(sin, exp).rewrite(cos, exp).expand().rewrite(exp, sin).simplify()
that returns:
cos(x)/4 - cos(3*x)/4
The full "fu" method tries many different combinations of transformations to find "the best" result.
The individual transforms used in the Fu-routines can be used to do targeted transformations. You will have to read the documentation to learn what the different functions do, but just running through the functions of the FU dictionary identifies TR8 as your workhorse here:
>>> for f in FU.keys():
... print("{}: {}".format(f, FU[f](sin(var('x'))**2)))
...
8<---
TR8 -cos(2*x)/2 + 1/2
TR1 sin(x)**2
8<---
Here is a silly way to get this job done.
trigsimp((sin(x)**2).rewrite(tan))
returns:
-cos(2*x)/2 + 1/2
also works for
trigsimp((sin(x)**3).rewrite(tan))
returns
3*sin(x)/4 - sin(3*x)/4
but not works for
trigsimp((sin(x)**2*cos(x)).rewrite(tan))
retruns
4*(-tan(x/2)**2 + 1)*cos(x/2)**6*tan(x/2)**2

Calculating a relative Levenshtein distance - make sense?

I am using both Daitch-Mokotoff soundexing and Damerau-Levenshtein to find out if a user entry and a value in the application are "the same".
Is Levenshtein distance supposed to be used as an absolute value? If I have a 20 letter word, a distance of 4 is not so bad. If the word has 4 letters...
What I am now doing is taking the distance / length to get a distance that better reflects what percentage of the word has been changed.
Is that a valid/proven approach? Or is it plain stupid?
Is Levenshtein distance supposed to be
used as an absolute value?
It seems like it would depend on your requirements. (To clarify: Levenshtein distance is an absolute value, but as the OP pointed out, the raw value may not be as useful as for a given application as a measure that takes the length of the word into account. This is because we are really more interested in similarity than distance per se.)
I am using both Daitch-Mokotoff
soundexing and Damerau-Levenshtein to
find out if a user entry and a value
in the application are "the same".
Sounds like you're trying to determine whether the user intended their entry to be the same as a given data value?
Are you doing spell-checking? or conforming invalid input to a known set of values?
What are your priorities?
Minimize false positives (try to make sure all suggested words are very "similar", and list of suggestions is short)
Minimize false negatives (try to make sure that the string the user intended is in the list of suggestions, even if it makes the list long)
Maximize average matching accuracy
You might end up using the Levenshtein distance in one way to determine whether a word should be offered in a suggestion list; and another way to determine how to order the suggestion list.
It seems to me, if I've inferred your purpose correctly, that the core thing you want to measure is similarity rather than difference between two strings. As such, you could use Jaro or Jaro-Winkler distance, which takes into account the length of the strings and the number of characters in common:
The Jaro distance dj of two given
strings s1 and s2 is
(m / |s1| + m / |s2| + (m - t) / m) / 3
where:
m is the number of matching characters
t is the number of transpositions
Jaro–Winkler distance uses a prefix
scale p which gives more favourable
ratings to strings that match from the
beginning for a set prefix length l.
The levenshtein distance is a relative value between two words. Comparing the LD to the length is not relevant eg
cat -> scat = 1 (75% similar??)
difference -> differences = 1 (90% similar??)
Both these words have lev distances of 1 ie they differ by one character, but when compared to their lengths the second set would appear to be 'more' similar.
I use soundexing to rank words that have the same lev distance eg
cat and fat both have a LD of 1 relative to kat, but the word is more likely to be kat than fat when using soundex (assuming the word is incrrectly spelt, not incorrectly typed!)
So the short answer is just use the lev distance to determine the similarity.

Similar String algorithm

I'm looking for an algorithm, or at least theory of operation on how you would find similar text in two or more different strings...
Much like the question posed here: Algorithm to find articles with similar text, the difference being that my text strings will only ever be a handful of words.
Like say I have a string:
"Into the clear blue sky"
and I'm doing a compare with the following two strings:
"The color is sky blue" and
"In the blue clear sky"
I'm looking for an algorithm that can be used to match the text in the two, and decide on how close they match. In my case, spelling, and punctuation are going to be important. I don't want them to affect the ability to discover the real text. In the above example, if the color reference is stored as "'sky-blue'", I want it to still be able to match. However, the 3rd string listed should be a BETTER match over the second, etc.
I'm sure places like Google probably use something similar with the "Did you mean:" feature...
* EDIT *
In talking with a friend, he worked with a guy who wrote a paper on this topic. I thought I might share it with everyone reading this, as there are some really good methods and processes described in it...
Here's the link to his paper, I hope it is helpful to those reading this question, and on the topic of similar string algorithms.
Levenshtein distance will not completely work, because you want to allow rearrangements. I think your best bet is going to be to find best rearrangement with levenstein distance as cost for each word.
To find the cost of rearrangement, kinda like the pancake sorting problem. So, you can permute every combination of words (filtering out exact matches), with every combination of other string, trying to minimize a combination of permute distance and Levenshtein distance on each word pair.
edit:
Now that I have a second I can post a quick example (all 'best' guesses are on inspection and not actually running the algorithms):
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the c_lear blue sky
The color is sky blue | is__ the colo_r blue sky
R_dist = dist( 3 1 2 5 4 ) --> 3 1 2 *4 5* --> *2 1 3* 4 5 --> *1 2* 3 4 5 = 3
L_dist = (2D+S) + (I+D+S) (Total Subsitutions: 2, deletions: 3, insertion: 1)
(notice all the flips include all elements in the range, and I use ranges where Xi - Xj = +/- 1)
Other example
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the clear blue sky
In the blue clear sky | In__ the clear blue sky
R_dist = dist( 1 2 4 3 5 ) --> 1 2 *3 4* 5 = 1
L_dist = (2D) (Total Subsitutions: 0, deletions: 2, insertion: 0)
And to show all possible combinations of the three...
The color is sky blue | The colo_r is sky blue
In the blue clear sky | the c_lear in sky blue
R_dist = dist( 2 4 1 3 5 ) --> *2 3 1 4* 5 --> *1 3 2* 4 5 --> 1 *2 3* 4 5 = 3
L_dist = (D+I+S) + (S) (Total Subsitutions: 2, deletions: 1, insertion: 1)
Anyway you make the cost function the second choice will be lowest cost, which is what you expected!
One way to determine a measure of "overall similarity without respect to order" is to use some kind of compression-based distance. Basically, the way most compression algorithms (e.g. gzip) work is to scan along a string looking for string segments that have appeared earlier -- any time such a segment is found, it is replaced with an (offset, length) pair identifying the earlier segment to use. You can use measures of how well two strings compress to detect similarities between them.
Suppose you have a function string comp(string s) that returns a compressed version of s. You can then use the following expression as a "similarity score" between two strings s and t:
len(comp(s)) + len(comp(t)) - len(comp(s . t))
where . is taken to be concatenation. The idea is that you are measuring how much further you can compress t by looking at s first. If s == t, then len(comp(s . t)) will be barely any larger than len(comp(s)) and you'll get a high score, while if they are completely different, len(comp(s . t)) will be very near len(comp(s) + comp(t)) and you'll get a score near zero. Intermediate levels of similarity produce intermediate scores.
Actually the following formula is even better as it is symmetric (i.e. the score doesn't change depending on which string is s and which is t):
2 * (len(comp(s)) + len(comp(t))) - len(comp(s . t)) - len(comp(t . s))
This technique has its roots in information theory.
Advantages: good compression algorithms are already available, so you don't need to do much coding, and they run in linear time (or nearly so) so they're fast. By contrast, solutions involving all permutations of words grow super-exponentially in the number of words (although admittedly that may not be a problem in your case as you say you know there will only be a handful of words).
One way (although this is perhaps better suited a spellcheck-type algorithm) is the "edit distance", ie., calculate how many edits it takes to transform one string to another. A common technique is found here:
http://en.wikipedia.org/wiki/Levenshtein_distance
You might want to look into the algorithms used by biologists to compare DNA sequences, since they have to cope with many of the same things (chunks may be missing, or have been inserted, or just moved to a different position in the string.
The Smith-Waterman algorithm would be one example that'd probably work fairly well, although it might be too slow for your uses. Might give you a starting point, though.
i had a similar problem, i needed to get the percentage of characters in a string that were similar. it needed exact sequences, so for example "hello sir" and "sir hello" when compared needed to give me five characters that are the same, in this case they would be the two "hello"'s. it would then take the length of the longest of the two strings and give me a percentage of how similar they were. this is the code that i came up with
int compare(string a, string b){
return(a.size() > b.size() ? bigger(a,b) : bigger(b,a));
}
int bigger(string a, string b){
int maxcount = 0, currentcount = 0;//used to see which set of concurrent characters were biggest
for(int i = 0; i < a.size(); ++i){
for(int j = 0; j < b.size(); ++j){
if(a[i+j] == b[j]){
++currentcount;
}
else{
if(currentcount > maxcount){
maxcount = currentcount;
}//end if
currentcount = 0;
}//end else
}//end inner for loop
}//end outer for loop
return ((int)(((float)maxcount/((float)a.size()))*100));
}
I can't mark two answers here, so I'm going to answer and mark my own. The Levenshtein distance appears to be the correct method in most cases for this. But, it is worth mentioning j_random_hackers answer as well. I have used an implementation of LZMA to test his theory, and it proves to be a sound solution. In my original question I was looking for a method for short strings (2 to 200 chars), where the Levenshtein Distance algorithm will work. But, not mentioned in the question was the need to compare two (larger) strings (in this case, text files of moderate size) and to perform a quick check to see how similar the two are. I believe that this compression technique will work well but I have yet to study it to find at which point one becomes better than the other, in terms of the size of the sample data and the speed/cost of the operation in question. I think a lot of the answers given to this question are valuable, and worth mentioning, for anyone looking to solve a similar string ordeal like I'm doing here. Thank you all for your great answers, and I hope they can be used to serve others well too.
There's another way. Pattern recognition using convolution. Image A is run thru a Fourier transform. Image B also. Now superimposing F(A) over F(B) then transforming this back gives you a black image with a few white spots. Those spots indicate where A matches B strongly. Total sum of spots would indicate an overall similarity. Not sure how you'd run an FFT on strings but I'm pretty sure it would work.
The difficulty would be to match the strings semantically.
You could generate some kind of value based on the lexical properties of the string. e.g. They bot have blue, and sky, and they're in the same sentence, etc etc... But it won't handle cases where "Sky's jean is blue", or some other odd ball English construction that uses same words, but you'd need to parse the English grammar...
To do anything beyond lexical similarity, you'd need to look at natural language processing, and there isn't going to be one single algorith that would solve your problem.
Possible approach:
Construct a Dictionary with a string key of "word1|word2" for all combinations of words in the reference string. A single combination may happen multiple times, so the value of the Dictionary should be a list of numbers, each representing the distance between the words in the reference string.
When you do this, there will be duplication here: for every "word1|word2" dictionary entry, there will be a "word2|word1" entry with the same list of distance values, but negated.
For each combination of words in the comparison string (words 1 and 2, words 1 and 3, words 2 and 3, etc.), check the two keys (word1|word2 and word2|word1) in the reference string and find the closest value to the distance in the current string. Add the absolute value of the difference between the current distance and the closest distance to a counter.
If the closest reference distance between the words is in the opposite direction (word2|word1) as the comparison string, you may want to weight it smaller than if the closest value was in the same direction in both strings.
When you are finished, divide the sum by the square of the number of words in the comparison string.
This should provide some decimal value representing how closely each word/phrase matches some word/phrase in the original string.
Of course, if the original string is longer, it won't account for that, so it may be necessary to compute this both directions (using one as the reference, then the other) and average them.
I have absolutely no code for this, and I probably just re-invented a very crude wheel. YMMV.