How can I get a list of the matrix symbols in a sympy expression? - sympy

Following this similar question I tried to retrieve the symbols of an expression containing matrix symbolic elements, without success:
import sympy as sy
P = sy.Matrix(sy.MatrixSymbol('P', 5, 2))
expr = P[2,:]*sy.transpose(P[3,:]) # Matrix([[P[2, 0]*P[3, 0] + P[2, 1]*P[3, 1]]])
expr.free_symbols # returns `{P}`
expr.atoms(sy.Symbol) # returns an empty set
How can I obtain from expr the sequence/set P[2,0],P[2,1],P[3,0],P[3,1]?

You need to know the type of the objects that you want to extract. You can use SymPy's srepr function to show what the expression consists of:
In [11]: expr
Out[11]: [P₂₀⋅P₃₀ + P₂₁⋅P₃₁]
In [12]: srepr(expr)
Out[12]: "ImmutableDenseMatrix([[Add(Mul(MatrixElement(MatrixSymbol(Str('P'), Integer(5), Integer(2)), Integer(2), Integer(0)), MatrixElement(MatrixSymbol(Str('P'), Integer(5), Integer(2)), Integer(3), Integer(0))), Mul(MatrixElement(MatrixSymbol(Str('P'), Integer(5), Integer(2)), Integer(2), Integer(1)), MatrixElement(MatrixSymbol(Str('P'), Integer(5), Integer(2)), Integer(3), Integer(1))))]])"
The parts that you want are the MatrixElement expressions so:
In [16]: from sympy.matrices.expressions.matexpr import MatrixElement
In [17]: expr.atoms(MatrixElement)
Out[17]: {P₂₀, P₂₁, P₃₀, P₃₁}

Related

Sympy: Is it possible use function collect() to IndexedBase variables?

I'm trying to use the function collect() to simplify mi expression . My desired result is
My code:
from sympy import *
#index
i = symbols('i' , integer = True )
#constants
a = symbols( 'a' )
#variables
alpha = IndexedBase('alpha', positive=True, domain=QQ)
index = (i, 1, 3)
rho = symbols( 'rho')
U = product( alpha[i]**(1/(rho-1)) , index )
U
:
My solution attempt:
U = U.subs(1/(rho-1),a)
collect(U,rho, evaluate=False)[1]
:
What I'm doing wrong?
You must be using a fairly old version of SymPy because in recent versions the form that you wanted arises automatically. In any case you should be able to use powsimp:
In [9]: U
Out[9]:
a a a
alpha[1] ⋅alpha[2] ⋅alpha[3]
In [10]: powsimp(U, force=True)
Out[10]:
a
(alpha[1]⋅alpha[2]⋅alpha[3])
https://docs.sympy.org/latest/tutorials/intro-tutorial/simplification.html#powsimp

Plot a special integral function

I am a newbye on Sympy.
I try to plot the function:
plot(Integral((t*ln(abs(2+t)))/((1+t**2)),(t,0,x)))
The important check to exercise is to look around the"singular" point of the integrating function.
I have a strange behaviour of tool on -2 bound.
Does it exist (into Sympy) a different plotting methods to obtain the global function behaviour without split the integral on 2 pieces ?
What does it happen on -2 bound that collapse the Sympy algorithm?
Integration on over (t,-3,-1) works.
Thanks to all.
Sadly, this is a case where the plotting module in unable to evaluate the expression. We can workaround it: we use sympy's lambdify to convert the symbolic expression to a numerical function. Then we use numpy and matplotlib to create a plot:
import matplotlib.pyplot as plt
import numpy as np
expr = Integral((t*ln(abs(2+t)))/((1+t**2)),(t,0,x))
f1 = lambdify([x], expr)
# since the expression contains an integral, we need to vectorize
# the numerical function so that it will automatically evaluate
# a numpy array
f1 = np.vectorize(f1)
f2 = lambdify([x], expr.diff(x))
f2 = np.vectorize(f2)
xx = np.linspace(-4, 0, 1000)
yy1 = f1(xx)
yy2 = f2(xx)
fig, axs = plt.subplots(2, 1)
def grid_labels(ax, ylabel):
ax.grid(which='major', axis='both', linewidth=0.75,
linestyle='-', color='0.85')
ax.grid(which='minor', axis='both', linewidth=0.25,
linestyle='--', color='0.80')
ax.minorticks_on()
ax.set_xlabel("x")
ax.set_ylabel(ylabel)
axs[0].plot(xx, yy1)
grid_labels(axs[0], "f(x)")
axs[1].plot(xx, yy2)
grid_labels(axs[1], "df/dx")
plt.show()

Merge generator objects to calculate frequency in NLTK

I am trying to count frequency of various ngrams using ngram and freqDist functions in nltk.
Due to the fact that the ngram function output is a generator object, I would like to merge the output from each ngram before calculating frequency.
However, I am running into problems to merge the various generator objects.
I have tried itertools.chain, which created an itertools object, rather than merge the generators.
I have finally settled on permutations, but to parse the objects afterwards seems redundant.
The working code thus far is:
import nltk
from nltk import word_tokenize, pos_tag
from nltk.collocations import *
from itertools import *
from nltk.util import ngrams
import re
corpus = 'testing sentences to see if if if this works'
token = word_tokenize(corpus)
unigrams = ngrams(token,1)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
perms = list(permutations([unigrams,bigrams,trigrams]))
fdist = nltk.FreqDist(perms)
for x,y in fdist.items():
for k in x:
for v in k:
words = '_'.join(v)
print words, y
As you can see in the results, freq dist is not calculating the words from the individual generator objects properly as each has a frequency of 1.
Is there a more pythonic way to do properly do this?
Use everygrams, it returns the all n-grams given a range of n.
>>> from nltk import everygrams
>>> from nltk import FreqDist
>>> corpus = 'testing sentences to see if if if this works'
>>> everygrams(corpus.split(), 1, 3)
<generator object everygrams at 0x7f4e272e9730>
>>> list(everygrams(corpus.split(), 1, 3))
[('testing',), ('sentences',), ('to',), ('see',), ('if',), ('if',), ('if',), ('this',), ('works',), ('testing', 'sentences'), ('sentences', 'to'), ('to', 'see'), ('see', 'if'), ('if', 'if'), ('if', 'if'), ('if', 'this'), ('this', 'works'), ('testing', 'sentences', 'to'), ('sentences', 'to', 'see'), ('to', 'see', 'if'), ('see', 'if', 'if'), ('if', 'if', 'if'), ('if', 'if', 'this'), ('if', 'this', 'works')]
To combine the counting of different orders of ngrams:
>>> from nltk import everygrams
>>> from nltk import FreqDist
>>> corpus = 'testing sentences to see if if if this works'.split()
>>> fd = FreqDist(everygrams(corpus, 1, 3))
>>> fd
FreqDist({('if',): 3, ('if', 'if'): 2, ('to', 'see'): 1, ('sentences', 'to', 'see'): 1, ('if', 'this'): 1, ('to', 'see', 'if'): 1, ('works',): 1, ('testing', 'sentences', 'to'): 1, ('sentences', 'to'): 1, ('sentences',): 1, ...})
Alternatively, FreqDist is essentially a collections.Counter sub-class, so you can combine counters as such:
>>> from collections import Counter
>>> x = Counter([1,2,3,4,4,5,5,5])
>>> y = Counter([1,1,1,2,2])
>>> x + y
Counter({1: 4, 2: 3, 5: 3, 4: 2, 3: 1})
>>> x
>>> from nltk import FreqDist
>>> FreqDist(['a', 'a', 'b'])
FreqDist({'a': 2, 'b': 1})
>>> a = FreqDist(['a', 'a', 'b'])
>>> b = FreqDist(['b', 'b', 'c', 'd', 'e'])
>>> a + b
FreqDist({'b': 3, 'a': 2, 'c': 1, 'e': 1, 'd': 1})
Alvas is right, nltk.everygrams is the perfect tool for this job. But merging several iterators is really not that hard, nor that uncommon, so you should know how to do it. The key is that any iterator can be converted to a list, but it's best to do that only once:
Make a list out of several iterators
Just use lists (simple but inefficient)
allgrams = list(unigrams) + list(bigrams) + list(trigrams)
Or build a single list, properly
allgrams = list(unigrams)
allgrams.extend(bigrams)
allgrams.extend(trigrams)
Or use itertools.chain(), then make a list
allgrams = list(itertools.chain(unigrams, bigrams, trigrams))
The above produce identical results (as long as you don't try to reuse the iterators unigrams etc.-- you need to redefine them between examples).
Use the iterators themselves
Don't fight iterators, learn to work with them. Many Python functions accept them instead of lists, saving you much space and time.
You could form a single iterator and pass it to nltk.FreqDist():
fdist = nltk.FreqDist(itertools.chain(unigrams, bigrams, trigrams))
You can work with multiple iterators. FreqDist, like Counter, has an update() method you can use to count things incrementally:
fdist = nltk.FreqDist(unigrams)
fdist.update(bigrams)
fdist.update(trigrams)

vectorizer.fit_transform gives NotImplementedError : adding a nonzero scalar to a sparse matrix is not supported

I am trying to create a term document matrix using my custom analyser to extract features out of the documents. Following is the code for the same :
vectorizer = CountVectorizer( \
ngram_range=(1,2),
)
analyzer=vectorizer.build_analyzer()
def customAnalyzer(text):
grams = analyzer(text)
tgrams = [gram for gram in grams if not re.match("^[0-9\s]+$",gram)]
return tgrams
This function is called to create the custom analyser, which is used by the countVectorizer to extract the features.
for i in xrange( 0, num_rows ):
clean_query.append( review_to_words( inp["keyword"][i] , units))
vectorizer = CountVectorizer(analyzer = customAnalyzer, \
tokenizer = None, \
ngram_range=(1,2), \
preprocessor = None, \
stop_words = None, \
max_features = n,
)
features = vectorizer.fit_transform(clean_query)
z = vectorizer.get_feature_names()
This call throws the following error:
(<type 'exceptions.NotImplementedError'>, 'python.py', 128,NotImplementedError('adding a nonzero scalar to a sparse matrix is not supported',))
This error comes when we call the vectorizer to fit and transform.
But the value of the variable clean_query is not scalar. I am using sklearn-0.17.1
np.isscalar(clean_query)
False
This is a small test which I did to reproduce the error, but it did not throw the same error for me. (This example has been taken from : scikit-learn Feature extraction)
scikit-learn version : 0.19.dev0
In [1]: corpus = [
...: ... 'This is the first document.',
...: ... 'This is the second second document.',
...: ... 'And the third one.',
...: ... 'Is this the first document?',
...: ... ]
In [2]: from sklearn.feature_extraction.text import TfidfVectorizer
In [3]: vectorizer = TfidfVectorizer(min_df=1)
In [4]: vectorizer.fit_transform(corpus)
Out[4]:
<4x9 sparse matrix of type '<type 'numpy.float64'>'
with 19 stored elements in Compressed Sparse Row format>
In [5]: import numpy as np
In [6]: np.isscalar(corpus)
Out[6]: False
In [7]: type(corpus)
Out[7]: list
From the code above you can see, corpus is not a scalar and has the type list.
I think your solution lies in creating the clean_query variable, as expected by the vectorizer.fit_transform function.

check_arrays() limiting array dimensions in scikit-learn?

I would like to use the sklearn.learning_curves.py available in scikit-learn X0.15. After I cloned this version, several functions no longer work because check_arrays() is limiting the dimension of the arrays to 2.
>>> from sklearn import metrics
>>> from sklearn.cross_validation import train_test_split
>>> import numpy as np
>>> X = np.random.random((10,2,2,2))
>>> y = np.random.random((10,2,2,2))
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=3)
>>> error "Found array with dim 4d. Expected <= 2"
Using the same X and y I get the same error.
>>> mse = metrics.mean_squared_error
>>> mse(X,y)
>>> error "Found array with dim 4d. Expected <= 2"
If I go to sklearn.utils.validation.py and comment out lines 272, 273, and 274 as shown below everything works just fine.
# if array.ndim >= 3:
# raise ValueError("Found array with dim %d. Expected <= 2" %
# array.ndim)
Why are the dimensions of the arrays being limited to 2?
Because scikit-learn uses a 2-d convention (n_samples × n_features) for all feature data. If any function or method lets a higher-d array through, that's usually just oversight and you can't really rely on it.