Creating Huge Sparse Matrices in python

Creating Huge Sparse Matrices in python - python-2.7

I have been using normal matrices from numpy to store a Matrix for a physics project. The size of the matrix is determined by the physical system.
So for instance if the system has parameters:
L=4 and N =2, then the matrix is of dimension 4C2 = 6, so the matrix is a 6x6 matrix.
This is fine except for now I need larger size i.e 20C10 = 184,756. So the matrix required is now a 184756x184756 matrix, which when I try to create an empty matrix of this size gives me a memory error. (with 16GB of RAM)
The resultant matrix is mostly just diagonal and off diagonal terms, so there are a huge amount of zeroes in the large size matrices. Hence Sparse Matrices seem like the correct approach.
I have tried to get it to work by looking at other answers and just trying by myself from the python libraries, but to no avail.
Below is the code for my normal matrix:
def HamGen(L,N,delta,J):
"""
Will Generate the hamiltonian matrix,
Takes parameters:
L : Number of sites
N : Number of spin downs
delta : anistropy
Each term is gotten by getting H(i,j) = <Set(i)|H|Set(j)>
The term will be a number
Where H is an operator that acts on elements of the set
"""
D = BS.dimension(L,N) # Gets the dimension of the matrix, i.e NxN matrix
Hamiltonian = np.zeros((D,D)) # Creates empty matrix
count1 = 0
Set = BS.getSet(L,N) # The set of states to construct the hamiltonian
for alpha in Set: #loop through the set (i)
count2 = 0
for beta in Set: # j
"""
Compute ab = <alpha|Hamiltonian|beta>
Then let Hamiltonian[a][b] = ab
"""
if (alpha == beta):
for i in range(L-1):
# Sz is just a function
Hamiltonian[count1][count2] += (J*delta*Sz(beta,i)*Sz(beta,i+1))
b = check(alpha,beta)
if b:
del b[0]
for j in b:
Hamiltonian[count1][count2] += (J*0.5*(Sp(beta,j)*Sm(beta,j+1) + Sm(beta,j)*Sp(beta,j+1)))
count2 += 1
count1 += 1
return (np.asmatrix(Hamiltonian))
I mostly just need to know how to make the matrix without having to use as much memory, and then how to put the terms I calculate into the matrix.
Here is my attempt to make the matrix as a sparse matrix.
def SPHamGen(L,N,delta):
"""
Will Generate the hamiltonian matrix,
Takes parameters:
L : Number of sites
N : Number of spin downs
delta : anistropy
"""
start = timeit.default_timer()
D = BS.dimension(L,N)
Ham = sp.coo_matrix((D,D))
print Ham
#data = ([0])*D
count1 = 0
Set = BS.getSet(L,N)
data = ([0])*(D*D)
rows = ([0])*(D*D)
cols = ([0])*(D*D)
for alpha in Set:
count2 = 0
for beta in Set:
"""
Compute ab = <alpha|Hamiltonian|beta>
Then let Hamiltonian[a][b] = ab
"""
if (alpha == beta):
for i in range(L-1):
#Hamiltonian[count1][count2] += (J*delta*Sz(beta,i)*Sz(beta,i+1))
data[count2] += (J*delta*Sz(beta,i)*Sz(beta,i+1))
rows[count2] = count1
cols[count2] = count2
b = check(alpha,beta)
if b:
del b[0]
for j in b:
#Hamiltonian[count1][count2] += (J*0.5*(Sp(beta,j)*Sm(beta,j+1) + Sm(beta,j)*Sp(beta,j+1)))
data[count2] += (J*0.5*(Sp(beta,j)*Sm(beta,j+1) + Sm(beta,j)*Sp(beta,j+1)))
rows[count2] = count1
cols[count2] = count2
count2 += 1
count1 += 1
Ham = Ham + sp.coo_matrix((data,(rows,cols)), shape = (D,D))
time = (timeit.default_timer() - start)
print "\n"+str(time) +"s to calculate H"
#return Ham
return sparse.csr_matrix(Ham)
Thanks, Phil.

Related

Component reconstruction for multivariate lagged time series

I am trying to write a multivariate Singular Spectrum Analysis with Monte Carlo test. To this extent I am working on a code piece that can reconstruct the input series using the lagged trajectory matrix and projection base (ST-PCs) that result from the pca/ssa decomposition of the input series. The attached code piece works for a lagged univariate (that is, single) time series, but I am struggling to make this reconstruction for a lagged multivariate time series. I don't quite get the procedure mathematically and - not surprisingly - I also did not manage to program it. Useful links are attached to the function descriptions of the accompanying code. Input data should be of the form (time * number of series), so say 288x3 implying 3 time series of 288 time levels.
I hope you can help me out!
import numpy as np
def lagged_covariance_matrix(data, M):
""" Computes the lagged covariance matrix using the Broomhead & King method
Background: Plaut, G., & Vautard, R. (1994). Spells of low-frequency oscillations and
weather regimes in the Northern Hemisphere. Journal of the atmospheric sciences, 51(2), 210-236.
Arguments:
data : pxn time series, where p denotes the length of the time series and n the number of channels
M : window length """
# explicitely 'add' spatial dimension if input is a single time series
if np.ndim(data) == 1:
data = np.reshape(data,(len(data),1))
T = data.shape[0]
L = data.shape[1]
N = T - M + 1
X = np.zeros((T, L, M))
for i in range(M):
X[:,:,i] = np.roll(data, -i, axis = 0)
X = X[:N]
# X constitutes the trajectory matrix and is a stacked hankel matrix
X = np.reshape(X, (N, M*L), order = 'C') # https://www.jstatsoft.org/article/viewFile/v067i02/v67i02.pdf
# choose the smallest projection basis for computation of the covariance matrix
if M*L >= N:
return 1/(M*L) * X.dot(X.T), X
else:
return 1/N * X.T.dot(X), X
def sort_by_eigenvalues(eigenvalues, PCs):
""" Sorts the PCs and eigenvalues by descending size of the eigenvalues """
desc = np.argsort(-eigenvalues)
return eigenvalues[desc], PCs[:,desc]
def Reconstruction(M, E, X):
""" Reconstructs the series as the sum of M subseries.
See: https://en.wikipedia.org/wiki/Singular_spectrum_analysis, 'Basic SSA' &
the work of Vivien Sainte Fare Garnot on univariate time series (https://github.com/VSainteuf/mcssa)
Arguments:
M : window length
E : eigenvector basis
X : trajectory matrix """
time = len(X) + M - 1
RC = np.zeros((time, M))
# step 3: grouping
for i in range(M):
d = np.zeros(M)
d[i] = 1
I = np.diag(d)
Q = np.flipud(X # E # I # E.T)
# step 4: diagonal averaging
for k in range(time):
RC[k, i] = np.diagonal(Q, offset = -(time - M - k)).mean()
return RC
#=====================================================================================================
#=====================================================================================================
#=====================================================================================================
# input data
data = None
# number of lags a.k.a. window length
M = 45 # M = 1 means no lag
covmat, X = lagged_covariance_matrix(data, M)
# get the eigenvalues and vectors of the covariance matrix
vals, vecs = np.linalg.eig(covmat)
eig_data, eigvec_data = sort_by_eigenvalues(vals, vecs)
# component reconstruction
recons_data = Reconstruction(M, eigvec_data, X)

The following works but does not make direct use of the projection base (ST-PCs). Hence the original question still stands, but this already helps a great lot and solves the problem for me. This code piece makes use of the similarity between the ST-PCs projection base and the u & vt matrices obtained from the single value decomposition of the lagged trajectory matrix. I think it gives back the same answer as one would obtain using the ST-PCs projection base?
def lag_reconstruction(data, X, M, pairs = None):
""" Reconstructs the series as the sum of M subseries using the lagged trajectory matrix.
Based on equation 2.9 of Plaut, G., & Vautard, R. (1994). Spells of low-frequency oscillations and weather regimes in the Northern Hemisphere. Journal of Atmospheric Sciences, 51(2), 210-236.
Inspired by work of R. van Westen and C. Wieners """
time = data.shape[0] # number of time levels of the original series
L = data.shape[1] # number of input series
N = time - M + 1
u, s, vt = np.linalg.svd(X, full_matrices = False)
rc = np.zeros((time, L, M))
for t in range(time):
counter = 0
for i in range(M):
if t-i >= 0 and t-i < N:
counter += 1
if pairs:
for k in pairs:
rc[t,:,i] += u[t-i, k] * s[k] * vt[k, i*L : i*L + L]
else:
for k in range(len(s)):
rc[t,:,i] += u[t-i, k] * s[k] * vt[k, i*L : i*L + L]
rc[t] = rc[t]/counter
return rc

Tensor contraction with Kronecker deltas in sympy

I'm trying to use sympy to do some index gymnastics for me. I'm trying to calculate the derivatives of a cost function that looks like
cost = sumi (Mii)2
where M is given by a rotation
Mij = U*ki M0kl Ulj
I've written up a parametrization for the rotation matrix, from which I get the derivatives as products of Kronecker deltas. What I've got so far is
def Uder(p,q,r,s):
return KroneckerDelta(p,r)*KroneckerDelta(q,s) - KroneckerDelta(p,s)*KroneckerDelta(q,r)
from sympy import *
# Matrix size
n = symbols('n')
p = symbols('p');
i = Dummy('i')
k = Dummy('k')
l = Dummy('l')
# Matrix elements
M0 = IndexedBase('M')
U = IndexedBase('U')
# Indices
r, s = map(tensor.Idx, ['r', 's'])
# Derivative
cost_x = Sum(Sum(Sum(M0[i,i]*(Uder(k,i,r,s)*M0[k,l]*U[l,i] + U[k,i]*M0[k,l]*Uder(l,i,r,s)),(k,1,n)),(l,1,n)),(i,1,n))
print cost_x
but sympy is not evaluating the contractions for me, which should reduce to simple sums in terms of r and s, which are the rotation indices. Instead, what I get is
Sum(((-KroneckerDelta(_i, r)*KroneckerDelta(_k, s) + KroneckerDelta(_i, s)*KroneckerDelta(_k, r))*M[_k, _l]*U[_l, _i] + (-KroneckerDelta(_i, r)*KroneckerDelta(_l, s) + KroneckerDelta(_i, s)*KroneckerDelta(_l, r))*M[_k, _l]*U[_k, _i])*M[_i, _i], (_k, 1, n), (_l, 1, n), (_i, 1, n))
I'm using the latest git snapshot 4633fd5713c434c3286e3412a2399bd40fbd9569 of sympy.

Error using scipy.optimize nonlinear solvers

I am trying to solve a set of M simultaneous eqns with M variables. I input a M X 2 matrix in as an initial guess to my function and it returns a M X 2 matrix, where each entry would equal zero if my guess was correct. Thus my function can be represented as f_k(u1,u2,...uN) = 0 for k=1,2,...N. Below is the code for my function, (for simplicities sake I have left out the modules that go with this code, i.e. p. or phi. for instance. I was more wondering if anyone else has had this error before)
M = len(p.x_lat)
def main(u_A):
## unpack u_A
u_P = u_total[:,0]
u_W = u_total[:,1]
## calculate phi_A for all monomeric species
G_W = exp(-u_W)
phi_W = zeros(M)
phi_W[1:] = p.phi_Wb * G_W[1:]
## calculate phi_A for all polymeric species
G_P = exp(-u_P)
G_P[0] = 0.
G_fwd = phi.fwd_propagator(G_P,p.Np,0) #(function that takes G_P and propagates outward)
G_bkwd = phi.bkwd_propagator(G_P,p.Np,0) #(function that takes G_P and propagates inward)
phi_P = phi.phi_P(G_fwd,G_bkwd,p.norm_graft_density,p.Np) #(function that takes the two propagators and combines them to calculate a segment density at each point)
## calculate u_A components
u_intW = en.u_int_AB(p.chi_PW,phi_P,p.phi_Pb) + en.u_int_AB(p.chi_SW,p.phi_S,p.phi_Sb) #(fxn that calculates new potential from the new segment densities)
u_intW[0] = 0.
u_Wprime = u_W - u_intW
u_intP = en.u_int_AB(p.chi_PW,phi_W,p.phi_Wb) + en.u_int_AB(p.chi_PS,p.phi_S,p.phi_Sb) #(fxn that calculates new potential from the new segment densities)
u_intP[0] = 0.
u_Pprime = u_P - u_intP
## calculate f_A
phi_total = p.phi_S + phi_W + phi_P
u_prime = 0.5 * (u_Wprime + u_Pprime)
f_total = zeros( (M, 2) )
f_total[:,0] = 1. - 1./phi_total + u_prime - u_Wprime
f_total[:,1] = 1. - 1./phi_total + u_prime - u_Pprime
return f_total
I researched ways of solving nonlinear equations such as this one using python. I came across the scipy.optimize library with the several options for solvers http://docs.scipy.org/doc/scipy-0.13.0/reference/optimize.nonlin.html. I first tried to use the newton_krylov solver and received the following error message:
ValueError: Jacobian inversion yielded zero vector. This indicates a bug in the Jacobian approximation.
I also tried broyden1 solver and it never converged but simply stayed stagnant. Code for implementation of both below:
sol = newton_krylov(main, guess, verbose=1, f_tol=10e-7)
sol = broyden1(main, guess, verbose=1, f_tol=10e-7)
My initial guess is given below here:
## first guess of u_A(x)
u_P = zeros(M)
u_P[1] = -0.0001
u_P[M-1] = 0.0001
u_W = zeros(M)
u_W[1] = 0.0001
u_W[M-1] = -0.0001
u_total = zeros( (M,2) )
u_total[:,0] = u_P
u_total[:,1] = u_W
guess = u_total
Any help would be greatly appreciated!

Incremental entropy computation

Let std::vector<int> counts be a vector of positive integers and let N:=counts[0]+...+counts[counts.length()-1] be the the sum of vector components. Setting pi:=counts[i]/N, I compute the entropy using the classic formula H=p0*log2(p0)+...+pn*log2(pn).
The counts vector is changing --- counts are incremented --- and every 200 changes I recompute the entropy. After a quick google and stackoverflow search I couldn't find any method for incremental entropy computation. So the question: Is there an incremental method, like the ones for variance, for entropy computation?
EDIT: Motivation for this question was usage of such formulas for incremental information gain estimation in VFDT-like learners.
Resolved: See this mathoverflow post.

I derived update formulas and algorithms for entropy and Gini index and made the note available on arXiv. (The working version of the note is available here.) Also see this mathoverflow answer.
For the sake of convenience I am including simple Python code, demonstrating the derived formulas:
from math import log
from random import randint
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0
# update entropy if new example x comes in
def update(H, S, x):
new_S = S+x
return 1.0*H*S/new_S+h(1.0*x/new_S)+h(1.0*S/new_S)
# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
S = S1+S2
return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)
# compute entropy(L) using only `update' function
def test(L):
S = 0.0 # sum of the sample elements
H = 0.0 # sample entropy
for x in L:
H = update(H, S, x)
S = S+x
return H
# compute entropy using the classic equation
def entropy(L):
n = 1.0*sum(L)
return sum([h(x/n) for x in L])
# entry point
if __name__ == "__main__":
L = [randint(1,100) for k in range(100)]
M = [randint(100,1000) for k in range(100)]
L_ent = entropy(L)
L_sum = sum(L)
M_ent = entropy(M)
M_sum = sum(M)
T = L+M
print("Full = ", entropy(T))
print("Update = ", update(L_ent, L_sum, M_ent, M_sum))

You could re-compute the entropy by re-computing the counts and using some simple mathematical identity to simplify the entropy formula
K = count.size();
N = count[0] + ... + count[K - 1];
H = count[0]/N * log2(count[0]/N) + ... + count[K - 1]/N * log2(count[K - 1]/N)
= F * h
h = (count[0] * log2(count[0]) + ... + count[K - 1] * log2(count[K - 1]))
F = -1/(N * log2(N))
which holds because of log2(a / b) == log2(a) - log2(b)
Now given an old vector count of observations so far and another vector of new 200 observations called batch, you can do in C++11
void update_H(double& H, std::vector<int>& count, int& N, std::vector<int> const& batch)
{
N += batch.size();
auto F = -1/(N * log2(N));
for (auto b: batch)
++count[b];
H = F * std::accumulate(count.begin(), count.end(), 0.0, [](int elem) {
return elem * log2(elem);
});
}
Here I assume that you have encoded your observations as int. If you have some kind of symbol, you would need a symbol table std::map<Symbol, int>, and do a lookup for each symbol in batch before you update the count.
This seems the quickest way of writing some code for a general update. If you know that in every batch only few counts actually change, you can do as #migdal does and keep track of the changing counts, subtract their old contribution to the entropy and add the new contribution.

Enumeration all possible matrices with constraints

I'm attempting to enumerate all possible matrices of size r by r with a few constraints.
Row and column sums must be in non-ascending order.
Starting from the top left element down the main diagonal, each row and column subset from that entry must be made up of combinations with replacements from 0 to the value in that upper left entry (inclusive).
The row and column sums must all be less than or equal to a predetermined n value.
The main diagonal must be in non-ascending order.
Important note is that I need every combination to be store somewhere, or if written in c++, to be ran through another few functions after finding them
r and n are values that range from 2 to say 100.
I've tried a recursive way to do this, along with an iterative, but keep getting hung up on keeping track column and row sums, along with all the data in a manageable sense.
I have attached my most recent attempt (which is far from completed), but may give you an idea of what is going on.
The function first_section(): builds row zero and column zero correctly, but other than that I don't have anything successful.
I need more than a push to get this going, the logic is a pain in the butt, and is swallowing me whole. I need to have this written in either python or C++.
import numpy as np
from itertools import combinations_with_replacement
global r
global n
r = 4
n = 8
global myarray
myarray = np.zeros((r,r))
global arraysums
arraysums = np.zeros((r,2))
def first_section():
bigData = []
myarray = np.zeros((r,r))
arraysums = np.zeros((r,2))
for i in reversed(range(1,n+1)):
myarray[0,0] = i
stuff = []
stuff = list(combinations_with_replacement(range(i),r-1))
for j in range(len(stuff)):
myarray[0,1:] = list(reversed(stuff[j]))
arraysums[0,0] = sum(myarray[0,:])
for k in range(len(stuff)):
myarray[1:,0] = list(reversed(stuff[k]))
arraysums[0,1] = sum(myarray[:,0])
if arraysums.max() > n:
break
bigData.append(np.hstack((myarray[0,:],myarray[1:,0])))
if printing: print 'myarray \n%s' %(myarray)
return bigData
def one_more_section(bigData,index):
newData = []
for item in bigData:
if printing: print 'item = %s' %(item)
upperbound = int(item[index-1]) # will need to have logic worked out
if printing: print 'upperbound = %s' % (upperbound)
for i in reversed(range(1,upperbound+1)):
myarray[index,index] = i
stuff = []
stuff = list(combinations_with_replacement(range(i),r-1))
for j in range(len(stuff)):
myarray[index,index+1:] = list(reversed(stuff[j]))
arraysums[index,0] = sum(myarray[index,:])
for k in range(len(stuff)):
myarray[index+1:,index] = list(reversed(stuff[k]))
arraysums[index,1] = sum(myarray[:,index])
if arraysums.max() > n:
break
if printing: print 'index = %s' %(index)
newData.append(np.hstack((myarray[index,index:],myarray[index+1:,index])))
if printing: print 'myarray \n%s' %(myarray)
return newData
bigData = first_section()
bigData = one_more_section(bigData,1)
A possible matrix could look like this:
r = 4, n >= 6
|3 2 0 0| = 5
|3 2 0 0| = 5
|0 0 2 1| = 3
|0 0 0 1| = 1
6 4 2 2

Here's a solution in numpy and python 2.7. Note that all the rows and columns are in non-increasing order, because you only specified that they should be combinations with replacement, and not their sortedness (and generating combinations is the simplest with sorted lists).
The code could be optimized somewhat by keeping row and column sums around as arguments instead of recomputing them.
import numpy as np
r = 2 #matrix dimension
maxs = 5 #maximum sum of row/column
def generate(r, maxs):
# We create an extra row and column for the starting "dummy" values.
# Filling in the matrix becomes much simpler when we do not have to treat cells with
# one or two zero indices in special way. Thus, we start iteration from the
# (1, 1) index.
m = np.zeros((r + 1, r + 1), dtype = np.int32)
m[0] = m[:,0] = maxs + 1
def go(n, i, j):
# If we completely filled the matrix, yield a copy of the non-dummy parts.
if (i, j) == (r, r):
yield m[1:, 1:].copy()
return
# We compute the next indices in row major order (the choice is arbitrary).
(i2, j2) = (i + 1, 1) if j == r else (i, j + 1)
# Computing the maximum possible value for the current cell.
max_val = min(
maxs - m[i, 1:].sum(),
maxs - m[1:, j].sum(),
m[i, j-1],
m[i-1, j])
for n2 in xrange(max_val, -1, -1):
m[i, j] = n2
for matrix in go(n2, i2, j2):
yield matrix
return go(maxs, 1, 1) #note that this is a generator object
# testing
for matrix in generate(r, maxs):
print
print matrix
If you'd like to have all the valid permutations in the rows and columns, this code below should work.
def generate(r, maxs):
m = np.zeros((r + 1, r + 1), dtype = np.int32)
rows = [0]*(r+1) # We avoid recomputing row/col sums on each cell.
cols = [0]*(r+1)
rows[0] = cols[0] = m[0, 0] = maxs
def go(i, j):
if (i, j) == (r, r):
yield m[1:, 1:].copy()
return
(i2, j2) = (i + 1, 1) if j == r else (i, j + 1)
max_val = min(rows[i-1] - rows[i], cols[j-1] - cols[j])
if i == j:
max_val = min(max_val, m[i-1, j-1])
if (i, j) != (1, 1):
max_val = min(max_val, m[1, 1])
for n in xrange(max_val, -1, -1):
m[i, j] = n
rows[i] += n
cols[j] += n
for matrix in go(i2, j2):
yield matrix
rows[i] -= n
cols[j] -= n
return go(1, 1)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Creating Huge Sparse Matrices in python - python-2.7

Related

Component reconstruction for multivariate lagged time series

Tensor contraction with Kronecker deltas in sympy

Error using scipy.optimize nonlinear solvers

Incremental entropy computation

Enumeration all possible matrices with constraints

Categories

Resources