Parallel Calculation of Distance Correlation (dcor) from DataFrame - python-2.7

I have a pandas DataFrame with 50 rows and 22000 columns, and I would like to calculate a distance correlation (dcor package) between each pair of columns. The code that I created (with a serial-processing and a portion of the data) is:
import pandas as pd
import dcor
DF = pd.DataFrame({'X':[0.72,-0.25,-1.2,-3],'Y':[-0.128,0.2,2,5.6],'Z':[15,-0.425,-0.3,-5]})
DCOR_REZ=pd.DataFrame(index=['X','Y','Z'],columns=['X','Y','Z'])
col_names=DCOR_REZ.columns.tolist()
k=0
for i in col_names:
v1=DF.loc[:,i].as_matrix()
for j in col_names[k:]:
v2=DF.loc[:,j].as_matrix()
rez=dcor.distance_correlation(v1,v2)
DCOR_REZ.at[i,j]=rez
DCOR_REZ.at[j,i]=rez
k=k+1
print DCOR_REZ
X Y Z
X 1 0.981778 0.854349
Y 0.981778 1 0.726328
Z 0.854349 0.726328 1
To execute this code on a full DataFrame I need 21h!.
Since my server has 40 processors I was thinking to cut the time by 40 and get the results in ~30 minutes but I don't know how to rewrite this code for parallel processing.
How can I rewrite the code?
Any help is appreciated.

I am the creator of the dcor package. One problem of this approach is that the pairwise distance matrices for each column are computed on each iteration, instead of just once. If you have enough memory, you could compute those matrices beforehand, and then compute the distance correlation:
import pandas as pd
import dcor
import numpy as np
from scipy.spatial.distance import pdist, squareform
DF = pd.DataFrame({'X':[0.72,-0.25,-1.2,-3],'Y':[-0.128,0.2,2,5.6],'Z':[15,-0.425,-0.3,-5]})
DCOR_REZ=pd.DataFrame(index=['X','Y','Z'],columns=['X','Y','Z'])
col_names=DCOR_REZ.columns.tolist()
k=0
dict_centered_matrices = {}
def compute_matrix(i):
v1=DF.loc[:,i].as_matrix()
v1_dist = squareform(pdist(v1[:, np.newaxis]))
return (i, dcor.double_centered(v1_dist))
dict_centered_matrices = dict(map(compute_matrix, col_names))
for i in col_names:
v1_centered = dict_centered_matrices[i]
for j in col_names[k:]:
v2_centered = dict_centered_matrices[j]
rez=np.sqrt(
dcor.average_product(v1_centered, v2_centered)/np.sqrt(
dcor.average_product(v1_centered, v1_centered)*
dcor.average_product(v2_centered, v2_centered)))
DCOR_REZ.at[i,j]=rez
DCOR_REZ.at[j,i]=rez
k=k+1
print(DCOR_REZ)
This should make your code faster, at the expense of consuming more memory. I will consider adding convenience functions for this case, as it seems a common one. You can also try parallelizing the code using the multiprocessing module, and replacing the map function with the map method of a Pool instance.

Since dcor version 0.5 I have added a rowwise method with this explicit purpose in mind. It will parallelize the computation using available cores when the right conditions are met (basically, when the distance covariance/correlation is computed between random variables and not random vectors, by default). Sorry for the delay in implementing this.

Related

Retrieving Pyomo solution without using for loop

I am struggling to find an efficient way of retrieving the solution to an optimization problem. The solution consists of around 200K variables that I would like in a pandas DataFrame. After searching online the only approaches I found for accessing the variables was through a for loop which looks something like this:
instance = M.create_instance('input.dat') # reading in a datafile
results = opt.solve(instance, tee=True)
results.write()
instance.solutions.load_from(results)
for v in instance.component_objects(Var, active=True):
print ("Variable",v)
varobject = getattr(instance, str(v))
for index in varobject:
print (" ",index, varobject[index].value)
I know I can use this for loop to store them in a dataframe but this is pretty inefficient.
I found out how to access the indexes by using
import pandas as pd
index = pd.DataFrame(instance.component_objects(Var, active=True))
But I dont know how to get the solution
There is actually a very simple and elegant solution, using the method pandas.DataFrame.from_dict combined with the Var.extract_values() method.
from pyomo.environ import *
import pandas as pd
m = ConcreteModel()
m.N = RangeSet(5)
m.x = Var(m.N, rule=lambda _, el: el**2) # x = [1,4,9,16,25]
df = pd.DataFrame.from_dict(m.x.extract_values(), orient='index', columns=[str(m.x)])
print(df)
yields
x
1 1
2 4
3 9
4 16
5 25
Note that for Var we can use both get_values() and extract_values(), they seem to do the same. For Param there is only extract_values().
Of course you can use instance.some_var.pprint() to print it on the screen.
But if you have a variable indexed by a large set. You can also write it to a
seperate file. The following code writes the result to a .txt file:
f = open('Result.txt', 'a')
instance.some_var.pprint(f)
f.close()
I had the same issue as Jasper and tried the suggested solutions. By doing so I noticed, that the part writing the results takes most time. Maybe this is also true in Jasper's case.
results.write()
instance.solutions.load_from(results)
So I suggest to surpress this two lines if you can do so. Maybe someone has a suggestions how to speed this up? Or an alternative method.
Also I saw that in this post (Pyomo: Save results to CSV files) The "for loop" method is recomanded. A pyomo developer states:"I think it's possible in option 2 for the indices and the variable slice to be iterated over in a different order which would invalidate your resulting array."
For simplicity of code and to largely avoid for-loops, I found the pyomoio module in the urbs project, which has taken over the slightly deprecated code of pandaspyomo.py. It relies on each pyomo object's iteritem() method, and handles multiple dimensions elegantly. It can extract sets, parameters, variables as pandas objects.
If I set up a small pyomo model
from pyomo.environ import *
import pyomoio as po
import pandas as pd
# Define a model with 200k values
m = ConcreteModel()
m.ix = RangeSet(200000)
def idem(model, i):
return i
m.a = Param(m.ix, rule=idem)
I can read in the parameter with just one line of code
%%timeit
a_po = po.get_entity(m, 'a')
# 110 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, if I compare it to the approach in the original question, it is not faster, even a little slower:
%%timeit
val = []
ix = []
varobject = getattr(m, 'a')
for index in varobject:
ix.append(index)
val.append(varobject[index])
a = pd.Series(index=ix, data=val)
# 92.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

how to give the test size in stratified kfold sampling in python?

Using sklearn , I want to have 3 splits (i.e. n_splits = 3)in the sample dataset and have a Train/Test ratio as 70:30. I'm able split the set into 3 folds but not able to define the test size (similar to train_test_split method).Is there a way to do define test sample size in StratifiedKFold ?
from sklearn.model_selection import StratifiedKFold as SKF
skf = SKF(n_splits=3)
skf.get_n_splits(X, y)
for train_index, test_index in skf.split(X, y):
# Loops over 3 iterations to have Train test stratified split
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
StratifiedKFold does by definition a K-fold split. This is, the iterator returned will yield (K-1) sets for training while 1 set for testing. K is controlled by n_splits, and thus, it does create groups of n_samples/K, and use all combinations of K-1 for training/testing. Refer to wikipedia or google K-fold cross-validation for more info about it.
In short, the size of the test set will be 1/K (i.e. 1/n_splits), so you can tune that parameter to control the test size (e.g. n_splits=3 will have test split of size 1/3 = 33% of your data). However, StratifiedKFold will iterate over K groups of K-1, and might not be what you want.
Having said that, you might be interested in StratifiedShuffleSplit, which returns just configurable number of splits and train/test ratio. If you just want a single split, you can tune n_splits=1 and yet keep test_size=0.3 (or whatever ratio you want).

PVLIB - DC Power From Irradiation - Simple Calculation

Dear pvlib users and devels.
I'm a researcher in computer science, not particularly expert in the simulation or modelling of solar panels. I'm interested in use pvlib since
we are trying to simulate the works of a small solar panel used for IoT
applications, in particular the panel spec are the following:
12.8% max efficiency, Vmp = 5.82V, size = 225 × 155 × 17 mm.
Before using pvlib, one of my collaborator wrote a code that compute the
irradiation directly from average monthly values calculated with PVWatt.
I was not really satisfied, so we are starting to use pvlib.
In the old code, we have the power and current of the panel calculated as:
W = Irradiation * PanelSize(m^2) * Efficiency
A = W / Vmp
The Irradiation, in Madrid, as been obtained with PVWatt, and this is
what my collaborator used:
DIrradiance = (2030.0,2960.0,4290.0,5110.0,5950.0,7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
I'm trying to understand if pvlib compute values similar to the ones above, as averages over a day for each month. And the curve of production in day.
I wrote this to compare pvlib with our old model:
import math
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
import pvlib
from pvlib.location import Location
def irradiance(day,m):
DIrradiance =(2030.0,2960.0,4290.0,5110.0,5950.0,
7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,m,day,00,00),
end=dt.datetime(2015,m,day,23,59),
freq='60min')
spaout = pvlib.solarposition.spa_python(times, madrid.latitude, madrid.longitude)
spaout = spaout.assign(cosz=pd.Series(np.cos(np.deg2rad(spaout['zenith']))))
z = np.array(spaout['cosz'])
return z.clip(0)*(DIrradiance[m-1])
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start = dt.datetime(2015,8,15,00,00),
end = dt.datetime(2015,8,15,23,59),
freq='60min')
old = irradiance(15,8) # old model
new = madrid.get_clearsky(times) # pvlib irradiance
plt.plot(old,'r-') # compare them.
plt.plot(old/6.0,'y-') # old seems 6 times more..I do not know why
plt.plot(new['ghi'].values,'b-')
plt.show()
The code above compute the old irradiance, using the zenit angle. and compute the ghi values using the clear_sky. I do not understand if the values in ghi must be multiplied by the cos of zenit too, or not. Anyway
they are smaller by a factor of 6. What I'd like to have at the end is the
power and current in output from the panel (DC) without any inverter, and
we are not really interested at modelling it exactly, but at least, to
have a reasonable curve. We are able to capture from the panel the ampere
produced, and we want to compare the values from the measurements putting
the panel on the roof top with the values calculated by pvlib.
Any help on this would be really appreachiated. Thanks
Sorry Will I do not care a lot about my previous model since I'd like to move all code to pvlib. I followed your suggestion and I'm using irradiance.total_irrad, the code now looks in this way:
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,1,1,00,00),
end=dt.datetime(2015,1,1,23,59),
freq='60min')
ephem_data = pvlib.solarposition.spa_python(times, madrid.latitude,
madrid.longitude)
irrad_data = madrid.get_clearsky(times)
AM = atmosphere.relativeairmass(ephem_data['apparent_zenith'])
total = irradiance.total_irrad(40, 180,
ephem_data['apparent_zenith'], ephem_data['azimuth'],
dni=irrad_data['dni'], ghi=irrad_data['ghi'],
dhi=irrad_data['dhi'], airmass=AM,
surface_type='urban')
poa = total['poa_global'].values
Now, I know the irradiance on POA, and I want to compute the output in Ampere: It is just
(poa*PANEL_EFFICIENCY*AREA) / VOLT_OUTPUT ?
It's not clear to me how you arrived at your values for DIrradiance or what the units are, so I can't comment much the discrepancies between the values. I'm guessing that it's some kind of monthly data since there are 12 values. If so, you'd need to calculate ~hourly pvlib irradiance data and then integrate it to check for consistency.
If your module will be tilted, you'll need to convert your ~hourly irradiance GHI, DNI, DHI values to plane of array (POA) irradiance using a transposition model. The irradiance.total_irrad function is the easiest way to do that.
The next steps depend on the IV characteristics of your module, the rest of the circuit, and how accurate you need the model to be.

Convert Python Dask Series to list or Dask DataFrame inside for loop

I am working with a code in Pandas that involves reading a lot of files and then performing various operations on each file inside a loop (which iterates over a file list).
I am trying to convert this to a Dask-based approach instead of a Pandas-based approach and have the following attempt so far - I am new to Dask and need to ask about whether this is a reasonable approach.
Here is what the input data looks like:
A X1 X2 X3 A_d S_d
0 1.0 0.475220 0.839753 0.872468 1 1
1 2.0 0.318410 0.940817 0.526758 2 2
2 3.0 0.053959 0.056407 0.169253 3 3
3 4.0 0.900777 0.307995 0.689259 4 4
4 5.0 0.670465 0.939116 0.037865 5 5
Here is the code:
import dask.dataframe as dd
import numpy as np; import pandas as pd
def my_func(df,r): # perform representative calculations
q = df.columns.tolist()
df2 = df.loc[:,q[1:]] / df.loc[:,q()[1:]].sum()
df2['A'] = df['A']
df2 = df2[ ( df2['A'] >= r[0] ) & ( df2['A'] <= r[1] ) ]
c = q[1:-2]
A = df2.loc[:,c].sum()
tx = df2.loc[:,c].min() * df2.loc[:,c].max()
return A - tx
list_1 = []
for j in range(1,13):
df = dd.read_csv('Test_file.csv')
N = my_func(df,[751.7,790.4]) # perform calculations
out = ['X'+str(j)+'_2', df['A'].min()] + N.compute().tolist()
list_1.append(out)
df_f = pd.DataFrame(list_1)
my_func returns a Dask Series N. Currently, I must .compute() the Dask Series before I can convert it into a list. I am having trouble overcoming this.
Is it possible to vertically append N (which is a Dask Series) as a row to a blank Dask DF? eg. in Pandas, I tend to do
this: df_N = pd.DataFrame() would go outside the for loop and
then something like df_N = pd.concat([df_N,N],axis=0). This would
allow a Dask DF to be built up in the for loop. After that
(outside the loop), I could easily just horizontally concatenate the
built-up Dask DF to pd.DataFrame(list_1).
Another approach is to create a single row Dask DF from the Dask
series N. Then, vertically concatenate this single row DF to a
blank Dask DF (that was created outside the loop). Is it possible in Dask to create single row Dask DataFrame
from a Series?
Additional Information (if needed):
In my real code, I am reading from a *.csv file inside a loop. For this reason, when I generated a sample dastaset, I wrote it to a *.csv file in order to use dd.read_csv() inside the loop.
df2s['A'] = df['A'] - this line is needed since the line above it omits column A (during a normalization of each column to its sum) and produces new DataFrame. df2s['A'] = df['A'] adds column A back to the new DataFrame.
Is it possible to vertically append N (which is a Dask Series) as a row to a blank Dask DF? eg. in Pandas, I tend to do this: df_N = pd.DataFrame() would go outside the for loop and then something like df_N = pd.concat([df_N,N],axis=0). This would allow a Dask DF to be built up in the for loop. After that (outside the loop), I could easily just horizontally concatenate the built-up Dask DF to pd.DataFrame(list_1).
You should never append rows to either a Pandas dataframe or a Dask dataframe. This is very inefficient. Instead it is better to collect many pandas/dask dataframes together and then call the pd.concat or dd.concat function.
Also I note that you are calling compute within your for loop. It is recommended to call compute only after you have set up your entire computation if possible. Otherwise you are probably not getting much parallelism.
Note: I haven't actually gone through the trouble of understanding your code. I'm just responding to the questions at the end. Hopefully someone else comes along with a more comprehensive answer.

ODEINT with multiple parameters (time-dependent)

I'm trying to solve a single first-order ODE using ODEINT. Following is the code. I expect to get 3 values of y for 3 time-points. The issue I'm struggling with is ability to pass nth value of mt and nt to calculate dydt. I think the ODEINT passes all 3 values of mt and nt, instead just 0th, 1st or 2nd, depending on the iteration. Because of this, I get this error:
RuntimeError: The size of the array returned by func (4) does not match the size of y0 (1).
Interestingly, if I replace the initial condition, which is (and should be) a single value as: a0= [2]*4, the code works, but gives me a 4X4 matrix as solution, which seems incorrect.
mt = np.array([3,7,4,2]) # Array of constants
nt = np.array([5,1,9,3]) # Array of constants
c1,c2,c3 = [-0.3,1.4,-0.5] # co-efficients
para = [mt,nt] # Packing parameters
#Test ODE function
def test (y,t,extra):
m,n = extra
dydt = c1*c2*m - c1*y - c3*n
return dydt
a0= [2] # Initial Condition
tspan = range(len(mt)) # Define tspan
#Solving the ODE
yt= odeint(test, a0,tspan,args=(para,))
#Plotting the ODE
plt.plot(tspan,yt,'g')
plt.title('Multiple Parameters Test')
plt.xlabel('Time')
plt.ylabel('Magnitude')
The first order differential equation is:
dy/dt = c1*(c2*mt-y(t)) - c3*nt
This equation represents a part of murine endocrine system, which I am trying to model. The system is analogous to a two-tank system, where the first tank receives a specific hormone [at an unknown rate] but our sensor will detect that level (mt) at specific time intervals (1 second). This tank then feeds into the second tank, where the level of this hormone (y) is detected by another sensor. I labeled the levels using separate variables because the sensors that detect the levels are independent of each other and are not calibrated to each other. 'c2' may be considered as the co-efficient that shows the correlation between the two levels. Also, the transfer of this hormone from tank 1 to tank 2 is diffusion-driven. This hormone is further consumed by a biochemical process (similar to a drain valve for the second tank). At the moment, it is unclear which parameters affect the consumption; however, another sensor can detect the amount of hormone (nt) being consumed at a specific time interval (1 second, in this case too).
Thus, mt and nt are the concentrations/levels of the hormone at specific time points. although only 4-element in length in the code, these arrays are much longer in my study. All sensors report the concentrations at 1 second interval - hence tspan consists of time points separated by 1 second.
The objective is to determine the concentration of this hormone in the second tank (y) mathematically and then optimize the values of these coefficients based on the experimental data. I was able to pass these arrays mt and nt to the defined ODE and solve using ODE45 in MATLAB with no issue. I've been running into this RunTimeError, while trying to replicate the code in Python.
As I mentioned in a comment, if you want to model this system using an ordinary differential equation, you have to make an assumption about the values of m and n between sample times. One possible model is to use linear interpolation. Here's a script that uses scipy.interpolate.interp1d to create the functions mfunc(t) and nfunc(t) based on the samples mt and nt.
import numpy as np
from scipy.integrate import odeint
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
mt = np.array([3,7,4,2]) # Array of constants
nt = np.array([5,1,9,3]) # Array of constants
c1, c2, c3 = [-0.3, 1.4, -0.5] # co-efficients
# Create linear interpolators for m(t) and n(t).
sample_times = np.arange(len(mt))
mfunc = interp1d(sample_times, mt, bounds_error=False, fill_value="extrapolate")
nfunc = interp1d(sample_times, nt, bounds_error=False, fill_value="extrapolate")
# Test ODE function
def test (y, t):
dydt = c1*c2*mfunc(t) - c1*y - c3*nfunc(t)
return dydt
a0 = [2] # Initial Condition
tspan = np.linspace(0, sample_times.max(), 8*len(sample_times)+1)
#tspan = sample_times
# Solving the ODE
yt = odeint(test, a0, tspan)
# Plotting the ODE
plt.plot(tspan, yt, 'g')
plt.title('Multiple Parameters Test')
plt.xlabel('Time')
plt.ylabel('Magnitude')
plt.show()
Here is the plot created by the script:
Note that instead of generating the solution only at sample_times (i.e. at times 0, 1, 2, and 3), I set tspan to a denser set of points. This shows the behavior of the model between sample times.