Matching Numpy and NetCDF4 indexing when creating a netCDF file - python-2.7

I'm trying to move values from a numpy array to a NetCDF file, which I am creating. Currently I'm trying to find the best way to emulate 'fancy indexing' of numpy arrays when creating a netCDF file, but the two indexing systems don't match when the dataset only has two points.
import netCDF4
import numpy as np
rootgrp = netCDF4.Dataset('Test.nc','w',format='NETCDF4')
time = rootgrp.createDimension('time',None)
dim1 = rootgrp.createDimension('dim1',100)
dim2 = rootgrp.createDimension('dim2',100)
dim3 = rootgrp.createDimension('dim3',100)
ncVar = rootgrp.createVariable('ncVar','f4',('time','dim1','dim2','dim3'))
npArr = np.arange(0,10000)
npArr = np.reshape(npArr,(100,100))
So this works just fine:
x,y=np.array(([1,75,10,99],[40,88,19,2]))
ncVar[0,x,y,0] = npArr[x,y]
While this does not:
x,y=np.array(([1,75],[40,88]))
ncVar[0,x,y,0] = npArr[x,y]
These assignments are part of a dynamic loop that determines x,y to create values for ncVar at ~1000 time-steps
EDIT: the issue seems to be that the first case recognizes x,y as defining a series of pts, and so returns a [4,] size array (despite the documentation on netCDF4 'fancy indexing'), while the second interprets them combinatorially and so returns a [2,2] size array (as stated in the documentation). Has anyone run into this or found a workaround?

Related

Parallel Calculation of Distance Correlation (dcor) from DataFrame

I have a pandas DataFrame with 50 rows and 22000 columns, and I would like to calculate a distance correlation (dcor package) between each pair of columns. The code that I created (with a serial-processing and a portion of the data) is:
import pandas as pd
import dcor
DF = pd.DataFrame({'X':[0.72,-0.25,-1.2,-3],'Y':[-0.128,0.2,2,5.6],'Z':[15,-0.425,-0.3,-5]})
DCOR_REZ=pd.DataFrame(index=['X','Y','Z'],columns=['X','Y','Z'])
col_names=DCOR_REZ.columns.tolist()
k=0
for i in col_names:
v1=DF.loc[:,i].as_matrix()
for j in col_names[k:]:
v2=DF.loc[:,j].as_matrix()
rez=dcor.distance_correlation(v1,v2)
DCOR_REZ.at[i,j]=rez
DCOR_REZ.at[j,i]=rez
k=k+1
print DCOR_REZ
X Y Z
X 1 0.981778 0.854349
Y 0.981778 1 0.726328
Z 0.854349 0.726328 1
To execute this code on a full DataFrame I need 21h!.
Since my server has 40 processors I was thinking to cut the time by 40 and get the results in ~30 minutes but I don't know how to rewrite this code for parallel processing.
How can I rewrite the code?
Any help is appreciated.
I am the creator of the dcor package. One problem of this approach is that the pairwise distance matrices for each column are computed on each iteration, instead of just once. If you have enough memory, you could compute those matrices beforehand, and then compute the distance correlation:
import pandas as pd
import dcor
import numpy as np
from scipy.spatial.distance import pdist, squareform
DF = pd.DataFrame({'X':[0.72,-0.25,-1.2,-3],'Y':[-0.128,0.2,2,5.6],'Z':[15,-0.425,-0.3,-5]})
DCOR_REZ=pd.DataFrame(index=['X','Y','Z'],columns=['X','Y','Z'])
col_names=DCOR_REZ.columns.tolist()
k=0
dict_centered_matrices = {}
def compute_matrix(i):
v1=DF.loc[:,i].as_matrix()
v1_dist = squareform(pdist(v1[:, np.newaxis]))
return (i, dcor.double_centered(v1_dist))
dict_centered_matrices = dict(map(compute_matrix, col_names))
for i in col_names:
v1_centered = dict_centered_matrices[i]
for j in col_names[k:]:
v2_centered = dict_centered_matrices[j]
rez=np.sqrt(
dcor.average_product(v1_centered, v2_centered)/np.sqrt(
dcor.average_product(v1_centered, v1_centered)*
dcor.average_product(v2_centered, v2_centered)))
DCOR_REZ.at[i,j]=rez
DCOR_REZ.at[j,i]=rez
k=k+1
print(DCOR_REZ)
This should make your code faster, at the expense of consuming more memory. I will consider adding convenience functions for this case, as it seems a common one. You can also try parallelizing the code using the multiprocessing module, and replacing the map function with the map method of a Pool instance.
Since dcor version 0.5 I have added a rowwise method with this explicit purpose in mind. It will parallelize the computation using available cores when the right conditions are met (basically, when the distance covariance/correlation is computed between random variables and not random vectors, by default). Sorry for the delay in implementing this.

Tensor flow shuffle a tensor for batch gradient

To whom it may concern,
I am pretty new to tensorflow. I am trying to solve the famous MNIST problem for CNN. But i have encountered difficulty when i have to resuffle the x_training data (which is a [40000, 28, 28, 1] shape data.
my code is as below:
x_train_final = tf.reshape(x_train_final, [-1, image_width, image_width, 1])
x_train_final = tf.cast(x_train_final, dtype=tf.float32)
perm = np.arange(num_training_example).astype(np.int32)
np.random.shuffle(perm)
x_train_final = x_train_final[perm]
Below errors happened:
ValueError: Shape must be rank 1 but is rank 2 for 'strided_slice_1371' (op: 'StridedSlice') with input shapes: [40000,28,28,1], [1,40000], [1,40000], [1].
Anyone can advise how can i work around this? Thanks.
I would suggest you to make use of scikit's shuffle function.
from sklearn.utils import shuffle
x_train_final = shuffle(x_train_final)
Also, you can pass in multiple arrays and shuffle function will reorganize(shuffle) the data in those multiple arrays maintaining same shuffling order in all those arrays. So with that, you can even pass in your label dataset as well.
Ex:
X_train, y_train = shuffle(X_train, y_train)

How can I save histogram plot in python?

I have following code that generates a histogram. How can I save the histogram automatically using the code? I tried what we do for other plot types but that did not work for histogram.a is a 'numpy.ndarray'.
a = [-0.86906864 -0.72122614 -0.18074998 -0.57190212 -0.25689268 -1.
0.68713553 0.29597819 0.45022949 0.37550592 0.86906864 0.17437203
0.48704826 0.2235648 0.72122614 0.14387731 0.94194514 ]
fig = pl.hist(a,normed=0)
pl.title('Mean')
pl.xlabel("value")
pl.ylabel("Frequency")
pl.savefig("abc.png")
This works for me:
import matplotlib.pyplot as pl
import numpy as np
a = np.array([-0.86906864, -0.72122614, -0.18074998, -0.57190212, -0.25689268 ,-1. ,0.68713553 ,0.29597819, 0.45022949, 0.37550592, 0.86906864, 0.17437203, 0.48704826, 0.2235648, 0.72122614, 0.14387731, 0.94194514])
fig = pl.hist(a,normed=0)
pl.title('Mean')
pl.xlabel("value")
pl.ylabel("Frequency")
pl.savefig("abc.png")
a in the OP is not a numpy array and its format also needs to be modified (it needs commas, not spaces as delimiters). This program successfully saves the histogram in the working directory. If it still does not work, supply it with a full path to the location where you want to save it like this
pl.savefig("/Users/atru/abc.png")
The pl.show() statement should not be placed before savefig() as it creates a new figure which makes savefig() save a blank figure instead of the desired one as explained in this post.

Convert Python Dask Series to list or Dask DataFrame inside for loop

I am working with a code in Pandas that involves reading a lot of files and then performing various operations on each file inside a loop (which iterates over a file list).
I am trying to convert this to a Dask-based approach instead of a Pandas-based approach and have the following attempt so far - I am new to Dask and need to ask about whether this is a reasonable approach.
Here is what the input data looks like:
A X1 X2 X3 A_d S_d
0 1.0 0.475220 0.839753 0.872468 1 1
1 2.0 0.318410 0.940817 0.526758 2 2
2 3.0 0.053959 0.056407 0.169253 3 3
3 4.0 0.900777 0.307995 0.689259 4 4
4 5.0 0.670465 0.939116 0.037865 5 5
Here is the code:
import dask.dataframe as dd
import numpy as np; import pandas as pd
def my_func(df,r): # perform representative calculations
q = df.columns.tolist()
df2 = df.loc[:,q[1:]] / df.loc[:,q()[1:]].sum()
df2['A'] = df['A']
df2 = df2[ ( df2['A'] >= r[0] ) & ( df2['A'] <= r[1] ) ]
c = q[1:-2]
A = df2.loc[:,c].sum()
tx = df2.loc[:,c].min() * df2.loc[:,c].max()
return A - tx
list_1 = []
for j in range(1,13):
df = dd.read_csv('Test_file.csv')
N = my_func(df,[751.7,790.4]) # perform calculations
out = ['X'+str(j)+'_2', df['A'].min()] + N.compute().tolist()
list_1.append(out)
df_f = pd.DataFrame(list_1)
my_func returns a Dask Series N. Currently, I must .compute() the Dask Series before I can convert it into a list. I am having trouble overcoming this.
Is it possible to vertically append N (which is a Dask Series) as a row to a blank Dask DF? eg. in Pandas, I tend to do
this: df_N = pd.DataFrame() would go outside the for loop and
then something like df_N = pd.concat([df_N,N],axis=0). This would
allow a Dask DF to be built up in the for loop. After that
(outside the loop), I could easily just horizontally concatenate the
built-up Dask DF to pd.DataFrame(list_1).
Another approach is to create a single row Dask DF from the Dask
series N. Then, vertically concatenate this single row DF to a
blank Dask DF (that was created outside the loop). Is it possible in Dask to create single row Dask DataFrame
from a Series?
Additional Information (if needed):
In my real code, I am reading from a *.csv file inside a loop. For this reason, when I generated a sample dastaset, I wrote it to a *.csv file in order to use dd.read_csv() inside the loop.
df2s['A'] = df['A'] - this line is needed since the line above it omits column A (during a normalization of each column to its sum) and produces new DataFrame. df2s['A'] = df['A'] adds column A back to the new DataFrame.
Is it possible to vertically append N (which is a Dask Series) as a row to a blank Dask DF? eg. in Pandas, I tend to do this: df_N = pd.DataFrame() would go outside the for loop and then something like df_N = pd.concat([df_N,N],axis=0). This would allow a Dask DF to be built up in the for loop. After that (outside the loop), I could easily just horizontally concatenate the built-up Dask DF to pd.DataFrame(list_1).
You should never append rows to either a Pandas dataframe or a Dask dataframe. This is very inefficient. Instead it is better to collect many pandas/dask dataframes together and then call the pd.concat or dd.concat function.
Also I note that you are calling compute within your for loop. It is recommended to call compute only after you have set up your entire computation if possible. Otherwise you are probably not getting much parallelism.
Note: I haven't actually gone through the trouble of understanding your code. I'm just responding to the questions at the end. Hopefully someone else comes along with a more comprehensive answer.

scikit-learn PCA doesn't have 'score' method

I am trying to identify the type of noise based on that article:
Model selection with Probabilistic (PCA) and Factor Analysis (FA)
I am using scikit-learn-0.14.1.win32-py2.7 on win8 64bit
I know that it refers on version 0.15, however at the version 0.14 documentation it mentions that the score method is available for PCA so I guess it should normally work:
sklearn.decomposition.ProbabilisticPCA
The problem is that no matter which PCA I will use for the *cross_val_score*, I always get a type error message saying that the estimator PCA does not have a score method:
*TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator PCA(copy=True, n_components=None, whiten=False) does not.*
Any ideas why is that happening?
Many thanks in advance
Christos
X has 1000 samples of 40 features
here is a portion of the code:
import numpy as np
import csv
from scipy import linalg
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.covariance import ShrunkCovariance, LedoitWolf
#read in the training data
train_path = '<train data path>/train.csv'
reader = csv.reader(open(train_path,"rb"),delimiter=',')
train = list(reader)
X = np.array(train).astype('float')
n_samples = 1000
n_features = 40
n_components = np.arange(0, n_features, 4)
def compute_scores(X):
pca = PCA()
pca_scores = []
for n in n_components:
pca.n_components = n
pca_scores.append(np.mean(cross_val_score(pca, X, n_jobs=1)))
return pca_scores
pca_scores = compute_scores(X)
n_components_pca = n_components[np.argmax(pca_scores)]
Ok, I think I found the problem. it is not working with PCA, but it does work with PPCA
However, by not providing a cv number the cross_val_score automatically sets 3-fold cross validation
that created 3 sets with sizes 334, 333 and 333 (my initial training set contains 1000 samples)
Since nympy.mean cannot make a comparison between sets with different sizes (334 vs 333), python rises an exception.
thx