I have a dataset similar to the following table:
The prediction target is going to be the 'score' column. I'm wondering how can I divide the testing set into different subgroups such as score between 1 to 3 or then check the accuracy on each subgroup.
Now what I have is as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
for i in (0,1,2,3,4):
y_new=y_test[(y_test>=i) & (y_test<=i+1)]
y_new_pred=model.predict(X_test)
print metrics.r2_score(y_new, y_new_pred)
However, my code did not work and this is the traceback that I get:
Found input variables with inconsistent numbers of samples: [14279,
55955]
I have tried the solution provided below, but it looks like that for the full score range (0-5) the r^2 is 0.67. but the subscore range for example (0-1,1-2,2-3,3-4,4-5) the r^2s are significantly lower than that of the full range. shouldn't some of the subscore r^2 be higher than 0.67 and some of them be lower than 0.67?
Could anyone kindly let me know where did I do wrong? Thanks a lot for all your help.
When you are computing the metrics, you have to filtered the predicted values (based on your subset condition).
Basically you are trying to compute
metrics.r2_score([1,3],[1,2,3,4,5])
which creates an error,
ValueError: Found input variables with inconsistent numbers of
samples: [2, 5]
Hence, my suggested solution would be
model.fit(X_train, y_train)
#compute the prediction only once.
y_pred = model.predict(X_test)
for i in (0,1,2,3,4):
#COMPUTE THE CONDITION FOR SUBSET HERE
subset = (y_test>=i) & (y_test<=i+1)
print metrics.r2_score(y_test [subset], y_pred[subset])
I am struggling to find an efficient way of retrieving the solution to an optimization problem. The solution consists of around 200K variables that I would like in a pandas DataFrame. After searching online the only approaches I found for accessing the variables was through a for loop which looks something like this:
instance = M.create_instance('input.dat') # reading in a datafile
results = opt.solve(instance, tee=True)
results.write()
instance.solutions.load_from(results)
for v in instance.component_objects(Var, active=True):
print ("Variable",v)
varobject = getattr(instance, str(v))
for index in varobject:
print (" ",index, varobject[index].value)
I know I can use this for loop to store them in a dataframe but this is pretty inefficient.
I found out how to access the indexes by using
import pandas as pd
index = pd.DataFrame(instance.component_objects(Var, active=True))
But I dont know how to get the solution
There is actually a very simple and elegant solution, using the method pandas.DataFrame.from_dict combined with the Var.extract_values() method.
from pyomo.environ import *
import pandas as pd
m = ConcreteModel()
m.N = RangeSet(5)
m.x = Var(m.N, rule=lambda _, el: el**2) # x = [1,4,9,16,25]
df = pd.DataFrame.from_dict(m.x.extract_values(), orient='index', columns=[str(m.x)])
print(df)
yields
x
1 1
2 4
3 9
4 16
5 25
Note that for Var we can use both get_values() and extract_values(), they seem to do the same. For Param there is only extract_values().
Of course you can use instance.some_var.pprint() to print it on the screen.
But if you have a variable indexed by a large set. You can also write it to a
seperate file. The following code writes the result to a .txt file:
f = open('Result.txt', 'a')
instance.some_var.pprint(f)
f.close()
I had the same issue as Jasper and tried the suggested solutions. By doing so I noticed, that the part writing the results takes most time. Maybe this is also true in Jasper's case.
results.write()
instance.solutions.load_from(results)
So I suggest to surpress this two lines if you can do so. Maybe someone has a suggestions how to speed this up? Or an alternative method.
Also I saw that in this post (Pyomo: Save results to CSV files) The "for loop" method is recomanded. A pyomo developer states:"I think it's possible in option 2 for the indices and the variable slice to be iterated over in a different order which would invalidate your resulting array."
For simplicity of code and to largely avoid for-loops, I found the pyomoio module in the urbs project, which has taken over the slightly deprecated code of pandaspyomo.py. It relies on each pyomo object's iteritem() method, and handles multiple dimensions elegantly. It can extract sets, parameters, variables as pandas objects.
If I set up a small pyomo model
from pyomo.environ import *
import pyomoio as po
import pandas as pd
# Define a model with 200k values
m = ConcreteModel()
m.ix = RangeSet(200000)
def idem(model, i):
return i
m.a = Param(m.ix, rule=idem)
I can read in the parameter with just one line of code
%%timeit
a_po = po.get_entity(m, 'a')
# 110 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, if I compare it to the approach in the original question, it is not faster, even a little slower:
%%timeit
val = []
ix = []
varobject = getattr(m, 'a')
for index in varobject:
ix.append(index)
val.append(varobject[index])
a = pd.Series(index=ix, data=val)
# 92.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using sklearn , I want to have 3 splits (i.e. n_splits = 3)in the sample dataset and have a Train/Test ratio as 70:30. I'm able split the set into 3 folds but not able to define the test size (similar to train_test_split method).Is there a way to do define test sample size in StratifiedKFold ?
from sklearn.model_selection import StratifiedKFold as SKF
skf = SKF(n_splits=3)
skf.get_n_splits(X, y)
for train_index, test_index in skf.split(X, y):
# Loops over 3 iterations to have Train test stratified split
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
StratifiedKFold does by definition a K-fold split. This is, the iterator returned will yield (K-1) sets for training while 1 set for testing. K is controlled by n_splits, and thus, it does create groups of n_samples/K, and use all combinations of K-1 for training/testing. Refer to wikipedia or google K-fold cross-validation for more info about it.
In short, the size of the test set will be 1/K (i.e. 1/n_splits), so you can tune that parameter to control the test size (e.g. n_splits=3 will have test split of size 1/3 = 33% of your data). However, StratifiedKFold will iterate over K groups of K-1, and might not be what you want.
Having said that, you might be interested in StratifiedShuffleSplit, which returns just configurable number of splits and train/test ratio. If you just want a single split, you can tune n_splits=1 and yet keep test_size=0.3 (or whatever ratio you want).
The result of output always in one hot format in keras look like:
[[ 0. 0. 0. 0. 0. 0. 1. 0]]
[[ 0. 1. 0. 0. 0. 0. 0. 0]]
But I want to have a probability percent for each classification, for example
[[ 0.54 0.80 0.34 0.041]]
My simple code is given below
import numpy as np
from keras.models import load_model
from new_get_data_for_test import load_my_data
try :
X_test = load_my_data()
model = load_model('/home/istiyak/development/pythonProject/my_save_model.h5')
output = model.predict(X_test)
print("After pridiction our output is : ")
print (output)
print(np.argmax(output, axis=1))
except Exception as e :
print ("Some Exception occurs")
Use predict_proba. From the documentation:
predict_proba(self, x, batch_size=32, verbose=1)
Generates class probability predictions for the input samples batch by batch.
Returns a Numpy array of probability predictions.
I have been trying to work out how to replicate IDL's smooth function in Python and I just can't get anything like the same results. (Disclaimer: It is probably 10 years since I touched this kind of mathematical problem so it has been dumped to make way for information like where to find the cheapest local fuel). I am trying to code this:
smooth(b,w,/nan)
where b is a 2D float array containing NANs (zeros - missing data - have also been converted to NAN).
From the IDL documents, it appears smooth uses a boxcar, so from scipy.ndimage.filters I have tried:
bsmooth = uniform_filter(b, w)
I am aware that there are some fundamental differences here:
the default edge behaviour from IDL is "the end points are copied
from the original array to the result with no smoothing" whereas I
don't seem to have the option to do this with the uniform filter.
Treatment of the NaN elements. In IDL, the /nan keyword seems to
mean that where possible the NaN values will be filled by the result
of the other points in the window. If there are no valid points to
generate a result, by a MISSING keyword. I thought I could
approximate this behaviour following the smoothing using
scipy.interpolate's NearestNDInterpolator (thanks to the brilliant
explanation by Alex on here:
filling gaps on an image using numpy and scipy)
Here is my test array:
>>>b array([[ 0.97599638, 0.93114936, 0.87070072, 0.5379253 ],
[ 0.34873217, nan, 0.40985891, 0.22407863],
[ nan, nan, nan, 0.67532134],
[ nan, nan, 0.85441768, nan]])
My answers bore not the SLIGHTEST resemblance to IDL, whether I use the /nan keyword or not.
IDL> smooth(b,2,/nan)
0.97599638 0.93114936 0.87070072 0.53792530
0.34873217 0.70728749 0.60817236 0.22407863
NaN 0.53766960 0.54091913 0.67532134
NaN NaN 0.85441768 NaN
IDL> smooth(b,2)
0.97599638 0.93114936 0.87070072 0.53792530
0.34873217 -NaN -NaN 0.22407863
-NaN -NaN -NaN 0.67532134
-NaN -NaN 0.85441768 NaN
I confess I find the scipy documentation rather sparse on detail so I have no idea if I am really doing what I think I doing. The fact that the two python approaches which I believed would both smooth the image give different answers suggests that things are not what I understood them to be.
>>>uniform_filter(b, 2)
array([[ 0.97599638, 0.95357287, 0.90092504, 0.70431301],
[ 0.66236428, nan, nan, nan],
[ nan, nan, nan, nan],
[ nan, nan, nan, nan]])
I thought it was a bit odd it was so empty so I tried this with an array of 100 elements (still using a window of 2) and output the images. The results (first image is 'b' second is 'bsmooth') are not quite what I was hoping for:
Going back to the smaller array and following the examples in: http://scipy.github.io/old-wiki/pages/Cookbook/SignalSmooth which I thought would give the same output as uniform_filter, I tried:
>>> box = np.array([1,1,1,1])
>>> box = box.reshape(2,2)
>>> box
array([[1, 1],
[1, 1]])
>>> bsmooth = scipy.signal.convolve2d(b,box,mode='same')
>>> print bsmooth
[[ 0.97599638 1.90714574 1.80185008 1.40862602]
[ 1.32472855 nan nan 2.04256356]
[ nan nan nan nan]
[ nan nan nan nan]]
Obviously I have completely misunderstood the scipy functions, maybe even the IDL one. If anyone can help me to replicate the IDL smooth function as closely as possible, I would be extremely grateful. I am under considerable time pressure to get a solution for this that doesn't rely on IDL and I am tossing a coin to decide whether to code the function from scratch or develop a very contagious illness.
How can I perform the same smoothing in python?
First: Please use matplotlib.pyplot.imshow with interpolation="none" that's nicer to look at and maybe with greyscale.
So for your example: There is actually no convolution (filter) within scipy and numpy that treat's NaN as missing values (they propagate them within the convolution). At least I've found none so far and your boundary-treatement is also (to my knowledge) not implemented. But the boundary could be just replaced afterwards.
If you want to do convolution with NaN you can for example use astropy.convolution.convolve. There NaNs are interpolated using the kernel of your filter. But their convolution has some drawbacks as well: Border handling like you want isn't implemented there neither and your kernel must be of odd shape and the sum of your kernel must not be zero (or very close to it)
For example:
from astropy.convolution import convolve
import numpy as np
array = np.random.uniform(10,100, (4,4))
array[1,1] = np.nan
kernel = np.ones((3,3))
convolve(array, kernel)
as an example an initial array of
array([[ 97.19514587, 62.36979751, 93.54811286, 30.23567842],
[ 51.02184613, nan, 46.14769821, 60.08088041],
[ 20.86482452, 42.39661484, 36.96961278, 96.89180175],
[ 45.54453509, 76.61274347, 46.44485141, 25.40985372]])
will become:
array([[ 266.9009961 , 406.59680717, 348.69637399, 230.01236989],
[ 330.16243546, 506.82785931, 524.95440336, 363.87378443],
[ 292.75477064, 422.31693304, 487.26826319, 311.94469828],
[ 185.41871792, 268.83318211, 324.72547798, 205.71611967]])
if you want to "normalize" it, astropy offers the normalize_kernel parameter:
convolved = convolve(array, kernel, normalize_kernel=True)
array([[ 29.58753936, 42.09982189, 49.31793529, 33.00203873],
[ 49.87040638, 65.67695002, 66.10447436, 40.44026448],
[ 52.51126383, 63.03914444, 60.85474739, 35.88011742],
[ 39.40188443, 46.82350749, 40.1380926 , 22.46090152]])
If you want to replace the "edge" values with the ones from the original array just replace them:
convolved[0,:] = array[0,:]
convolved[-1,:] = array[-1,:]
convolved[:,0] = array[:,0]
convolved[:,-1] = array[:,-1]
So that's what the existing packages offer (as far as I know it). If you want to learn a bit of Cython or numba you can easily write your own convolutions that is not much slower (only a factor of 2-10) than the numpy/scipy ones but does EXACTLY what you want without messing around.
Since this is not something that is available in the python packages and because I saw the question asked several times during my research without satisfactory answers, here is how I solved the issue.
Provided is a test version of my function that I'm off to tidy up. I am sure there will be better ways to do the things I have done as I'm still fairly new to Python - please do recommend any appropriate changes.
Plots use autumn colourmap just because it allowed me to see the NaNs clearly.
My results:
IDL propagate
0.033369284 0.067915268 0.96602046 0.85623550
0.30435592 NaN NaN 100.00000
0.94065958 NaN NaN 0.90966976
0.018516513 0.044460904 0.051047217 NaN
python propagate
[[ 3.33692829e-02 6.79152655e-02 9.66020487e-01 8.56235492e-01]
[ 3.04355923e-01 nan nan 1.00000000e+02]
[ 9.40659566e-01 nan nan 9.09669768e-01]
[ 1.85165123e-02 4.44609040e-02 5.10472165e-02 nan]]
IDL replace
0.033369284 0.067915268 0.96602046 0.85623550
0.30435592 0.47452110 14.829881 100.00000
0.94065958 0.33833817 17.002417 0.90966976
0.018516513 0.044460904 0.051047217 NaN
python replace
[[ 3.33692829e-02 6.79152655e-02 9.66020487e-01 8.56235492e-01]
[ 3.04355923e-01 4.74521092e-01 1.48298812e+01 1.00000000e+02]
[ 9.40659566e-01 3.38338177e-01 1.70024175e+01 9.09669768e-01]
[ 1.85165123e-02 4.44609040e-02 5.10472165e-02 nan]]
My function:
#!/usr/bin/env python
# smooth.py
__version__ = 0.1
# Version 0.1 29 Feb 2016 ELH Test release
import numpy as np
import matplotlib.pyplot as mp
def Smooth(v1, w, nanopt):
# v1 is the input 2D numpy array.
# w is the width of the square window along one dimension
# nanopt can be replace or propagate
'''
v1 = np.array(
[[3.33692829e-02, 6.79152655e-02, 9.66020487e-01, 8.56235492e-01],
[3.04355923e-01, np.nan , 4.86013025e-01, 1.00000000e+02],
[9.40659566e-01, 5.23314093e-01, np.nan , 9.09669768e-01],
[1.85165123e-02, 4.44609040e-02, 5.10472165e-02, np.nan ]])
w = 2
'''
mp.imshow(v1, interpolation='None', cmap='autumn')
mp.show()
# make a copy of the array for the output:
vout=np.copy(v1)
# If w is even, add one
if w % 2 == 0:
w = w + 1
# get the size of each dim of the input:
r,c = v1.shape
# Assume that w, the width of the window is always square.
startrc = (w - 1)/2
stopr = r - ((w + 1)/2) + 1
stopc = c - ((w + 1)/2) + 1
# For all pixels within the border defined by the box size, calculate the average in the window.
# There are two options:
# Ignore NaNs and replace the value where possible.
# Propagate the NaNs
for col in range(startrc,stopc):
# Calculate the window start and stop columns
startwc = col - (w/2)
stopwc = col + (w/2) + 1
for row in range (startrc,stopr):
# Calculate the window start and stop rows
startwr = row - (w/2)
stopwr = row + (w/2) + 1
# Extract the window
window = v1[startwr:stopwr, startwc:stopwc]
if nanopt == 'replace':
# If we're replacing Nans, then select only the finite elements
window = window[np.isfinite(window)]
# Calculate the mean of the window
vout[row,col] = np.mean(window)
mp.imshow(vout, interpolation='None', cmap='autumn')
mp.show()
return vout