TypeError when attempting stacking - python-2.7

would be grateful for some help here... I am trying to implement stacking but based on this code, I keep getting TypeError: object of type 'generator' has no len(). Would anyone know how to correct this? Many thanks.
y_lr = clf_lr.predict(X_test) # Linear
y_rf = forest.predict(X_test) # Random Forest
y_gb = clf_gb.predict(X_test) # Gradient Boosting
dtest = xgb.DMatrix(X_test)
y_xgb = clf_xgb.predict(dtest) # XGBoost
y_nn = model_nn.predict(X_test) # DNN
stack = pd.DataFrame(data={'lr':y_lr, 'gb':y_gb, 'xgb':y_xgb, 'nn':y_nn, 'true':y_test})
This is what I get when I do just data = {'rf':y_rf, 'gb':y_gb, 'xgb':y_xgb, 'nn':y_nn, 'true':y_test} and print data:
{'gb': array([ 5176163.73806255, 6717797.72382604, 7079943.66873864, ...,
12224999.12632363, 6632903.39968627, 7314008.41080324]),
'nn': <generator object _as_iterable at 0x7f535ca0a780>,
'rf': array([ 3525000. , 6713017.2, 5577500. , ..., 11708300. ,
6255000. , 6290000. ]),
'true': 1715 5200000.0
17126 6796548.0
28143 7300000.0
10037 12581315.0
16133 7500000.0
17356 6450000.0
28348 7300000.0
24818 6100000.0
2240 5000000.0
25878 3300000.0
8533 4058255.0
4374 5063160.0
29140 6200000.0
16599 3606128.0
5647 6500000.0
4878 4347200.0
30267 7500000.0
18793 4762800.0
22865 5850000.0
20585 6600000.0
1166 3000000.0
21417 6100000.0
13557 4200000.0
8716 10000000.0
1486 8000000.0
7916 4650776.0
28010 8600000.0
21926 5972181.0
3567 4498491.0
6729 6850000.0
...
1224 5550000.0
19234 5100000.0
1201 9500000.0
11412 5000000.0
27141 6623516.0
28107 6800000.0
19347 3834328.0
17965 8300000.0
18584 6440000.0
11473 6518400.0
16907 11200000.0
28412 5950000.0
18744 5700000.0
15247 7000000.0
19411 6907232.0
3185 8000000.0
6348 3413300.0
19544 4800000.0
21309 12800000.0
10733 5107200.0
17367 6900000.0
11761 14500000.0
27435 7251680.0
13039 6300000.0
1966 12800000.0
3664 4506800.0
3626 4467294.0
25682 15250000.0
8988 1000000.0
24637 7700000.0
Name: price_doc, dtype: float64,
'xgb': array([ 4634399. , 5984703. , 6499839.5, ..., 12502588. ,
6457020.5, 7572096. ], dtype=float32)}

After trying different iterations, this is the code that worked!
y_lr = list(clf_lr.predict(X_test)) # Linear
y_rf = list(forest.predict(X_test)) # Random Forest
y_gb = list(clf_gb.predict(X_test)) # Gradient Boosting
dtest = xgb.DMatrix(X_test)
y_xgb = list(clf_xgb.predict(dtest)) # XGBoost
y_nn = list(model_nn.predict(X_test)) # DNN
stack = pd.DataFrame({'lr':y_lr, 'gb':y_gb, 'xgb':y_xgb,'nn':y_nn, 'true':y_test})

For sklearn-compatible estimators you can use the StackingClassifier to make this job easier. It will merge the outputs of each model into a single dataset and then pass these on to a final meta-estimator (which is "stacked" on top of the other models).
As an alternative, you could try using a library called skdag (disclaimer: I am the author) which lets you compose your classifiers in any kind of workflow, including the stacking architecture you described in your example:
from skdag import DAGBuilder
stack = (
DAGBuilder()
.add_step("pass", "passthrough")
.add_step("lr", clf_lr, deps=["pass"])
.add_step("rf", forest, deps=["pass"])
.add_step("gb", clf_gb, deps=["pass"])
.add_step("xgb", clf_xgb, deps=["pass"])
.add_step("meta", LinearRegression(), deps={
"lr": [1], "rf": [1], "gb": [1], "xgb": [1]
}])
.make_dag()
)
stack.fit(X_train, y_train)
stack.predict(X_test, y_test)
You can read more about this in the docs for skdag.
Both the StackingClassifier and the DAG are slightly different from your example in that they use predict_proba rather than predict as the inputs for your final meta-estimator, but maybe this is what you want anyway? A call to predict will apply thresholds and drop lots of valuable information.

Related

Unable to get the numerical result for a tanh function in Sympy

I am a python newbie. I am trying to get the two numerical results of the convection formula, but the best code I've created outputs a symbolic list containing the 'Lc' parameter, and not the expected numerical result. Anyone could give me a helping hand, please?
from sympy import var, tanh, solve
def convection ():
m = 0.9
Lc = var('Lc')
rend = 0.8
f = tanh(m*Lc)/(m*Lc)-rend
return solve(f,[m,Lc,rend],positive=True)
# Gotten : [(0.900000000000000, Lc, 1.11111111111111*tanh(0.9*Lc)/Lc)]
# Expected : [0.9, 0.986709867, 0.8] (or something like that)
Thank you in advance.
Your equation is:
In [33]: m = 0.9
In [34]: Lc = Symbol('Lc')
In [35]: rend = 0.8
In [36]: f = tanh(m*Lc)/(m*Lc)-rend
In [37]: f
Out[37]:
1.11111111111111⋅tanh(0.9⋅Lc)
-0.8 + ─────────────────────────────
Lc
The solve function is intended to find analytic solutions but that is often impossible for a transcendental equation such as this.
You are calling solve and asking it to solve for m and rend as well which is just confusing things. You should call it like:
In [38]: solve(f, Lc)
---------------------------------------------------------------------------
NotImplementedError
...
NotImplementedError: multiple generators [Lc, exp(Lc/10)]
No algorithms are implemented to solve equation -4/5 + 10*(exp(9*Lc/10) - exp(-9*Lc/10))/(9*Lc*(exp(9*Lc/10) + exp(-9*Lc/10)))
This fails because the transcendental equation can not be solved in explicit analytic form.
Instead if what you want is a numeric solution you can find that using nsolve:
In [41]: nsolve(f, Lc, 1)
Out[41]: 0.986683032622042
In [42]: nsolve(f, Lc, -1)
Out[42]: -0.986683032622042
Here we have to use an initial guess (e.g. 1 or -1) to seed the numeric solver but then we get a numeric answer.

How to build an inflation term structure in QuantLib?

This is what I've got, but I'm getting weird results. Can you spot an error?:
#Zero Coupon Inflation Indexed Swap Data
zciisData = [(ql.Date(18,4,2020), 1.9948999881744385),
(ql.Date(18,4,2021), 1.9567999839782715),
(ql.Date(18,4,2022), 1.9566999673843384),
(ql.Date(18,4,2023), 1.9639999866485596),
(ql.Date(18,4,2024), 2.017400026321411),
(ql.Date(18,4,2025), 2.0074000358581543),
(ql.Date(18,4,2026), 2.0297999382019043),
(ql.Date(18,4,2027), 2.05430006980896),
(ql.Date(18,4,2028), 2.0873000621795654),
(ql.Date(18,4,2029), 2.1166999340057373),
(ql.Date(18,4,2031), 2.152100086212158),
(ql.Date(18,4,2034), 2.18179988861084),
(ql.Date(18,4,2039), 2.190999984741211),
(ql.Date(18,4,2044), 2.2016000747680664),
(ql.Date(18,4,2049), 2.193000078201294)]
def build_inflation_term_structure(calendar, observationDate):
dayCounter = ql.ActualActual()
yTS = build_yield_curve()
lag = 3
fixing_date = calendar.advance(observationDate,-lag, ql.Months)
convention = ql.ModifiedFollowing
cpiTS = ql.RelinkableZeroInflationTermStructureHandle()
inflationIndex = ql.USCPI(False, cpiTS)
#last observed CPI level
fixing_rate = 252.0
baseZeroRate = 1.8
inflationIndex.addFixing(fixing_date, fixing_rate)
observationLag = ql.Period(lag, ql.Months)
zeroSwapHelpers = []
for date,rate in zciisData:
nextZeroSwapHelper = ql.ZeroCouponInflationSwapHelper(rate/100,observationLag,date,calendar,
convention,dayCounter,inflationIndex)
zeroSwapHelpers = zeroSwapHelpers + [nextZeroSwapHelper]
# the derived inflation curve
derived_inflation_curve = ql.PiecewiseZeroInflation(observationDate, calendar, dayCounter, observationLag,
inflationIndex.frequency(), inflationIndex.interpolated(),
baseZeroRate, yTS, zeroSwapHelpers,
1.0e-12, ql.Linear())
cpiTS.linkTo(derived_inflation_curve)
return inflationIndex, derived_inflation_curve, cpiTS, yTS
observation_date = ql.Date(17, 4, 2019)
calendar = ql.UnitedStates()
inflationIndex, derived_inflation_curve, cpiTS, yTS = build_inflation_term_structure(calendar, observation_date)
If I plot the inflationIndex zero rates, I get this:
I've now been checking the same problem and first of all I don't think you need that function ZeroCouponInflationSwapHelper at all.
Just to let you know I've perfectly replicated OpenRiskEngine (ORE)'s inflation curve which itself is built on QuantLib and the idea is quite simple.
compute base value I(0) which is the lagged CPI (this is supposedly your value 252.0),
compute CPI(t)=I(T)=I(0)*(1+quote)^T, where T = ZCIS expiry date, t is T - 3M
I'm not sure where you took I(0) from, but since a lag is present the I(0) is not the latest available CPI value. Instead I(0)=I(2019-04-17) is interpolated value of CPI(2019-01) and CPI(2019-02).
Also to build the CPI curve you don't need any interest rate yield curve at all because the ZCIS pays at maturity T a single cash flow: 'floating' (I(T)/I(0)-1) for 'fixed' (1+quote)^T-1 cash flows. If you equate these, you can back out the 'fair' I(T) which I used above.
Assuming your value I(0)=252.0 is correct, your CPI curve would look like this:
import QuantLib as ql
import pandas as pd
fixing_rate = 252.0
observation_date = ql.Date(17, 4, 2019)
zciisData = [(ql.Date(18,4,2020), 1.9948999881744385),
(ql.Date(18,4,2021), 1.9567999839782715),
(ql.Date(18,4,2022), 1.9566999673843384),
(ql.Date(18,4,2023), 1.9639999866485596),
(ql.Date(18,4,2024), 2.017400026321411),
(ql.Date(18,4,2025), 2.0074000358581543),
(ql.Date(18,4,2026), 2.0297999382019043),
(ql.Date(18,4,2027), 2.05430006980896),
(ql.Date(18,4,2028), 2.0873000621795654),
(ql.Date(18,4,2029), 2.1166999340057373),
(ql.Date(18,4,2031), 2.152100086212158),
(ql.Date(18,4,2034), 2.18179988861084),
(ql.Date(18,4,2039), 2.190999984741211),
(ql.Date(18,4,2044), 2.2016000747680664),
(ql.Date(18,4,2049), 2.193000078201294)]
fixing_dates = []
CPI_computed = []
for tenor, quote in zciisData:
fixing_dates.append(tenor - ql.Period('3M')) # this is 'fixing date' t
pay_date = ql.ActualActual().yearFraction(observation_date, tenor) # this is year fraction of 'pay date' T
CPI_computed.append(fixing_rate * (1+quote/100)**(pay_date))
results = pd.DataFrame({'date': pd.Series(fixing_dates).apply(ql.Date.to_date), 'CPI':CPI_computed})
display(results)
results.set_index('date')['CPI'].plot();

Tensorflow: list of tuples as placeholder

I want to use compute_gradients and generate local gradients. These gradients are to be averaged with multiple local gradients from other machines after which apply_gradients will be called. I am using 2 session.runs with a feed_dict in the second one that accepts gradients. Since apply_gradients expects a list of tuples, I am looking for an efficient way to do this.
This is how I am generating the list of tuples placeholder :
grads = cifar10.train_part1(loss, global_step)
xx = [tf.placeholder(tf.float32, shape=grads[0][0].shape) for i in range(10)]
yy = [tf.placeholder(tf.float32, shape=grads[0][0].shape) for i in range(10)]
xyz = zip(xx,yy)
train_op = cifar10.train_part2(loss,global_step, xyz)
I get the following error :
NotImplementedError: ('Trying to optimize unsupported type ', tf.Tensor 'Placeholder_10:0' shape=(5, 5, 3, 64) dtype=float32)

How to extract labels from a Binary Image in SimpleITK in python

I would like to extract the labels from the 2D Binary image I get using the following code:
image2DThresh = sitk.Threshold(image2D, lower=stats.GetMinimum(), upper=127.500)
cca = sitk.ConnectedComponentImageFilter()
cca_image = cca.Execute(2D_Slice)
# Get the shape statistics of the labels using
labelStats = sitk.LabelShapeStatisticsImageFilter()
The basic idea is to find the mean intensity, area of ROI and min/max indexes of the label in the main image. What I am trying to do is binarizing the image with Threshold Filter, then running CCA on this to get all the labels. Then I use the LabelShapeStatisticsImageFilter() to get the physical attributes of every label (except label 0 of course) and check if the label meets the conditions. The problem is i am not able to get the average intensity in the main image where the label is. That is why I suggest using LabelIntensityStatisticsFilter, which however for python 2.7, SimpleITK 0.10 isn't available.
The two filters which you may be interested in are the "LabelStatisticsImageFilter" and the "LabelIntensityStatisticsImageFilter". These are both available in SimpleITK 0.10, if not you have a distribution problem. Both filters compute the mean, but the later computes a bounding box and many more advanced statistics.
Usage would go something like this:
In [1]: import SimpleITK as sitk
In [2]: print sitk.Version()
SimpleITK Version: 0.10.0 (ITK 4.10)
Compiled: Aug 16 2016 17:21:32
In [3]: img = sitk.ReadImage("cthead1.png")
In [4]: cc = sitk.ConnectedComponent(img>100)
In [5]: stats = sitk.LabelIntensityStatisticsImageFilter()
In [6]: stats.Execute(cc,img)
Out[6]: <SimpleITK.SimpleITK.Image; proxy of <Swig Object of type 'std::vector< itk::simple::Image >::value_type *' at 0x2a6b540> >
In [7]: for l in stats.GetLabels():
...: print("Label: {0} -> Mean: {1} Size: {2}".format(l, stats.GetMean(l), stats.GetPhysicalSize(l)))
...:
Label: 1 -> Mean: 157.494210868 Size: 3643.8348071
Label: 2 -> Mean: 151.347826087 Size: 2.86239969136
Label: 3 -> Mean: 123.75 Size: 0.497808641975
Label: 4 -> Mean: 106.0 Size: 0.248904320988
Label: 5 -> Mean: 104.0 Size: 0.124452160494
Label: 6 -> Mean: 106.0 Size: 0.124452160494
Label: 7 -> Mean: 103.0 Size: 0.124452160494
Label: 8 -> Mean: 121.5 Size: 1.49342592593
Label: 9 -> Mean: 106.0 Size: 0.124452160494
In stead of printing you could create lists of labels to preserve or to relabel to 0 (erase). The ChangeLabelImageFilter can then be used to apply this change to the label image.
The combination of thresholding, statistics, and label section is a power segmentation approach which can be used and customized for many tasks. It also serves as a starting point for more complication methods.
So I solved the problem using numpy. I'm posting the code, may be it helps someone else in the future!
def get_label(ccaimage, label, image2D):
# labelImage is the mask for a particular label
labelImage = sitk.Threshold(ccaimage, lower=label, upper=label)
#sitk_show(labelImage)
# get image as array
labelImageArray = sitk.GetArrayFromImage(labelImage)
image2Darray = sitk.GetArrayFromImage(image2D)
# ROI_1 is the minimum in the original image where the mask is equal to label
ROI_1 = image2Darray == np.min(image2Darray[labelImageArray == label])
plt.imshow(ROI_1)
plt.show()
# ROI_2 is the mask image
ROI_2 = labelImageArray == label
plt.imshow(ROI_2)
plt.show()
# AND gives me only those pixels which satisfy both conditions.
ROI = np.logical_and(image2Darray == np.min(image2Darray[labelImageArray == label]), labelImageArray == label )
avg = np.mean(image2Darray[labelImageArray == label])
print np.min(image2Darray[labelImageArray == label])
print np.where(ROI)
plt.imshow(ROI)
plt.show()

How to restore variables using CheckpointReader in Tensorflow

I'm trying to restore some variables from checkpoint file if same variable name is in current model.
And I found that there is some way as in Tensorfow Github
So what I want to do is checking variable names in checkpoint file using has_tensor("variable.name") as below,
...
reader = tf.train.NewCheckpointReader(ckpt_path)
for v in tf.trainable_variables():
print v.name
if reader.has_tensor(v.name):
print 'has tensor'
...
But I found that v.name returns both variable name and colon+number. For example, I have variable name W_o and b_o then v.name returns W_o:0, b_o:0.
However reader.has_tensor() requires name without colon and number as W_o, b_o.
My question is: how to remove the colon and number at the end of the variable name in order to read the variables?
Is there a better way to restore such variables?
You could use string.split() to get the tensor name:
...
reader = tf.train.NewCheckpointReader(ckpt_path)
for v in tf.trainable_variables():
tensor_name = v.name.split(':')[0]
print tensor_name
if reader.has_tensor(tensor_name):
print 'has tensor'
...
Next, let me use an example to show how I would restore every possible variable from a .cpkt file. First, let's save v2 and v3 in tmp.ckpt:
import tensorflow as tf
v1 = tf.Variable(tf.ones([1]), name='v1')
v2 = tf.Variable(2 * tf.ones([1]), name='v2')
v3 = tf.Variable(3 * tf.ones([1]), name='v3')
saver = tf.train.Saver({'v2': v2, 'v3': v3})
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
saver.save(sess, 'tmp.ckpt')
That's how I would restore every variable (belonging to a new graph) showing up in tmp.ckpt:
with tf.Graph().as_default():
assert len(tf.trainable_variables()) == 0
v1 = tf.Variable(tf.zeros([1]), name='v1')
v2 = tf.Variable(tf.zeros([1]), name='v2')
reader = tf.train.NewCheckpointReader('tmp.ckpt')
restore_dict = dict()
for v in tf.trainable_variables():
tensor_name = v.name.split(':')[0]
if reader.has_tensor(tensor_name):
print('has tensor ', tensor_name)
restore_dict[tensor_name] = v
saver = tf.train.Saver(restore_dict)
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
saver.restore(sess, 'tmp.ckpt')
print(sess.run([v1, v2])) # prints [array([ 0.], dtype=float32), array([ 2.], dtype=float32)]
Also, you may want to ensure that shapes and dtypes match.
tf.train.NewCheckpointReader is a nifty method that creates a CheckpointReader object. CheckpointReader has several very useful methods. The method that would be the most relevant to your question would be get_variable_to_shape_map().
get_variable_to_shape_map() provides a dictionary with variable names and shapes:
saved_shapes = reader.get_variable_to_shape_map()
print 'fire9/squeeze1x1/kernels:', saved_shapes['fire9/squeeze1x1/kernels']
Please take a look at this quick tutorial below:
Loading Variables from Existing Checkpoints
Simple answer:
reader = tf.train.NewCheckpointReader(checkpoint_file)
variable1 = reader.get_tensor('layer_name1/layer_type_name')
variable2 = reader.get_tensor('layer_name2/layer_type_name')
Now, after modification to these variables, you can assign it back.
layer_name1_var.set_weights([variable1, variable2])