feed pre-computed estimates to TfidfVectorizer - python-2.7

I trained an instance of scikit-learn's TfidfVectorizer and I want to persist it to disk. I saved the IDF matrix (the idf_ attribute) to disk as a numpy array and I saved the vocabulary (vocabulary_) to disk as a JSON object (I'm avoiding pickle, for security and other reasons). I'm trying to do this:
import json
from idf import idf # numpy array with the pre-computed IDFs
from sklearn.feature_extraction.text import TfidfVectorizer
# dirty trick so I can plug my pre-computed IDFs
# necessary because "vectorizer.idf_ = idf" doesn't work,
# it returns "AttributeError: can't set attribute."
class MyVectorizer(TfidfVectorizer):
TfidfVectorizer.idf_ = idf
# instantiate vectorizer
vectorizer = MyVectorizer(lowercase = False,
min_df = 2,
norm = 'l2',
smooth_idf = True)
# plug vocabulary
vocabulary = json.load(open('vocabulary.json', mode = 'rb'))
vectorizer.vocabulary_ = vocabulary
# test it
vectorizer.transform(['foo bar'])
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1314, in transform
return self._tfidf.transform(X, copy=False)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1014, in transform
check_is_fitted(self, '_idf_diag', 'idf vector is not fitted')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 627, in check_is_fitted
raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.utils.validation.NotFittedError: idf vector is not fitted
So, what am I doing wrong? I'm failing to fool the vectorizer object: somehow it knows that I'm cheating (i.e., passing it pre-computed data and not training it with actual text). I inspected the attributes of the vectorizer object but I can't find anything like 'istrained', 'isfitted', etc. So, how do I fool the vectorizer?

Ok, I think I got it: the vectorizer instance has an attribute _tfidf, which in turn must have an attribute _idf_diag. The transform method calls a check_is_fitted function that checks whether whether that _idf_diag exists. (I had missed it because it's an attribute of an attribute.) So, I inspected the TfidfVectorizer source code to see how _idf_diag is created. Then I just added it to the _tfidf attribute:
import scipy.sparse as sp
# ... code ...
vectorizer._tfidf._idf_diag = sp.spdiags(idf,
diags = 0,
m = len(idf),
n = len(idf))
And now the vectorization works.

Related

Saving data from traceplot in PyMC3

Below is the code for a simple Bayesian Linear regression. After I obtain the trace and the plots for the parameters, is there any way in which I can save the data that created the plots in a file so that if I need to plot it again I can simply plot it from the data in the file rather than running the whole simulation again?
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0,9,5)
y = 2*x + 5
yerr=np.random.rand(len(x))
def soln(x, p1, p2):
return p1+p2*x
with pm.Model() as model:
# Define priors
intercept = pm.Normal('Intercept', 15, sd=5)
slope = pm.Normal('Slope', 20, sd=5)
# Model solution
sol = soln(x, intercept, slope)
# Define likelihood
likelihood = pm.Normal('Y', mu=sol,
sd=yerr, observed=y)
# Sampling
trace = pm.sample(1000, nchains = 1)
pm.traceplot(trace)
print pm.summary(trace, ['Slope'])
print pm.summary(trace, ['Intercept'])
plt.show()
There are two easy ways of doing this:
Use a version after 3.4.1 (currently this means installing from master, with pip install git+https://github.com/pymc-devs/pymc3). There is a new feature that allows saving and loading traces efficiently. Note that you need access to the model that created the trace:
...
pm.save_trace(trace, 'linreg.trace')
# later
with model:
trace = pm.load_trace('linreg.trace')
Use cPickle (or pickle in python 3). Note that pickle is at least a little insecure, don't unpickle data from untrusted sources:
import cPickle as pickle # just `import pickle` on python 3
...
with open('trace.pkl', 'wb') as buff:
pickle.dump(trace, buff)
#later
with open('trace.pkl', 'rb') as buff:
trace = pickle.load(buff)
Update for someone like me who is still coming over to this question:
load_trace and save_trace functions were removed. Since version 4.0 even the deprecation waring for these functions were removed.
The way to do it is now to use arviz:
with model:
trace = pymc.sample(return_inferencedata=True)
trace.to_netcdf("filename.nc")
And it can be loaded with:
trace = arviz.from_netcdf("filename.nc")
This way works for me :
# saving trace
pm.save_trace(trace=trace_nb, directory=r"c:\Users\xxx\Documents\xxx\traces\trace_nb")
# loading saved traces
with model_nb:
t_nb = pm.load_trace(directory=r"c:\Users\xxx\Documents\xxx\traces\trace_nb")

Type error: 'function' object is not subscriptable with tensor flow

I'm trying to execute the code from https://github.com/lucfra/RFHO, more specifically from RFHO starting example.ipynb. The only thing I want to change is doing it in forward mode instead of reverse mode. So this is the changed code:
import tensorflow as tf
import rfho as rf
from rfho.datasets import load_mnist
mnist = load_mnist(partitions=(.05, .01)) # 5% of data in training set, 1% in validation
# remaining in test set (change these percentages and see the effect on regularization hyperparameter)
x, y = tf.placeholder(tf.float32, name='x'), tf.placeholder(tf.float32, name='y')
# define the model (here use a linear model from rfho.models)
model = rf.LinearModel(x, mnist.train.dim_data, mnist.train.dim_target)
# vectorize the model, and build the state vector (augment by 1 since we are
# going to optimize the weights with momentum)
s, out, w_matrix = rf.vectorize_model(model.var_list, model.inp[-1], model.Ws[0],
augment=0)
# (this function will print also some tensorflow infos and warnings about variables
# collections... we'll solve this)
# define error
error = tf.reduce_mean(rf.cross_entropy_loss(labels=y, logits=out), name='error')
constraints = []
# define training error by error + L2 weights penalty
rho = tf.Variable(0., name='rho') # regularization hyperparameter
training_error = error + rho*tf.reduce_sum(tf.pow(w_matrix, 2))
constraints.append(rf.positivity(rho)) # regularization coefficient should be positive
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(out, 1), tf.argmax(y, 1)),
"float"), name='accuracy')
# define learning rates and momentum factor as variables, to be optimized
eta = tf.Variable(.01, name='eta')
#mu = tf.Variable(.5, name='mu')
# now define the training dynamics (similar to tf.train.Optimizer)
optimizer = rf.GradientDescentOptimizer.create(s, eta, loss=training_error)
# add constraints for learning rate and momentum factor
constraints += optimizer.get_natural_hyperparameter_constraints()
# we want to optimize the weights w.r.t. training_error
# and hyperparameters w.r.t. validation error (that in this case is
# error evaluated on the validation set)
# we are going to use ReverseMode
hyper_dict = {error: [rho, eta]}
hyper_opt = rf.HyperOptimizer(optimizer, hyper_dict, method=rf.ForwardHG)
# define helper for stochastic descent
ev_data = rf.ExampleVisiting(mnist.train, batch_size=2**8, epochs=200)
tr_suppl = ev_data.create_supplier(x, y)
val_supplier = mnist.validation.create_supplier(x, y)
test_supplier = mnist.test.create_supplier(x, y)
# Run all for some hyper-iterations and print progresses
def run(hyper_iterations):
with tf.Session().as_default() as ss:
ev_data.generate_visiting_scheme() # needed for remembering the example visited in forward pass
for hyper_step in range(hyper_iterations):
hyper_opt.initialize() # initializes all variables or reset weights to initial state
hyper_opt.run(ev_data.T, train_feed_dict_supplier=tr_suppl,
val_feed_dict_suppliers=val_supplier,
hyper_constraints_ops=constraints)
#
# print('Concluded hyper-iteration', hyper_step)
# print('Test accuracy:', ss.run(accuracy, feed_dict=test_supplier()))
# print('Validation error:', ss.run(error, feed_dict=val_supplier()))
saver = rf.Saver('Staring example', collect_data=False)
with saver.record(rf.Records.tensors('error', fd=('x', 'y', mnist.validation), rec_name='valid'),
rf.Records.tensors('error', fd=('x', 'y', mnist.test), rec_name='test'),
rf.Records.tensors('accuracy', fd=('x', 'y', mnist.validation), rec_name='valid'),
rf.Records.tensors('accuracy', fd=('x', 'y', mnist.test), rec_name='test'),
rf.Records.hyperparameters(),
rf.Records.hypergradients(),
): # a context to print some statistics.
# If you execute again any cell containing the model construction,
# restart the notebook or reset tensorflow graph in order to prevent errors
# due to tensor namings
run(20) # this will take some time... run it for less hyper-iterations for a quicker look
The problem is I get a Type error: 'function' object is not subscriptable back after the first iteration:
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydev_run_in_console.py", line 52, in run_file
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/repierau/Documents/FSHO/RFHO-master/rfho/examples/simply_example.py", line 80, in <module>
run(20) # this will take some time... run it for less hyper-iterations for a quicker look
File "/Users/repierau/Documents/FSHO/RFHO-master/rfho/examples/simply_example.py", line 63, in run
hyper_constraints_ops=constraints)
File "/Users/repierau/Documents/FSHO/RFHO-master/rfho/save_and_load.py", line 624, in _saver_wrapped
res = f(*args, **kwargs)
File "/Users/repierau/Documents/FSHO/RFHO-master/rfho/hyper_gradients.py", line 689, in run
hyper_batch_step=self.hyper_batch_step.eval())
File "/Users/repierau/Documents/FSHO/RFHO-master/rfho/hyper_gradients.py", line 581, in run_all
return self.hyper_gradients(val_feed_dict_suppliers, hyper_batch_step)
File "/Users/repierau/Documents/FSHO/RFHO-master/rfho/hyper_gradients.py", line 551, in hyper_gradients
val_sup_lst.append(val_feed_dict_supplier[k])
TypeError: 'function' object is not subscriptable

preprocessing error trying to use OneHot Encoder Python

I'm trying to run the code below in virtual machine for a homework practice problem. I'm getting the error message below, and I'm trying to figure out if it's an issue with my code or the site. If anyone can point out if it's an error with my code an how to fix it I'd be grateful. If my code looks ok, then I'll let the course know they have a bug.
Code:
import numpy as np
import pandas as pd
# Load the dataset
X = pd.read_csv('titanic_data.csv')
# Limit to categorical data
X = X.select_dtypes(include=[object])
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# TODO: Create a LabelEncoder object, which will turn all labels present in
# in each feature to numbers.
# HINT: Use LabelEncoder()
df=pd.DataFrame(X)
le = preprocessing.labelEncoder()
# TODO: For each feature in X, apply the LabelEncoder's fit_transform
# function, which will first learn the labels for the feature (fit)
# and then change the labels to numbers (transform).
df2=df.apply(le.fit_transform)
#for feature in X:
# HINT: use fit_transform on X[feature] using the LabelEncoder() object
#X[feature] = label_encoder.fit_transform(X[feature])
# TODO: Create a OneHotEncoder object, which will create a feature for each
# label present in the data.
# HINT: Use OneHotEncoder()
ohe = preprocessing.OneHotEncoder()
# TODO: Apply the OneHotEncoder's fit_transform function to all of X, which will
# first learn of all the (now numerical) labels in the data (fit), and then
# change the data to one-hot encoded entries (transform).
# HINT: Use fit_transform on X using the OneHotEncoder() object
onehotlabels = enc.fit_transform(df2)
Error:
Traceback (most recent call last):
File "vm_main.py", line 33, in <module>
import main
File "/tmp/vmuser_zrkfroofmi/main.py", line 2, in <module>
import studentMain
File "/tmp/vmuser_zrkfroofmi/studentMain.py", line 3, in <module>
import OneHot
File "/tmp/vmuser_zrkfroofmi/OneHot.py", line 21, in <module>
le = preprocessing.labelEncoder()
NameError: name 'preprocessing' is not defined
Call OneHotEncoder without preprocessing before the name. So just do ohe = OneHotEncoder(). The problem is in your import, what you have in your script would work if you did from sklearn import preprocessing.
From the code, 'from sklearn.preprocessing import OneHotEncoder', user was trying to import OneHotEncoder from preprocessing without importing preprocessing first. Preprocessing is a package from sklearn and OneHotEncoder is a preprocessor. That caused the error. So import preprocessing and then try with OneHotEncoder

Keras:Vgg16 -- Error in `decode_predictions'

I am trying to perform an image classification task using a pre-trained VGG16 model in Keras. The code I wrote, following the instructions in the Keras application page, is:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input, decode_predictions
import numpy as np
model = VGG16(weights='imagenet', include_top=True)
img_path = './train/cat.1.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
(inID, label) = decode_predictions(features)[0]
which is quite similar to the code shown in this question already asked in the forum. But in spite of having the include_top parameter as True, I am getting the following error:
Traceback (most recent call last):
File "vgg16-keras-classifier.py", line 14, in <module>
(inID, label) = decode_predictions(features)[0]
ValueError: too many values to unpack
Any help will be deeply appreciated! Thanks!
It's because (according to a function definition which might be found here) a function decode_predictions returns a triple (class_name, class_description, score). This why it claims that there are too many values to unpack.

Invalid parameter error when doing python scikit-learn grid-search method

I am trying to learn how to find the optimal hyperparameters in decision trees classifier using the GridSearchCV() method from scikit-learn.
The problem is it is fine if I am specifying just one parameter's options, it is fine as in the following:
print(__doc__)
# Code source: Gael Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
from sklearn import datasets
from sklearn.grid_search import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
# define classifier
dt = DecisionTreeClassifier()
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
# define parameter values that should be searched
min_samples_split_options = range(2, 4)
# create a parameter grid: map the parameter names to the values that should be saved
param_grid_dt = dict(min_samples_split= min_samples_split_options) # for DT
# instantiate the grid
grid = GridSearchCV(dt, param_grid_dt, cv=10, scoring='accuracy')
# fit the grid with param
grid.fit(X, y)
# view complete results
grid.grid_scores_
'''# examine results from first tuple
print grid.grid_scores_[0].parameters
print grid.grid_scores_[0].cv_validation_scores
print grid.grid_scores_[0].mean_validation_score'''
# examine the best model
print '*******Final results*********'
print grid.best_score_
print grid.best_params_
print grid.best_estimator_
Result:
None
*******Final results*********
0.68
{'min_samples_split': 3}
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=3, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
But when I add another parameters' options into the mix, it gives me an "invalid parameter" error, as follows:
print(__doc__)
# Code source: Gael Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
from sklearn import datasets
from sklearn.grid_search import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
# define classifier
dt = DecisionTreeClassifier()
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
# define parameter values that should be searched
max_depth_options = range(10, 251) # for DT
min_samples_split_options = range(2, 4)
# create a parameter grid: map the parameter names to the values that should be saved
param_grid_dt = dict(max_depth=max_depth_options, min_sample_split=min_samples_split_options) # for DT
# instantiate the grid
grid = GridSearchCV(dt, param_grid_dt, cv=10, scoring='accuracy')
# fit the grid with param
grid.fit(X, y)
'''# view complete results
grid.grid_scores_
# examine results from first tuple
print grid.grid_scores_[0].parameters
print grid.grid_scores_[0].cv_validation_scores
print grid.grid_scores_[0].mean_validation_score
# examine the best model
print '*******Final results*********'
print grid.best_score_
print grid.best_params_
print grid.best_estimator_'''
Result:
None
Traceback (most recent call last):
File "C:\Users\KubiK\Desktop\GridSearch_ex6.py", line 31, in <module>
grid.fit(X, y)
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in _fit
for parameters in parameter_iterable
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 800, in __call__
while self.dispatch_one_batch(iterator):
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 658, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 566, in _dispatch
job = ImmediateComputeBatch(batch)
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 180, in __init__
self.results = batch()
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\cross_validation.py", line 1520, in _fit_and_score
estimator.set_params(**parameters)
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\base.py", line 270, in set_params
(key, self.__class__.__name__))
ValueError: Invalid parameter min_sample_split for estimator DecisionTreeClassifier. Check the list of available parameters with `estimator.get_params().keys()`.
[Finished in 0.3s]
There's a typo in your code, it should be min_samples_split not min_sample_split.