preprocessing error trying to use OneHot Encoder Python

preprocessing error trying to use OneHot Encoder Python - python-2.7

I'm trying to run the code below in virtual machine for a homework practice problem. I'm getting the error message below, and I'm trying to figure out if it's an issue with my code or the site. If anyone can point out if it's an error with my code an how to fix it I'd be grateful. If my code looks ok, then I'll let the course know they have a bug.
Code:
import numpy as np
import pandas as pd
# Load the dataset
X = pd.read_csv('titanic_data.csv')
# Limit to categorical data
X = X.select_dtypes(include=[object])
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# TODO: Create a LabelEncoder object, which will turn all labels present in
# in each feature to numbers.
# HINT: Use LabelEncoder()
df=pd.DataFrame(X)
le = preprocessing.labelEncoder()
# TODO: For each feature in X, apply the LabelEncoder's fit_transform
# function, which will first learn the labels for the feature (fit)
# and then change the labels to numbers (transform).
df2=df.apply(le.fit_transform)
#for feature in X:
# HINT: use fit_transform on X[feature] using the LabelEncoder() object
#X[feature] = label_encoder.fit_transform(X[feature])
# TODO: Create a OneHotEncoder object, which will create a feature for each
# label present in the data.
# HINT: Use OneHotEncoder()
ohe = preprocessing.OneHotEncoder()
# TODO: Apply the OneHotEncoder's fit_transform function to all of X, which will
# first learn of all the (now numerical) labels in the data (fit), and then
# change the data to one-hot encoded entries (transform).
# HINT: Use fit_transform on X using the OneHotEncoder() object
onehotlabels = enc.fit_transform(df2)
Error:
Traceback (most recent call last):
File "vm_main.py", line 33, in <module>
import main
File "/tmp/vmuser_zrkfroofmi/main.py", line 2, in <module>
import studentMain
File "/tmp/vmuser_zrkfroofmi/studentMain.py", line 3, in <module>
import OneHot
File "/tmp/vmuser_zrkfroofmi/OneHot.py", line 21, in <module>
le = preprocessing.labelEncoder()
NameError: name 'preprocessing' is not defined

Call OneHotEncoder without preprocessing before the name. So just do ohe = OneHotEncoder(). The problem is in your import, what you have in your script would work if you did from sklearn import preprocessing.

From the code, 'from sklearn.preprocessing import OneHotEncoder', user was trying to import OneHotEncoder from preprocessing without importing preprocessing first. Preprocessing is a package from sklearn and OneHotEncoder is a preprocessor. That caused the error. So import preprocessing and then try with OneHotEncoder

Related

Saving data from traceplot in PyMC3

Below is the code for a simple Bayesian Linear regression. After I obtain the trace and the plots for the parameters, is there any way in which I can save the data that created the plots in a file so that if I need to plot it again I can simply plot it from the data in the file rather than running the whole simulation again?
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0,9,5)
y = 2*x + 5
yerr=np.random.rand(len(x))
def soln(x, p1, p2):
return p1+p2*x
with pm.Model() as model:
# Define priors
intercept = pm.Normal('Intercept', 15, sd=5)
slope = pm.Normal('Slope', 20, sd=5)
# Model solution
sol = soln(x, intercept, slope)
# Define likelihood
likelihood = pm.Normal('Y', mu=sol,
sd=yerr, observed=y)
# Sampling
trace = pm.sample(1000, nchains = 1)
pm.traceplot(trace)
print pm.summary(trace, ['Slope'])
print pm.summary(trace, ['Intercept'])
plt.show()

There are two easy ways of doing this:
Use a version after 3.4.1 (currently this means installing from master, with pip install git+https://github.com/pymc-devs/pymc3). There is a new feature that allows saving and loading traces efficiently. Note that you need access to the model that created the trace:
...
pm.save_trace(trace, 'linreg.trace')
# later
with model:
trace = pm.load_trace('linreg.trace')
Use cPickle (or pickle in python 3). Note that pickle is at least a little insecure, don't unpickle data from untrusted sources:
import cPickle as pickle # just `import pickle` on python 3
...
with open('trace.pkl', 'wb') as buff:
pickle.dump(trace, buff)
#later
with open('trace.pkl', 'rb') as buff:
trace = pickle.load(buff)

Update for someone like me who is still coming over to this question:
load_trace and save_trace functions were removed. Since version 4.0 even the deprecation waring for these functions were removed.
The way to do it is now to use arviz:
with model:
trace = pymc.sample(return_inferencedata=True)
trace.to_netcdf("filename.nc")
And it can be loaded with:
trace = arviz.from_netcdf("filename.nc")

This way works for me :
# saving trace
pm.save_trace(trace=trace_nb, directory=r"c:\Users\xxx\Documents\xxx\traces\trace_nb")
# loading saved traces
with model_nb:
t_nb = pm.load_trace(directory=r"c:\Users\xxx\Documents\xxx\traces\trace_nb")

Pickle figures from matplotlib: 2

Following Pickle figures from matplotlib, I am trying to load a figure from a pickle. I am using the same code with the modifications that are suggested in the responses.
Saving script:
import numpy as np
import matplotlib.pyplot as plt
import pickle as pl
# Plot simple sinus function
fig_handle = plt.figure()
x = np.linspace(0,2*np.pi)
y = np.sin(x)
plt.plot(x,y)
# plt.show()
# Save figure handle to disk
pl.dump(fig_handle,file('sinus.pickle','wb'))
Loading script:
import matplotlib.pyplot as plt
import pickle as pl
import numpy as np
# Load figure from disk and display
fig_handle = pl.load(open('sinus.pickle', 'rb'))
fig_handle.show()
The saving script produces a file named "sinus.pickle" but the loading file does not show the anticipated figure. Any suggestions?
Python 2.7.13
matplotlib 2.0.0
numpy 1.12.1
p.s. following a suggestion replaced fig_handle.show() with pat.show() which produced an error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/
site-packages/matplotlib/backends/backend_macosx.py", line 109,
in_set_device_scale
self.figure.dpi = self.figure.dpi / self._device_scale * value
File "/usr/local/lib/python2.7/site-packages/matplotlib/figure.py",
line 416, in _set_dpi
self.callbacks.process('dpi_changed', self)
File "/usr/local/lib/python2.7/site-packages/matplotlib/cbook.py",
line 546, in process
if s in self.callbacks:
AttributeError: 'CallbackRegistry' object has no attribute 'callbacks'

What you call your "loading script" doesn't make any sense.
From the very link that you provided in your question, loading the picked figure is as simple as:
# Load figure from disk and display
fig_handle2 = pl.load(open('sinus.pickle','rb'))
fig_handle2.show()

Final solution included modification of
fig_handle.show()
to
plt.show()
and modification of the backend to "TkAgg", based to an advice given by ImportanceOfBeingErnest

How can I specify a non-theano based likelihood?

I saw a post from a few days ago by someone else: pymc3 likelihood math with non-theano function. Even though I think the problem at its core is the same, I thought I would ask with a simpler example:
Inside logp_wrap, I put some made up definition of a likelihood function. It depends on the rv and an observation. In this case I could do this with theano operations, but let's say that I want this function to be more complex and so I cannot use theano.
The problem comes when I try to define the likelihood both in terms of an RV and observations. From what I have seen, this format would work if I was specifying everything in 'logp_wrap' as theano operations.
I have searched around for a solution to this, but haven't found anything where this problem is fully addressed.
The problem in my attempt to do this is actually that the logp_ function is correctly decorated, but the logp_wrap function is only correctly decorated for its input, and not for its output, so I get the error
TypeError: 'TensorVariable' object is not callable.
Would be great if someone had a solution - don't think I am the only one with this problem.
The theano version of this that works (and uses the same function within a function definition) without the #as_op code is here: https://pymc-devs.github.io/pymc3/notebooks/lda-advi-aevb.html?highlight=densitydist (Specifically the sections: "Log-likelihood of documents for LDA" and "LDA model section")
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
"""
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pymc3 as pm
from theano import as_op
import theano.tensor as T
from scipy.stats import norm
#Some data that we observed
g_observed = [0.0, 1.0, 2.0, 3.0]
#Define a function to calculate the logp without using theano
#This as_op is where the problem is - the input is an rv but the output is a
#function.
#as_op(itypes=[T.dscalar],otypes=[T.dscalar])
def logp_wrap(rv):
#We are not using theano so we wrap the function.
#as_op(itypes=[T.dvector],otypes=[T.dscalar])
def logp_(ob):
#Some made up likelihood -
#The key here is that lp depends on the rv input and the observations
lp = np.log(norm.pdf(rv + ob))
return lp
return logp_
hb1_model = pm.Model()
with hb1_model:
I_mean = pm.Normal('I_mean', mu=0.1, sd=0.05)
xs = pm.DensityDist('x', logp_wrap(I_mean),observed = g_observed)
with hb1_model:
step = pm.Metropolis()
trace = pm.sample(1000, step)

Keras:Vgg16 -- Error in `decode_predictions'

I am trying to perform an image classification task using a pre-trained VGG16 model in Keras. The code I wrote, following the instructions in the Keras application page, is:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input, decode_predictions
import numpy as np
model = VGG16(weights='imagenet', include_top=True)
img_path = './train/cat.1.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
(inID, label) = decode_predictions(features)[0]
which is quite similar to the code shown in this question already asked in the forum. But in spite of having the include_top parameter as True, I am getting the following error:
Traceback (most recent call last):
File "vgg16-keras-classifier.py", line 14, in <module>
(inID, label) = decode_predictions(features)[0]
ValueError: too many values to unpack
Any help will be deeply appreciated! Thanks!

It's because (according to a function definition which might be found here) a function decode_predictions returns a triple (class_name, class_description, score). This why it claims that there are too many values to unpack.

feed pre-computed estimates to TfidfVectorizer

I trained an instance of scikit-learn's TfidfVectorizer and I want to persist it to disk. I saved the IDF matrix (the idf_ attribute) to disk as a numpy array and I saved the vocabulary (vocabulary_) to disk as a JSON object (I'm avoiding pickle, for security and other reasons). I'm trying to do this:
import json
from idf import idf # numpy array with the pre-computed IDFs
from sklearn.feature_extraction.text import TfidfVectorizer
# dirty trick so I can plug my pre-computed IDFs
# necessary because "vectorizer.idf_ = idf" doesn't work,
# it returns "AttributeError: can't set attribute."
class MyVectorizer(TfidfVectorizer):
TfidfVectorizer.idf_ = idf
# instantiate vectorizer
vectorizer = MyVectorizer(lowercase = False,
min_df = 2,
norm = 'l2',
smooth_idf = True)
# plug vocabulary
vocabulary = json.load(open('vocabulary.json', mode = 'rb'))
vectorizer.vocabulary_ = vocabulary
# test it
vectorizer.transform(['foo bar'])
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1314, in transform
return self._tfidf.transform(X, copy=False)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1014, in transform
check_is_fitted(self, '_idf_diag', 'idf vector is not fitted')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 627, in check_is_fitted
raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.utils.validation.NotFittedError: idf vector is not fitted
So, what am I doing wrong? I'm failing to fool the vectorizer object: somehow it knows that I'm cheating (i.e., passing it pre-computed data and not training it with actual text). I inspected the attributes of the vectorizer object but I can't find anything like 'istrained', 'isfitted', etc. So, how do I fool the vectorizer?

Ok, I think I got it: the vectorizer instance has an attribute _tfidf, which in turn must have an attribute _idf_diag. The transform method calls a check_is_fitted function that checks whether whether that _idf_diag exists. (I had missed it because it's an attribute of an attribute.) So, I inspected the TfidfVectorizer source code to see how _idf_diag is created. Then I just added it to the _tfidf attribute:
import scipy.sparse as sp
# ... code ...
vectorizer._tfidf._idf_diag = sp.spdiags(idf,
diags = 0,
m = len(idf),
n = len(idf))
And now the vectorization works.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

preprocessing error trying to use OneHot Encoder Python - python-2.7

Call OneHotEncoder without preprocessing before the name. So just do ohe = OneHotEncoder(). The problem is in your import, what you have in your script would work if you did from sklearn import preprocessing.

Related

Saving data from traceplot in PyMC3

Pickle figures from matplotlib: 2

How can I specify a non-theano based likelihood?

Keras:Vgg16 -- Error in `decode_predictions'

feed pre-computed estimates to TfidfVectorizer

Categories

Resources