Specifying select features to be categorical using OneHotEncoder in sklearn 0.14 - python-2.7

I am using the sklearn 0.14 module in Python to create a decision tree. I was hoping to use the OneHotEncoder to convert some features into categorical features. According to the documentation, I should be able to provide an array of indices to indicate which features should be converted. However, trying the following code:
xs = [[64, 15230], [3, 67673], [16, 43678]]
encoder = preprocessing.OneHotEncoder(n_values='auto', categorical_features=[1], dtype=numpy.integer)
encoder.fit(xs)
I receive the following error:
Traceback (most recent call last): File
"C:\Users\sara\Documents\Shipping
Project\PythonSandbox\CarrierDecisionTree.py", line 35, in <module>
encoder.fit(xs) File "C:\Python27\lib\site-packages\sklearn\preprocessing\data.py", line
892, in fit
self.fit_transform(X) File "C:\Python27\lib\site-packages\sklearn\preprocessing\data.py", line
944, in fit_transform
self.categorical_features, copy=True) File "C:\Python27\lib\site-packages\sklearn\preprocessing\data.py", line
795, in _transform_selected
return sparse.hstack((X_sel, X_not_sel)) File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 417,
in hstack
return bmat([blocks], format=format, dtype=dtype) File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 532,
in bmat
dtype = upcast( *tuple([A.dtype for A in blocks[block_mask]]) ) File "C:\Python27\lib\site-packages\scipy\sparse\sputils.py", line 53,
in upcast
raise TypeError('no supported conversion for types: %r' % (args,)) TypeError: no supported conversion for types: (dtype('int32'),
dtype('S6'))
If instead, I provide the array [0, 1] to categorical_features, it works correctly and converts both features properly. The same correct behavior occurs with using 'all' to categorical_features. However, I only want the second feature converted and not the first. I understand I could do this manually by converting one feature at a time, but I was hoping to use all the beauty of OneHotEncoder as I will be using many more features later on.

Posting as an answer, for the record:
TypeError: no supported conversion for types: (dtype('int32'), dtype('S6'))
means something in the true xs (not the one shown in the code snippet) is a string: dtype('S6') is NumPy's length-six string type.

Related

pyomo mindtpy example program when run becomes unfeasible for binary variable

So I installed pyomo, glpk, and ipopt with anaconda,
When I run the example code here: https://pyomo.readthedocs.io/en/stable/contributed_packages/mindtpy.html
from pyomo.environ import *
model = ConcreteModel()
model.x = Var(bounds=(1.0,10.0),initialize=5.0)
model.y = Var(within=Binary)
model.c1 = Constraint(expr=(model.x-3.0)**2 <= 50.0*(1-model.y))
model.c2 = Constraint(expr=model.x*log(model.x)+5.0 <= 50.0*(model.y))
model.objective = Objective(expr=model.x, sense=minimize)
SolverFactory('mindtpy').solve(model, mip_solver='glpk', nlp_solver='ipopt',tee=True)
model.objective.display()
model.display()
model.pprint()
I get the output that the binary variable has apparently become infeasible:
python minlpex.py
INFO: ---Starting MindtPy---
INFO: Original model has 2 constraints (2 nonlinear) and 0 disjunctions, with
2 variables, of which 1 are binary, 0 are integer, and 1 are continuous.
INFO: NLP 1: Solve relaxed integrality
INFO: NLP 1: OBJ: 1.0 LB: 1.0 UB: inf
INFO: ---MindtPy Master Iteration 0---
INFO: MIP 1: Solve master problem.
WARNING: Empty constraint block written in LP format - solver may error
Traceback (most recent call last):
File "minlpex.py", line 13, in <module>
op.SolverFactory('mindtpy').solve(model, mip_solver='glpk', nlp_solver='ipopt',tee=True)
File "/anaconda3/envs/py36/lib/python3.6/site-packages/pyomo/contrib/mindtpy/MindtPy.py", line 370, in solve
MindtPy_iteration_loop(solve_data, config)
File "/anaconda3/envs/py36/lib/python3.6/site-packages/pyomo/contrib/mindtpy/iterate.py", line 30, in MindtPy_iteration_loop
handle_master_mip_optimal(master_mip, solve_data, config)
File "/anaconda3/envs/py36/lib/python3.6/site-packages/pyomo/contrib/mindtpy/mip_solve.py", line 62, in handle_master_mip_optimal
config)
File "/anaconda3/envs/py36/lib/python3.6/site-packages/pyomo/contrib/gdpopt/util.py", line 199, in copy_var_list_values
v_to.set_value(value(v_from, exception=False))
File "/anaconda3/envs/py36/lib/python3.6/site-packages/pyomo/core/base/var.py", line 173, in set_value
if valid or self._valid_value(val):
File "/anaconda3/envs/py36/lib/python3.6/site-packages/pyomo/core/base/var.py", line 185, in _valid_value
"domain %s" % (val, type(val), self.domain))
ValueError: Numeric value `0.22709088987977885` (<class 'float'>) is not in domain Binary
So I was a little confused, since this was a code provided, I would not expect it to error like this. So I feel like I'm messing something up or I am missing some required library?
Thanks a lot.
Looks like something must be wrong with the conda pyomo install or ipopt install.
When I reinstalled using pip for ipopt and compiling pyomo from github source everything works fine.

How to fine-tune a network in caffe2

There are very little information about how to fine-tuning parameters and it really confuses me a lot about how to fine-tune a network in caffe2. Could anybody show me some codes about the fine-tuning part. Many thanks.
By the way, in the link:Food101 SqueezeNet Caffe2 number of iterations, it seems that the author has successfully fine-tuned the network.
add: Here are some codes of my train part,
train_model = cnn.CNNModelHelper(order="NCHW", name="train")
train_model.param_init_net.AppendNet(core.Net(init_net))
train_model.net.AppendNet(core.Net(predict_net))
train_model.param_init_net.RunAllOnGPU(gpu_id=0)
train_model.net.RunAllOnGPU(gpu_id=0)
workspace.RunNetOnce(train_model.param_init_net)
AddTrainingOperators(train_model, 'softmaxout', 'label')
AddBookkeepingOperators(train_model)
workspace.RunNetOnce(train_model.param_init_net)
data, label = AddInput(train_model, batch_size=3,
db=os.path.join(data_folder, 'toy_train.lmdb'),
db_type='lmdb')
workspace.FeedBlob('data', data)
workspace.FeedBlob('label', label)
workspace.CreateNet(train_model.net)
However, when i run the code, an error which warns
Traceback (most recent call last):
File "lenetForChineseFinetune.py", line 62, in <module>
workspace.FeedBlob('data', data)
File "/opt/caffe2/caffe2/local/caffe2/python/workspace.py", line 262, in FeedBlob
return C.feed_blob(name, arr)
RuntimeError: [enforce fail at pybind_state.cc:825] . Unexpected type of argument - only numpy array or string are supported for feeding
occured. How should i modify the codes?

Python RPy2 function with multiple input arguments

I am looking to call an rPy2 function with multiple input parameters. Here is the R function write.csv that I am trying to use. It has multiple input parameters and I need to specify more than one such parameter.
If I use it without the optional parameter row.names and column.names, it works like this:
r("write.csv")(d,file='myfilename.csv')
For my requirements, I must issue this command with the optional parameters row.names and column.names. So, I tried:
r('write.csv')(d, file='myfilename.csv', row.names=FALSE, column.names=FALSE)
but I got this error message:
File "/home/UserName/test.py", line 12
r("write.csv")(d,file='myfilename.csv',row.names=FALSE, column.names=FALSE)
SyntaxError: keyword can't be an expression
[Finished in 0.0s with exit code 1]
[shell_cmd: python -u "/home/UserName/test.py"]
[dir: /home/UserName]
[path: /home/UserName/bin:/home/UserName/.local/bin:/usr/local/sbin:/usr/local/bin:
.../usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin]
How can I achieve write.csv with row.names=FALSE and column.names=FALSE, in rPy2?
You can use Python's **.
See the note here: http://rpy2.readthedocs.io/en/version_2.8.x/robjects_functions.html#callable
Ony of my mistakes was that I should have replaced . by _, as shown in the docs here:
from rpy2.robjects.packages import importr
base = importr('base')
base.rank(0, na_last = True)
so I would analogously need row_names = TRUE. However, the . in write.csv() still remained, so this only solved part of the question. Ok, so I tried a few things to get an answer:
Generating sample data:
from rpy2.robjects import r, globalenv
from rpy2.robjects import IntVector, DataFrame
d = {'a': IntVector((1,2,3)), 'b': IntVector((4,5,6))}
dataf = DataFrame(d)
Attempts follow - 1. did not work, 2. and 3. did work:
1:
r('write_csv')(x=dataf,file='testing.csv',row_names=False)
Traceback (most recent call last):
File "C:\Users\UserName\FileD\test.py", line 18, in <module>
r('write_csv')(x=dataf,file='testing.csv',row_names=False)
File "C:\Python27\lib\site-packages\rpy2\robjects\__init__.py", line 321, in __call__
res = self.eval(p)
File "C:\Python27\lib\site-packages\rpy2\robjects\functions.py", line 178, in __call__
return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
File "C:\Python27\lib\site-packages\rpy2\robjects\functions.py", line 106, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
rpy2.rinterface.RRuntimeError: Error in eval(expr, envir, enclos) : object 'write_csv'
..not found
Error in eval(expr, envir, enclos) : object 'write_csv' not found
2.
r('''
write_csv <- function(x,verbose=FALSE)
write.csv(x,file='testing.csv',row.names=FALSE)
''')
r['write_csv'](dataf)
3.
globalenv['dataf'] = dataf
r("write.csv(dataf,file='testing2.csv',row.names=FALSE)")
I was really hoping attempt 1. would have worked. It seemed I had reproduced the example in the docs base.rank(0, na_last = True), but I think something might have still been missing.

Different behaviour for io in pickle with string content

When working with pickled data I encountered a different behavior for the io.open and __builtin__.open. Consider the following simple example:
import pickle
payload = 'foo'
fn = 'test.pickle'
pickle.dump(payload, open(fn, 'w'))
a = pickle.load(open(fn, 'r'))
This works as expected. But running this code here:
import pickle
import io
payload = 'foo'
fn = 'test.pickle'
pickle.dump(payload, io.open(fn, 'w'))
a = pickle.load(io.open(fn, 'r'))
gives the following Traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "D:/**.py", line 15, in <module>
pickle.dump(payload, io.open(fn, 'w'))
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\pickle.py", line 1370, in dump
Pickler(file, protocol).dump(obj)
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\pickle.py", line 224, in dump
self.save(obj)
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\pickle.py", line 488, in save_string
self.write(STRING + repr(obj) + '\n')
TypeError: must be unicode, not str
As I want to be future-compatible, how can I circumwent this misbehavior? Or, what else am I doing wrong here?
I stumbled over this when dumping dictionaries with keys of type string.
My python version is:
'2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]'
The difference is not supprising, because io.open() explicitly deals with Unicode strings when using text mode. The documentation is quite clear about this:
Note: Since this module has been designed primarily for Python 3.x, you have to be aware that all uses of “bytes” in this document refer to the str type (of which bytes is an alias), and all uses of “text” refer to the unicode type. Furthermore, those two types are not interchangeable in the io APIs.
and
Python distinguishes between files opened in binary and text modes, even when the underlying operating system doesn’t. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as unicode strings, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.
You need to open files in binary mode. The fact that it worked without with the built-in open() at all is actually more luck than wisdom; if your pickles contained data with \n and/or \r bytes the pickle loading may well fail. The Python 2 default pickle happens to be a text protocol but the output should still be considered as binary.
In all cases, when writing pickle data, use binary mode:
pickle.dump(payload, open(fn, 'wb'))
a = pickle.load(open(fn, 'rb'))
or
pickle.dump(payload, io.open(fn, 'wb'))
a = pickle.load(io.open(fn, 'rb'))

error: unpack requires a string argument of length 8

I was running my script and I stumbled upon on this error
WARNING *** file size (24627) not 512 + multiple of sector size (512)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Traceback (most recent call last):
File "C:\Email Attachments\whatever.py", line 20, in <module>
main()
File "C:\Email Attachments\whatever.py", line 17, in main
csv_from_excel()
File "C:\Email Attachments\whatever.py", line 7, in csv_from_excel
sh = wb.sheet_by_name('B2B_REP_YLD_100_D_SQ.rpt')
File "C:\Python27\lib\site-packages\xlrd\book.py", line 442, in sheet_by_name
return self.sheet_by_index(sheetx)
File "C:\Python27\lib\site-packages\xlrd\book.py", line 432, in sheet_by_index
return self._sheet_list[sheetx] or self.get_sheet(sheetx)
File "C:\Python27\lib\site-packages\xlrd\book.py", line 696, in get_sheet
sh.read(self)
File "C:\Python27\lib\site-packages\xlrd\sheet.py", line 1055, in read
dim_tuple = local_unpack('<ixxH', data[4:12])
error: unpack requires a string argument of length 8
I was trying to process this excel file.
https://drive.google.com/file/d/0B12NevhOGQGRMkRVdExuYjFveDQ/edit?usp=sharing
One solution that I found is that I have to open manually the spreadsheet, save it, then close it before I run my script of converting .xls to .csv. I find this solution a bit cumbersome and clunky.
This kind of spreadsheet is saved daily in my drive via an Outlook Macro. Unprocessed data is increasing that's why I turned into scripting to ease the job.
Who made the Outlook macro that's dumping this file? xlrd uses byte level unpacking to read in the Excel file, and is failing to read a field in this excel file. There are ways to follow where its failing, but none to automatically recover from this type of error.
The erroneous data seems to be at data[4:12] of a specific frame (we'll see later), which should be a bytestring that's parsed as such:
one integer (i)
2 pad bytes (xx)
unsigned short 2 byte integer (H).
You can set xlrd to DEBUG mode, which will show you which bytes its parsing, and exactly where in the file there is an error:
import xlrd
xlrd.DEBUG = 2
workbook = xlrd.open_workbook(u'/home/sparker/Downloads/20131117_040934_B2B_REP_YLD_100_D_LT.xls')
Here's the results, slightly trimmed down for the sake of SO:
parse_globals: record code is 0x0293
parse_globals: record code is 0x0293
parse_globals: record code is 0x0085
CODEPAGE: codepage 1200 -> encoding 'utf_16_le'
BOUNDSHEET: bv=80 data '\xfd\x04\x00\x00\x00\x00\x18\x00B2B_REP_YLD_100_D_SQ.rpt'
BOUNDSHEET: inx=0 vis=0 sheet_name=u'B2B_REP_YLD_100_D_SQ.rpt' abs_posn=1277 sheet_type=0x00
parse_globals: record code is 0x000a
GET_SHEETS: [u'B2B_REP_YLD_100_D_SQ.rpt'] [1277]
GET_SHEETS: sheetno = 0 [u'B2B_REP_YLD_100_D_SQ.rpt'] [1277]
reqd: 0x0010
getbof(): data='\x00\x06\x10\x00\xbb\r\xcc\x07\x00\x00\x00\x00\x06\x00\x00\x00'
getbof(): op=0x0809 version2=0x0600 streamtype=0x0010
getbof(): BOF found at offset 1277; savpos=1277
BOF: op=0x0809 vers=0x0600 stream=0x0010 buildid=3515 buildyr=1996 -> BIFF80
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build/bdist.macosx-10.4-x86_64/egg/xlrd/__init__.py", line 457, in open_workbook
File "build/bdist.macosx-10.4-x86_64/egg/xlrd/__init__.py", line 1007, in get_sheets
File "build/bdist.macosx-10.4-x86_64/egg/xlrd/__init__.py", line 998, in get_sheet
File "build/bdist.macosx-10.4-x86_64/egg/xlrd/sheet.py", line 864, in read
struct.error: unpack requires a string argument of length 8
Specifically, you can see that it parses the name of the workbook named u'B2B_REP_YLD_100_D_SQ.rpt.
Lets check the source code. The traceback throws an error here where we can see from the parent loop that we're trying to parse the XL_DIMENSION and XL_DIMENSION2 values. These directly correspond to the shape of the Excel Sheet.
And that's where there's a problem in your workbook. It's not being made correctly. So, back to my original question, who made the excel macro? It needs to be fixed. But that's for another SO question, some other time.