cPickle error with pathos multiprocessing? [duplicate] - python-2.7

I'm trying to use multiprocessing to speed up pandas excel reading. However when I use multiprocessing I'm getting the error
cPickle.PicklingError: Can't pickle : attribute lookup __builtin__.function failed
when I try to run the following:
import dill
from pathos.multiprocessing import ProcessPool
class A(object):
def __init__(self):
self.files = glob.glob(\*)
def read_file(self, filename):
return pd.read_excel(filename)
def file_data(self):
pool = ProcessPool(9)
file_list = [filename for filename in self.files]
df_list = pool.map(A().read_file, file_list)
combined_df = pd.concat(df_list, ignore_index=True)
Isn't pathos.multiprocessing designed to fix this issue? Am I overlooking something here?
Edit:
Full error code traces to
File "c:\users\zky3sse\appdata\local\continuum\anaconda2\lib\site-packages\pathos-0.2.0-py2.7.egg\
pathos\multiprocessing.py", line 136, in map
return _pool.map(star(f), zip(*args)) # chunksize
File "C:\Users\ZKY3SSE\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Users\ZKY3SSE\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\pool.py", line 567, in get
raise self._value

It is possible that Pandas may be using Swig as a wrapper for C code. If this is the case, then dill may not work properly, and pathos would then switch to pickle. There are workarounds, as shown here: How to make my SWIG extension module work with Pickle?

Related

Tensorflow built-in function passed as a variable doesn't work. How can I make it work?

When I ran the below code,
import numpy as np
import tensorflow as tf
class Config:
activation = tf.nn.tanh
class Sample:
def function(self, x):
return self.config.activation(x)
def __init__(self, config):
self.config = config
if __name__ == "__main__":
with tf.Graph().as_default():
config = Config()
sample = Sample(config)
with tf.Session() as sess:
a = tf.constant(2.0)
print sess.run(sample.function(a))
I get this error message:
Traceback (most recent call last):
File "test.py", line 27, in <module>
print sess.run(sample.function(a))
File "test.py", line 11, in function
return self.config.activation(x)
File "/Users/byungwookang/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 2019, in tanh
with ops.name_scope(name, "Tanh", [x]) as name:
File "/Users/byungwookang/anaconda/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/Users/byungwookang/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 4185, in name_scope
with g.as_default(), g.name_scope(n) as scope:
File "/Users/byungwookang/anaconda/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/Users/byungwookang/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2839, in name_scope
if name:
File "/Users/byungwookang/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 541, in __nonzero__
raise TypeError("Using a `tf.Tensor` as a Python `bool` is not allowed. "
TypeError: Using a `tf.Tensor` as a Python `bool` is not allowed. Use `if t is not None:` instead of `if t:` to test if a tensor is defined, and use TensorFlow ops such as tf.cond to execute subgraphs conditioned on the value of a tensor.
In contrast, this code works as expected.
import numpy as np
import tensorflow as tf
class Config:
activation = np.tanh
class Sample:
def function(self, x):
return self.config.activation(x)
def __init__(self, config):
self.config = config
if __name__ == "__main__":
config = Config()
sample = Sample(config)
print sample.function(2.0)
print np.tanh(2.0)
It gives
0.964027580076
0.964027580076
I am curious why one can't pass a tensorflow built-in function as a variable (as done in the first code above), and whether there is a way to avoid the above error. Especially given the second code where a numpy function is passed nicely as a variable, it seems very strange to me that tensorflow doesn't allow this.
The reason your stuff does not work is because in your case
print sample.function # <bound method Sample.function of <__main__.Sample instance at 0xXXX>>
print tf.nn.tanh # <function tanh at 0xXXX>
are not the same, whereas in your second case they match. So when you run sample.function(a), not tanh but something else is executed.
It is hard for me to understand the purpose of all these classes and functions to do a simple job, so I just found the easiest way to modify whatever was written for it to work:
import numpy as np
import tensorflow as tf
def config():
return {'activation': tf.nn.tanh}
class Sample:
def function(self, x):
return self.config['activation'](x)
def __init__(self, config):
self.config = config
if __name__ == "__main__":
with tf.Graph().as_default(): # this is also not needed
sample = Sample(config())
with tf.Session() as sess:
a = tf.constant(2.0)
print sess.run(sample.function(a))

How to mock creation of text files python2.7 in unitest framework?

I have a function that first examines whether a txt file exists and if it does not it creates one. If the txt file already exists it reads the info. I am trying to write unittests to examine whether the logic of the function is correct. I want to patch things like existence of files, creation of files and reading of files.
The function to be tested looks like this:
import json
import os.path
def read_create_file():
filename = 'directory/filename.txt'
info_from_file = []
if os.path.exists(filename):
with open(filename, 'r') as f:
content = f.readlines()
for i in range(len(content)):
info_from_file.append(json.loads(content[i]))
return info_from_file
else:
with open(filename, 'w') as f:
pass
return []
The unittest looks like this:
import unittest
import mock
from mock import patch
class TestReadCreateFile(unittest.TestCase):
def setUp(self):
pass
def function(self):
return read_create_file()
#patch("os.path.exists", return_value=False)
#mock.patch('directory/filename.txt.open', new=mock.mock_open())
def test_file_does_not_exist(self, mock_existence, mock_open_patch):
result = self.function()
self.assertEqual(result, (True, []))
ERROR: ImportError: Import by filename is not supported.
or like this:
import unittest
import mock
from mock import patch
#patch("os.path.exists", return_value=False)
def test_file_not_exist_yet(self, mock_existence):
m = mock.mock_open()
with patch('__main__.open', m, create=True):
handle = open('directory/filename.txt', 'w')
result = self.function()
self.assertEqual(result, (True, {}))
ERROR:
IOError: [Errno 2] No such file or directory: 'directory/filename.txt'
As a newbie I cannot seem to get my head around a solution, any help is greatly appreciated.
Thank you
You're mocking os.path.exists wrong. When you patch you patch from the file under test.
#patch("path_to_method_under_test.path.exists", return_value=False)
def test_file_not_exist_yet(self, mock_existence):

Python3 pickle serialization with Cmd

I am new to Python and as my first project I am attempting to convert a Python2 script to Python3.
The script is failing when it attempts to serialize a class using pickle.
It seems as though it is failing as I am trying to save a class which uses the Cmd CLI.
This code works using Python2.
Can anyone tell me what is wrong with the script and how I fix it?
import sys
import cmd
try:
import pickle as pickle
except:
import pickle
import os.path
def main():
app = Labyrinth()
turnfile = "turn0.lwot"
app.Save(turnfile)
class CLI(cmd.Cmd):
def __init__(self):
cmd.Cmd.__init__(self)
class Labyrinth(cmd.Cmd):
def __init__(self):
cmd.Cmd.__init__(self)
def Save(self, fname):
with open(fname, 'wb') as f:
pickle.dump(self,f, 2)
f.close()
print ("Save Successful!")
sys.exit()
if __name__ == '__main__':
main()
Not all objects are picklable. In particular, file objects are problematic because you can't generally restore their state later. cmd.Cmd holds stdin and stdout file objects and that should make them unpicklable. I was quite surprised that it worked in python 2, but it didn't really... Even though the stdin and stdout pickled, the unpickled object you get back later doesn't work, as in this example:
>>> import sys
>>> import pickle
>>> sys.stdout.write('foo\n')
foo
>>> serialized = pickle.dumps(sys.stdout, 2)
>>> stdout = pickle.loads(serialized)
>>> stdout.write('bar\n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file
>>>
So, even though this bit of code didn't fail, the object shouldn't be usable later. You can add a few special methods to an object that let you fix objects so they can be serialized. Here, I've stripped the bad attributes on save and added them back on restore. Now you can pickle, unpickle and it actually works when you are done.
import sys
import cmd
try:
import cPickle as pickle
except:
import pickle
import os.path
def main():
app = Labyrinth()
turnfile = "turn0.lwot"
app.Save(turnfile)
class CLI(cmd.Cmd):
def __init__(self):
cmd.Cmd.__init__(self)
class Labyrinth(cmd.Cmd):
def __init__(self):
cmd.Cmd.__init__(self)
def Save(self, fname):
with open(fname, 'wb') as f:
pickle.dump(self,f, pickle.HIGHEST_PROTOCOL)
f.close()
print ("Save Successful!")
sys.exit()
def __getstate__(self):
# stdin/out are unpicklable. We'll get new ones on load
return tuple(((k,v) for k,v in self.__dict__.items()
if k not in ('stdin', 'stdout')))
def __setstate__(self, state):
self.__dict__.update(state)
self.stdin = sys.stdin
self.stdout = sys.stdout
if __name__ == '__main__':
main()
Playing with the protocol doesn't help. The full error message (which you should have included) is:
1027:~/mypy$ python3 stack41334887.py
Traceback (most recent call last):
File "stack41334887.py", line 33, in <module>
main()
File "stack41334887.py", line 14, in main
app.Save(turnfile)
File "stack41334887.py", line 27, in Save
pickle.dump(self,f, 3, fix_imports=True)
TypeError: cannot serialize '_io.TextIOWrapper' object
Python3 made some major changes in the io system. This TextIOWrapper is, I think new to Py3.
https://docs.python.org/3.1/library/io.html#io.TextIOWrapper
Can I use multiprocessing.Pool in a method of a class? also had problems serializing a TextIOWrapper.
=========
So inspireed by #tdelaney, I checked the stdin for my PY3 session:
In [1212]: sys.stdin
Out[1212]: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='UTF-8'>
So that's the thing that can't be serialized.

cPickle error while using pathos.multiprocessing?

I'm trying to use multiprocessing to speed up pandas excel reading. However when I use multiprocessing I'm getting the error
cPickle.PicklingError: Can't pickle : attribute lookup __builtin__.function failed
when I try to run the following:
import dill
from pathos.multiprocessing import ProcessPool
class A(object):
def __init__(self):
self.files = glob.glob(\*)
def read_file(self, filename):
return pd.read_excel(filename)
def file_data(self):
pool = ProcessPool(9)
file_list = [filename for filename in self.files]
df_list = pool.map(A().read_file, file_list)
combined_df = pd.concat(df_list, ignore_index=True)
Isn't pathos.multiprocessing designed to fix this issue? Am I overlooking something here?
Edit:
Full error code traces to
File "c:\users\zky3sse\appdata\local\continuum\anaconda2\lib\site-packages\pathos-0.2.0-py2.7.egg\
pathos\multiprocessing.py", line 136, in map
return _pool.map(star(f), zip(*args)) # chunksize
File "C:\Users\ZKY3SSE\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Users\ZKY3SSE\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\pool.py", line 567, in get
raise self._value
It is possible that Pandas may be using Swig as a wrapper for C code. If this is the case, then dill may not work properly, and pathos would then switch to pickle. There are workarounds, as shown here: How to make my SWIG extension module work with Pickle?

Pyro4 pickle serializer numpy array

Trying to serialize numpy array with Pyro4 is returning the following type error
TypeError: don't know how to serialize class <type 'numpy.ndarray'>. Give it vars() or an appropriate __getstate__
The code is a the following
import numpy as np
import Pyro4
# set pickle serializer
Pyro4.config.SERIALIZERS_ACCEPTED = set(['pickle','json', 'marshal', 'serpent'])
#Pyro4.expose
class test(object):
def get_array(self):
return np.random.random((10,10))
def main():
# create a Pyro daemon
daemon = Pyro4.Daemon()
# register test instance
uri = daemon.register(test())
# print uri to connect to it in another console
print uri
# start the event loop of the server to wait for calls
daemon.requestLoop()
if __name__=="__main__":
main()
now open another console and try to call test instance doing the following
import Pyro4
Pyro4.config.SERIALIZERS_ACCEPTED = set(['pickle','json', 'marshal', 'serpent'])
# connect to URI which is printed above
# must be something like this 'PYRO:obj_c261949088104b839878255b98a9da90#localhost:57495'
p = Pyro4.Proxy(URI)
p.get_array()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/Pyro4/core.py", line 171, in __call__
return self.__send(self.__name, args, kwargs)
File "/usr/local/lib/python2.7/site-packages/Pyro4/core.py", line 438, in _pyroInvoke
raise data
TypeError: don't know how to serialize class <type 'numpy.ndarray'>. Give it vars() or an appropriate __getstate
This is mentioned in the manual including what you can do to solve it: http://pythonhosted.org/Pyro4/tipstricks.html#pyro-and-numpy
In your code above, you didn't tell your client code to use pickle. You should use another config item for that (SERIALIZER). What you have there is meant for Pyro deamons instead.