Unpickling data from Python 2 with unicode strings in Python 3 - python-2.7

I have pickled data from 2.7 that I pickled like this:
#!/usr/bin/env python2
# coding=utf-8
import pickle
data = {1: datetime.date(2014, 3, 18),
'string-key': u'ünicode-string'}
pickle.dump(data, open('file.pickle', 'wb'))
The only way I found to load this in Python 3.4 is:
data = pickle.load(open('file.pickle', "rb"), encoding='bytes')
Now my unicode string are fine but the dict keys are bytes. print(repr(data)) gives:
{1: datetime.date(2014, 3, 18), b'string-key': 'ünicode-string'}
Does anybody have an idea to get around rewriting my code like data[b'string-key'] resp. converting all existing files?

This is not a real answer but only a workaround. This converts pickled data to version 3 in Python 3.4 (doesn't work in 3.3):
#!/usr/bin/env python3
import pickle, glob
def bytes_to_unicode(ob):
t = type(ob)
if t in (list, tuple):
l = [str(i, 'utf-8') if type(i) is bytes else i for i in ob]
l = [bytes_to_unicode(i) if type(i) in (list, tuple, dict) else i for i in l]
ro = tuple(l) if t is tuple else l
elif t is dict:
byte_keys = [i for i in ob if type(i) is bytes]
for bk in byte_keys:
v = ob[bk]
del(ob[bk])
ob[str(bk,'utf-8')] = v
for k in ob:
if type(ob[k]) is bytes:
ob[k] = str(ob[k], 'utf-8')
elif type(ob[k]) in (list, tuple, dict):
ob[k] = bytes_to_unicode(ob[k])
ro = ob
else:
ro = ob
print("unprocessed object: {0} {1}".format(t, ob))
return ro
for fn in glob.glob('*.pickle'):
data = pickle.load(open(fn, "rb"), encoding='bytes')
ndata = bytes_to_unicode(data)
pickle.dump(ndata, open(fn + '3', "wb"))
The Python docs say:
The pickle serialization format is guaranteed to be backwards compatible across Python releases.
I didn't find a way to pickle.load Python-2.7 pickled data in Python 3.3 -- not even data that contained only ints and dates.

Have a look at the implementation.
You can subclass the Unpickler and overwrite the byte deserialization to produce strings.

Related

Cloud datastore client changes type from int to float

I was writing a script in python using google-cloud-datastore python module to upload data from my CSV to datastore. The script seems to work fine but There seems to be a problem that I'm stuck with. I see that the integer values from my CSV are being stored as Floating point number. Is it a default way of sending data to datastore or am I doing something wrong?
Here's my code:
import sys
import getopt
import pandas as pd
from google.cloud import datastore
def write_dict_chunks(data, SIZE=100):
log_count = 0
datastore_client = datastore.Client()
task_key = datastore_client.key(kind)
for i in xrange(0, len(data), SIZE):
entities = []
for each_entry in data[i : i+SIZE]:
nan_check = lambda v: v if str(v)!='nan' else None
string_check = lambda v: v.decode('utf-8') if isinstance(v, str) else v
write_row = {k: nan_check(string_check(v)) for k, v in each_entry.iteritems()}
entity = datastore.Entity(key=task_key)
entity.update(write_row)
entities.append(entity)
datastore_client.put_multi(entities)
log_count += len(entities)
print 'Wrote {} entities to datastore'.format(log_count)
try:
opts, args = getopt.getopt(sys.argv[1:], "ho:v", ["kind=", "filepath="])
if len(args) > 0:
for each in args:
print 'Unrecognized argument: '+each
sys.exit(2)
except getopt.GetoptError as err:
# print help information and exit:
print str(err) # will print something like "option -a not recognized"
print 'Usage: python parse_csv.py --kind=kind_name --filepath=path_to_csv'
kind = None
filepath = None
for option, argument in opts:
if option in '--kind':
kind = argument
elif option in '--filepath':
filepath = argument
df = pd.read_csv(filepath)
df = df.to_dict(orient='records')
write_dict_chunks(df)

Extract elements from tuple to encode in python

I have a list of a list of tuples. With unicode problems.
I have be struggling to encode this into equivalent characters and I have been unsuccessful.
Here is a sample of my code:
import spaghetti as sgt
import codecs
f = codecs.open('output-data-pos', encoding='utf-8')
raw = f.read()
reviews = [raw.split()]
output_tagged = (sgt.pos_tag_sents(reviews))
Here is a sample of output_tagged produces.
[[(u'cerramos', None), (u'igual', u'aq0cs0'), (u'arrancado', None), (u'estanter\xeda', None), (u'\xe9xito', u'ncms000'), (u'an\xe9cdotas', u'ncfp000')]]
My overall objective is to extract each value from the tuple and encode it in utf-8 for a final result such as
cerramos None
igual aq0cs0
arrancado None
estantería None
éxito ncms000
anécdotas ncfp000
Some of the strategies that I have so far tried are from simple stratgies:
where i try to output the list and encode it directly
d = codecs.open('output-data-tagged', 'w', encoding='utf-8')
d.write(output_tagged)
or this approach
f = open('output-data-tagged', 'w')
for output in output_tagged:
output.encode('utf-8')
f.write(output)
f.close
where I first try to map the list and then encode it:
list_of_lists = map(list, output_tagged)
print list_of_lists
where I try functions to encode the data
def reprunicode(u):
return reprunicode(u).decode('raw_unicode_escape')
print u'[%s]' % u', '.join([u'(%s,)' % reprunicode(ti[0]) for ti in output_tagged])
this one too:
def utf8data(list):
return [item.decode('utf8') for item in list]
print utf8data(output_tagged)
Considering my many trials, how can I extract the elements from the tuple in the list of list in order to arrive at my desired final encoding results?

Can't merge two lists into a dictionary

I can't merge two lists into a dictionary.I tried the following :
Map two lists into a dictionary in Python
I tried all solutions and I still get an empty dictionary
from sklearn.feature_extraction import DictVectorizer
from itertools import izip
import itertools
text_file = open("/home/vesko_/evnt_classification/bag_of_words", "r")
text_fiel2 = open("/home/vesko_/evnt_classification/sdas", "r")
lines = text_file.read().split('\n')
words = text_fiel2.read().split('\n')
diction = dict(itertools.izip(words,lines))
new_dict = {k: v for k, v in zip(words, lines)}
print new_dict
I get the following :
{'word': ''}
['word=']
The two lists are not empty.
I'm using python2.7
EDIT :
Output from the two lists (I'm only showing a few because it's a vector with 11k features)
//lines
['change', 'I/O', 'fcnet2', 'ifconfig',....
//words
['word', 'word', 'word', .....
EDIT :
Now at least I have some output #DamianLattenero
{'word\n': 'XXAMSDB35:XXAMSDB35_NGCEAC_DAT_L_Drivei\n'}
['word\n=XXAMSDB35:XXAMSDB35_NGCEAC_DAT_L_Drivei\n']
I think the root of a lot of confusion is code in the example that is not relevant.
Try this:
text_file = open("/home/vesko_/evnt_classification/bag_of_words", "r")
text_fiel2 = open("/home/vesko_/evnt_classification/sdas", "r")
lines = text_file.read().split('\n')
words = text_fiel2.read().split('\n')
# to remove any extra newline or whitespace from what was read in
map(lambda line: line.rstrip(), lines)
map(lambda word: word.rstrip(), words)
new_dict = dict(zip(words,lines))
print new_dict
Python builtin zip() returns an iterable of tuples from each of the arguments. Giving this iterable of tuples to the dict() object constructor creates a dictionary where each of the items in words is the key and items in lines is the corresponding value.
Also note that if the words file has more items than lines then there will either keys with empty values. If lines has items then only the last one will be added with an None key.
I tryed this and worked for me, I created two files, added numbers 1 to 4, letters a to d, and the code creates the dictionary ok, I didn't need to import itertools, actually there is an extra line not needed:
lines = [1,2,3,4]
words = ["a","b","c","d"]
diction = dict(zip(words,lines))
# new_dict = {k: v for k, v in zip(words, lines)}
print(diction)
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
If that worked, and not the other, you must have a problem in loading the list, try loading like this:
def create_list_from_file(file):
with open(file, "r") as ins:
my_list = []
for line in ins:
my_list.append(line)
return my_list
lines = create_list_from_file("/home/vesko_/evnt_classification/bag_of_words")
words = create_list_from_file("/home/vesko_/evnt_classification/sdas")
diction = dict(zip(words,lines))
# new_dict = {k: v for k, v in zip(words, lines)}
print(diction)
Observation:
If you files.txt looks like this:
1
2
3
4
and
a
b
c
d
the result will have for keys in the dictionary, one per line:
{'a\n': '1\n', 'b\n': '2\n', 'c\n': '3\n', 'd': '4'}
But if you file looks like:
1 2 3 4
and
a b c d
the result will be {'a b c d': '1 2 3 4'}, only one value

Performance difference serializing pandas frames python 2.x / 3.x

I was experiencing some performance differences between python 2.7 and 3.5 when serializing pandas frames to CSV.
So did a quick search on google and found this benchmark:
https://gist.github.com/GitRay/4001b4962eb9f3e09a9d456ee5a30aae
And modified it a bit for my needs:
import pandas as pd
from time import time
import platform
def timeit(func, n=5):
start = time()
for i in range(n):
func()
end = time()
return (end - start) / n
def csvdumps(s):
s.to_csv('foo')
return 'foo'
def csvloads(fn):
return pd.read_csv(fn)
def hdfdumps(s):
s.to_hdf('foo', 'bar', mode='w')
return ('foo', 'bar')
def hdfloads(path):
return pd.read_hdf('foo', 'bar')
df = pd.DataFrame({'text': [str(i % 1000) for i in range(1000000)],
'numbers': range(1000000)})
keys = ['csv', 'hdfstore']
d = {'csv': [csvloads, csvdumps],
'hdfstore': [hdfloads, hdfdumps]}
result = dict()
for name, (loads, dumps) in d.items():
text = dumps(df.text)
numbers = dumps(df.numbers)
result[name] = {'text': {'dumps': timeit(lambda: dumps(df.text)),
'loads': timeit(lambda: loads(text))},
'numbers': {'dumps': timeit(lambda: dumps(df.numbers)),
'loads': timeit(lambda: loads(numbers))}}
########
# Plot #
########
# Much of this was taken from
# http://nbviewer.ipython.org/gist/mwaskom/886b4e5cb55fed35213d
# by Michael Waskom
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", font_scale=1.3)
w, h = 7, 7
f, (left, right) = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(w*2, h), squeeze=True)
df = pd.DataFrame({'loads': [result[key]['text']['loads'] for key in keys],
'dumps': [result[key]['text']['dumps'] for key in keys],
'storage': keys})
df = pd.melt(df, "storage", value_name="duration", var_name="operation")
sns.barplot("duration", "storage", "operation", data=df, ax=left)
left.set(xlabel="Duration (s)", ylabel="")
sns.despine(bottom=True)
left.set_title('Cost to Serialize Text')
left.legend(loc="lower center", ncol=2, frameon=True, title="operation")
df = pd.DataFrame({'loads': [result[key]['numbers']['loads'] for key in keys],
'dumps': [result[key]['numbers']['dumps'] for key in keys],
'storage': keys})
df = pd.melt(df, "storage", value_name="duration", var_name="operation")
sns.barplot("duration", "storage", "operation", data=df, ax=right)
right.set(xlabel="Duration (s)", ylabel="")
sns.despine(bottom=True)
right.set_title('Cost to Serialize Numerical Data')
right.legend(loc="lower center", ncol=2, frameon=True, title="operation")
plt.savefig('serialize_py'+'.'.join(platform.python_version_tuple())+'.png')
As you can see in the results serializing in python 3 is much slower :
python 2.7 python 3.5 diff
load 0.3504s 0.329005s +06.50%
dump 1.2784s 3.333152s -61.65%
Does anybody know why?

Pythonic way to create empty map of vector of vector of vector

I have the following C++ code
std::map<std::string, std::vector<std::vector<std::vector<double> > > > details
details["string"][index][index].push_back(123.5);
May I know what is the Pythonic to declare an empty map of vector of vector of vector? :p
I try to have
self.details = {}
self.details["string"][index][index].add(value)
I am getting
KeyError: 'string'
Probably the best way would be to use a dict for the outside container with strings for the keys mapping to an inner dictionary with tuples (the vector indices) mapping to doubles:
d = {'abc': {(0,0,0): 1.2, (0,0,1): 1.3}}
It's probably less efficient (less time-efficient at least, it's actually more space-efficient I would imagine) than actually nesting the lists, but IMHO cleaner to access:
>>> d['abc'][0,0,1]
1.3
Edit
Adding keys as you went:
d = {} #start with empty dictionary
d['abc'] = {} #insert a new string key into outer dict
d['abc'][0,3,3] = 1.3 #insert new value into inner dict
d['abc'][5,3,3] = 2.4 #insert another value into inner dict
d['def'] = {} #insert another string key into outer dict
d['def'][1,1,1] = 4.4
#...
>>> d
{'abc': {(0, 3, 3): 1.3, (5, 3, 3): 2.4}, 'def': {(1, 1, 1): 4.4}}
Or if using Python >= 2.5, an even more elegant solution would be to use defaultdict: it works just like a normal dictionary, but can create values for keys that don't exist.
import collections
d = collections.defaultdict(dict) #The first parameter is the constructor of values for keys that don't exist
d['abc'][0,3,3] = 1.3
d['abc'][5,3,3] = 2.4
d['def'][1,1,1] = 4.4
#...
>>> d
defaultdict(<type 'dict'>, {'abc': {(0, 3, 3): 1.3, (5, 3, 3): 2.4}, 'def': {(1, 1, 1): 4.4}})
Python is a dynamic (latent-typed) language, so there is no such thing as a "map of vector of vector of vector" (or "dict of list of list of list" in Python-speak). Dicts are just dicts, and can contain values of any type. And an empty dict is simply: {}
create dict that contains a nested list which inturn contains a nested list
dict1={'a':[[2,4,5],[3,2,1]]}
dict1['a'][0][1]
4
Using collections.defaultdict, you can try the following lambda trick below. Note that you'll encounter problems pickling these objects.
from collections import defaultdict
# Regular dict with default float value, 1D
dict1D = defaultdict(float)
val1 = dict1D["1"] # string key type; val1 == 0.0 by default
# 2D
dict2D = defaultdict(lambda: defaultdict(float))
val2 = dict2D["1"][2] # string and integer key types; val2 == 0.0 by default
# 3D
dict3D = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
val3 = dict3D[1][2][3] # val3 == 0.0 by default
# N-D, arbitrary nested defaultdicts
dict4D = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(str))))
val4 = dict4D["abc"][10][9][90] # val4 == '' by default
You can basically nest as many of these defaultdict collection types. Also, note that they behave like regular python dictionaries that can take the usual key types (non-mutable and hashable). Best of luck!