pandas.DataFrame.to_pickle backward compatibility - python-2.7

pandas.DataFrame.to_pickle's compression parameter was introduced in pandas 0.20. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_pickle.html
Before pandas 0.20, there was no compression param that I needed to specify.
I have a webapp written using pandas 0.18 and to read the pickle file using pandas.read_pickle in version 0.18 without error, how should I pickle the file?
So far I have tried setting the compression parameter to None and 'gzip'. Both don't work.

It looks like you don't actually need to specify. The default compression='infer' should work.
But, why not just import and use pickle?
This is what I have been using
# import and save object as pickle
import pickle
pickle.dump(object, open('filename.pkl', 'wb'))
# and this is how to load them
loaded_object = pickle.load(open('filename.pkl', 'rb'))

Related

Why can't I import WD_ALIGN_PARAGRAPH from docx.enum.text?

I transferred some code from IDLE 3.5 (64 bits) to pycharm (Python 2.7). Most of the code is still working, for example I can import WD_LINE_SPACING from docx.enum.text, but for some reason I can't import WD_ALIGN_PARAGRAPH.
At first, nearly non of the imports worked, but after I did
pip install python-docx
instead of
pip install docx
most of the imports worked except for WD_ALIGN_PARAGRAPH.
# works
from __future__ import print_function
import xlrd
import xlwt
import os
import subprocess
from calendar import monthrange
import datetime
from docx import Document
from datetime import datetime
from datetime import date
from docx.enum.text import WD_LINE_SPACING
from docx.shared import Pt
# does not work
from docx.enum.text import WD_ALIGN_PARAGRAPH
I don't get any error messages but Pycharm marks the line as error:
"Cannot find reference 'WD_ALIGN_PARAGRAPH' in 'text.py'".
You can use this instead:
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT
and then substitute WD_PARAGRAPH_ALIGNMENT wherever WD_ALIGN_PARAGRAPH would have appeared before.
The reason this is happening is that the actual enum object is named WD_PARAGRAPH_ALIGNMENT, and a decorator is applied that also allows it to be referenced as WD_ALIGN_PARAGRAPH (which is a little shorter, and possibly clearer). I expect the syntax checker in PyCharm is operating on direct module attributes and doesn't pick up the alias, which is resolved by the Python parser/compiler.
Interestingly, I expect your code would work fine either way. But to get rid of the annoying message you can use the base name.
If someone uses pylint it can be easily suppressed with # pylint: disable=E0611 added at the end of the import line.

encoding in python and writing it to a YAML file in Python

I have a Unicode, which is read from a CSV file:
df.iloc[0,1]
Out[41]: u'EU-repr\xe6sentant udpeget'
In [42]: type(df_translated.iloc[0,1])
Out[42]: unicode
I would like to have it as EU-repræsentant udpeget. The final goal is to write this into a dictionary and then finally save that dict to a YAML file with PyYAML using safe_dump. However, I struggle with the encoding.
If you really need to use PyYAML you should provide the arguments
encoding='utf-8' and allow_unicode=True to the safe_dump()
routine.
If you ever intend to upgrade to YAML 1.2 and use ruamel.yaml
(disclaimer: I am the author of that package), those are the (much
more sensible) defaults:
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML()
data = [u'EU-repr\xe6sentant udpeget']
yaml.dump(data, sys.stdout)
which gives:
- EU-repræsentant udpeget

memory leak unpickle dict of pandas DataFrames

When I pickle a dictionary of dataframes and then unpickle them again, I experience a kind of memory leak. After the unpickled variable is dereferenced, the memory is only released partially. Calling gc.collect() does not help. I have created the following minimal exmaple:
import pickle
import numpy as np
import pandas as pd
new = np.zeros((1000, 100))
new = pd.DataFrame(new)
cc = {ix: new.copy() for ix in range(500)}
pickle.dump(cc, open('/tmp/test21', 'wb'))
Now I open a clean python session and do
import pickle
# memory consumption is around 40MB
data = pickle.load(open('/tmp/test21'))
# memory consumption goes to 991MB
data = None
# memory consumption goes to 776MB
This is pandas 0.19.2 and python 2.7.13. The problem seems to be the interaction between pickle, dictionary and pandas. If I remove the line new = pd.DataFrame(new), the problem does not occur. If I simply make a large df without a dictionary, the problem does not occur. If I don't pickle the dictionary and set cc = None, the problem does not occur. I have also tested the problem with pandas 0.14.1 and python 2.7.13. Finally the problem appears with both pickle and cPickle.
What could be the reason or a strategy to analyze this further? Any help is much appreciated!

Displaying PNG in matplotlib.pyplot framework in python 2.7

I am pulling PNG images from Jupyter Notebooks and manage to display with IPython.display.Image but not with matplotib.pyplot.plt. What am I missing? I use python 2.7.
I am using the following algorithm:
To open the notebook JSON content I do:
import nbformat
notebook_ = nbformat.read(file_notebook, 4)
After retrieving the relevant cell information I pull the png information from it using:
def cell_to_image(cell, out_value_item_number=1):
if "execution_count" in cell.keys(): # i.e version >=4
return cell["outputs"][out_value_item_number]['data']['image/png']
elif "prompt_number" in cell.keys(): # i.e version < 4
return cell["outputs"][out_value_item_number]['png']
return None
cell_image = cell_to_image(cell)
The first few characters of cell_image (which is unicode) looks like:
iVBORw0KGgoAAAANSUhEUgAAA64AAAFMCAYAAADLFeHSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\n
AAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd8jef/x/HXyTjZiYQkCGrU3ruR0tr9oq2qGtGo0dbe
\nm5pVlJpFUSMoVb6UoEZ/lCpatWuPUiNEEiMDmef3R75OexonJKUO3s/HI4/mXPd1X/d1f+LRR965
\n7/u6DSaTyYSIiIiIiIiIjbJ70hMQERERERERyYiCq4iIiIiIiNg0BVcRERERERGxaQquIiIiIiIi
\nYtMUXEVERERERMSmKbiKiIiIiIiITVNwFRGRxyIkJIRixYqxfv36+24/e/YsxYoVo3jx4v/yzGxb
\naGgoderUIS4uDoBdu3bRsmVLKlasyCuvvMKgQYOIjo622CcsLIyGDRtSunRp6tSpw8KFC62OW7p0
\naRo2bJju53Lnzh1GjRrFyy+/TNmyZWnRogW//fbbQ835q6++olGjRpQvX5769eszc+ZMkpOTzdtT
\nU1OZNGkSNWrUoHTp0jRp0oTdu3enGyc2NpZOn
I can easily plot in my Jupityer notebook using
from IPython.display import Image
Image(cell_image)
And now to my question:
How can I manipulate cell_image to be plt.subplot friendly?
(Assuming import matplotlib.pyplot as plt).
I realise that plt.imshow wouldn't work because this would require an array, which is not my case (which is a string, as far as I understand).
If you have your image string representation in a variable string_rep, the following code should work.
from io import BytesIO
import matplotlib.image as mpimage
import matplotlib.pyplot as plt
with BytesIO(string_rep.decode('base64')) as byte_rep:
image = mpimage.imread(byte_rep)
plt.imshow(image)

Reading csv zipped files in python

I'm trying to get data from a zipped csv file. Is there a way to do this without unzipping the whole files? If not, how can I unzip the files and read them efficiently?
I used the zipfile module to import the ZIP directly to pandas dataframe.
Let's say the file name is "intfile" and it's in .zip named "THEZIPFILE":
import pandas as pd
import zipfile
zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip')
df = pd.read_csv(zf.open('intfile.csv'))
If you aren't using Pandas it can be done entirely with the standard lib. Here is Python 3.7 code:
import csv
from io import TextIOWrapper
from zipfile import ZipFile
with ZipFile('yourfile.zip') as zf:
with zf.open('your_csv_inside_zip.csv', 'r') as infile:
reader = csv.reader(TextIOWrapper(infile, 'utf-8'))
for row in reader:
# process the CSV here
print(row)
A quick solution can be using below code!
import pandas as pd
#pandas support zip file reads
df = pd.read_csv("/path/to/file.csv.zip")
zipfile also supports the with statement.
So adding onto yaron's answer of using pandas:
with zipfile.ZipFile('file.zip') as zip:
with zip.open('file.csv') as myZip:
df = pd.read_csv(myZip)
Thought Yaron had the best answer but thought I would add a code that iterated through multiple files inside a zip folder. It will then append the results:
import os
import pandas as pd
import zipfile
curDir = os.getcwd()
zf = zipfile.ZipFile(curDir + '/targetfolder.zip')
text_files = zf.infolist()
list_ = []
print ("Uncompressing and reading data... ")
for text_file in text_files:
print(text_file.filename)
df = pd.read_csv(zf.open(text_file.filename)
# do df manipulations
list_.append(df)
df = pd.concat(list_)
Yes. You want the module 'zipfile'
You open the zip file itself with zipfile.ZipInfo([filename[, date_time]])
You can then use ZipFile.infolist() to enumerate each file within the zip, and extract it with ZipFile.open(name[, mode[, pwd]])
this is the simplest thing I always use.
import pandas as pd
df = pd.read_csv("Train.zip",compression='zip')
Supposing you are downloading a zip file that contains a CSV and you don't want to use temporary storage. Here is what a sample implementation looks like:
#!/usr/bin/env python3
from csv import DictReader
from io import TextIOWrapper, BytesIO
from zipfile import ZipFile
import requests
def all_tickers():
url = "https://simfin.com/api/bulk/bulk.php?dataset=industries&variant=null"
r = requests.get(url)
zip_ref = ZipFile(BytesIO(r.content))
for name in zip_ref.namelist():
print(name)
with zip_ref.open(name) as file_contents:
reader = DictReader(TextIOWrapper(file_contents, 'utf-8'), delimiter=';')
for item in reader:
print(item)
This takes care of all python3 bytes/str issues.
Modern Pandas since version 0.18.1 natively supports compressed csv files: its read_csv method has compression parameter : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
If you have a file name: my_big_file.csv and you zip it with the same name my_big_file.zip
you may simply do this:
df = pd.read_csv("my_big_file.zip")
Note: check your pandas version first (not applicable for older versions)