Pandas dataframe to existing excel workbook - python-2.7

I'm trying to use openpyxl to write data to an existing xlsx workbook and save it as a separate file. I would like to write a dataframe to a sheet called 'Data' and write some values to another sheet called 'Summary' then save the workbook object as 'test.xlsx'. I have the following code that writes the calculations I need to the summary sheet, but I get a TypeError for an unexpected keyword argument 'font' when I try to write the dateframe, which doesn't make much sense to me...
I'm using the following code which I've adapted from here
from openpyxl import load_workbook
import pandas as pd
book = load_workbook('template.xlsx')
# Write values to summary sheet
ws = book.get_sheet_by_name('Summary')
ws['A1'] = 'TEST'
book.save('test.xlsx')
# Write df
writer = pd.ExcelWriter('test.xlsx', engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, sheet_name="Data", index=False)
writer.save()
The trace-back:
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site- packages\pandas\core\frame.py", line 1274, in to_excel
startrow=startrow, startcol=startcol)
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\excel.py", line 778, in write_cells
xcell.style = xcell.style.copy(**style_kwargs)
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site-packages\openpyxl-2.2.2-py2.7.egg\openpyxl\compat\__init__.py", line 67, in new_func
return obj(*args, **kwargs)
TypeError: copy() got an unexpected keyword argument 'font'

I think the error arises from necessary changes in the openpyxl styles API. As a result Pandas installs an older version of openpyxl. You should be okay, therefore, if you remove openpyxl 2.2

Related

Django download excel file results in corrupted Excel file

I am trying to export a Pandas dataframe from a Django app as an Excel file. I currently export it to CSV file and this works fine except as noted. The problem is that when the user open the csv file in Excel App, a string that looks like numbers .... for example a cell with a value of '111,1112' or '123E345' which is intended to be a string, ends up showing as error or exponent in Excel view; even if I make sure that the Pandas column is not numeric.
This is how I export to CSV:
response = HttpResponse(content_type='text/csv')
filename = 'aFileName'
response['Content-Disposition'] = 'attachment; filename="' + filename + '"'
df.to_csv(response, encoding='utf-8', index=False)
return response
To export with content type EXCEL, I saw several references where the following approach was recommended:
with BytesIO() as b:
# Use the StringIO object as the filehandle.
writer = pd.ExcelWriter(b, engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
return HttpResponse(b.getvalue(), content_type='application/vnd.ms-excel')
When I try this, It appears to export something, but if I try to open the file in Excel Office 2016, I get a Excel message that the file is corrupted. Generally the size is few KB at most.
Please advise what could be wrong with the 2nd approach that is causing a bad export. I am using Django 2.2.1, Pandas 0.25.1
Thank you

Flask response to excel file giving corrupt excel file

I have a website that has a button. When it's clicked, it returns a number of pandas dataframers into an excel file and returns that excel file automatically as download.
It seems to work ok, except when I open the file, it seems to be corrupted. It asks if some of the tabs should be recovered. I'm using the code below. Any suggestions are appreciated what could be the cause for this.
import io
from flask.helpers import make_response
from pandas.io.excel import ExcelWriter
output = io.BytesIO()
writer = ExcelWriter(output)
dfs = [df1,df2....]
tabs ['tab1','tab2',....]
for df, tab_name in zip(dfs, tab_names):
df.to_excel(writer, tab_name)
writer.close()
resp = make_response(output.getvalue())
resp.headers['Content-Disposition'] = 'attachment; filename=output.xlsx'
resp.headers["Content-type"] = "text/csv"
return resp
You'll need to to add
output.seek(0)
after you close the writer.
You might also find it easier to write
return send_file(output, attachment_filename="output.xlsx", as_attachment=True)
(after importing send_file from flask)

Read text file via UDF in pyspark returns unexpected output

I have pyspark dataframe df containing paths to text files. I want to create a new column with the content of the text files.
import pyspark.sql.functions as F
from pyspark.sql.types import *
def read_file(filepath):
import s3fs
s3 = s3fs.S3FileSystem()
with s3.open(filepath) as f:
return f.read()
read_file_udf = F.udf(read_file, StringType())
df.withColumn('raw_text', read_file_udf('filepath')).show()
+---------------------+-----------+
| file | raw_text|
+---------------------+-----------+
|s3://bucket/file1.txt| [B#aa2a4f3|
|s3://bucket/file2.txt|[B#138664c5|
|s3://bucket/file3.txt| [B#3bcc67e|
|s3://bucket/file4.txt|[B#70b735c4|
|s3://bucket/file5.txt|[B#6fad821d|
+---------------------+-----------+
Instead of getting the actual file content, I am getting these strange [B# codes. What are they, why am I getting them and how do I fix this?
To answer my own question... I was getting [B# because the read_file() function was returning the byte representation of a string. Defining:
def read_file(filepath):
import s3fs
s3 = s3fs.S3FileSystem()
with s3.open(filepath) as f:
return f.read().decode("utf-8")
will fix the issue.

Load HDF file into list of Python Dask DataFrames

I have a HDF5 file that I would like to load into a list of Dask DataFrames. I have set this up using a loop following an abbreviated version of the Dask pipeline approach. Here is the code:
import pandas as pd
from dask import compute, delayed
import dask.dataframe as dd
import os, h5py
#delayed
def load(d,k):
ddf = dd.read_hdf(os.path.join(d,'Cleaned.h5'), key=k)
return ddf
if __name__ == '__main__':
d = 'C:\Users\User\FileD'
loaded = [load(d,'/DF'+str(i)) for i in range(1,10)]
ddf_list = compute(*loaded)
print(ddf_list[0].head(),ddf_list[0].compute().shape)
I get this error message:
C:\Python27\lib\site-packages\tables\group.py:1187: UserWarning: problems loading leaf ``/DF1/table``::
HDF5 error back trace
File "..\..\hdf5-1.8.18\src\H5Dio.c", line 173, in H5Dread
can't read data
File "..\..\hdf5-1.8.18\src\H5Dio.c", line 543, in H5D__read
can't initialize I/O info
File "..\..\hdf5-1.8.18\src\H5Dchunk.c", line 841, in H5D__chunk_io_init
unable to create file chunk selections
File "..\..\hdf5-1.8.18\src\H5Dchunk.c", line 1330, in H5D__create_chunk_file_map_hyper
can't insert chunk into skip list
File "..\..\hdf5-1.8.18\src\H5SL.c", line 1066, in H5SL_insert
can't create new skip list node
File "..\..\hdf5-1.8.18\src\H5SL.c", line 735, in H5SL_insert_common
can't insert duplicate key
End of HDF5 error back trace
Problems reading the array data.
The leaf will become an ``UnImplemented`` node.
% (self._g_join(childname), exc))
The message mentions a duplicate key. I iterated over the first 9 files to test out the code and, in the loop, I am using each iteration to assemble a different key that I use with dd.read_hdf. Across all iterations, I'm keeping the filename is the same - only the key is being changed.
I need to use dd.concat(list,axis=0,...) in order to vertically concatenate the contents of the file. My approach was to load them into a list first and then concatenate them.
I have installed PyTables and h5Py and have Dask version 0.14.3+2.
With Pandas 0.20.1, I seem to get this to work:
for i in range(1,10):
hdf = pd.HDFStore(os.path.join(d,'Cleaned.h5'),mode='r')
df = hdf.get('/DF{}' .format(i))
print df.shape
hdf.close()
Is there a way I can load this HDF5 file into a list of Dask DataFrames? Or is there another approach to vertically concatenate them together?
Dask.dataframe is already lazy, so there is no need to use dask.delayed to make it lazier. You can just call dd.read_hdf repeatedly:
ddfs = [dd.read_hdf(os.path.join(d,'Cleaned.h5'), key=k)
for k in keys]
ddf = dd.concat(ddfs)

Python 2.7 and Textblob - TypeError: The `text` argument passed to `__init__(text)` must be a string, not <type 'list'>

Update: Issue resolved. (see comment section below.) Ultimately, the following two lines were required to transform my .csv to unicode and utilize TextBlob: row = [cell.decode('utf-8') for cell in row], and text = ' '.join(row).
Original question:
I am trying to use a Python library called Textblob to analyze text from a .csv file. Error I receive when I call Textblob in my code is:
Traceback (most recent call last): File
"C:\Users\Marcus\Documents\Blog\Python\Scripts\Brooks\textblob_sentiment.py",
line 30, in
blob = TextBlob(row) File "C:\Python27\lib\site-packages\textblob\blob.py", line 344, in
init
'must be a string, not {0}'.format(type(text)))TypeError: The text argument passed to __init__(text) must be a string, not
My code is:
#from __future__ import division, unicode_literals #(This was recommended for Python 2.x, but didn't help in my case.)
#-*- coding: utf-8 -*-
import csv
from textblob import TextBlob
with open(u'items.csv', 'rb') as scrape_file:
reader = csv.reader(scrape_file, delimiter=',', quotechar='"')
for row in reader:
row = [unicode(cell, 'utf-8') for cell in row]
print row
blob = TextBlob(row)
print type(blob)
I have been working through UTF/unicode issues. I'd originally had a different subject which I posed to this thread. (Since my code and the error have changed, I'm posting to a new thread.) Print statements indicate that the variable "row" is of type=str, which I thought indicated that the reader object had been transformed as required by Textblob. The source .csv file is saved as UTF-8. Can anyone provide feedback as to how I can get unblocked on this, and the flaws in my code?
Thanks so much for the help.
So maybe you can make change as below:
row = str([cell.encode('utf-8') for cell in row])