Load HDF file into list of Python Dask DataFrames - python-2.7

I have a HDF5 file that I would like to load into a list of Dask DataFrames. I have set this up using a loop following an abbreviated version of the Dask pipeline approach. Here is the code:
import pandas as pd
from dask import compute, delayed
import dask.dataframe as dd
import os, h5py
#delayed
def load(d,k):
ddf = dd.read_hdf(os.path.join(d,'Cleaned.h5'), key=k)
return ddf
if __name__ == '__main__':
d = 'C:\Users\User\FileD'
loaded = [load(d,'/DF'+str(i)) for i in range(1,10)]
ddf_list = compute(*loaded)
print(ddf_list[0].head(),ddf_list[0].compute().shape)
I get this error message:
C:\Python27\lib\site-packages\tables\group.py:1187: UserWarning: problems loading leaf ``/DF1/table``::
HDF5 error back trace
File "..\..\hdf5-1.8.18\src\H5Dio.c", line 173, in H5Dread
can't read data
File "..\..\hdf5-1.8.18\src\H5Dio.c", line 543, in H5D__read
can't initialize I/O info
File "..\..\hdf5-1.8.18\src\H5Dchunk.c", line 841, in H5D__chunk_io_init
unable to create file chunk selections
File "..\..\hdf5-1.8.18\src\H5Dchunk.c", line 1330, in H5D__create_chunk_file_map_hyper
can't insert chunk into skip list
File "..\..\hdf5-1.8.18\src\H5SL.c", line 1066, in H5SL_insert
can't create new skip list node
File "..\..\hdf5-1.8.18\src\H5SL.c", line 735, in H5SL_insert_common
can't insert duplicate key
End of HDF5 error back trace
Problems reading the array data.
The leaf will become an ``UnImplemented`` node.
% (self._g_join(childname), exc))
The message mentions a duplicate key. I iterated over the first 9 files to test out the code and, in the loop, I am using each iteration to assemble a different key that I use with dd.read_hdf. Across all iterations, I'm keeping the filename is the same - only the key is being changed.
I need to use dd.concat(list,axis=0,...) in order to vertically concatenate the contents of the file. My approach was to load them into a list first and then concatenate them.
I have installed PyTables and h5Py and have Dask version 0.14.3+2.
With Pandas 0.20.1, I seem to get this to work:
for i in range(1,10):
hdf = pd.HDFStore(os.path.join(d,'Cleaned.h5'),mode='r')
df = hdf.get('/DF{}' .format(i))
print df.shape
hdf.close()
Is there a way I can load this HDF5 file into a list of Dask DataFrames? Or is there another approach to vertically concatenate them together?

Dask.dataframe is already lazy, so there is no need to use dask.delayed to make it lazier. You can just call dd.read_hdf repeatedly:
ddfs = [dd.read_hdf(os.path.join(d,'Cleaned.h5'), key=k)
for k in keys]
ddf = dd.concat(ddfs)

Related

Django encoding error when reading from a CSV

When I try to run:
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Most of my data gets created in the database, except for one particular row. When my script reaches the row, I receive the error:
ProgrammingError: You must not use 8-bit bytestrings unless you use a
text_factory that can interpret 8-bit bytestrings (like text_factory = str).
It is highly recommended that you instead just switch your application to Unicode strings.`
The particular row in the CSV that causes this error is:
>>> row
{'FR\xed\x8aD\xed\x8aRIC.ST-DENIS', 'BOS', 'G'}
I've looked at the other similar Stackoverflow threads with the same or similar issues, but most aren't specific to using Sqlite with Django. Any advice?
If it matters, I'm running the script by going into the Django shell by calling python manage.py shell, and copy-pasting it in, as opposed to just calling the script from the command line.
This is the stacktrace I get:
Traceback (most recent call last):
File "<console>", line 4, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 108, in next
row = self.reader.next()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 302, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1674: invalid continuation byte
EDIT: I decided to just manually import this entry into my database, rather than try to read it from my CSV, based on Alastair McCormack's feedback
Based on the output from your question, it looks like the person who made the CSV mojibaked it - it doesn't seem to represent FRÉDÉRIC.ST-DENIS. You can try using windows-1252 instead of utf-8 but I think you'll end up with FRíŠDíŠRIC.ST-DENIS in your database.
I suspect you're using Python 2 - open() returns str which are simply byte strings.
The error is telling you that you need to decode your text to Unicode string before use.
The simplest method is to decode each cell:
with open('data.csv', 'r') as csvfile: # 'U' means Universal line mode and is not necessary
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].decode('utf-8),
team=row['Team'].decode('utf-8),
position=row['Position'].decode('utf-8)
)
That'll work but it's ugly add decodes everywhere and it won't work in Python 3. Python 3 improves things by opening files in text mode and returning Python 3 strings which are the equivalent of Unicode strings in Py2.
To get the same functionality in Python 2, use the io module. This gives you a open() method which has an encoding option. Annoyingly, the Python 2.x CSV module is broken with Unicode, so you need to install a backported version:
pip install backports.csv
To tidy your code and future proof it, do:
import io
from backports import csv
with io.open('data.csv', 'r', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# now every row is automatically decoded from UTF-8
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Encode Player name in utf-8 using .encode('utf-8') in player name
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].encode('utf-8'),
team=row['Team'],
position=row['Position']
)
In Django, decode with latin-1, csv.DictReader(io.StringIO(csv_file.read().decode('latin-1'))), it would devour all special characters and all comma exceptions you get in utf-8.

Errno 22 when using shutil.copyfile on dictionary values in python

I am getting a feedback error message that I can't seem to resolve. I have a csv file that I am trying to read and generate pdf files based on the county they fall in. If there is only one map in that county then I do not need to append the files (code TBD once this hurdle is resolved as I am sure I will run into the same issue with the code when using pyPDF2) and want to simply copy the map to a new directory with a new name. The shutil.copyfile does not seem to recognize the path as valid for County3 which meets the condition to execute this command.
Map.csv file
County Maps
County1 C:\maps\map1.pdf
County1 C:\maps\map2.pdf
County2 C:\maps\map1.pdf
County2 C:\maps\map3.pdf
County3 C:\maps\map3.pdf
County4 C:\maps\map2.pdf
County4 C:\maps\map3.pdf
County4 C:\maps\map4.pdf
My code:
import csv, os
import shutil
from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter
merged_file = PdfFileMerger()
counties = {}
with open(r'C:\maps\Maps.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=",")
for n, row in enumerate(reader):
if not n:
continue
county, location = row
if county not in counties:
counties[county] = list()
counties[county].append((location))
for k, v in counties.items():
newPdfFile = ('C:\maps\Maps\JoinedMaps\County-' + k +'.pdf')
if len(str(v).split(',')) > 1:
print newPdfFile
else:
shutil.copyfile(str(v),newPdfFile)
print 'v: ' + str(v)
Feedback message:
C:\maps\Maps\JoinedMaps\County-County4.pdf
C:\maps\Maps\JoinedMaps\County-County1.pdf
v: ['C:\\maps\\map3.pdf']
Traceback (most recent call last):
File "<module2>", line 22, in <module>
File "C:\Python27\ArcGIS10.5\lib\shutil.py", line 82, in copyfile
with open(src, 'rb') as fsrc:
IOError: [Errno 22] invalid mode ('rb') or filename: "['C:\\\\maps\\\\map3.pdf']"
There are no blank lines in the csv file. In the csv file I tried changing the back slashes to forward slashes, double slashes, etc. I still get the error message. Is it because data is returned in brackets? If so, how do I strip these?
You are actually trying to create the file ['C:\maps\map3.pdf'], you can tell this because the error messages shows the filename its trying to create:
IOError: [Errno 22] invalid mode ('rb') or filename: "['C:\\\\maps\\\\map3.pdf']"
This value comes from the fact that you are converting to string, the value of the dictionary key, which is a list here:
shutil.copyfile(str(v),newPdfFile)
What you need to do is check if the list has more than one member or not, then step through each member of the list (the v) and copy the file.
for k, v in counties.items():
newPdfFile = (r'C:\maps\Maps\JoinedMaps\County-' + k +'.pdf')
if len(v) > 1:
print newPdfFile
else:
for filename in v:
shutil.copyfile(filename, newPdfFile)
print('v: {}'.format(filename))

scheduler produces empty files

I'm using pythonanywhere for a simple scheduled task.
I want to download data from a link once a day and save csv files. Later once i have a decent time series I'll figure out how I actually want to manage the data. It's not much data so don't need anything fancy like a database.
My script takes the data from the google sheets link, adds a log column and a time column, then writes a csv with the date in the filename.
It works exactly as I want it to when I run it manually in pythonanywhere, but the scheduler is just creating empty csv files albeit with the correct name.
Any ideas what's up? I don't understand the log file. Surely the error should happen when it is run manually?
script:
import pandas as pd
import time
import datetime
def write_today(df):
date = time.strftime("%Y-%m-%d")
df.to_csv('Properties_'+date+'.csv')
url = 'https://docs.google.com/spreadsheets/d/19h2GmLN-2CLgk79gVxcazxtKqS6rwW36YA-qvuzEpG4/export?format=xlsx'
df = pd.read_excel(url, header=1).rename(columns={'Unnamed: 1':'code'})
source = pd.read_excel(url).columns[0]
df['source'] = source
df['time'] = datetime.datetime.now()
write_today(df)
the scheduler is set up as so:
log file:
Traceback (most recent call last):
File "/home/abmoore/load_data.py", line 24, in <module>
write_today(df)
File "/home/abmoore/load_data.py", line 16, in write_today
df.to_csv('Properties_'+date+'.csv')
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1344, in to_csv
formatter.save()
File "/usr/local/lib/python2.7/dist-packages/pandas/formats/format.py", line 1551, in save
self._save()
File "/usr/local/lib/python2.7/dist-packages/pandas/formats/format.py", line 1638, in _save
self._save_header()
File "/usr/local/lib/python2.7/dist-packages/pandas/formats/format.py", line 1634, in _save_header
writer.writerow(encoded_labels)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
Your problem there is the UnicodeDecodeError -- you have some non-ascii data in your spreadsheet, and the pandas to_csv function defaults to ascii encoding. try specifying utf8 instead:
def write_today(df):
filename = 'Properties_{date}.csv'.format(date=time.strftime("%Y-%m-%d"))
df.to_csv(filename, encoding='utf8')
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

Python 2.7 and Textblob - TypeError: The `text` argument passed to `__init__(text)` must be a string, not <type 'list'>

Update: Issue resolved. (see comment section below.) Ultimately, the following two lines were required to transform my .csv to unicode and utilize TextBlob: row = [cell.decode('utf-8') for cell in row], and text = ' '.join(row).
Original question:
I am trying to use a Python library called Textblob to analyze text from a .csv file. Error I receive when I call Textblob in my code is:
Traceback (most recent call last): File
"C:\Users\Marcus\Documents\Blog\Python\Scripts\Brooks\textblob_sentiment.py",
line 30, in
blob = TextBlob(row) File "C:\Python27\lib\site-packages\textblob\blob.py", line 344, in
init
'must be a string, not {0}'.format(type(text)))TypeError: The text argument passed to __init__(text) must be a string, not
My code is:
#from __future__ import division, unicode_literals #(This was recommended for Python 2.x, but didn't help in my case.)
#-*- coding: utf-8 -*-
import csv
from textblob import TextBlob
with open(u'items.csv', 'rb') as scrape_file:
reader = csv.reader(scrape_file, delimiter=',', quotechar='"')
for row in reader:
row = [unicode(cell, 'utf-8') for cell in row]
print row
blob = TextBlob(row)
print type(blob)
I have been working through UTF/unicode issues. I'd originally had a different subject which I posed to this thread. (Since my code and the error have changed, I'm posting to a new thread.) Print statements indicate that the variable "row" is of type=str, which I thought indicated that the reader object had been transformed as required by Textblob. The source .csv file is saved as UTF-8. Can anyone provide feedback as to how I can get unblocked on this, and the flaws in my code?
Thanks so much for the help.
So maybe you can make change as below:
row = str([cell.encode('utf-8') for cell in row])

Pandas dataframe to existing excel workbook

I'm trying to use openpyxl to write data to an existing xlsx workbook and save it as a separate file. I would like to write a dataframe to a sheet called 'Data' and write some values to another sheet called 'Summary' then save the workbook object as 'test.xlsx'. I have the following code that writes the calculations I need to the summary sheet, but I get a TypeError for an unexpected keyword argument 'font' when I try to write the dateframe, which doesn't make much sense to me...
I'm using the following code which I've adapted from here
from openpyxl import load_workbook
import pandas as pd
book = load_workbook('template.xlsx')
# Write values to summary sheet
ws = book.get_sheet_by_name('Summary')
ws['A1'] = 'TEST'
book.save('test.xlsx')
# Write df
writer = pd.ExcelWriter('test.xlsx', engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, sheet_name="Data", index=False)
writer.save()
The trace-back:
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site- packages\pandas\core\frame.py", line 1274, in to_excel
startrow=startrow, startcol=startcol)
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\excel.py", line 778, in write_cells
xcell.style = xcell.style.copy(**style_kwargs)
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site-packages\openpyxl-2.2.2-py2.7.egg\openpyxl\compat\__init__.py", line 67, in new_func
return obj(*args, **kwargs)
TypeError: copy() got an unexpected keyword argument 'font'
I think the error arises from necessary changes in the openpyxl styles API. As a result Pandas installs an older version of openpyxl. You should be okay, therefore, if you remove openpyxl 2.2