error: unpack requires a string argument of length 8 - python-2.7

I was running my script and I stumbled upon on this error
WARNING *** file size (24627) not 512 + multiple of sector size (512)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Traceback (most recent call last):
File "C:\Email Attachments\whatever.py", line 20, in <module>
main()
File "C:\Email Attachments\whatever.py", line 17, in main
csv_from_excel()
File "C:\Email Attachments\whatever.py", line 7, in csv_from_excel
sh = wb.sheet_by_name('B2B_REP_YLD_100_D_SQ.rpt')
File "C:\Python27\lib\site-packages\xlrd\book.py", line 442, in sheet_by_name
return self.sheet_by_index(sheetx)
File "C:\Python27\lib\site-packages\xlrd\book.py", line 432, in sheet_by_index
return self._sheet_list[sheetx] or self.get_sheet(sheetx)
File "C:\Python27\lib\site-packages\xlrd\book.py", line 696, in get_sheet
sh.read(self)
File "C:\Python27\lib\site-packages\xlrd\sheet.py", line 1055, in read
dim_tuple = local_unpack('<ixxH', data[4:12])
error: unpack requires a string argument of length 8
I was trying to process this excel file.
https://drive.google.com/file/d/0B12NevhOGQGRMkRVdExuYjFveDQ/edit?usp=sharing
One solution that I found is that I have to open manually the spreadsheet, save it, then close it before I run my script of converting .xls to .csv. I find this solution a bit cumbersome and clunky.
This kind of spreadsheet is saved daily in my drive via an Outlook Macro. Unprocessed data is increasing that's why I turned into scripting to ease the job.

Who made the Outlook macro that's dumping this file? xlrd uses byte level unpacking to read in the Excel file, and is failing to read a field in this excel file. There are ways to follow where its failing, but none to automatically recover from this type of error.
The erroneous data seems to be at data[4:12] of a specific frame (we'll see later), which should be a bytestring that's parsed as such:
one integer (i)
2 pad bytes (xx)
unsigned short 2 byte integer (H).
You can set xlrd to DEBUG mode, which will show you which bytes its parsing, and exactly where in the file there is an error:
import xlrd
xlrd.DEBUG = 2
workbook = xlrd.open_workbook(u'/home/sparker/Downloads/20131117_040934_B2B_REP_YLD_100_D_LT.xls')
Here's the results, slightly trimmed down for the sake of SO:
parse_globals: record code is 0x0293
parse_globals: record code is 0x0293
parse_globals: record code is 0x0085
CODEPAGE: codepage 1200 -> encoding 'utf_16_le'
BOUNDSHEET: bv=80 data '\xfd\x04\x00\x00\x00\x00\x18\x00B2B_REP_YLD_100_D_SQ.rpt'
BOUNDSHEET: inx=0 vis=0 sheet_name=u'B2B_REP_YLD_100_D_SQ.rpt' abs_posn=1277 sheet_type=0x00
parse_globals: record code is 0x000a
GET_SHEETS: [u'B2B_REP_YLD_100_D_SQ.rpt'] [1277]
GET_SHEETS: sheetno = 0 [u'B2B_REP_YLD_100_D_SQ.rpt'] [1277]
reqd: 0x0010
getbof(): data='\x00\x06\x10\x00\xbb\r\xcc\x07\x00\x00\x00\x00\x06\x00\x00\x00'
getbof(): op=0x0809 version2=0x0600 streamtype=0x0010
getbof(): BOF found at offset 1277; savpos=1277
BOF: op=0x0809 vers=0x0600 stream=0x0010 buildid=3515 buildyr=1996 -> BIFF80
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build/bdist.macosx-10.4-x86_64/egg/xlrd/__init__.py", line 457, in open_workbook
File "build/bdist.macosx-10.4-x86_64/egg/xlrd/__init__.py", line 1007, in get_sheets
File "build/bdist.macosx-10.4-x86_64/egg/xlrd/__init__.py", line 998, in get_sheet
File "build/bdist.macosx-10.4-x86_64/egg/xlrd/sheet.py", line 864, in read
struct.error: unpack requires a string argument of length 8
Specifically, you can see that it parses the name of the workbook named u'B2B_REP_YLD_100_D_SQ.rpt.
Lets check the source code. The traceback throws an error here where we can see from the parent loop that we're trying to parse the XL_DIMENSION and XL_DIMENSION2 values. These directly correspond to the shape of the Excel Sheet.
And that's where there's a problem in your workbook. It's not being made correctly. So, back to my original question, who made the excel macro? It needs to be fixed. But that's for another SO question, some other time.

Related

tensorflow object detection API: generate TF record of custom data set

I am trying to retrain the tensorflow object detection API with my own data
i have labelled my image with labelImg but when i am using the script create_pascal_tf_record.py which is included in the tensorflow/models/research, i got some errors and i dont really know why it happens
python object_detection/dataset_tools/create_pascal_tf_record.py --data_dir=/home/jim/Documents/tfAPI/workspace/training_cabbage/images/train/ --label_map_path=/home/jim/Documents/tfAPI/workspace/training_cabbage/annotations/label_map.pbtxt --output_path=/home/jim/Desktop/cabbage_pascal.record --set=train --annotations_dir=/home/jim/Documents/tfAPI/workspace/training_cabbage/images/train/ --year=merged
Traceback (most recent call last):
File "object_detection/dataset_tools/create_pascal_tf_record.py", line 185, in <module>
tf.app.run()
File "/home/jim/.virtualenvs/enrouteDeepDroneTF/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "object_detection/dataset_tools/create_pascal_tf_record.py", line 167, in main
examples_list = dataset_util.read_examples_list(examples_path)
File "/home/jim/Documents/tfAPI/models/research/object_detection/utils/dataset_util.py", line 59, in read_examples_list
lines = fid.readlines()
File "/home/jim/.virtualenvs/enrouteDeepDroneTF/local/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 188, in readlines
self._preread_check()
File "/home/jim/.virtualenvs/enrouteDeepDroneTF/local/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 85, in _preread_check
compat.as_bytes(self.__name), 1024 * 512, status)
File "/home/jim/.virtualenvs/enrouteDeepDroneTF/local/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: /home/jim/Documents/tfAPI/workspace/training_cabbage/images/train/VOC2007/ImageSets/Main/aeroplane_train.txt; No such file or directory
the train folder contains the xml and the jpg
the annotation folder contains my labelmap.pbtxt for my custom class
and i want to publish the TF record file on the desktop
it seems that it cant find a file in my images and annotations folder but i dont know why
If someone has idea, thank you in advance
This error happens because you use the code for PASCAL VOC, which requires certain data folders structure. Basically, you need to download and unpack VOCdevkit to make the script work. As user phd pointed you out, you need the file VOC2007/ImageSets/Main/aeroplane_train.txt.
I recommend you to write your own script for tfrecords creation, it's not difficult. You need just two key components:
Loop over your data that reads the images and annotations
A function that encodes the data into tf.train.Example. For that you can pretty much re-use the dict_to_tf_example
Inside the loop, having created the tf_example, pass it to TFRecordWriter:
writer.write(tf_example.SerializeToString())
OK for future references, this is how i add background images to the dataset allowing the model to train on it.
Functions used from: datitran/raccoon_dataset
Generate CSV file -> xml_to_csv.py
Generate TFRecord from CSV file -> generate_tfrecord.py
First Step - Creating XML file for it
Example of background image XML file
<annotation>
<folder>test/</folder>
<filename>XXXXXX.png</filename>
<path>your_path/test/XXXXXX.png</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>640</width>
<height>640</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
</annotation>
Basically you remove the entire <object> (i.e no annotation )
Second Step - Generate CSV file
Using the xml_to_csv.py I just add a little change, to consider the XML file that do not have any annotation (the background images) as so:
From the original:
https://github.com/datitran/raccoon_dataset/blob/93938849301895fb73909842ba04af9b602f677a/xml_to_csv.py#L12-L22
I add:
value = None
for member in root.findall('object'):
value = (root.find('filename').text,
int(root.find('size')[0].text),
int(root.find('size')[1].text),
member[0].text,
int(member[4][0].text),
int(member[4][1].text),
int(member[4][2].text),
int(member[4][3].text)
)
xml_list.append(value)
if value is None:
value = (root.find('filename').text,
int(root.find('size')[0].text),
int(root.find('size')[1].text),
'-1',
'-1',
'-1',
'-1',
'-1'
)
xml_list.append(value)
I'm just adding negative values to the coordinates of the bounding box if there is no in the XML file, which is the case for the background images, and it will be usefull when generating the TFRecords.
Third and Final Step - Generating the TFRecords
Now, when creating the TFRecords, if the the corresponding row/image has negative coordinates, i just add zero values to the record (before, this would not even be possible).
So from the original:
https://github.com/datitran/raccoon_dataset/blob/93938849301895fb73909842ba04af9b602f677a/generate_tfrecord.py#L60-L66
I add:
for index, row in group.object.iterrows():
if int(row['xmin']) > -1:
xmins.append(row['xmin'] / width)
xmaxs.append(row['xmax'] / width)
ymins.append(row['ymin'] / height)
ymaxs.append(row['ymax'] / height)
classes_text.append(row['class'].encode('utf8'))
classes.append(class_text_to_int(row['class']))
else:
xmins.append(0)
xmaxs.append(0)
ymins.append(0)
ymaxs.append(0)
classes_text.append('something'.encode('utf8')) # this doe not matter for the background
classes.append(5000)
To note that in the class_text (of the else statement), since for the background images there are no bounding boxes, you can replace the string with whatever you would like, for the background cases, this will not appear anywhere.
And lastly for the classes (of the else statement) you just need to add a number label that does not belong to neither of your own classes.
For those who are wondering, I've used this procedure many times, and currently works for my use cases.
Hope it helped in some way.

retriving data saved under HDF5 group as Carray

I am new to HDF5 file format and I have a data(images) saved in HDF5 format. The images are saved undere a group called 'data' which is under the root group as Carrays. what I want to do is to retrive a slice of the saved images. for example the first 400 or somthing like that. The following is what I did.
h5f = h5py.File('images.h5f', 'r')
image_grp= h5f['/data/'] #the image group (data) is opened
print(image_grp[0:400])
but I am getting the following error
Traceback (most recent call last):
File "fgf.py", line 32, in <module>
print(image_grp[0:40])
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
(/feedstock_root/build_artefacts/h5py_1496410723014/work/h5py-2.7.0/h5py/_objects.c:2846)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
(/feedstock_root/build_artefacts/h5py_1496410723014/work/h5py
2.7.0/h5py/_objects.c:2804)
File "/..../python2.7/site-packages/h5py/_hl/group.py", line 169, in
__getitem__oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "/..../python2.7/site-packages/h5py/_hl/base.py", line 133, in _e name = name.encode('ascii')
AttributeError: 'slice' object has no attribute 'encode'
I am not sure why I am getting this error but I am not even sure if I can slice the images which are saved as individual datasets.
I know this is an old question, but it is the first hit when searching for 'slice' object has no attribute 'encode' and it has no solution.
The error happens because the "group" is a group which does not have the encoding attribute. You are looking for the dataset element.
You need to find/know the key for the item that contains the dataset.
One suggestion is to list all keys in the group, and then guess which one it is:
print(list(image_grp.keys()))
This will give you the keys in the group.
A common case is that the first element is the image, so you can do this:
image_grp= h5f['/data/']
image= image_grp(image_grp.keys[0])
print(image[0:400])
yesterday I had a similar error and wrote this little piece of code to take my desired slice of h5py file.
import h5py
def h5py_slice(h5py_file, begin_index, end_index):
slice_list = []
with h5py.File(h5py_file, 'r') as f:
for i in range(begin_index, end_index):
slice_list.append(f[str(i)][...])
return slice_list
and it can be used like
the_desired_slice_list = h5py_slice('images.h5f', 0, 400)

scheduler produces empty files

I'm using pythonanywhere for a simple scheduled task.
I want to download data from a link once a day and save csv files. Later once i have a decent time series I'll figure out how I actually want to manage the data. It's not much data so don't need anything fancy like a database.
My script takes the data from the google sheets link, adds a log column and a time column, then writes a csv with the date in the filename.
It works exactly as I want it to when I run it manually in pythonanywhere, but the scheduler is just creating empty csv files albeit with the correct name.
Any ideas what's up? I don't understand the log file. Surely the error should happen when it is run manually?
script:
import pandas as pd
import time
import datetime
def write_today(df):
date = time.strftime("%Y-%m-%d")
df.to_csv('Properties_'+date+'.csv')
url = 'https://docs.google.com/spreadsheets/d/19h2GmLN-2CLgk79gVxcazxtKqS6rwW36YA-qvuzEpG4/export?format=xlsx'
df = pd.read_excel(url, header=1).rename(columns={'Unnamed: 1':'code'})
source = pd.read_excel(url).columns[0]
df['source'] = source
df['time'] = datetime.datetime.now()
write_today(df)
the scheduler is set up as so:
log file:
Traceback (most recent call last):
File "/home/abmoore/load_data.py", line 24, in <module>
write_today(df)
File "/home/abmoore/load_data.py", line 16, in write_today
df.to_csv('Properties_'+date+'.csv')
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1344, in to_csv
formatter.save()
File "/usr/local/lib/python2.7/dist-packages/pandas/formats/format.py", line 1551, in save
self._save()
File "/usr/local/lib/python2.7/dist-packages/pandas/formats/format.py", line 1638, in _save
self._save_header()
File "/usr/local/lib/python2.7/dist-packages/pandas/formats/format.py", line 1634, in _save_header
writer.writerow(encoded_labels)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
Your problem there is the UnicodeDecodeError -- you have some non-ascii data in your spreadsheet, and the pandas to_csv function defaults to ascii encoding. try specifying utf8 instead:
def write_today(df):
filename = 'Properties_{date}.csv'.format(date=time.strftime("%Y-%m-%d"))
df.to_csv(filename, encoding='utf8')
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

how to use format read from xlrd and write thru xlsxwriter in python

I am reading an excel file using xlrd. Doing some macro replacing and then writing thru xlsxwriter. Without reading and copying formatting info the code works but when I add formatting info I get an error (at the bottom)
The code snippet is below..I read a xls file, for each data row I replace token macros with the values and write back. when I try to close the output_workbook I get an error
filePath = os.path.realpath(os.path.join(inputPath,filename))
input_workbook = open_workbook(filePath, formatting_info=True)
input_DataSheet = input_workbook.sheet_by_index(0)
data = [[input_DataSheet.cell_value(r,c) for c in range(input_DataSheet.ncols)] for r in range(input_DataSheet.nrows)]
output_workbook = xlsxwriter.Workbook('C:\Users\Manish\Downloads\Sunny\Drexel_Funding\MacroReplacer\demo.xlsx')
output_worksheet = output_workbook.add_worksheet()
for rowIndex, value in enumerate(data):
copyItem = []
for individualItem in value:
tempItem = individualItem
if (isinstance(individualItem, basestring)):
tempItem = tempItem.replace("[{0}]".format(investorNameMacro), investorName)
tempItem = tempItem.replace("[{0}]".format(investorPhoneMacro), investorPhone)
tempItem = tempItem.replace("[{0}]".format(investorEmailMacro), investorEmail)
tempItem = tempItem.replace("[{0}]".format(loanNumberMacro), loanNumber)
copyItem.append(tempItem)
for columnIndex, val in enumerate(copyItem):
fmt =input_workbook.xf_list[input_DataSheet.cell(rowIndex,columnIndex).xf_index]
output_worksheet.write(rowIndex,columnIndex, val,fmt)
output_workbook.close()
The error that I get is
Traceback (most recent call last):
File "C:/Users/Manish/Downloads/Sunny/Drexel_Funding/MacroReplacer/drexelfundingmacroreplacer.py", line 87, in
output_workbook.close()
File "build\bdist.win-amd64\egg\xlsxwriter\workbook.py", line 297, in close
File "build\bdist.win-amd64\egg\xlsxwriter\workbook.py", line 605, in _store_workbook
File "build\bdist.win-amd64\egg\xlsxwriter\packager.py", line 131, in _create_package
File "build\bdist.win-amd64\egg\xlsxwriter\packager.py", line 189, in _write_worksheet_files
File "build\bdist.win-amd64\egg\xlsxwriter\worksheet.py", line 3426, in _assemble_xml_file
File "build\bdist.win-amd64\egg\xlsxwriter\worksheet.py", line 4829, in _write_sheet_data
File "build\bdist.win-amd64\egg\xlsxwriter\worksheet.py", line 5015, in _write_rows
File "build\bdist.win-amd64\egg\xlsxwriter\worksheet.py", line 5183, in _write_cell
AttributeError: 'XF' object has no attribute '_get_xf_index'
any help is appreciated
Thanks
The Xlrd and XlsxWriter formats are different object types and are not interchangeable.
If you wish to preserve formatting you will have to write some code that translates the properties from one to the other.

Different behaviour for io in pickle with string content

When working with pickled data I encountered a different behavior for the io.open and __builtin__.open. Consider the following simple example:
import pickle
payload = 'foo'
fn = 'test.pickle'
pickle.dump(payload, open(fn, 'w'))
a = pickle.load(open(fn, 'r'))
This works as expected. But running this code here:
import pickle
import io
payload = 'foo'
fn = 'test.pickle'
pickle.dump(payload, io.open(fn, 'w'))
a = pickle.load(io.open(fn, 'r'))
gives the following Traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "D:/**.py", line 15, in <module>
pickle.dump(payload, io.open(fn, 'w'))
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\pickle.py", line 1370, in dump
Pickler(file, protocol).dump(obj)
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\pickle.py", line 224, in dump
self.save(obj)
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\pickle.py", line 488, in save_string
self.write(STRING + repr(obj) + '\n')
TypeError: must be unicode, not str
As I want to be future-compatible, how can I circumwent this misbehavior? Or, what else am I doing wrong here?
I stumbled over this when dumping dictionaries with keys of type string.
My python version is:
'2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]'
The difference is not supprising, because io.open() explicitly deals with Unicode strings when using text mode. The documentation is quite clear about this:
Note: Since this module has been designed primarily for Python 3.x, you have to be aware that all uses of “bytes” in this document refer to the str type (of which bytes is an alias), and all uses of “text” refer to the unicode type. Furthermore, those two types are not interchangeable in the io APIs.
and
Python distinguishes between files opened in binary and text modes, even when the underlying operating system doesn’t. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as unicode strings, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.
You need to open files in binary mode. The fact that it worked without with the built-in open() at all is actually more luck than wisdom; if your pickles contained data with \n and/or \r bytes the pickle loading may well fail. The Python 2 default pickle happens to be a text protocol but the output should still be considered as binary.
In all cases, when writing pickle data, use binary mode:
pickle.dump(payload, open(fn, 'wb'))
a = pickle.load(open(fn, 'rb'))
or
pickle.dump(payload, io.open(fn, 'wb'))
a = pickle.load(io.open(fn, 'rb'))