Different behaviour for io in pickle with string content - python-2.7

When working with pickled data I encountered a different behavior for the io.open and __builtin__.open. Consider the following simple example:
import pickle
payload = 'foo'
fn = 'test.pickle'
pickle.dump(payload, open(fn, 'w'))
a = pickle.load(open(fn, 'r'))
This works as expected. But running this code here:
import pickle
import io
payload = 'foo'
fn = 'test.pickle'
pickle.dump(payload, io.open(fn, 'w'))
a = pickle.load(io.open(fn, 'r'))
gives the following Traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "D:/**.py", line 15, in <module>
pickle.dump(payload, io.open(fn, 'w'))
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\pickle.py", line 1370, in dump
Pickler(file, protocol).dump(obj)
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\pickle.py", line 224, in dump
self.save(obj)
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "D:\WinPython-32bit-2.7.8.1\python-2.7.8\lib\pickle.py", line 488, in save_string
self.write(STRING + repr(obj) + '\n')
TypeError: must be unicode, not str
As I want to be future-compatible, how can I circumwent this misbehavior? Or, what else am I doing wrong here?
I stumbled over this when dumping dictionaries with keys of type string.
My python version is:
'2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)]'

The difference is not supprising, because io.open() explicitly deals with Unicode strings when using text mode. The documentation is quite clear about this:
Note: Since this module has been designed primarily for Python 3.x, you have to be aware that all uses of “bytes” in this document refer to the str type (of which bytes is an alias), and all uses of “text” refer to the unicode type. Furthermore, those two types are not interchangeable in the io APIs.
and
Python distinguishes between files opened in binary and text modes, even when the underlying operating system doesn’t. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as unicode strings, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.
You need to open files in binary mode. The fact that it worked without with the built-in open() at all is actually more luck than wisdom; if your pickles contained data with \n and/or \r bytes the pickle loading may well fail. The Python 2 default pickle happens to be a text protocol but the output should still be considered as binary.
In all cases, when writing pickle data, use binary mode:
pickle.dump(payload, open(fn, 'wb'))
a = pickle.load(open(fn, 'rb'))
or
pickle.dump(payload, io.open(fn, 'wb'))
a = pickle.load(io.open(fn, 'rb'))

Related

Django encoding error when reading from a CSV

When I try to run:
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Most of my data gets created in the database, except for one particular row. When my script reaches the row, I receive the error:
ProgrammingError: You must not use 8-bit bytestrings unless you use a
text_factory that can interpret 8-bit bytestrings (like text_factory = str).
It is highly recommended that you instead just switch your application to Unicode strings.`
The particular row in the CSV that causes this error is:
>>> row
{'FR\xed\x8aD\xed\x8aRIC.ST-DENIS', 'BOS', 'G'}
I've looked at the other similar Stackoverflow threads with the same or similar issues, but most aren't specific to using Sqlite with Django. Any advice?
If it matters, I'm running the script by going into the Django shell by calling python manage.py shell, and copy-pasting it in, as opposed to just calling the script from the command line.
This is the stacktrace I get:
Traceback (most recent call last):
File "<console>", line 4, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 108, in next
row = self.reader.next()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 302, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1674: invalid continuation byte
EDIT: I decided to just manually import this entry into my database, rather than try to read it from my CSV, based on Alastair McCormack's feedback
Based on the output from your question, it looks like the person who made the CSV mojibaked it - it doesn't seem to represent FRÉDÉRIC.ST-DENIS. You can try using windows-1252 instead of utf-8 but I think you'll end up with FRíŠDíŠRIC.ST-DENIS in your database.
I suspect you're using Python 2 - open() returns str which are simply byte strings.
The error is telling you that you need to decode your text to Unicode string before use.
The simplest method is to decode each cell:
with open('data.csv', 'r') as csvfile: # 'U' means Universal line mode and is not necessary
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].decode('utf-8),
team=row['Team'].decode('utf-8),
position=row['Position'].decode('utf-8)
)
That'll work but it's ugly add decodes everywhere and it won't work in Python 3. Python 3 improves things by opening files in text mode and returning Python 3 strings which are the equivalent of Unicode strings in Py2.
To get the same functionality in Python 2, use the io module. This gives you a open() method which has an encoding option. Annoyingly, the Python 2.x CSV module is broken with Unicode, so you need to install a backported version:
pip install backports.csv
To tidy your code and future proof it, do:
import io
from backports import csv
with io.open('data.csv', 'r', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# now every row is automatically decoded from UTF-8
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Encode Player name in utf-8 using .encode('utf-8') in player name
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].encode('utf-8'),
team=row['Team'],
position=row['Position']
)
In Django, decode with latin-1, csv.DictReader(io.StringIO(csv_file.read().decode('latin-1'))), it would devour all special characters and all comma exceptions you get in utf-8.

Error translating Tornado template with gettext

I've got this site running on top with Tornado and its template engine that I want to Internationalize, so I thought on using gettext to help me with that.
Since my site is already in Portuguese, my message.po (template) file has all msgid's in portuguese as well (example):
#: base.html:30 base.html:51
msgid "Início"
msgstr ""
It was generated with xgettext:
xgettext -i *.html -L Python --from-code UTF-8
Later I used Poedit to generate the translation file en_US.po and later compile it as en_US.mo.
Stored in my translation folder:
translation/en_US/LC_MESSAGES/site.mo
So far, so good.
I've created a really simple RequestHandler that would render and return the translated site.
import os
import logging
from tornado.web import RequestHandler
import tornado.locale as locale
LOG = logging.getLogger(__name__)
class SiteHandler(RequestHandler):
def initialize(self):
locale.load_gettext_translations(os.path.join(os.path.dirname(__file__), '../translations'), "site")
def get(self, page):
LOG.debug("PAGE REQUESTED: %s", page)
self.render("site/%s.html" %page)
As far as I know that should work perfectly, but somehow I've encountered some issues:
1 - How do I tell Tornado that my template has its text in Portuguese so it won't go looking for a pt locale which I don't have?
2 - When asking for the site with en_US locale, it loads ok but when Tornado is going to translate, it throws an encoding exception.
TypeError: not all arguments converted during string formatting
ERROR:views.site:Could not load template
Traceback (most recent call last):
File "/Users/ademarizu/Dev/git/new_plugin/site/src/main/py/views/site.py", line 20, in get
self.render("site/%s.html" %page)
File "/Users/ademarizu/Dev/virtualEnvs/execute/lib/python2.7/site-packages/tornado/web.py", line 664, in render
html = self.render_string(template_name, **kwargs)
File "/Users/ademarizu/Dev/virtualEnvs/execute/lib/python2.7/site-packages/tornado/web.py", line 771, in render_string
return t.generate(**namespace)
File "/Users/ademarizu/Dev/virtualEnvs/execute/lib/python2.7/site-packages/tornado/template.py", line 278, in generate
return execute()
File "site/home_html.generated.py", line 11, in _tt_execute
_tt_tmp = _("Início") # site/base.html:30
File "/Users/ademarizu/Dev/virtualEnvs/execute/lib/python2.7/site-packages/tornado/locale.py", line 446, in translate
return self.gettext(message)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gettext.py", line 406, in ugettext
return self._fallback.ugettext(message)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gettext.py", line 407, in ugettext
return unicode(message)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
Any help?
Ah, I'm running python 2.7 btw!
1 - How do I tell Tornado that my template has its text in Portuguese so it won't go looking for a pt locale which I don't have?
This is what the set_default_locale method is for. Call tornado.locale.set_default_locale('pt') (or pt_BR, etc) once at startup to tell tornado that your template source is in Portuguese.
2 - When asking for the site with en_US locale, it loads ok but when Tornado is going to translate, it throws an encoding exception.
Remember that in Python 2, strings containing non-ascii characters need to be marked as unicode. Instead of _("Início"), use _(u"Início").

During migrating tool from windows to linux lxml error

I have developed a tool in python 2.7 that take xsd file as input ,
and give the process data into a test file
During processing the xsd file I used lxml, I am unable to resolve this sort of error.
AttributeError: 'Element' object has no attribute 'iterdescendants'
I don`t know what wrong with the lxml lib.
I want to know is there any lxml Linux compatible version for python 2.7
I have imported in the file like below:
try:
from lxml import etree
except ImportError:
import xml.etree.ElementTree as etree
I have imported only in file , and sending the element tree pointer to process the the element into another file ,
it is OK in the declared file , giving error in another file only.
the code throw the error is :
for tdocNode in lincFileRootNode:
rootNode = tdocNode.getroot()
lchildren = rootNode.getchildren()
for elt in lchildren:
if 'complex' == elt.tag:
if 'name' in elt.attrib:
if 'element' == item.tag:
if 'type' in item.attrib:
if elt.attrib['name'] == item.attrib['type']:
for key in elt.iterdescendants(tag='element'):
bIsElemTypeSimple = false
bIsElemTypeSimple = process_elementtype(key, lincFileRootNode)
where :
lincFileRootNode --> is list that containe the xsd file pointer to be processed
the error thrown is :
Traceback (most recent call last):
File "run.py", line 1210, in <module>
iret = xsd2dic_main()
File "run.py", line 71, in xsd2dic_main
iRet = yxsdtodic()
File "run.py", line 352, in yxsdtodic
iret = process_xsdfile(sXsdPath)
File "run.py", line 485, in xsdfile
sRet = process_dic_elementtype(item,lincFileRootNode)
File "run.py", line 817, in process_dic_elementtype
for key in elt.iterdescendants(tag='element'):
AttributeError: 'Element' object has no attribute 'iterdescendants'
i tired in the both the cases :
1:writing all code in a same file
2:writing different files
still i am getting the same error
This is mostly a guess, but look into it.
You appear to be calling iterdescendants from lxml's implementation of the Element type. However, if lxml fails to import, you fall back on Python's built in xml library instead. But it's implementation of Element doesn't have an iterdescendants methods of any kind. In other words, the two implementations have different public APIs. Add some print statements to see which library you're importing and do some additionally checking to see exactly what type elt is. If you want to be able to fall back on Python's built in xml, you'll need to structure your code to accommodate the different APIs.

error: unpack requires a string argument of length 8

I was running my script and I stumbled upon on this error
WARNING *** file size (24627) not 512 + multiple of sector size (512)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Traceback (most recent call last):
File "C:\Email Attachments\whatever.py", line 20, in <module>
main()
File "C:\Email Attachments\whatever.py", line 17, in main
csv_from_excel()
File "C:\Email Attachments\whatever.py", line 7, in csv_from_excel
sh = wb.sheet_by_name('B2B_REP_YLD_100_D_SQ.rpt')
File "C:\Python27\lib\site-packages\xlrd\book.py", line 442, in sheet_by_name
return self.sheet_by_index(sheetx)
File "C:\Python27\lib\site-packages\xlrd\book.py", line 432, in sheet_by_index
return self._sheet_list[sheetx] or self.get_sheet(sheetx)
File "C:\Python27\lib\site-packages\xlrd\book.py", line 696, in get_sheet
sh.read(self)
File "C:\Python27\lib\site-packages\xlrd\sheet.py", line 1055, in read
dim_tuple = local_unpack('<ixxH', data[4:12])
error: unpack requires a string argument of length 8
I was trying to process this excel file.
https://drive.google.com/file/d/0B12NevhOGQGRMkRVdExuYjFveDQ/edit?usp=sharing
One solution that I found is that I have to open manually the spreadsheet, save it, then close it before I run my script of converting .xls to .csv. I find this solution a bit cumbersome and clunky.
This kind of spreadsheet is saved daily in my drive via an Outlook Macro. Unprocessed data is increasing that's why I turned into scripting to ease the job.
Who made the Outlook macro that's dumping this file? xlrd uses byte level unpacking to read in the Excel file, and is failing to read a field in this excel file. There are ways to follow where its failing, but none to automatically recover from this type of error.
The erroneous data seems to be at data[4:12] of a specific frame (we'll see later), which should be a bytestring that's parsed as such:
one integer (i)
2 pad bytes (xx)
unsigned short 2 byte integer (H).
You can set xlrd to DEBUG mode, which will show you which bytes its parsing, and exactly where in the file there is an error:
import xlrd
xlrd.DEBUG = 2
workbook = xlrd.open_workbook(u'/home/sparker/Downloads/20131117_040934_B2B_REP_YLD_100_D_LT.xls')
Here's the results, slightly trimmed down for the sake of SO:
parse_globals: record code is 0x0293
parse_globals: record code is 0x0293
parse_globals: record code is 0x0085
CODEPAGE: codepage 1200 -> encoding 'utf_16_le'
BOUNDSHEET: bv=80 data '\xfd\x04\x00\x00\x00\x00\x18\x00B2B_REP_YLD_100_D_SQ.rpt'
BOUNDSHEET: inx=0 vis=0 sheet_name=u'B2B_REP_YLD_100_D_SQ.rpt' abs_posn=1277 sheet_type=0x00
parse_globals: record code is 0x000a
GET_SHEETS: [u'B2B_REP_YLD_100_D_SQ.rpt'] [1277]
GET_SHEETS: sheetno = 0 [u'B2B_REP_YLD_100_D_SQ.rpt'] [1277]
reqd: 0x0010
getbof(): data='\x00\x06\x10\x00\xbb\r\xcc\x07\x00\x00\x00\x00\x06\x00\x00\x00'
getbof(): op=0x0809 version2=0x0600 streamtype=0x0010
getbof(): BOF found at offset 1277; savpos=1277
BOF: op=0x0809 vers=0x0600 stream=0x0010 buildid=3515 buildyr=1996 -> BIFF80
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build/bdist.macosx-10.4-x86_64/egg/xlrd/__init__.py", line 457, in open_workbook
File "build/bdist.macosx-10.4-x86_64/egg/xlrd/__init__.py", line 1007, in get_sheets
File "build/bdist.macosx-10.4-x86_64/egg/xlrd/__init__.py", line 998, in get_sheet
File "build/bdist.macosx-10.4-x86_64/egg/xlrd/sheet.py", line 864, in read
struct.error: unpack requires a string argument of length 8
Specifically, you can see that it parses the name of the workbook named u'B2B_REP_YLD_100_D_SQ.rpt.
Lets check the source code. The traceback throws an error here where we can see from the parent loop that we're trying to parse the XL_DIMENSION and XL_DIMENSION2 values. These directly correspond to the shape of the Excel Sheet.
And that's where there's a problem in your workbook. It's not being made correctly. So, back to my original question, who made the excel macro? It needs to be fixed. But that's for another SO question, some other time.

Specifying select features to be categorical using OneHotEncoder in sklearn 0.14

I am using the sklearn 0.14 module in Python to create a decision tree. I was hoping to use the OneHotEncoder to convert some features into categorical features. According to the documentation, I should be able to provide an array of indices to indicate which features should be converted. However, trying the following code:
xs = [[64, 15230], [3, 67673], [16, 43678]]
encoder = preprocessing.OneHotEncoder(n_values='auto', categorical_features=[1], dtype=numpy.integer)
encoder.fit(xs)
I receive the following error:
Traceback (most recent call last): File
"C:\Users\sara\Documents\Shipping
Project\PythonSandbox\CarrierDecisionTree.py", line 35, in <module>
encoder.fit(xs) File "C:\Python27\lib\site-packages\sklearn\preprocessing\data.py", line
892, in fit
self.fit_transform(X) File "C:\Python27\lib\site-packages\sklearn\preprocessing\data.py", line
944, in fit_transform
self.categorical_features, copy=True) File "C:\Python27\lib\site-packages\sklearn\preprocessing\data.py", line
795, in _transform_selected
return sparse.hstack((X_sel, X_not_sel)) File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 417,
in hstack
return bmat([blocks], format=format, dtype=dtype) File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 532,
in bmat
dtype = upcast( *tuple([A.dtype for A in blocks[block_mask]]) ) File "C:\Python27\lib\site-packages\scipy\sparse\sputils.py", line 53,
in upcast
raise TypeError('no supported conversion for types: %r' % (args,)) TypeError: no supported conversion for types: (dtype('int32'),
dtype('S6'))
If instead, I provide the array [0, 1] to categorical_features, it works correctly and converts both features properly. The same correct behavior occurs with using 'all' to categorical_features. However, I only want the second feature converted and not the first. I understand I could do this manually by converting one feature at a time, but I was hoping to use all the beauty of OneHotEncoder as I will be using many more features later on.
Posting as an answer, for the record:
TypeError: no supported conversion for types: (dtype('int32'), dtype('S6'))
means something in the true xs (not the one shown in the code snippet) is a string: dtype('S6') is NumPy's length-six string type.