I'd like to log (or even print) some messages that are utf-8 encoded unicode ; having run into this a lot I collected a bunch of fixes none of which actually seems to work in my case (python3 in a jupyter notebook). What i have so far is:
#!/usr/bin/env LC_ALL=en_US.UTF-8 /usr/local/bin/python3
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)
!export PYTHONIOENCODING=UTF-8
import logging
logging.basicConfig(level=logging.INFO)
then i try any of
logging.info(u'שלום')
logging.info(unicode('שלום','utf-8'))
logging.info(u'שלום'.encode('utf-8'))
all of which hit the dreaded
'ascii' codec can't decode byte 0xd7 in position 10: ordinal not in range(128)
At this point i am willing to sacrifice a goat to the unicode monkey-god if that would help, can anyone weigh in (e.g. what kind of goat?)
Related
I'm communicating over a serial port, and currently using python2 code which I want to convert to python3. I want to make sure the bytes I send over the wire are the same, but I'm having trouble verifying that that's the case.
In the original code the commands are sent like this:
serial.Serial().write("\xaa\xb4" + chr(2))
If I print "\xaa\xb4" in python2 I get this: ��.
If I print("\xaa\xb4") in python3 I get this: ª´
Encoding and decoding seem opposite too:
Python2: print "\xaa".decode('latin-1') -> ª
Python3: print("\xaa".encode('latin-1')) -> b'\xaa'
To be crude, what do I need to send in serial.write() in python3 to make sure exactly the same sequence of 1s and 0s are sent down the wire?
Use a bytes sequence.
ser.write(b'\xaa\xb4')
In Python 2.7 I have the following and I debug through IDLE:
print 'Here'
import sys
reload(sys)
sys.setdefaultencoding('cp1252')
print 'There'
what I get in return is
Here
So after I have set the default encoding it does not print the desired output.
Could this be due by conflicts with the IDLE encoding?
Because it is unable to find reference to setdefaultencoding from sys. That is why it is not printing 'There'
setdefaultencoding is deprecated and one should never use it!
Have a look at the following link.
Why should we NOT use sys.setdefaultencoding(“utf-8”) in a py script?
I have a function built that reads in hex-ascii encoded data, I built that in Python 2.7. I am changing my code over to run on 3.x and hit an unforeseen issue. The function worked flawlessly under 2.7. Here is what I have:
# works with 2.7
data = open('hexascii_file.dat', 'rU').read()
When I run that under 3.x I get a UnicodeError:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 500594: invalid start byte
I thought the default codec under Python 2.7 was ascii, so I tried the following under 3.x:
data = open('hexascii_file.dat', 'rU', encoding='ascii')
This did not work (same error as above, but specifying 'ascii' instead of 'utf-8'. However, when I use the latin-1 codec all works well.
data = open('hexascii_file.dat', 'rU', encoding='latin-1')
I guess I am looking for a quick sanity check here to ensure I have made the proper change to the script. Does this change make sense?
Two uServices are communicating via a message queue (RabbitMQ). The data is encoded using message pack.
I have the following scenarios:
python3 -> python3: working fine
python2 -> python3: encoding issues
Encoding is done with:
umsgpack.packb(data)
Decoding with:
umsgpack.unpackb(body)
When doing encoding and decoding in python3 I get:
data={'sender': 'producer-big-red-tiger', 'json': '{"msg": "hi"}', 'servicename': 'echo', 'command': 'run'}
When doing encoding in python2 and decoding on python3 I get:
data={b'command': b'run', b'json': b'{"msg": ""}', b'servicename': b'echo', b'sender': b'bla-blah'}
Why is the data non "completely" decoded? What should I do on the sender / receiver to achieve compatibility between python2 and python3?
Look at the "Notes" section of the README from msgpack-python;
msgpack can distinguish string and binary type for now. But it is not like Python 2. Python 2 added unicode string. But msgpack renamed raw to str and added bin type. It is because keep compatibility with data created by old libs. raw was used for text more than binary.
Currently, while msgpack-python supports new bin type, default setting doesn't use it and decodes raw as bytes instead of unicode (str in Python 3).
You can change this by using use_bin_type=True option in Packer and encoding="utf-8" option in Unpacker.
>>> import msgpack
>>> packed = msgpack.packb([b'spam', u'egg'], use_bin_type=True)
>>> msgpack.unpackb(packed, encoding='utf-8')
['spam', u'egg']
Why doesn't this work in the Python interpreter? I am running the Python 2.7 version of python.exe on Windows 7. My locale is en_GB.
open(u'黒色.txt')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 22] invalid mode ('r') or filename: u'??.txt'
The file does exist, and is readable.
And if I try
name = u'黒色.txt'
name
the interpreter shows
u'??.txt'
Additional:
Okay, I was trying to simplify my problem for the purposes of this forum. Originally the filename was arriving in a cgi script from a web page with a file picker. The idea was to let the web page user upload files to a server:
import cgi
form = cgi.FieldStorage()
fileItems = form['attachment[]']
for fileItem in fileItems:
if fileItem.file:
fileName = os.path.split(fileItem.filename)[1]
f = open(fileName, 'wb')
while True:
chunk = fileItem.file.read(100000)
if not chunk:
break
f.write(chunk)
f.close()
but the files created at the server side had corrupted names. I started investigating this in the Python interpreter, reproduced the problem (so I thought), and that is what I put into my original question. However, I think now that I managed to create a separate problem.
Thanks to the answers below, I fixed the cgi script by making sure the file name is treated as unicode:
fileName = unicode(os.path.split(fileItem.filename)[1])
I never got my example in the interpreter to work. I suspect that is because my PC has the wrong locale for this.
Here's an example script that reads and writes the file. You can use any encoding for the source file that supports the characters you are writing but make sure the #coding line matches. You can use any encoding for the data file as long as the encoding parameter matches.
#coding:utf8
import io
with io.open(u'黒色.txt','w',encoding='utf8') as f:
f.write(u'黒色.txt content')
with io.open(u'黒色.txt',encoding='utf8') as f:
print f.read()
Output:
黒色.txt content
Note the print will only work if the terminal running the script supports Japanese; otherwise, you'll likely get a UnicodeEncodeError. I am on Windows and use an IDE that supports UTF-8 output, since the Windows console uses a legacy US-OEM encoding that doesn't support Japanese.
Run IDLE if you want to work with Unicode strings interactively in Python. Then inputting or printing any characters will just work.