Problems in codification - unicode vs. utf-8 in python 2.7 - python-2.7

Well, my python script is supposed to open all utf-8 yaml files in a directory and show the content to the user. But, there are words with graphic accent, words in French, such as présenter, which is shown like this: u"pr\xe9senter. I need it to be shown properly to the user.
Here is my code:
import glob
files = glob.glob("data/*.yaml")
def read_yaml_file(filename):
with open(filename, 'r') as stream:
try:
print(yaml.safe_load(stream))
except yaml.YAMLError as exc:
print(exc)
for file in files:
read_yaml_file(file)
I already tried to use the import from __future__, but it didn't work. Does anyone know how to solve it?

Unicode in 2.x is painful. If you can, use current python 3, in which text is unicode, printed without a 'u' prefix, instead of bytes, which is now printed with a 'b' prefix.
>>> print(u"pr\xe9senter") # 3.8
'présenter'
You also need a system console/terminal or IDE that displays glyphs for the codepoints in your yaml files.
If you are a masochist or otherwise stuck on 2.7, use sys.stdout.write(). Note that you must explicitly write '\n's.
>>> import sys; sys.stdout.write(u"pr\xe9senter\n") # 2.7
présenter
This question is not really about IDLE. However, the above lines work in both standard interactive Python on Windows 10 and in IDLE. IDLE uses tkinter which uses tcl/tk. Tk itself can handle all Basic Multilingual Plane (BMP) characters (the first 64K), but only those. Which BMP characters it can display depends on your OS and its current fonts.

Related

encoding in py_compile vs import

When a python script with non ASCII character is compiled using py_compile.compile it does not complaint about encoding. But when imported gives in python 2.7
SyntaxError: Non-ASCII character '\xe2' in file
Why is this happening? whats the difference between importing and compiling using py_compile?
It seems that Python provides two variants of its lexer, one used internally when Python itself parses files, and one that is exposed to Python through e.g. __builtins__.compile or tokenizer.generate_tokens. Only the former one checks for non-ASCII characters, it seems. It's controlled by an #ifdef PGEN in Parser/tokenizer.c.
I have a qualified guess on why they did it this way: In Python 3, non-ASCII characters are permitted in .py files, and are interpreted as utf-8 IIRC. By silently permitting UTF-8 in the lexer, 2.7's tokenizer.generate_tokens() function can accept all valid Py3 code.

Why does termcolor not work in python27 windows?

I just installed termcolor for python 2.7 on windows8.1. When I try to print colored text, I get the strange output.
from termcolor import colored
print colored('Hello world','red')
Here is the result:
[31mHello world[0m
Help to get out from this problem.Thanks,In advance
See this stackOverflow post.
It basically says that in order to get the escape sequences working in Windows, you need to run os.system('color') first.
For example:
import termcolor
import os
os.system('color')
print(termcolor.colored("Stack Overflow", "green")
termcolor or colored works perfectly fine under python 2.7 and I can't replicate your error on my Mac/Linux.
If you looks into the source code of colored, it basically print the string in the format as
\033[%dm%s\033[0m' % (COLORS[color], text)
Somehow your terminal environment does not recognise the non-printing escape sequences that is used in the unix/linux system for setting the foreground color of xterm.

Different base64 encoding between python versions

I'm having trouble sending an html code through JSON.
I'm noticing my string values are different between python versions (2.7 and 3.5)
My string being something like: <html><p>PAÇOCA</p></html>
on Python 2.7:
x = '<html><p>PAÇOCA</p></html>'
base64.b64encode(x)
=> PGh0bWw+PHA+UEGAT0NBPC9wPjwvaHRtbD4=
on Python 3.5:
x = '<html><p>PAÇOCA</p></html>'
base64.b64encode(x)
=> b'PGh0bWw+PHA+UEHDh09DQTwvcD48L2h0bWw+'
Why are these values different?
How can I make the 3.5 string equal to the 2.7?
This is causing me troubles with receiving e-mails due to the accents being lost.
Your example x values are not valid Python so it is difficult to tell where the code went wrong, but the answer is to use Unicode strings and explicitly encode them to get consistent answers. The below code gives the same answer in Python 2 and 3, although Python 3 decorates byte strings with b'' when printed. Save the source file in the encoding declared via #coding. The source code encoding can be any encoding that supports the characters used in the source file. Typically UTF-8 is used for non-ASCII source code, but I made it deliberately different to show it doesn't matter.
#coding:cp1252
from __future__ import print_function
import base64
x = u'<html><p>PAÇOCA</p></html>'.encode('utf8')
enc = base64.b64encode(x)
print(enc)
Output using Pylauncher to choose the major Python version:
C:\>py -2 test.py
PGh0bWw+PHA+UEHDh09DQTwvcD48L2h0bWw+
C:\>py -3 test.py
b'PGh0bWw+PHA+UEHDh09DQTwvcD48L2h0bWw+'

Writing py2.x and py3.x compatible code without six.text_type

Given the six.text_type function. it's easy to write i/o code for unicode text, e.g. https://github.com/nltk/nltk/blob/develop/nltk/parse/malt.py#L188
fout.write(text_type(line))
But without the six module, it would require a try-except gymnastics that looks like this:
try:
fout.write(text_type(line))
except:
try:
fout.write(unicode(line))
except:
fout.write(bytes(line))
What is the pythonic way to resolve the file writing a unicode line and ensuring the python script is py2.x and py3.x compatible?
Is the try-except above the pythonic way to handle the py2to3 compatibility? What other alternatives are there?
For more details/context of this question: https://github.com/nltk/nltk/issues/1080#issuecomment-134542174
Do what six does, and define text_type yourself:
try:
# Python 2
text_type = unicode
except NameError:
# Python 3
text_type = str
In any case, never use blanked except lines here, you'll be masking other issues entirely unrelated to using a different Python version.
It is not clear to me what kind of file object you are writing to however. If you are using io.open() to open the file you'll get a file object that'll always expect Unicode text, in both Python 2 and 3, and you should not need to convert text to bytes, ever.

Python to C++ Character encoding

I have a C++ program that uses the Python C/API to call Python scripts for DB info, but the data received is not encoded in the right way. This is in France, so my data has accents and other non-English characters.
In a python terminal with the sys.defaultencoding set to "utf-8", an example:
>>> robin = 'testé'
>>> robin
'test\x82'
>>> print robin
testé
>>> str(robin)
'test\x82'
If I call:
PyString_AsString(PyObject_Repr(PyObject_GetAttrString(/*PyObject of my Py_Init*/, "robin")));
I get a char* filled with the folowing: test\x82
And creating a string or wstring from that yields the same result.
I would like to be able to create a string that says "testé", and I'm guessing that starts with being able to output the variable correctly in the python terminal, as in:
>>> robin = 'testé'
>>> robin
'testé'
I tried encode() decode(), sys.setdefaultencoding, sys.stdout.encoding, and even some force_text and force_bytes from Django. Nothing seems to be able to get me a standard C++ string with my actual characters in it. Any help would be greatly appreciated.
FYI - Python 2.7, Windows 8 x64, VS2012 and C++9
EDIT to answer to comments:
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf-8')
>>> sys.getdefaultencoding()
'utf-8'
>>> robin = 'testé'
>>> robin
'test\x82'
>>> print robin
testé
I just want whatever 'print' does to display the information correctly...
This is not as simple as it seems, I was wrong, acute e in utf-8 is c3 a9. Working with encodings from the console with the python's interpreter is hard. There are several things you have to get right.
First, your console default code page (encoding). You can check this by issuing chcp command. Mine says 437, but it hardly depends on your windows installation.
Code page for latin-1 is 28591 and code page for utf-8 is 65001. Odd enough, is complicated to use the python interpreter when the console has code page 65001, seems like there hasn't been declared it is a synonym for utf-8 in python's encoding libraries.
My point here is that you have to get your mind right. If your console is in code page X, your input to the python's interpreter will be encoding in X, and you'll see the output the way X is able to manage the bytes.
I'll suggest you to use unicode instead of hard encoded strings in python, and use scape bytes instead of characters. For example, you can declare robin like this:
robin = u'test\xe9'
U+00E9 is the code for é. After that, robin is unicode and can be encoded into any econding you want like this: robin.encode('utf-8'). This way you have control over the variable to encode it in any encode for every possible output scenario.
To resume it:
Figure out your console's encoding
encode the robin variable according to this encoding
The console should output it right
Hope this is helpful!
You call PyObject_Repr which is the same as repr(robin) in Python, and produces the literal characters \x82. Leave it out from your chain of calls.