encoding in py_compile vs import - python-2.7

When a python script with non ASCII character is compiled using py_compile.compile it does not complaint about encoding. But when imported gives in python 2.7
SyntaxError: Non-ASCII character '\xe2' in file
Why is this happening? whats the difference between importing and compiling using py_compile?

It seems that Python provides two variants of its lexer, one used internally when Python itself parses files, and one that is exposed to Python through e.g. __builtins__.compile or tokenizer.generate_tokens. Only the former one checks for non-ASCII characters, it seems. It's controlled by an #ifdef PGEN in Parser/tokenizer.c.
I have a qualified guess on why they did it this way: In Python 3, non-ASCII characters are permitted in .py files, and are interpreted as utf-8 IIRC. By silently permitting UTF-8 in the lexer, 2.7's tokenizer.generate_tokens() function can accept all valid Py3 code.

Related

Problems in codification - unicode vs. utf-8 in python 2.7

Well, my python script is supposed to open all utf-8 yaml files in a directory and show the content to the user. But, there are words with graphic accent, words in French, such as présenter, which is shown like this: u"pr\xe9senter. I need it to be shown properly to the user.
Here is my code:
import glob
files = glob.glob("data/*.yaml")
def read_yaml_file(filename):
with open(filename, 'r') as stream:
try:
print(yaml.safe_load(stream))
except yaml.YAMLError as exc:
print(exc)
for file in files:
read_yaml_file(file)
I already tried to use the import from __future__, but it didn't work. Does anyone know how to solve it?
Unicode in 2.x is painful. If you can, use current python 3, in which text is unicode, printed without a 'u' prefix, instead of bytes, which is now printed with a 'b' prefix.
>>> print(u"pr\xe9senter") # 3.8
'présenter'
You also need a system console/terminal or IDE that displays glyphs for the codepoints in your yaml files.
If you are a masochist or otherwise stuck on 2.7, use sys.stdout.write(). Note that you must explicitly write '\n's.
>>> import sys; sys.stdout.write(u"pr\xe9senter\n") # 2.7
présenter
This question is not really about IDLE. However, the above lines work in both standard interactive Python on Windows 10 and in IDLE. IDLE uses tkinter which uses tcl/tk. Tk itself can handle all Basic Multilingual Plane (BMP) characters (the first 64K), but only those. Which BMP characters it can display depends on your OS and its current fonts.

How to set the UTF-8 as default encoding in Odoo Build?

Can anyone tell me how to set the UTF-8 as default encoding option in Odoo Build.?
Note : I have mentioned "# -- coding: utf-8 --" in all the files which takes no effect on my expected encoding.
If you put # coding: utf-8 at the top of a Python module, this affects the way how Python interprets the source code. This is important if you have string literals with non-ASCII characters in your code, in order to have them represent the correct characters.
However, since you talk about "default encoding", I assume you care about the encoding of text files opened for reading or writing. In Python 2.x, the default for reading and writing files is not to decode/encode at all. I don't think you can change this default (because the built-in function open simply doesn't support encoding), but you can use io.open() or codecs.open() to open files with an explicit encoding.
Thus, to read from a file encoded with UTF-8, open it as follows:
with io.open(filename, encoding='utf-8') as f:
for line in f:
...
In Python 3, built-in open() is the same as io.open(), and the default encoding is platform-dependent.

Different base64 encoding between python versions

I'm having trouble sending an html code through JSON.
I'm noticing my string values are different between python versions (2.7 and 3.5)
My string being something like: <html><p>PAÇOCA</p></html>
on Python 2.7:
x = '<html><p>PAÇOCA</p></html>'
base64.b64encode(x)
=> PGh0bWw+PHA+UEGAT0NBPC9wPjwvaHRtbD4=
on Python 3.5:
x = '<html><p>PAÇOCA</p></html>'
base64.b64encode(x)
=> b'PGh0bWw+PHA+UEHDh09DQTwvcD48L2h0bWw+'
Why are these values different?
How can I make the 3.5 string equal to the 2.7?
This is causing me troubles with receiving e-mails due to the accents being lost.
Your example x values are not valid Python so it is difficult to tell where the code went wrong, but the answer is to use Unicode strings and explicitly encode them to get consistent answers. The below code gives the same answer in Python 2 and 3, although Python 3 decorates byte strings with b'' when printed. Save the source file in the encoding declared via #coding. The source code encoding can be any encoding that supports the characters used in the source file. Typically UTF-8 is used for non-ASCII source code, but I made it deliberately different to show it doesn't matter.
#coding:cp1252
from __future__ import print_function
import base64
x = u'<html><p>PAÇOCA</p></html>'.encode('utf8')
enc = base64.b64encode(x)
print(enc)
Output using Pylauncher to choose the major Python version:
C:\>py -2 test.py
PGh0bWw+PHA+UEHDh09DQTwvcD48L2h0bWw+
C:\>py -3 test.py
b'PGh0bWw+PHA+UEHDh09DQTwvcD48L2h0bWw+'

how to use stdscr.addstr() (curses) to print unicode characters

I know how to use the print() function to print unicode characters, but I do not know how to do it using stdscr.addstr()
I'm using python 2.7 on a Linux operating system
Thanks
I'm pretty sure you need to encode the string.
The docs reads:
Since version 5.4, the ncurses library decides how to interpret non-ASCII data using the nl_langinfo function. That means that you have to call locale.setlocale() in the application and encode Unicode strings using one of the system’s available encodings.
This example worked for me in 2.7.12
import locale
locale.setlocale(locale.LC_ALL, '')
stdscr.addstr(0, 0, mystring.encode('UTF-8'))

Error thrown even if a line is commented

I have this line:
#str = u'Harsha: This has unicode character ♭.\n'
This line causes SyntaxError: Non-ASCII character '\xe2' even if it's commented.
If I remove this line the error is gone. Can anyone tell me whats wrong here?
I'm using PyCharm as IDE.
You want to add the following line at the top of your source file:
# -*- coding: utf-8 -*-
This tells python what is the encoding of your source file.
Source: Working with utf-8 encoding in Python source
You need to hint the proper file encoding.
As you know the character e2 is represented by binary string
1110 ...
this is ambiguos because it could be the UTF8 starting byte for a triplet, or just a Extended ASCII character (wich is what you wanted).
Python defaults to ASCII (7 bit character) that means that without giving some hint for parsing the code everythin over 7 bit will be considered ambiguos and hence lead to an error.
You should instead escape that character or if possible hint the python interpreter to do so (I don't know if it possible, I only found a proposal for that but I don't know if that is implemented already)