Different base64 encoding between python versions - python-2.7

I'm having trouble sending an html code through JSON.
I'm noticing my string values are different between python versions (2.7 and 3.5)
My string being something like: <html><p>PAÇOCA</p></html>
on Python 2.7:
x = '<html><p>PAÇOCA</p></html>'
base64.b64encode(x)
=> PGh0bWw+PHA+UEGAT0NBPC9wPjwvaHRtbD4=
on Python 3.5:
x = '<html><p>PAÇOCA</p></html>'
base64.b64encode(x)
=> b'PGh0bWw+PHA+UEHDh09DQTwvcD48L2h0bWw+'
Why are these values different?
How can I make the 3.5 string equal to the 2.7?
This is causing me troubles with receiving e-mails due to the accents being lost.

Your example x values are not valid Python so it is difficult to tell where the code went wrong, but the answer is to use Unicode strings and explicitly encode them to get consistent answers. The below code gives the same answer in Python 2 and 3, although Python 3 decorates byte strings with b'' when printed. Save the source file in the encoding declared via #coding. The source code encoding can be any encoding that supports the characters used in the source file. Typically UTF-8 is used for non-ASCII source code, but I made it deliberately different to show it doesn't matter.
#coding:cp1252
from __future__ import print_function
import base64
x = u'<html><p>PAÇOCA</p></html>'.encode('utf8')
enc = base64.b64encode(x)
print(enc)
Output using Pylauncher to choose the major Python version:
C:\>py -2 test.py
PGh0bWw+PHA+UEHDh09DQTwvcD48L2h0bWw+
C:\>py -3 test.py
b'PGh0bWw+PHA+UEHDh09DQTwvcD48L2h0bWw+'

Related

How to add non ASCII characters in a python list?

I am a new learner of python. I want to have a list of strings with non-ASCII characters.
This answer suggested a way to do this, but when I tried a code, I got some weird results. Please see the following MWE -
#-*- coding: utf-8 -*-
mylist = ["अ,ब,क"]
print mylist
The output was ['\xe0\xa4\x85,\xe0\xa4\xac,\xe0\xa4\x95']
When I use ASCII characters in the list, let's say ["a,b,c"] the output also is ['a,b,c']. I want the output of my code to be ["अ,ब,क"]
How to do this?
PS - I am using python 2.7.16
You want to mark these as Unicode strings.
mylist = [u"अ,ब,क"]
Depending on what you want to accomplish, if the data is just a single string, it might not need to be in a list. Or perhaps you want a list of strings?
mylist = [u"अ", u"ब", u"क"]
Python 3 brings a lot of relief to working with Unicode (and doesn't need the u sigil in front of Unicode strings, because all strings are Unicode), and should definitely be your learning target unless you are specifically tasked with maintaining legacy software after Python 2 is officially abandoned at the end of this year.
Regardless of your Python version, there may still be issues with displaying Unicode on your system, in particular on older systems and on Windows.
If you are unfamiliar with encoding issues, you'll want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and perhaps the Python-specific Pragmatic Unicode.
Use:
#-*- coding: utf-8 -*-
mylist = ["अ,ब,क"]
print [unicode(i) for i in mylist]
Or use:
#-*- coding: utf-8 -*-
mylist = ["अ,ब,क"]
print map(unicode, mylist)

Problems in codification - unicode vs. utf-8 in python 2.7

Well, my python script is supposed to open all utf-8 yaml files in a directory and show the content to the user. But, there are words with graphic accent, words in French, such as présenter, which is shown like this: u"pr\xe9senter. I need it to be shown properly to the user.
Here is my code:
import glob
files = glob.glob("data/*.yaml")
def read_yaml_file(filename):
with open(filename, 'r') as stream:
try:
print(yaml.safe_load(stream))
except yaml.YAMLError as exc:
print(exc)
for file in files:
read_yaml_file(file)
I already tried to use the import from __future__, but it didn't work. Does anyone know how to solve it?
Unicode in 2.x is painful. If you can, use current python 3, in which text is unicode, printed without a 'u' prefix, instead of bytes, which is now printed with a 'b' prefix.
>>> print(u"pr\xe9senter") # 3.8
'présenter'
You also need a system console/terminal or IDE that displays glyphs for the codepoints in your yaml files.
If you are a masochist or otherwise stuck on 2.7, use sys.stdout.write(). Note that you must explicitly write '\n's.
>>> import sys; sys.stdout.write(u"pr\xe9senter\n") # 2.7
présenter
This question is not really about IDLE. However, the above lines work in both standard interactive Python on Windows 10 and in IDLE. IDLE uses tkinter which uses tcl/tk. Tk itself can handle all Basic Multilingual Plane (BMP) characters (the first 64K), but only those. Which BMP characters it can display depends on your OS and its current fonts.

encoding in py_compile vs import

When a python script with non ASCII character is compiled using py_compile.compile it does not complaint about encoding. But when imported gives in python 2.7
SyntaxError: Non-ASCII character '\xe2' in file
Why is this happening? whats the difference between importing and compiling using py_compile?
It seems that Python provides two variants of its lexer, one used internally when Python itself parses files, and one that is exposed to Python through e.g. __builtins__.compile or tokenizer.generate_tokens. Only the former one checks for non-ASCII characters, it seems. It's controlled by an #ifdef PGEN in Parser/tokenizer.c.
I have a qualified guess on why they did it this way: In Python 3, non-ASCII characters are permitted in .py files, and are interpreted as utf-8 IIRC. By silently permitting UTF-8 in the lexer, 2.7's tokenizer.generate_tokens() function can accept all valid Py3 code.

Print special character from utf-8 encoded string

I'm having trouble dealing with encoding in Python:
I get some strings from a csv that I open using pandas.read_csv(), they are encoded in unicode so I encode it to utf-8 doing the following
# data is from my csv
string = data.encode('utf-8')
print string
However, when I print it, i get
"Parc d'Activit\xc3\xa9s des Gravanches"
and i would like to return
"Parc d'Activités des Gravanches"
It seems like an easy issue but I'm quite new to python and did not find anything close enough to my problem.
Note: I am using Python 2.7 and my file starts with
#!/usr/bin/env python2.7
# coding: utf8
EDIT: I just say that you are using Python 2, okay, I think the answer below is still valuable though.
In Python 2 this is even more complicated and inconsistent. Here you have str and unicode, and the default str doesn't support unicode stuff.
Anyways, the situation is more or less the same, use decode instead of encode to convert from str to unicode. That should fix it.
More info at: https://pythonhosted.org/kitchen/unicode-frustrations.html
This is a common source of confusion.The issue is a bit complex, but I'll try to simplify it. I'm talking about Python 3 here, I believe there's several differences with Python 2.
There's two types of what you would call a string: str and bytes.
str is the general string type form Python, it supports unicode seamlessly in Python 3, but the way it encodes the actual data is not relevant, it's an object.
bytes is a byte array, like char* in C. It's a sequence of bytes.
Strings can be represented both ways, but you need to specify an encoding standard to translate between the two, as bytes needs to be interpreted, because it's just, again, a raw array of bytes.
encode converts a str into bytes, that's the mistake you make. Of course, if you print bytes it will just show it's raw data, AKA, the string encoded as utf-8.
decode does the opposite operation, that may be what you need.
However, if you open the file normally (open(file_name, 'r')) instead of in byte mode (open(file_name, 'b'), which I doubt you are doing, you shouldn't need to do anything, printing data should just work as you want it to.
More info at: https://docs.python.org/3/howto/unicode.html

Python to C++ Character encoding

I have a C++ program that uses the Python C/API to call Python scripts for DB info, but the data received is not encoded in the right way. This is in France, so my data has accents and other non-English characters.
In a python terminal with the sys.defaultencoding set to "utf-8", an example:
>>> robin = 'testé'
>>> robin
'test\x82'
>>> print robin
testé
>>> str(robin)
'test\x82'
If I call:
PyString_AsString(PyObject_Repr(PyObject_GetAttrString(/*PyObject of my Py_Init*/, "robin")));
I get a char* filled with the folowing: test\x82
And creating a string or wstring from that yields the same result.
I would like to be able to create a string that says "testé", and I'm guessing that starts with being able to output the variable correctly in the python terminal, as in:
>>> robin = 'testé'
>>> robin
'testé'
I tried encode() decode(), sys.setdefaultencoding, sys.stdout.encoding, and even some force_text and force_bytes from Django. Nothing seems to be able to get me a standard C++ string with my actual characters in it. Any help would be greatly appreciated.
FYI - Python 2.7, Windows 8 x64, VS2012 and C++9
EDIT to answer to comments:
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf-8')
>>> sys.getdefaultencoding()
'utf-8'
>>> robin = 'testé'
>>> robin
'test\x82'
>>> print robin
testé
I just want whatever 'print' does to display the information correctly...
This is not as simple as it seems, I was wrong, acute e in utf-8 is c3 a9. Working with encodings from the console with the python's interpreter is hard. There are several things you have to get right.
First, your console default code page (encoding). You can check this by issuing chcp command. Mine says 437, but it hardly depends on your windows installation.
Code page for latin-1 is 28591 and code page for utf-8 is 65001. Odd enough, is complicated to use the python interpreter when the console has code page 65001, seems like there hasn't been declared it is a synonym for utf-8 in python's encoding libraries.
My point here is that you have to get your mind right. If your console is in code page X, your input to the python's interpreter will be encoding in X, and you'll see the output the way X is able to manage the bytes.
I'll suggest you to use unicode instead of hard encoded strings in python, and use scape bytes instead of characters. For example, you can declare robin like this:
robin = u'test\xe9'
U+00E9 is the code for é. After that, robin is unicode and can be encoded into any econding you want like this: robin.encode('utf-8'). This way you have control over the variable to encode it in any encode for every possible output scenario.
To resume it:
Figure out your console's encoding
encode the robin variable according to this encoding
The console should output it right
Hope this is helpful!
You call PyObject_Repr which is the same as repr(robin) in Python, and produces the literal characters \x82. Leave it out from your chain of calls.