Unicode characters output from python I/O to files - python-2.7

I don't know if this is my misunderstanding of UTF-8 or of python, but I'm having trouble understanding how python writes Unicode characters to a file. I'm on a Mac under OSX by the way, if that makes a difference.
Let's say I have the following unicode string
foo=u'\x93Stuff in smartquotes\x94\n'
Here \x93 and \x94 are those awful smart-quotes.
Then I write it to a file:
with open('file.txt','w') as file:
file.write(foo.encode('utf8'))
When I open the file in a text editor like TextWrangler or in a web browser file.txt seems like it was written as
\xc2\x93**Stuff in smartquotes\xc2\x94\n
The text editor properly understands the file to be UTF8 encoded, but it renders \xc2\x93 as garbage. If I go in and manually strip out the \xc2 part, I get what I expect, and TextWrangler and Firefox render the utf characters as smartquotes.
This is exactly what I get when I read the file back into python without decoding it as 'utf8'. However, when I do read it in with the read().decode('utf8') method, I get back what I originally put in, without the \xc2 bit.
This is driving me bonkers, because I'm trying to parse a bunch of html files into text and the incorrect rendering of these unicode characters is screwing up a bunch of stuff.
I also tried it in python3 using the read/write methods normally, and it has the same behavior.
edit: Regarding stripping out the \xc2 manually, it turns out that it was rendering correctly when I did that because the browser and text editors were defaulting to Latin encoding.
Also, as a follow up, Filefox renders the text as
☐Stuff in smartquotes☐
where the boxes are empty unicode values, while Chrome renders the text as
Stuff in smartquotes

The problem is, u'\x93' and u'\x94' are not the Unicode codepoints for smart quotes. They are smart quotes in the Windows-1252 encoding, which is not the same as the latin1 encoding. In latin1, those values are not defined.
>>> import unicodedata as ud
>>> ud.name(u'\x93')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> import unicodedata as ud
>>> ud.name(u'\x94')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> ud.name(u'\u201c')
'LEFT DOUBLE QUOTATION MARK'
>>> ud.name(u'\u201d')
'RIGHT DOUBLE QUOTATION MARK'
So you should one of:
foo = u'\u201cStuff in smartquotes\u201d'
foo = u'\N{LEFT DOUBLE QUOTATION MARK}Stuff in smartquotes\N{RIGHT DOUBLE QUOTATION MARK}'
or in a UTF-8 source file:
#coding:utf8
foo = u'“Stuff in smartquotes”'
Edit: If you somehow have a Unicode string with those incorrect bytes in it, here's a way to fix them. The first 256 Unicode codepoints map 1:1 with latin1 encoding, so it can be used to encode a mis-decoded Unicode string directly back to a byte string so the correct decoding can be used:
>>> foo = u'\x93Stuff in smartquotes\x94'
>>> foo
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('windows-1252')
'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”
If you have the UTF-8-encoded version of the incorrect Unicode characters:
>>> foo = '\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo = foo.decode('utf8').encode('latin1').decode('windows-1252')
>>> foo
u'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”
And if you have the very worst case the following Unicode string:
>>> foo = u'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1') # back to a UTF-8 encoded byte string.
'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1').decode('utf8') # Undo the UTF-8, but Unicode is still wrong.
u'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1') # back to a byte string.
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1').decode('windows-1252') # Now decode correctly.
u'\u201cStuff in smartquotes\u201d'

Related

how to get python to recognize the ® symbol [duplicate]

This question already has answers here:
Python to show special characters
(3 answers)
Closed 4 years ago.
Hi there I am trying to make python recognize ® as a symbol( if it doesn't show up that well here but it is the symbol with a capital R within a circle known as the 'registered' symbol)
I understand that it is not recognized in python due to ASCII however i was wondering if anyone knows of a way to use a different decoding system that includes this symbol or a method to make python 'ignore' it.
For some context:
I am trying to make an auto checkout program for a website so my program needs to match the item that the user wants. To do this I am using Beatifulsoup to scrape information however this symbol '®' is within the names of a few of the items causing python to crash.
Here is the current command that I am using but is not working due to ASCII:
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
Any help would be appreciated
Here is the entirety of the program so far(ignore the mess nowhere near done):
import time
import webbrowser
from selenium import webdriver
import mechanize
from bs4 import BeautifulSoup
import urllib2
from selenium.webdriver.support.ui import Select
CnI = []
item = []
colour = []
Uhrefs = []
Whrefs = []
FinalColours = []
selectItemindex = []
selectColourindex = []
#counters
Ccounter = 0
Icounter = 0
Splitcounter = 1
#wanted items suffix options:jackets, shirts, tops_sweaters, sweatshirts, pants, shorts, hats, bags, accessories, skate
suffix = 'accessories'
Wcolour = 'Black'
Witem = '2-Tone Nylon 6-Panel'
driver=webdriver.Chrome()
driver.get('http://www.supremenewyork.com/shop/all/'+suffix)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup)
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
print(colour)
print('#############')
for each in CnI:
each.split(',')
print(each)
while Splitcounter<=len(CnI):
item.append(CnI[Splitcounter-1])
FinalColours.append(CnI[Splitcounter])
Whrefs.append(Uhrefs[Splitcounter])
Splitcounter+=2
print(Uhrefs)
for each in item:
print(each)
for z in FinalColours:
print(z)
for i in Whrefs:
print(i)
##for i in item:
## hold = item.index(i)
## print(hold)
## if Witem == i and Wcolour == FinalColours[i]:
## print('correct')
##
##
for count,elem in enumerate(item):
if Witem in elem:
selectItemindex.append(count+1)
for count,elem in enumerate(FinalColours):
if Wcolour in elem:
selectColourindex.append(count+1)
print(selectColourindex)
print(selectItemindex)
for each in selectColourindex:
if selectColourindex[Ccounter] in selectItemindex:
point = selectColourindex[Ccounter]
print(point)
else:
Ccounter+=1
web = 'http://www.supremenewyork.com'+Whrefs[point-1]
driver.get(web)
elem1 = driver.find_element_by_name('commit')
elem1.click()
time.sleep(1)
elem2 = driver.find_element_by_link_text('view/edit basket')
elem2.click()
time.sleep(1)
elem3 = driver.find_element_by_link_text('checkout now')
elem3.click()
"®" is not a character but a unicode codepoint so if you're using Python2, your code will never work. Instead of using str(), use something like this:
unicode(input_string, 'utf8')
# or
unicode(input_string, 'unicode-escape')
Edit: Given the code surrounding the initial snippet that was posted later and the fact that BeautifulSoup actually returns unicode already, it seems that removal of str() might be the best course of action and #MarkTolonen's answer is spot-on.
BeautifulSoup returns Unicode strings. Stop converting them back to byte strings. Best practice when dealing with text is to:
Decode incoming text to Unicode (what BeautifulSoup is doing).
Process all text using Unicode.
Encode outgoing text to Unicode (to file, to database, to sockets, etc.).
Small example of your issue:
text = u'\N{REGISTERED SIGN}' # syntax to create a Unicode codepoint by name.
bytes = str(text)
Output:
Traceback (most recent call last):
File "test.py", line 2, in <module>
bytes = str(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 0: ordinal not in range(128)
Note the first line works and supports the character. Converting it to a byte string fails because it defaults to encoding in ASCII. You can explicitly encode it with another encoding (e.g. bytes = text.encode('utf8'), but that breaks rule 2 above and creates other issues.
Suggested reading:
https://nedbatchelder.com/text/unipain.html
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Get non-ASCII filename from S3 notification event in Lambda

The key field in an AWS S3 notification event, which denotes the filename, is URL escaped.
This is evident when the filename contains spaces or non-ASCII characters.
For example, I have upload the following filename to S3:
my file řěąλλυ.txt
The notification is received as:
{
"Records": [
"s3": {
"object": {
"key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
}
}
]
}
I've tried to decode using:
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')
but that yields:
my file ÅÄÄλλÏ.txt
Of course, when I then try to get the file from S3 using Boto, I get a 404 error.
tl;dr
You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.
For example, for an S3 object with the filename: my file řěąλλυ.txt:
>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'
>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a
# Unicode with UTF-8 encoded bytes!
'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt'
# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.
>>> type(key)
<type 'unicode'>
>>> print(key)
my file řěąλλυ.txt
Background
AWS have commited the cardinal sin of changing the default encoding - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
The error you should've got from your decode() is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)
The value of key is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.
Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call .decode() at all.
Just in case someone else comes here hoping for a JavaScript solution, here's what I ended up with:
function decodeS3EventKey (key = '') {
return decodeURIComponent(key.replace(/\+/g, ' '))
}
With limited testing, it seems to work fine:
test+image+%C3%BCtf+%E3%83%86%E3%82%B9%E3%83%88.jpg decodes to test image ütf テスト.jpg
my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt decodes to my file řěąλλυ.txt
For python 3:
from urllib.parse import unquote_plus
result = unquote_plus('input/%D0%BF%D1%83%D1%81%D1%82%D0%BE%D0%B8%CC%86.pdf')
print(result)
# will prints 'input/пустой.pdf'

Python 2.7 and Sublime 2 + unicode don't mix

First of all, I've looked here: Sublime Text 3, Python 3 and UTF-8 don't like each other and read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets but am still none the wiser to the following:
Running Python from a file created in Sublime (not compiling) and executing via command prompt on an XP machine
I have a couple of text files named with accents (German, Spanish & French mostly). I want to remove accented characters (umlauts, acutes, graves, cidillas etc) and replace them with their equilivant non accented look a like.
I can strip the accents if they are a string from with the script. But accesing a textfile of the same name causes the the strippAcent function to fail. I'm all out of ideas as I think this is due to a conflict with Sublime and Python.
Here's my script
# -*- coding: utf-8 -*-
import unicodedata
import os
def stripAccents(s):
try:
us = unicode(s,"utf-8")
nice = unicodedata.normalize("NFD", us).encode("ascii", "ignore")
print nice
return nice
except:
print ("Fail! : %s" %(s))
return None
stripAccents("Découvrez tous les logiciels à télécharger")
# Decouvrez tous les logiciels a telecharger
stripAccents("Östblocket")
# Ostblocket
stripAccents("Blühende Landschaften")
# Bluhende Landschaften
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
for name in files:
x = name
x = stripAccents(x)
For the record:
C:\chcp
gets me 437
This is what the code produces for me:
The error in full is:
C:\WINDOWS\system32>D:\LearnPython\unicode_accents.py
Decouvrez tous les logiciels a telecharger
Ostblocket
Bluhende Landschaften
Traceback (most recent call last):
File "D:\LearnPython\unicode_accents.py", line 37, in <module>
x = stripAccents(x)
File "D:\LearnPython\unicode_accents.py", line 8, in stripAccents
us = unicode(s,"utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 2: invalid start byte
C:\WINDOWS\system32>
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
If you want to read Windows's filenames in their native Unicode form you have to ask for that specfically, by passing a Unicode string to filesystem functions:
root = u"D:\\temp\\test\\"
Otherwise Python will default to using the standard byte-based interfaces to the filesystem. On Windows, these return filenames to you encoded in the system's locale-specific legacy encoding (ANSI code page).
In stripAccents you try to decode the byte string you got from here using UTF-8, but the ANSI code page is never UTF-8, and the byte sequence you have doesn't happen to be a valid UTF-8 sequence so you get an error. You can decode from the ANSI code page using the pseudo-encoding mbcs, but it would be better to stick to Unicode filepath strings so you can include characters that don't fit in ANSI.
Always use Unicode strings to represent text in Python. Add from __future__ import unicode_literals at the top so that all "" literals would create Unicode strings. Or use u"" literals everywhere. Drop unicode(s, 'utf-8') from stripAccents(), always pass Unicode strings instead (try unidecode package, to transliterate Unicode to ascii).
Using Unicode solves several issues transparently:
there won't be UnicodeDecodeError because Windows provides Unicode API for filenames: if you pass Unicode input; you get Unicode output
you won't get a mojibake when a bytestring containing text encoded using your Windows encoding such as cp1252 is displayed in console using cp437 encoding e.g., Blühende -> Blⁿhende (ü is corrupted)
you might be able to work with text that can't be represented using neither cp1252 nor cp437 encoding e.g., '❤' (U+2764 HEAVY BLACK HEART).
To print Unicode text to Windows console, you could use win-unicode-console package.

Why does ElementTree reject UTF-16 XML declarations with "encoding incorrect"?

In Python 2.7, when passing a unicode string to ElementTree's fromstring() method that has encoding="UTF-16" in the XML declaration, I'm getting a ParseError saying that the encoding specified is incorrect:
>>> from xml.etree import ElementTree
>>> data = u'<?xml version="1.0" encoding="utf-16"?><root/>'
>>> ElementTree.fromstring(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1300, in XML
parser.feed(text)
File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: encoding specified in XML declaration is incorrect: line 1, column 30
What does that mean? What makes ElementTree think so?
After all, I'm passing in unicode codepoints, not a byte string. There is no encoding involved here. How can it be incorrect?
Of course, one could argue that any encoding is incorrect, as these unicode codepoints are not encoded. However, then why is UTF-8 not rejected as "incorrect encoding"?
>>> ElementTree.fromstring(u'<?xml version="1.0" encoding="utf-8"?><root/>')
I can solve this problem easily either by encoding the unicode string into a UTF-16-encoded byte string and passing that to fromstring() or by replacing encoding="utf-16" with encoding="utf-8" in the unicode string, but I would like to understand why that exception is raised. The documentation of ElementTree says nothing about only accepting byte strings.
Specifically, I would like to avoid these additional operations because my input data can get quite big and I would like to avoid having them twice in memory and the CPU overhead of processing them more than absolutely necessary.
I'm not going to try to justify the behavior, but to explain why it's actually happening with the code as written.
In short: the XML parser that Python uses, expat, operates on bytes, not unicode characters. You MUST call .encode('utf-16-be') or .encode('utf-16-le') on the string before you pass it to ElementTree.fromstring:
ElementTree.fromstring(data.encode('utf-16-be'))
Proof: ElementTree.fromstring eventually calls down into pyexpat.xmlparser.Parse, which is implemented in pyexpat.c:
static PyObject *
xmlparse_Parse(xmlparseobject *self, PyObject *args)
{
char *s;
int slen;
int isFinal = 0;
if (!PyArg_ParseTuple(args, "s#|i:Parse", &s, &slen, &isFinal))
return NULL;
return get_parse_result(self, XML_Parse(self->itself, s, slen, isFinal));
}
So the unicode parameter you passed in gets converted using s#. The docs for PyArg_ParseTuple say:
s# (string, Unicode or any read buffer compatible object) [const char
*, int (or Py_ssize_t, see below)] This variant on s stores into two C variables, the first one a pointer to a character string, the second
one its length. In this case the Python string may contain embedded
null bytes. Unicode objects pass back a pointer to the default encoded
string version of the object if such a conversion is possible. All
other read-buffer compatible objects pass back a reference to the raw
internal data representation.
Let's check this out:
from xml.etree import ElementTree
data = u'<?xml version="1.0" encoding="utf-8"?><root>\u2163</root>'
print ElementTree.fromstring(data)
gives the error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2163' in position 44: ordinal not in range(128)
which means that when you were specifying encoding="utf-8", you were just getting lucky that there weren't non-ASCII characters in your input when the Unicode string got encoded to ASCII. If you add the following before you parse, UTF-8 works as expected with that example:
import sys
reload(sys).setdefaultencoding('utf8')
however, it doesn't work to set the defaultencoding to 'utf-16-be' or 'utf-16-le', since the Python bits of ElementTree do direct string comparisons which start to fail in UTF-16 land.

"UnicodeEncodeError: 'ascii' codec can't encode character"

I'm trying to pass big strings of random html through regular expressions and my Python 2.6 script is choking on this:
UnicodeEncodeError: 'ascii' codec can't encode character
I traced it back to a trademark superscript on the end of this word: Protection™ -- and I expect to encounter others like it in the future.
Is there a module to process non-ascii characters? or, what is the best way to handle/escape non-ascii stuff in python?
Thanks!
Full error:
E
======================================================================
ERROR: test_untitled (__main__.Untitled)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Python26\Test2.py", line 26, in test_untitled
ofile.write(Whois + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 1005: ordinal not in range(128)
Full Script:
from selenium import selenium
import unittest, time, re, csv, logging
class Untitled(unittest.TestCase):
def setUp(self):
self.verificationErrors = []
self.selenium = selenium("localhost", 4444, "*firefox", "http://www.BaseDomain.com/")
self.selenium.start()
self.selenium.set_timeout("90000")
def test_untitled(self):
sel = self.selenium
spamReader = csv.reader(open('SubDomainList.csv', 'rb'))
for row in spamReader:
sel.open(row[0])
time.sleep(10)
Test = sel.get_text("//html/body/div/table/tbody/tr/td/form/div/table/tbody/tr[7]/td")
Test = Test.replace(",","")
Test = Test.replace("\n", "")
ofile = open('TestOut.csv', 'ab')
ofile.write(Test + '\n')
ofile.close()
def tearDown(self):
self.selenium.stop()
self.assertEqual([], self.verificationErrors)
if __name__ == "__main__":
unittest.main()
You're trying to convert unicode to ascii in "strict" mode:
>>> help(str.encode)
Help on method_descriptor:
encode(...)
S.encode([encoding[,errors]]) -> object
Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.
You probably want something like one of the following:
s = u'Protection™'
print s.encode('ascii', 'ignore') # removes the ™
print s.encode('ascii', 'replace') # replaces with ?
print s.encode('ascii','xmlcharrefreplace') # turn into xml entities
print s.encode('ascii', 'strict') # throw UnicodeEncodeErrors
You're trying to pass a bytestring to something, but it's impossible (from the scarcity of info you provide) to tell what you're trying to pass it to. You start with a Unicode string that cannot be encoded as ASCII (the default codec), so, you'll have to encode by some different codec (or transliterate it, as #R.Pate suggests) -- but it's impossible for use to say what codec you should use, because we don't know what you're passing the bytestring and therefore don't know what that unknown subsystem is going to be able to accept and process correctly in terms of codecs.
In such total darkness as you leave us in, utf-8 is a reasonable blind guess (since it's a codec that can represent any Unicode string exactly as a bytestring, and it's the standard codec for many purposes, such as XML) -- but it can't be any more than a blind guess, until and unless you're going to tell us more about what you're trying to pass that bytestring to, and for what purposes.
Passing thestring.encode('utf-8') rather than bare thestring will definitely avoid the particular error you're seeing right now, but it may result in peculiar displays (or whatever it is you're trying to do with that bytestring!) unless the recipient is ready, willing and able to accept utf-8 encoding (and how could WE know, having absolutely zero idea about what the recipient could possibly be?!-)
The "best" way always depends on your requirements; so, what are yours? Is ignoring non-ASCII appropriate? Should you replace ™ with "(tm)"? (Which looks fancy for this example, but quickly breaks down for other codepoints—but it may be just what you want.) Could the exception be exactly what you need; now you just need to handle it in some way?
Only you can really answer this question.
First of all, try installing translations for English language (or any other if needed):
sudo apt-get install language-pack-en
which provides translation data updates for all supported packages (including Python).
And make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
Then double check your system configuration like value of LANG or configuration of locale (/etc/default/locale) and don't forget to re-login your session.