Python 2.7 and Sublime 2 + unicode don't mix - python-2.7

First of all, I've looked here: Sublime Text 3, Python 3 and UTF-8 don't like each other and read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets but am still none the wiser to the following:
Running Python from a file created in Sublime (not compiling) and executing via command prompt on an XP machine
I have a couple of text files named with accents (German, Spanish & French mostly). I want to remove accented characters (umlauts, acutes, graves, cidillas etc) and replace them with their equilivant non accented look a like.
I can strip the accents if they are a string from with the script. But accesing a textfile of the same name causes the the strippAcent function to fail. I'm all out of ideas as I think this is due to a conflict with Sublime and Python.
Here's my script
# -*- coding: utf-8 -*-
import unicodedata
import os
def stripAccents(s):
try:
us = unicode(s,"utf-8")
nice = unicodedata.normalize("NFD", us).encode("ascii", "ignore")
print nice
return nice
except:
print ("Fail! : %s" %(s))
return None
stripAccents("Découvrez tous les logiciels à télécharger")
# Decouvrez tous les logiciels a telecharger
stripAccents("Östblocket")
# Ostblocket
stripAccents("Blühende Landschaften")
# Bluhende Landschaften
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
for name in files:
x = name
x = stripAccents(x)
For the record:
C:\chcp
gets me 437
This is what the code produces for me:
The error in full is:
C:\WINDOWS\system32>D:\LearnPython\unicode_accents.py
Decouvrez tous les logiciels a telecharger
Ostblocket
Bluhende Landschaften
Traceback (most recent call last):
File "D:\LearnPython\unicode_accents.py", line 37, in <module>
x = stripAccents(x)
File "D:\LearnPython\unicode_accents.py", line 8, in stripAccents
us = unicode(s,"utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 2: invalid start byte
C:\WINDOWS\system32>

root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
If you want to read Windows's filenames in their native Unicode form you have to ask for that specfically, by passing a Unicode string to filesystem functions:
root = u"D:\\temp\\test\\"
Otherwise Python will default to using the standard byte-based interfaces to the filesystem. On Windows, these return filenames to you encoded in the system's locale-specific legacy encoding (ANSI code page).
In stripAccents you try to decode the byte string you got from here using UTF-8, but the ANSI code page is never UTF-8, and the byte sequence you have doesn't happen to be a valid UTF-8 sequence so you get an error. You can decode from the ANSI code page using the pseudo-encoding mbcs, but it would be better to stick to Unicode filepath strings so you can include characters that don't fit in ANSI.

Always use Unicode strings to represent text in Python. Add from __future__ import unicode_literals at the top so that all "" literals would create Unicode strings. Or use u"" literals everywhere. Drop unicode(s, 'utf-8') from stripAccents(), always pass Unicode strings instead (try unidecode package, to transliterate Unicode to ascii).
Using Unicode solves several issues transparently:
there won't be UnicodeDecodeError because Windows provides Unicode API for filenames: if you pass Unicode input; you get Unicode output
you won't get a mojibake when a bytestring containing text encoded using your Windows encoding such as cp1252 is displayed in console using cp437 encoding e.g., Blühende -> Blⁿhende (ü is corrupted)
you might be able to work with text that can't be represented using neither cp1252 nor cp437 encoding e.g., '❤' (U+2764 HEAVY BLACK HEART).
To print Unicode text to Windows console, you could use win-unicode-console package.

Related

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12

I'm developing a chatbot with the chatterbot library. The chatbot is in my native language --> Slovene, which has a lot of strange characters (for example: š, č, ž). I'm using python 2.7.
When I try to train the bot, the library has trouble with the characters mentioned above. For example, when I run the following code:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
it throws the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12: invalid start byte
I added the # -*- coding: utf-8 -*- line to the top of my file, I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8, I changed the system default encoding with the following code:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
The strings are of type unicode.
When I try to get a response, with these strange characters, it works, it has no issues with them. For example, running the following code in the same execution as the above training code(when I change 'š' to 's' and 'č' to 'c', in the train strings), throws no errors:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam se dopusta?",
"Letos imate se 19 dni dopusta.",
])
chatBot.get_response("Koliko imam še dopusta?")
I can't find a solution to this issue. Any suggestions?
Thanks loads in advance. :)
EDIT: I used from __future__ import unicode_literals, to make strings of type unicode. I also checked if they really were unicode with the method type(myString)
I would also like to paste this link.
EDIT 2: #MallikarjunaraoKosuri - s code works, but in my case, I had one more thing inside the chatbot instance intialization, which is the following:
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer',
storage_adapter='chatterbot.storage.JsonFileStorageAdapter'
)
This is the cause of my error. The json storage file the chatbot creates, is created in my local encoding and not in utf-8. It seems the default storage (.sqlite3), doesn't have this issue, so for now I'll just avoid the json storage. But I am still interested in finding a solution to this error.
The strings from your example are not of type unicode.
Otherwise Python would not throw the UnicodeDecodeError.
This type of error says that at a certain step of program's execution Python tries to decode byte-string into unicode but for some reason fails.
In your case the reason is that:
decoding is configured by utf-8
your source file is not in utf-8 and almost certainly in cp1252:
import unicodedata
b = '\x9a'
# u = b.decode('utf-8') # UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a
# in position 0: invalid start byte
u = b.decode('cp1252')
print unicodedata.name(u) # LATIN SMALL LETTER S WITH CARON
print u # š
So, the 0x9a byte from your cp1252 source can't be decoded with utf-8.
The best solution is to do nothing except convertation your source to utf-8.
With Sublime Text 3 you can easily do it by: File -> Reopen with Encoding -> UTF-8.
But don't forget to Ctrl+C your source code before the convertation beacuse just after that all your š, č, ž chars wil be replaced with ?.
Some of our friends are already suggested good part solutions, However again I would like combine all the solutions into one.
And author #gunthercox suggested some guidelines are described here http://chatterbot.readthedocs.io/en/stable/encoding.html#how-do-i-fix-python-encoding-errors
# -*- coding: utf-8 -*-
from chatterbot import ChatBot
# Create a new chat bot named Test
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer'
)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
Python Terminal
>>> # -*- coding: utf-8 -*-
... from chatterbot import ChatBot
>>>
>>> # Create a new chat bot named Test
... chatBot = ChatBot(
... 'Test',
... trainer='chatterbot.trainers.ListTrainer'
... )
>>>
>>> chatBot.train([
... "Koliko imam še dopusta?",
... "Letos imate še 19 dni dopusta.",
... ])
List Trainer: [####################] 100%
>>>

Get non-ASCII filename from S3 notification event in Lambda

The key field in an AWS S3 notification event, which denotes the filename, is URL escaped.
This is evident when the filename contains spaces or non-ASCII characters.
For example, I have upload the following filename to S3:
my file řěąλλυ.txt
The notification is received as:
{
"Records": [
"s3": {
"object": {
"key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
}
}
]
}
I've tried to decode using:
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')
but that yields:
my file ÅÄÄλλÏ.txt
Of course, when I then try to get the file from S3 using Boto, I get a 404 error.
tl;dr
You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.
For example, for an S3 object with the filename: my file řěąλλυ.txt:
>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'
>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a
# Unicode with UTF-8 encoded bytes!
'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt'
# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.
>>> type(key)
<type 'unicode'>
>>> print(key)
my file řěąλλυ.txt
Background
AWS have commited the cardinal sin of changing the default encoding - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
The error you should've got from your decode() is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)
The value of key is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.
Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call .decode() at all.
Just in case someone else comes here hoping for a JavaScript solution, here's what I ended up with:
function decodeS3EventKey (key = '') {
return decodeURIComponent(key.replace(/\+/g, ' '))
}
With limited testing, it seems to work fine:
test+image+%C3%BCtf+%E3%83%86%E3%82%B9%E3%83%88.jpg decodes to test image ütf テスト.jpg
my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt decodes to my file řěąλλυ.txt
For python 3:
from urllib.parse import unquote_plus
result = unquote_plus('input/%D0%BF%D1%83%D1%81%D1%82%D0%BE%D0%B8%CC%86.pdf')
print(result)
# will prints 'input/пустой.pdf'

Best way to decode command line inputs to Unicode Python 2.7 scripts

All my scripts use Unicode literals throughout, with
from __future__ import unicode_literals
but this creates a problem when there is the potential for functions being called with bytestrings, and I'm wondering what the best approach is for handling this and producing clear helpful errors.
I gather that one common approach, which I've adopted, is to simply make this clear when it occurs, with something like
def my_func(somearg):
"""The 'somearg' argument must be Unicode."""
if not isinstance(arg, unicode):
raise TypeError("Parameter 'somearg' should be a Unicode")
# ...
for all arguments that need to be Unicode (and might be bytestrings). However even if I do this, I encounter problems with my argparse command line script if supplied parameters correspond to such arguments, and I wonder what the best approach here is. It seems that I can simply check the encoding of such arguments, and decode them using that encoding, with, for example
if __name__ == '__main__':
parser = argparse.ArgumentParser(...)
parser.add_argument('somearg', ...)
# ...
args = parser.parse_args()
some_arg = args.somearg
if not isinstance(config_arg, unicode):
some_arg = some_arg.decode(sys.getfilesystemencoding())
#...
my_func(some_arg, ...)
Is this combination of approaches a common design pattern for Unicode modules that may receive bytestring inputs? Specifically,
can I reliable decode command line arguments in this way, and
will sys.getfilesystemencoding() give me the correct encoding for command line arguments; or
does argparse provide some builtin facility for accomplishing this that I've missed?
I don't think getfilesystemencoding will necessarily get the right encoding for the shell, it depends on the shell (and can be customised by the shell, independent of the filesystem). The file system encoding is only concerned with how non-ascii filenames are stored.
Instead, you should probably be looking at sys.stdin.encoding which will give you the encoding for standard input.
Additionally, you might consider using the type keyword argument when you add an argument:
import sys
import argparse as ap
def foo(str_, encoding=sys.stdin.encoding):
return str_.decode(encoding)
parser = ap.ArgumentParser()
parser.add_argument('my_int', type=int)
parser.add_argument('my_arg', type=foo)
args = parser.parse_args()
print repr(args)
Demo:
$ python spam.py abc hello
usage: spam.py [-h] my_int my_arg
spam.py: error: argument my_int: invalid int value: 'abc'
$ python spam.py 123 hello
Namespace(my_arg=u'hello', my_int=123)
$ python spam.py 123 ollǝɥ
Namespace(my_arg=u'oll\u01dd\u0265', my_int=123)
If you have to work with non-ascii data a lot, I would highly recommend upgrading to python3. Everything is a lot easier there, for example, parsed arguments will already be unicode on python3.
Since there is conflicting information about the command line argument encoding around, I decided to test it by changing my shell encoding to latin-1 whilst leaving the file system encoding as utf-8. For my tests I use the c-cedilla character which has a different encoding in these two:
>>> u'Ç'.encode('ISO8859-1')
'\xc7'
>>> u'Ç'.encode('utf-8')
'\xc3\x87'
Now I create an example script:
#!/usr/bin/python2.7
import argparse as ap
import sys
print 'sys.stdin.encoding is ', sys.stdin.encoding
print 'sys.getfilesystemencoding() is', sys.getfilesystemencoding()
def encoded(s):
print 'encoded', repr(s)
return s
def decoded_filesystemencoding(s):
try:
s = s.decode(sys.getfilesystemencoding())
except UnicodeDecodeError:
s = 'failed!'
return s
def decoded_stdinputencoding(s):
try:
s = s.decode(sys.stdin.encoding)
except UnicodeDecodeError:
s = 'failed!'
return s
parser = ap.ArgumentParser()
parser.add_argument('first', type=encoded)
parser.add_argument('second', type=decoded_filesystemencoding)
parser.add_argument('third', type=decoded_stdinputencoding)
args = parser.parse_args()
print repr(args)
Then I change my shell encoding to ISO/IEC 8859-1:
And I call the script:
wim-macbook:tmp wim$ ./spam.py Ç Ç Ç
sys.stdin.encoding is ISO8859-1
sys.getfilesystemencoding() is utf-8
encoded '\xc7'
Namespace(first='\xc7', second='failed!', third=u'\xc7')
As you can see, the command line arguments were encoding in latin-1, and so the second command line argument (using sys.getfilesystemencoding) fails to decode. The third command line argument (using sys.stdin.encoding) decodes correctly.
sys.getfilesystemencoding() is the correct(but see examples) encoding for OS data such as filenames, environment variables, and command-line arguments.
You could see the logic behind the choice: sys.argv[0] may be the path to the script (the filename) and therefore it is natural to assume that it uses the same encoding as other filenames and that other items in the argv list use the same character encoding as sys.argv[0]. os.environ['PATH'] contains paths and therefore it is also natural that environment variables use the same encoding:
$ echo 'import sys; print(sys.argv)' >print_argv.py
$ python print_argv.py
['print_argv.py']
Note: sys.argv[0] is the script filename whatever other command-line arguments you might have.
"best way" depends on your specific use-case e.g., on Windows, you should probably use Unicode API directly (CommandLineToArgvW()). On POSIX, if all you need is to pass some argv items to OS functions back (such as os.listdir()) then you could leave them as bytes -- command-line argument can be arbitrary byte sequence, see PEP 0383 -- Non-decodable Bytes in System Character Interfaces:
import os, sys
os.execl(sys.executable, sys.executable, '-c', 'import sys; print(sys.argv)',
bytes(bytearray(range(1, 0x100))))
As you can see POSIX allows to pass any bytes (except zero).
Obviously, you can also misconfigure your environment:
$ LANG=C PYTHONIOENCODING=latin-1 python -c'import sys;
> print(sys.argv, sys.stdin.encoding, sys.getfilesystemencoding())' €
(['-c', '\xe2\x82\xac'], 'latin-1', 'ANSI_X3.4-1968') # Linux output
The output shows that € is encoded using utf-8 but both locale and PYTHONIOENCODING are configured differently.
The examples demonstrate that sys.argv may be encoded using a character encoding that does not correspond to any of the standard encodings or it even may contain arbitrary (except zero byte) binary data on POSIX (no character encoding). On Windows, I guess, you could paste a Unicode string that can't be encoded using ANSI or OEM Windows encodings but you might get the correct value using Unicode API anyway (Python 2 probably drops data here).
Python 3 uses Unicode sys.argv and therefore it shouldn't lose data on Windows (Unicode API is used) and it allows to demonstrate that sys.getfilesystemencoding() is used (not sys.stdin.encoding) to decode sys.argv on Linux (where sys.getfilesystemencoding() is derived from locale):
$ LANG=C.UTF-8 PYTHONIOENCODING=latin-1 python3 -c'import sys; print(*map(ascii, sys.argv))' µ
'-c' '\xb5'
$ LANG=C PYTHONIOENCODING=latin-1 python3 -c'import sys; print(*map(ascii, sys.argv))' µ
'-c' '\udcc2\udcb5'
$ LANG=en_US.ISO-8859-15 PYTHONIOENCODING=latin-1 python3 -c'import sys; print(*map(ascii, sys.argv))' µ
'-c' '\xc2\xb5'
The output shows that LANG that defines locale in this case that defines sys.getfilesystemencoding() on Linux is used to decode the command-line arguments:
$ python3
>>> print(ascii(b'\xc2\xb5'.decode('utf-8')))
'\xb5'
>>> print(ascii(b'\xc2\xb5'.decode('ascii', 'surrogateescape')))
'\udcc2\udcb5'
>>> print(ascii(b'\xc2\xb5'.decode('iso-8859-15')))
'\xc2\xb5'

Unicode characters output from python I/O to files

I don't know if this is my misunderstanding of UTF-8 or of python, but I'm having trouble understanding how python writes Unicode characters to a file. I'm on a Mac under OSX by the way, if that makes a difference.
Let's say I have the following unicode string
foo=u'\x93Stuff in smartquotes\x94\n'
Here \x93 and \x94 are those awful smart-quotes.
Then I write it to a file:
with open('file.txt','w') as file:
file.write(foo.encode('utf8'))
When I open the file in a text editor like TextWrangler or in a web browser file.txt seems like it was written as
\xc2\x93**Stuff in smartquotes\xc2\x94\n
The text editor properly understands the file to be UTF8 encoded, but it renders \xc2\x93 as garbage. If I go in and manually strip out the \xc2 part, I get what I expect, and TextWrangler and Firefox render the utf characters as smartquotes.
This is exactly what I get when I read the file back into python without decoding it as 'utf8'. However, when I do read it in with the read().decode('utf8') method, I get back what I originally put in, without the \xc2 bit.
This is driving me bonkers, because I'm trying to parse a bunch of html files into text and the incorrect rendering of these unicode characters is screwing up a bunch of stuff.
I also tried it in python3 using the read/write methods normally, and it has the same behavior.
edit: Regarding stripping out the \xc2 manually, it turns out that it was rendering correctly when I did that because the browser and text editors were defaulting to Latin encoding.
Also, as a follow up, Filefox renders the text as
☐Stuff in smartquotes☐
where the boxes are empty unicode values, while Chrome renders the text as
Stuff in smartquotes
The problem is, u'\x93' and u'\x94' are not the Unicode codepoints for smart quotes. They are smart quotes in the Windows-1252 encoding, which is not the same as the latin1 encoding. In latin1, those values are not defined.
>>> import unicodedata as ud
>>> ud.name(u'\x93')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> import unicodedata as ud
>>> ud.name(u'\x94')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> ud.name(u'\u201c')
'LEFT DOUBLE QUOTATION MARK'
>>> ud.name(u'\u201d')
'RIGHT DOUBLE QUOTATION MARK'
So you should one of:
foo = u'\u201cStuff in smartquotes\u201d'
foo = u'\N{LEFT DOUBLE QUOTATION MARK}Stuff in smartquotes\N{RIGHT DOUBLE QUOTATION MARK}'
or in a UTF-8 source file:
#coding:utf8
foo = u'“Stuff in smartquotes”'
Edit: If you somehow have a Unicode string with those incorrect bytes in it, here's a way to fix them. The first 256 Unicode codepoints map 1:1 with latin1 encoding, so it can be used to encode a mis-decoded Unicode string directly back to a byte string so the correct decoding can be used:
>>> foo = u'\x93Stuff in smartquotes\x94'
>>> foo
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('windows-1252')
'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”
If you have the UTF-8-encoded version of the incorrect Unicode characters:
>>> foo = '\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo = foo.decode('utf8').encode('latin1').decode('windows-1252')
>>> foo
u'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”
And if you have the very worst case the following Unicode string:
>>> foo = u'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1') # back to a UTF-8 encoded byte string.
'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1').decode('utf8') # Undo the UTF-8, but Unicode is still wrong.
u'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1') # back to a byte string.
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1').decode('windows-1252') # Now decode correctly.
u'\u201cStuff in smartquotes\u201d'

"UnicodeEncodeError: 'ascii' codec can't encode character"

I'm trying to pass big strings of random html through regular expressions and my Python 2.6 script is choking on this:
UnicodeEncodeError: 'ascii' codec can't encode character
I traced it back to a trademark superscript on the end of this word: Protection™ -- and I expect to encounter others like it in the future.
Is there a module to process non-ascii characters? or, what is the best way to handle/escape non-ascii stuff in python?
Thanks!
Full error:
E
======================================================================
ERROR: test_untitled (__main__.Untitled)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Python26\Test2.py", line 26, in test_untitled
ofile.write(Whois + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 1005: ordinal not in range(128)
Full Script:
from selenium import selenium
import unittest, time, re, csv, logging
class Untitled(unittest.TestCase):
def setUp(self):
self.verificationErrors = []
self.selenium = selenium("localhost", 4444, "*firefox", "http://www.BaseDomain.com/")
self.selenium.start()
self.selenium.set_timeout("90000")
def test_untitled(self):
sel = self.selenium
spamReader = csv.reader(open('SubDomainList.csv', 'rb'))
for row in spamReader:
sel.open(row[0])
time.sleep(10)
Test = sel.get_text("//html/body/div/table/tbody/tr/td/form/div/table/tbody/tr[7]/td")
Test = Test.replace(",","")
Test = Test.replace("\n", "")
ofile = open('TestOut.csv', 'ab')
ofile.write(Test + '\n')
ofile.close()
def tearDown(self):
self.selenium.stop()
self.assertEqual([], self.verificationErrors)
if __name__ == "__main__":
unittest.main()
You're trying to convert unicode to ascii in "strict" mode:
>>> help(str.encode)
Help on method_descriptor:
encode(...)
S.encode([encoding[,errors]]) -> object
Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.
You probably want something like one of the following:
s = u'Protection™'
print s.encode('ascii', 'ignore') # removes the ™
print s.encode('ascii', 'replace') # replaces with ?
print s.encode('ascii','xmlcharrefreplace') # turn into xml entities
print s.encode('ascii', 'strict') # throw UnicodeEncodeErrors
You're trying to pass a bytestring to something, but it's impossible (from the scarcity of info you provide) to tell what you're trying to pass it to. You start with a Unicode string that cannot be encoded as ASCII (the default codec), so, you'll have to encode by some different codec (or transliterate it, as #R.Pate suggests) -- but it's impossible for use to say what codec you should use, because we don't know what you're passing the bytestring and therefore don't know what that unknown subsystem is going to be able to accept and process correctly in terms of codecs.
In such total darkness as you leave us in, utf-8 is a reasonable blind guess (since it's a codec that can represent any Unicode string exactly as a bytestring, and it's the standard codec for many purposes, such as XML) -- but it can't be any more than a blind guess, until and unless you're going to tell us more about what you're trying to pass that bytestring to, and for what purposes.
Passing thestring.encode('utf-8') rather than bare thestring will definitely avoid the particular error you're seeing right now, but it may result in peculiar displays (or whatever it is you're trying to do with that bytestring!) unless the recipient is ready, willing and able to accept utf-8 encoding (and how could WE know, having absolutely zero idea about what the recipient could possibly be?!-)
The "best" way always depends on your requirements; so, what are yours? Is ignoring non-ASCII appropriate? Should you replace ™ with "(tm)"? (Which looks fancy for this example, but quickly breaks down for other codepoints—but it may be just what you want.) Could the exception be exactly what you need; now you just need to handle it in some way?
Only you can really answer this question.
First of all, try installing translations for English language (or any other if needed):
sudo apt-get install language-pack-en
which provides translation data updates for all supported packages (including Python).
And make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
Then double check your system configuration like value of LANG or configuration of locale (/etc/default/locale) and don't forget to re-login your session.