UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12 - python-2.7

I'm developing a chatbot with the chatterbot library. The chatbot is in my native language --> Slovene, which has a lot of strange characters (for example: š, č, ž). I'm using python 2.7.
When I try to train the bot, the library has trouble with the characters mentioned above. For example, when I run the following code:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
it throws the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12: invalid start byte
I added the # -*- coding: utf-8 -*- line to the top of my file, I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8, I changed the system default encoding with the following code:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
The strings are of type unicode.
When I try to get a response, with these strange characters, it works, it has no issues with them. For example, running the following code in the same execution as the above training code(when I change 'š' to 's' and 'č' to 'c', in the train strings), throws no errors:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam se dopusta?",
"Letos imate se 19 dni dopusta.",
])
chatBot.get_response("Koliko imam še dopusta?")
I can't find a solution to this issue. Any suggestions?
Thanks loads in advance. :)
EDIT: I used from __future__ import unicode_literals, to make strings of type unicode. I also checked if they really were unicode with the method type(myString)
I would also like to paste this link.
EDIT 2: #MallikarjunaraoKosuri - s code works, but in my case, I had one more thing inside the chatbot instance intialization, which is the following:
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer',
storage_adapter='chatterbot.storage.JsonFileStorageAdapter'
)
This is the cause of my error. The json storage file the chatbot creates, is created in my local encoding and not in utf-8. It seems the default storage (.sqlite3), doesn't have this issue, so for now I'll just avoid the json storage. But I am still interested in finding a solution to this error.

The strings from your example are not of type unicode.
Otherwise Python would not throw the UnicodeDecodeError.
This type of error says that at a certain step of program's execution Python tries to decode byte-string into unicode but for some reason fails.
In your case the reason is that:
decoding is configured by utf-8
your source file is not in utf-8 and almost certainly in cp1252:
import unicodedata
b = '\x9a'
# u = b.decode('utf-8') # UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a
# in position 0: invalid start byte
u = b.decode('cp1252')
print unicodedata.name(u) # LATIN SMALL LETTER S WITH CARON
print u # š
So, the 0x9a byte from your cp1252 source can't be decoded with utf-8.
The best solution is to do nothing except convertation your source to utf-8.
With Sublime Text 3 you can easily do it by: File -> Reopen with Encoding -> UTF-8.
But don't forget to Ctrl+C your source code before the convertation beacuse just after that all your š, č, ž chars wil be replaced with ?.

Some of our friends are already suggested good part solutions, However again I would like combine all the solutions into one.
And author #gunthercox suggested some guidelines are described here http://chatterbot.readthedocs.io/en/stable/encoding.html#how-do-i-fix-python-encoding-errors
# -*- coding: utf-8 -*-
from chatterbot import ChatBot
# Create a new chat bot named Test
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer'
)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
Python Terminal
>>> # -*- coding: utf-8 -*-
... from chatterbot import ChatBot
>>>
>>> # Create a new chat bot named Test
... chatBot = ChatBot(
... 'Test',
... trainer='chatterbot.trainers.ListTrainer'
... )
>>>
>>> chatBot.train([
... "Koliko imam še dopusta?",
... "Letos imate še 19 dni dopusta.",
... ])
List Trainer: [####################] 100%
>>>

Related

How can I write characters like äöü in a Excel file without getting a UnicodeDecodeError

I want to use the xlwt to create an Excel file. However, some of the strings contain letters like ä, ü and ö. Therefore, I get the UnicodeDecodeError. Can this be fixed?
I transfered my code from 3.5 (IDLE) to 2.7 (Pycharm). It worked in 3.5, probably because I didn't need to put
# coding=utf-8
# -*- coding: iso-8859-1 -*-
at the beginning of the code in 3.5...
# coding=utf-8
# -*- coding: iso-8859-1 -*-
import xlwt
name_of_new_file = "Test.xls"
workbook_new = xlwt.Workbook()
worksheet = workbook_new.add_sheet("Testing")
worksheet.write(0, 0, "ä") # it works if I write an "a" instead
workbook_new.save("C:\\...\\Test123.xls")
I'm pretty sure the reason for the problem is about the first two lines. The Error message says:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
in python 2, the encoding is a thouble
worksheet.write(0, 0, u"ä")

My telegram bot does not support Persian language

I built a telegram bot with Python-Telegram-bot, and I want to send a bot to a user in Persian when the user sends /Start ;but the bot does not work.
My Code:
from telegram.ext import Updater,CommandHandler
updater = Updater(token='TOKEN')
def start_method(bot,update):
bot.sendMessage(update.message.chat_id,"سلام")
start_command = CommandHandler('start', start_method)
updater.dispatcher.add_handler(start_command)
updater.start_polling()
If you want to use unicode text in your code, you have to specify the file encoding according to PEP 263.
Place this comment at the beginning of your script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
You can also use Python 3, which has much better unicode support in general and assumes utf-8 encoding for source files by default.
First, need to use a urllib. If your text is something like txt1, you need to quote it first and then send it as a message. like this:
from urllib.parse import quote
......
txt1 = 'سلام. خوش آمدید!'
txt = quote(txt1.encode('utf-8'))

Get non-ASCII filename from S3 notification event in Lambda

The key field in an AWS S3 notification event, which denotes the filename, is URL escaped.
This is evident when the filename contains spaces or non-ASCII characters.
For example, I have upload the following filename to S3:
my file řěąλλυ.txt
The notification is received as:
{
"Records": [
"s3": {
"object": {
"key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
}
}
]
}
I've tried to decode using:
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')
but that yields:
my file ÅÄÄλλÏ.txt
Of course, when I then try to get the file from S3 using Boto, I get a 404 error.
tl;dr
You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.
For example, for an S3 object with the filename: my file řěąλλυ.txt:
>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'
>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a
# Unicode with UTF-8 encoded bytes!
'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt'
# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.
>>> type(key)
<type 'unicode'>
>>> print(key)
my file řěąλλυ.txt
Background
AWS have commited the cardinal sin of changing the default encoding - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
The error you should've got from your decode() is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)
The value of key is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.
Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call .decode() at all.
Just in case someone else comes here hoping for a JavaScript solution, here's what I ended up with:
function decodeS3EventKey (key = '') {
return decodeURIComponent(key.replace(/\+/g, ' '))
}
With limited testing, it seems to work fine:
test+image+%C3%BCtf+%E3%83%86%E3%82%B9%E3%83%88.jpg decodes to test image ütf テスト.jpg
my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt decodes to my file řěąλλυ.txt
For python 3:
from urllib.parse import unquote_plus
result = unquote_plus('input/%D0%BF%D1%83%D1%81%D1%82%D0%BE%D0%B8%CC%86.pdf')
print(result)
# will prints 'input/пустой.pdf'

Python 2.7 and Sublime 2 + unicode don't mix

First of all, I've looked here: Sublime Text 3, Python 3 and UTF-8 don't like each other and read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets but am still none the wiser to the following:
Running Python from a file created in Sublime (not compiling) and executing via command prompt on an XP machine
I have a couple of text files named with accents (German, Spanish & French mostly). I want to remove accented characters (umlauts, acutes, graves, cidillas etc) and replace them with their equilivant non accented look a like.
I can strip the accents if they are a string from with the script. But accesing a textfile of the same name causes the the strippAcent function to fail. I'm all out of ideas as I think this is due to a conflict with Sublime and Python.
Here's my script
# -*- coding: utf-8 -*-
import unicodedata
import os
def stripAccents(s):
try:
us = unicode(s,"utf-8")
nice = unicodedata.normalize("NFD", us).encode("ascii", "ignore")
print nice
return nice
except:
print ("Fail! : %s" %(s))
return None
stripAccents("Découvrez tous les logiciels à télécharger")
# Decouvrez tous les logiciels a telecharger
stripAccents("Östblocket")
# Ostblocket
stripAccents("Blühende Landschaften")
# Bluhende Landschaften
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
for name in files:
x = name
x = stripAccents(x)
For the record:
C:\chcp
gets me 437
This is what the code produces for me:
The error in full is:
C:\WINDOWS\system32>D:\LearnPython\unicode_accents.py
Decouvrez tous les logiciels a telecharger
Ostblocket
Bluhende Landschaften
Traceback (most recent call last):
File "D:\LearnPython\unicode_accents.py", line 37, in <module>
x = stripAccents(x)
File "D:\LearnPython\unicode_accents.py", line 8, in stripAccents
us = unicode(s,"utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 2: invalid start byte
C:\WINDOWS\system32>
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
If you want to read Windows's filenames in their native Unicode form you have to ask for that specfically, by passing a Unicode string to filesystem functions:
root = u"D:\\temp\\test\\"
Otherwise Python will default to using the standard byte-based interfaces to the filesystem. On Windows, these return filenames to you encoded in the system's locale-specific legacy encoding (ANSI code page).
In stripAccents you try to decode the byte string you got from here using UTF-8, but the ANSI code page is never UTF-8, and the byte sequence you have doesn't happen to be a valid UTF-8 sequence so you get an error. You can decode from the ANSI code page using the pseudo-encoding mbcs, but it would be better to stick to Unicode filepath strings so you can include characters that don't fit in ANSI.
Always use Unicode strings to represent text in Python. Add from __future__ import unicode_literals at the top so that all "" literals would create Unicode strings. Or use u"" literals everywhere. Drop unicode(s, 'utf-8') from stripAccents(), always pass Unicode strings instead (try unidecode package, to transliterate Unicode to ascii).
Using Unicode solves several issues transparently:
there won't be UnicodeDecodeError because Windows provides Unicode API for filenames: if you pass Unicode input; you get Unicode output
you won't get a mojibake when a bytestring containing text encoded using your Windows encoding such as cp1252 is displayed in console using cp437 encoding e.g., Blühende -> Blⁿhende (ü is corrupted)
you might be able to work with text that can't be represented using neither cp1252 nor cp437 encoding e.g., '❤' (U+2764 HEAVY BLACK HEART).
To print Unicode text to Windows console, you could use win-unicode-console package.

PyCharm issue on encoding

I am trying pycharm and facing an encoding issue. Can you please help resolve it.
code:
# -*- coding: utf-8 -*-
__author__ = 'me'
import os, sys
def main():
print repeat('mike',False)
print repeat('mok', True)
"""
comments here..
"""
def repeat(s,exclaim):
result = s*3
if exclaim:
result = result +'!!!'
return result
if __name__ == '__main__':
main()
error:
C:\Python27\python.exe C:\Python27\python.exe C:/Users/prakashs/PycharmProjects/GooglePython/WarmUp.py
File "C:\Python27\python.exe", line 1
SyntaxError: Non-ASCII character '\x90' in file C:\Python27\python.exe on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Process finished with exit code 1
I have set the default encoding in pycharm to utf-8 as well. but i need to know where in pycharm we have to edit the settings.
Thank you.
Googling for Non-ASCII character '\x90' in file gives Using #-*- coding: utf-8 -*- does not remove "Non-ASCII character '\x90' in file hello.exe on line 1, but no encoding declared" error Stackoverflow question as the first hit. There you'll find answer to your question.
You have wrong command starting with C:\Python27\python.exe C:\Python27\python.exe... (python.exe is mentioned twice) which means you try to run executable (python.exe) instead of script file (WarmUp.py).