Decode other languages - python-2.7

In python, I am having other languages text as,
import json
name = "அரவிந்த்"
result = {"Name": name}
j_res = json.dumps(result)
print j_res
Output:
{"Name": "\u0b85\u0bb0\u0bb5\u0bbf\u0ba8\u0bcd\u0ba4\u0bcd"}
Is there any ways to get the name of அரவிந்த் from \u0b85\u0bb0\u0bb5\u0bbf\u0ba8\u0bcd\u0ba4\u0bcd this text.?

Yes, it just as simple:
# -*- coding: utf-8 -*-
import json
name = "அரவிந்த்"
result = {"Name": name}
j_res = json.dumps(result)
print j_res
print json.loads(j_res)
print json.loads(j_res)["Name"]
Output:
{"Name": "\u0b85\u0bb0\u0bb5\u0bbf\u0ba8\u0bcd\u0ba4\u0bcd"}
{u'Name': u'\u0b85\u0bb0\u0bb5\u0bbf\u0ba8\u0bcd\u0ba4\u0bcd'}
அரவிந்த்

In Python 2.7, strings are simply collections of the ASCII charset (0 through 255 bits) .. if you need to handle and show characters beyond these 256 characters, you should almost-certainly use unicode objects (prefixed by u) instead of the naive str (default for new strings).
In Python 3+, this problem is solved by strings being arrays of raw bytes with an associated encoding (normally utf-8), which can represent all types of characters found in the encoding. If you can use Python 3, it may solve this and many similar problems related to how strings and characters are saved and displayed for you.
If you're forced to use Python 2.7, you should read these with an encoding and make certain they're loaded as unicode

Related

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12

I'm developing a chatbot with the chatterbot library. The chatbot is in my native language --> Slovene, which has a lot of strange characters (for example: š, č, ž). I'm using python 2.7.
When I try to train the bot, the library has trouble with the characters mentioned above. For example, when I run the following code:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
it throws the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12: invalid start byte
I added the # -*- coding: utf-8 -*- line to the top of my file, I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8, I changed the system default encoding with the following code:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
The strings are of type unicode.
When I try to get a response, with these strange characters, it works, it has no issues with them. For example, running the following code in the same execution as the above training code(when I change 'š' to 's' and 'č' to 'c', in the train strings), throws no errors:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam se dopusta?",
"Letos imate se 19 dni dopusta.",
])
chatBot.get_response("Koliko imam še dopusta?")
I can't find a solution to this issue. Any suggestions?
Thanks loads in advance. :)
EDIT: I used from __future__ import unicode_literals, to make strings of type unicode. I also checked if they really were unicode with the method type(myString)
I would also like to paste this link.
EDIT 2: #MallikarjunaraoKosuri - s code works, but in my case, I had one more thing inside the chatbot instance intialization, which is the following:
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer',
storage_adapter='chatterbot.storage.JsonFileStorageAdapter'
)
This is the cause of my error. The json storage file the chatbot creates, is created in my local encoding and not in utf-8. It seems the default storage (.sqlite3), doesn't have this issue, so for now I'll just avoid the json storage. But I am still interested in finding a solution to this error.
The strings from your example are not of type unicode.
Otherwise Python would not throw the UnicodeDecodeError.
This type of error says that at a certain step of program's execution Python tries to decode byte-string into unicode but for some reason fails.
In your case the reason is that:
decoding is configured by utf-8
your source file is not in utf-8 and almost certainly in cp1252:
import unicodedata
b = '\x9a'
# u = b.decode('utf-8') # UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a
# in position 0: invalid start byte
u = b.decode('cp1252')
print unicodedata.name(u) # LATIN SMALL LETTER S WITH CARON
print u # š
So, the 0x9a byte from your cp1252 source can't be decoded with utf-8.
The best solution is to do nothing except convertation your source to utf-8.
With Sublime Text 3 you can easily do it by: File -> Reopen with Encoding -> UTF-8.
But don't forget to Ctrl+C your source code before the convertation beacuse just after that all your š, č, ž chars wil be replaced with ?.
Some of our friends are already suggested good part solutions, However again I would like combine all the solutions into one.
And author #gunthercox suggested some guidelines are described here http://chatterbot.readthedocs.io/en/stable/encoding.html#how-do-i-fix-python-encoding-errors
# -*- coding: utf-8 -*-
from chatterbot import ChatBot
# Create a new chat bot named Test
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer'
)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
Python Terminal
>>> # -*- coding: utf-8 -*-
... from chatterbot import ChatBot
>>>
>>> # Create a new chat bot named Test
... chatBot = ChatBot(
... 'Test',
... trainer='chatterbot.trainers.ListTrainer'
... )
>>>
>>> chatBot.train([
... "Koliko imam še dopusta?",
... "Letos imate še 19 dni dopusta.",
... ])
List Trainer: [####################] 100%
>>>

How can i clean urdu data corpus Python without nltk

I have a corpus of more that 10000 words in urdu. Now what i want is to clean my data. There appear a special uni coded data in my text like "!؟ـ،" whenever i use regular expressions it gives me error that your data is not in encoded form.
Kindly provide me some help to clean my data.
Thank you
Here is my sample data:
ظہیر
احمد
ماہرہ
خان
کی،
تصاویر،
نے
دائیں
اور
بائیں
والوں
کو
آسمانوں
پر
پہنچایا
،ہوا
ہے۔
دائیں؟
والے
I used your sample to find all words with ہ or ر
Notice that I had to tell python that I am dealing with utf-8 data by using u in front of the regex string as well as the data string
import re
data = u"""
ظہیر
احمد
ماہرہ
خان
.....
"""
result = re.findall(u'[^\s\n]+[ہر][^\s\n]+',data,re.MULTILINE)
print(result)
The output was
['ظہیر', 'ماہرہ', 'تصاویر،', 'پہنچایا', '،ہوا']
another example, removes all none alphabets except whitespace and makes sure only one whitespace separates the words
result = re.sub(' +',' ',re.sub(u'[\W\s]',' ',data))
print(result)
the output is
ظہیر احمد ماہرہ خان کی تصاویر نے دائیں اور بائیں والوں کو آسمانوں پر پہنچایا ہوا ہے دائیں والے
you can also use word tokanizer,
import nltk
result = nltk.tokenize.wordpunct_tokenize(data)
print(result)
the output will be
['ظہیر', 'احمد', 'ماہرہ'
, 'خان', 'کی', '،', 'تصاویر'
, '،', 'نے', 'دائیں', 'اور', 'بائیں', 'والوں'
, 'کو', 'آسمانوں', 'پر', 'پہنچایا'
, '،', 'ہوا', 'ہے', '۔', 'دائیں', '؟', 'والے']
Edit ... for Python 2.7 you have to specify the encoding at the beginning of the code file as well as telling re that the regex is 'unicode' using re.UNICODE
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import re
data = u"""ظہیر
احمد
ماہرہ
خان
کی،
.....
"""
result = re.sub(ur'\s+',u' ',re.sub(ur'[\W\s]',ur' ',data,re.UNICODE),re.UNICODE)
print(result)
also note the use of ur'' to specify the string is a unicode regex string

Get non-ASCII filename from S3 notification event in Lambda

The key field in an AWS S3 notification event, which denotes the filename, is URL escaped.
This is evident when the filename contains spaces or non-ASCII characters.
For example, I have upload the following filename to S3:
my file řěąλλυ.txt
The notification is received as:
{
"Records": [
"s3": {
"object": {
"key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
}
}
]
}
I've tried to decode using:
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')
but that yields:
my file ÅÄÄλλÏ.txt
Of course, when I then try to get the file from S3 using Boto, I get a 404 error.
tl;dr
You need to convert the URL encoded Unicode string to a bytes str before un-urlparsing it and decoding as UTF-8.
For example, for an S3 object with the filename: my file řěąλλυ.txt:
>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'
>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a
# Unicode with UTF-8 encoded bytes!
'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt'
# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.
>>> type(key)
<type 'unicode'>
>>> print(key)
my file řěąλλυ.txt
Background
AWS have commited the cardinal sin of changing the default encoding - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/
The error you should've got from your decode() is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)
The value of key is a Unicode. In Python 2.x you could decode a Unicode, even though it doesn't make sense. In Python 2.x to decode a Unicode, Python first tries to encode it to a [byte] str first before decoding it using the given encoding. In Python 2.x the default encoding should be ASCII, which of course can't contain the characters used.
Had you got the proper UnicodeEncodeError from Python, you may have found suitable answers. On Python 3, you wouldn't have been able to call .decode() at all.
Just in case someone else comes here hoping for a JavaScript solution, here's what I ended up with:
function decodeS3EventKey (key = '') {
return decodeURIComponent(key.replace(/\+/g, ' '))
}
With limited testing, it seems to work fine:
test+image+%C3%BCtf+%E3%83%86%E3%82%B9%E3%83%88.jpg decodes to test image ütf テスト.jpg
my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt decodes to my file řěąλλυ.txt
For python 3:
from urllib.parse import unquote_plus
result = unquote_plus('input/%D0%BF%D1%83%D1%81%D1%82%D0%BE%D0%B8%CC%86.pdf')
print(result)
# will prints 'input/пустой.pdf'

Unicode characters output from python I/O to files

I don't know if this is my misunderstanding of UTF-8 or of python, but I'm having trouble understanding how python writes Unicode characters to a file. I'm on a Mac under OSX by the way, if that makes a difference.
Let's say I have the following unicode string
foo=u'\x93Stuff in smartquotes\x94\n'
Here \x93 and \x94 are those awful smart-quotes.
Then I write it to a file:
with open('file.txt','w') as file:
file.write(foo.encode('utf8'))
When I open the file in a text editor like TextWrangler or in a web browser file.txt seems like it was written as
\xc2\x93**Stuff in smartquotes\xc2\x94\n
The text editor properly understands the file to be UTF8 encoded, but it renders \xc2\x93 as garbage. If I go in and manually strip out the \xc2 part, I get what I expect, and TextWrangler and Firefox render the utf characters as smartquotes.
This is exactly what I get when I read the file back into python without decoding it as 'utf8'. However, when I do read it in with the read().decode('utf8') method, I get back what I originally put in, without the \xc2 bit.
This is driving me bonkers, because I'm trying to parse a bunch of html files into text and the incorrect rendering of these unicode characters is screwing up a bunch of stuff.
I also tried it in python3 using the read/write methods normally, and it has the same behavior.
edit: Regarding stripping out the \xc2 manually, it turns out that it was rendering correctly when I did that because the browser and text editors were defaulting to Latin encoding.
Also, as a follow up, Filefox renders the text as
☐Stuff in smartquotes☐
where the boxes are empty unicode values, while Chrome renders the text as
Stuff in smartquotes
The problem is, u'\x93' and u'\x94' are not the Unicode codepoints for smart quotes. They are smart quotes in the Windows-1252 encoding, which is not the same as the latin1 encoding. In latin1, those values are not defined.
>>> import unicodedata as ud
>>> ud.name(u'\x93')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> import unicodedata as ud
>>> ud.name(u'\x94')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
ValueError: no such name
>>> ud.name(u'\u201c')
'LEFT DOUBLE QUOTATION MARK'
>>> ud.name(u'\u201d')
'RIGHT DOUBLE QUOTATION MARK'
So you should one of:
foo = u'\u201cStuff in smartquotes\u201d'
foo = u'\N{LEFT DOUBLE QUOTATION MARK}Stuff in smartquotes\N{RIGHT DOUBLE QUOTATION MARK}'
or in a UTF-8 source file:
#coding:utf8
foo = u'“Stuff in smartquotes”'
Edit: If you somehow have a Unicode string with those incorrect bytes in it, here's a way to fix them. The first 256 Unicode codepoints map 1:1 with latin1 encoding, so it can be used to encode a mis-decoded Unicode string directly back to a byte string so the correct decoding can be used:
>>> foo = u'\x93Stuff in smartquotes\x94'
>>> foo
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('windows-1252')
'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”
If you have the UTF-8-encoded version of the incorrect Unicode characters:
>>> foo = '\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo = foo.decode('utf8').encode('latin1').decode('windows-1252')
>>> foo
u'\u201cStuff in smartquotes\u201d'
>>> print foo
“Stuff in smartquotes”
And if you have the very worst case the following Unicode string:
>>> foo = u'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1') # back to a UTF-8 encoded byte string.
'\xc2\x93Stuff in smartquotes\xc2\x94'
>>> foo.encode('latin1').decode('utf8') # Undo the UTF-8, but Unicode is still wrong.
u'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1') # back to a byte string.
'\x93Stuff in smartquotes\x94'
>>> foo.encode('latin1').decode('utf8').encode('latin1').decode('windows-1252') # Now decode correctly.
u'\u201cStuff in smartquotes\u201d'

Python 2.7 and Sublime 2 + unicode don't mix

First of all, I've looked here: Sublime Text 3, Python 3 and UTF-8 don't like each other and read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets but am still none the wiser to the following:
Running Python from a file created in Sublime (not compiling) and executing via command prompt on an XP machine
I have a couple of text files named with accents (German, Spanish & French mostly). I want to remove accented characters (umlauts, acutes, graves, cidillas etc) and replace them with their equilivant non accented look a like.
I can strip the accents if they are a string from with the script. But accesing a textfile of the same name causes the the strippAcent function to fail. I'm all out of ideas as I think this is due to a conflict with Sublime and Python.
Here's my script
# -*- coding: utf-8 -*-
import unicodedata
import os
def stripAccents(s):
try:
us = unicode(s,"utf-8")
nice = unicodedata.normalize("NFD", us).encode("ascii", "ignore")
print nice
return nice
except:
print ("Fail! : %s" %(s))
return None
stripAccents("Découvrez tous les logiciels à télécharger")
# Decouvrez tous les logiciels a telecharger
stripAccents("Östblocket")
# Ostblocket
stripAccents("Blühende Landschaften")
# Bluhende Landschaften
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
for name in files:
x = name
x = stripAccents(x)
For the record:
C:\chcp
gets me 437
This is what the code produces for me:
The error in full is:
C:\WINDOWS\system32>D:\LearnPython\unicode_accents.py
Decouvrez tous les logiciels a telecharger
Ostblocket
Bluhende Landschaften
Traceback (most recent call last):
File "D:\LearnPython\unicode_accents.py", line 37, in <module>
x = stripAccents(x)
File "D:\LearnPython\unicode_accents.py", line 8, in stripAccents
us = unicode(s,"utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 2: invalid start byte
C:\WINDOWS\system32>
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
If you want to read Windows's filenames in their native Unicode form you have to ask for that specfically, by passing a Unicode string to filesystem functions:
root = u"D:\\temp\\test\\"
Otherwise Python will default to using the standard byte-based interfaces to the filesystem. On Windows, these return filenames to you encoded in the system's locale-specific legacy encoding (ANSI code page).
In stripAccents you try to decode the byte string you got from here using UTF-8, but the ANSI code page is never UTF-8, and the byte sequence you have doesn't happen to be a valid UTF-8 sequence so you get an error. You can decode from the ANSI code page using the pseudo-encoding mbcs, but it would be better to stick to Unicode filepath strings so you can include characters that don't fit in ANSI.
Always use Unicode strings to represent text in Python. Add from __future__ import unicode_literals at the top so that all "" literals would create Unicode strings. Or use u"" literals everywhere. Drop unicode(s, 'utf-8') from stripAccents(), always pass Unicode strings instead (try unidecode package, to transliterate Unicode to ascii).
Using Unicode solves several issues transparently:
there won't be UnicodeDecodeError because Windows provides Unicode API for filenames: if you pass Unicode input; you get Unicode output
you won't get a mojibake when a bytestring containing text encoded using your Windows encoding such as cp1252 is displayed in console using cp437 encoding e.g., Blühende -> Blⁿhende (ü is corrupted)
you might be able to work with text that can't be represented using neither cp1252 nor cp437 encoding e.g., '❤' (U+2764 HEAVY BLACK HEART).
To print Unicode text to Windows console, you could use win-unicode-console package.