Python 2.7.x - unicode issue - python-2.7

I'm scraping this site www.soundkartell.de, and I'm facing some unicode issues:
results =[]
for article in soup.find_all('article'):
if article.select('a[href*="alternative"]'):
artist = article.h2.text
results.append(artist.encode('latin1').decode("utf-8"))
print artist # Din vän Skuggan
print results # [u'Din v\xe4n Skuggan']
I have -*- coding: utf-8 -*-at the top of my file.
why does python print scraped data correctly and not the appended data?
how do I fix the unicode issue?
I am using Python 2.7.x

You likely do not actually have a problem. What you are seeing is a side effect of how python prints things:
Sample Code:
artist = 'Din vän Skuggan'
artists = [artist]
print 'artist:', artist
print 'artists:', artists
print 'str:', str(artist)
print 'repr:', repr(artist)
Produces:
artist: Din vän Skuggan
artists: ['Din v\xc3\xa4n Skuggan']
str: Din vän Skuggan
repr: 'Din v\xc3\xa4n Skuggan'
So as can be seen above, when python prints a list, it use the repr() for the items in the list. In both cases you have the same contents, python is just showing it differently.
Side Note:
# -*- coding: utf-8 -*-
At top of your script, is useful for string literals with unicode text in your code.

Related

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12

I'm developing a chatbot with the chatterbot library. The chatbot is in my native language --> Slovene, which has a lot of strange characters (for example: š, č, ž). I'm using python 2.7.
When I try to train the bot, the library has trouble with the characters mentioned above. For example, when I run the following code:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
it throws the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12: invalid start byte
I added the # -*- coding: utf-8 -*- line to the top of my file, I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8, I changed the system default encoding with the following code:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
The strings are of type unicode.
When I try to get a response, with these strange characters, it works, it has no issues with them. For example, running the following code in the same execution as the above training code(when I change 'š' to 's' and 'č' to 'c', in the train strings), throws no errors:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam se dopusta?",
"Letos imate se 19 dni dopusta.",
])
chatBot.get_response("Koliko imam še dopusta?")
I can't find a solution to this issue. Any suggestions?
Thanks loads in advance. :)
EDIT: I used from __future__ import unicode_literals, to make strings of type unicode. I also checked if they really were unicode with the method type(myString)
I would also like to paste this link.
EDIT 2: #MallikarjunaraoKosuri - s code works, but in my case, I had one more thing inside the chatbot instance intialization, which is the following:
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer',
storage_adapter='chatterbot.storage.JsonFileStorageAdapter'
)
This is the cause of my error. The json storage file the chatbot creates, is created in my local encoding and not in utf-8. It seems the default storage (.sqlite3), doesn't have this issue, so for now I'll just avoid the json storage. But I am still interested in finding a solution to this error.
The strings from your example are not of type unicode.
Otherwise Python would not throw the UnicodeDecodeError.
This type of error says that at a certain step of program's execution Python tries to decode byte-string into unicode but for some reason fails.
In your case the reason is that:
decoding is configured by utf-8
your source file is not in utf-8 and almost certainly in cp1252:
import unicodedata
b = '\x9a'
# u = b.decode('utf-8') # UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a
# in position 0: invalid start byte
u = b.decode('cp1252')
print unicodedata.name(u) # LATIN SMALL LETTER S WITH CARON
print u # š
So, the 0x9a byte from your cp1252 source can't be decoded with utf-8.
The best solution is to do nothing except convertation your source to utf-8.
With Sublime Text 3 you can easily do it by: File -> Reopen with Encoding -> UTF-8.
But don't forget to Ctrl+C your source code before the convertation beacuse just after that all your š, č, ž chars wil be replaced with ?.
Some of our friends are already suggested good part solutions, However again I would like combine all the solutions into one.
And author #gunthercox suggested some guidelines are described here http://chatterbot.readthedocs.io/en/stable/encoding.html#how-do-i-fix-python-encoding-errors
# -*- coding: utf-8 -*-
from chatterbot import ChatBot
# Create a new chat bot named Test
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer'
)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
Python Terminal
>>> # -*- coding: utf-8 -*-
... from chatterbot import ChatBot
>>>
>>> # Create a new chat bot named Test
... chatBot = ChatBot(
... 'Test',
... trainer='chatterbot.trainers.ListTrainer'
... )
>>>
>>> chatBot.train([
... "Koliko imam še dopusta?",
... "Letos imate še 19 dni dopusta.",
... ])
List Trainer: [####################] 100%
>>>

How can i clean urdu data corpus Python without nltk

I have a corpus of more that 10000 words in urdu. Now what i want is to clean my data. There appear a special uni coded data in my text like "!؟ـ،" whenever i use regular expressions it gives me error that your data is not in encoded form.
Kindly provide me some help to clean my data.
Thank you
Here is my sample data:
ظہیر
احمد
ماہرہ
خان
کی،
تصاویر،
نے
دائیں
اور
بائیں
والوں
کو
آسمانوں
پر
پہنچایا
،ہوا
ہے۔
دائیں؟
والے
I used your sample to find all words with ہ or ر
Notice that I had to tell python that I am dealing with utf-8 data by using u in front of the regex string as well as the data string
import re
data = u"""
ظہیر
احمد
ماہرہ
خان
.....
"""
result = re.findall(u'[^\s\n]+[ہر][^\s\n]+',data,re.MULTILINE)
print(result)
The output was
['ظہیر', 'ماہرہ', 'تصاویر،', 'پہنچایا', '،ہوا']
another example, removes all none alphabets except whitespace and makes sure only one whitespace separates the words
result = re.sub(' +',' ',re.sub(u'[\W\s]',' ',data))
print(result)
the output is
ظہیر احمد ماہرہ خان کی تصاویر نے دائیں اور بائیں والوں کو آسمانوں پر پہنچایا ہوا ہے دائیں والے
you can also use word tokanizer,
import nltk
result = nltk.tokenize.wordpunct_tokenize(data)
print(result)
the output will be
['ظہیر', 'احمد', 'ماہرہ'
, 'خان', 'کی', '،', 'تصاویر'
, '،', 'نے', 'دائیں', 'اور', 'بائیں', 'والوں'
, 'کو', 'آسمانوں', 'پر', 'پہنچایا'
, '،', 'ہوا', 'ہے', '۔', 'دائیں', '؟', 'والے']
Edit ... for Python 2.7 you have to specify the encoding at the beginning of the code file as well as telling re that the regex is 'unicode' using re.UNICODE
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import re
data = u"""ظہیر
احمد
ماہرہ
خان
کی،
.....
"""
result = re.sub(ur'\s+',u' ',re.sub(ur'[\W\s]',ur' ',data,re.UNICODE),re.UNICODE)
print(result)
also note the use of ur'' to specify the string is a unicode regex string

My telegram bot does not support Persian language

I built a telegram bot with Python-Telegram-bot, and I want to send a bot to a user in Persian when the user sends /Start ;but the bot does not work.
My Code:
from telegram.ext import Updater,CommandHandler
updater = Updater(token='TOKEN')
def start_method(bot,update):
bot.sendMessage(update.message.chat_id,"سلام")
start_command = CommandHandler('start', start_method)
updater.dispatcher.add_handler(start_command)
updater.start_polling()
If you want to use unicode text in your code, you have to specify the file encoding according to PEP 263.
Place this comment at the beginning of your script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
You can also use Python 3, which has much better unicode support in general and assumes utf-8 encoding for source files by default.
First, need to use a urllib. If your text is something like txt1, you need to quote it first and then send it as a message. like this:
from urllib.parse import quote
......
txt1 = 'سلام. خوش آمدید!'
txt = quote(txt1.encode('utf-8'))

why cleaning text function doens't work without decoding to UTF8?

I wrote the following function in python 2.7 to clean the text but it doesn't work without decoding the tweet variable to utf8
# -*- coding: utf-8 -*-
import re
def clean_tweet(tweet):
tweet = re.sub(u"[^\u0622-\u064A]", ' ', tweet, flags=re.U)
return tweet
if __name__ == "__main__":
s="sadfas سيبس sdfgsdfg/dfgdfg ffeee منت منشس يت??بمنشس//تبي منشكسميكمنشسكيمنك ٌاإلا رًاٌااًٌَُ"
print "not working "+clean_tweet(s)
print "working "+clean_tweet(s.decode("utf-8"))
Could any one explain why?
Because I don't want to use the decoding as it makes the manipulation of the text in Sframe in graphlab is too slow.

Tokenizing in french using nltk

I am trying to tokenize french words but when i tokenize french words the words which contain "^" symbol returns \xe .The following is the code that i implemented
.
import nltk
from nltk.tokenize import WhitespaceTokenizer
from nltk.tokenize import SpaceTokenizer
from nltk.tokenize import RegexpTokenizer
data = "Vous êtes au volant d'une voiture et vous roulez à vitesse"
#wst = WhitespaceTokenizer()
#tokenizer = RegexpTokenizer('\s+', gaps=True)
token=WhitespaceTokenizer().tokenize(data)
print token
Output i got
['Vous', '\xeates', 'au', 'volant', "d'une", 'voiture', 'et', 'vous', 'roulez', '\xe0', 'vitesse']
Desired output
['Vous', 'êtes', 'au', 'volant', "d'une", 'voiture', 'et', 'vous', 'roulez', 'à', 'vitesse']
In Python 2, to write UTF-8 text in your code, you need to start your file with # -*- coding: <encoding name> -*- when not using ASCII. You also need to prepend Unicode strings with u:
# -*- coding: utf-8 -*-
import nltk
...
data = u"Vous êtes au volant d'une voiture et vous roulez à grande vitesse"
print WhitespaceTokenizer().tokenize(data)
When you're not writing data in your Python code but reading it from a file, you must make sure that it's properly decoded by Python. The codecs module helps here:
import codecs
codecs.open('fichier.txt', encoding='utf-8')
This is good practice because if there is an encoding error, you will know about it right away: it won't bite you later on, eg. after processing your data. This is also the only approach that works in Python 3, where codecs.open becomes open and decoding is always done right away. More generally, avoid the 'str' Python 2 type like the plague and always stick with Unicode strings to make sure encoding is done properly.
Recommended readings:
Python 2: Unicode HOWTO
Python 3: Unicode HOWTO
Python 3 Text Files Processing
What's new in Python 3: Unicode
Bon courage !
Take at the section "3.3 Text Processing with Unicode" in Chapter 3 of NTLK.
Make sure that your string is prepended with a u and you should be ok. Also note from that chapter that, as #tripleee suggested:
There are many factors determining what glyphs are rendered on your screen. If you are sure that you have the correct encoding, but your Python code is still failing to produce the glyphs you expected, you should also check that you have the necessary fonts installed on your system.
You don't really need the whitespace tokenizer for French if it's a simple sentence where tokens are naturally delimited by spaces. If not the nltk.tokenize.word_tokenize() would serve you better.
See How to print UTF-8 encoded text to the console in Python < 3?
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
sentence = "Vous êtes au volant d'une voiture et vous roulez à grande $3.88 vitesse"
print sentence.split()
from nltk.tokenize import word_tokenize
print word_tokenize(sentence)
from nltk.tokenize import wordpunct_tokenize
print wordpunct_tokenize(sentence)