My telegram bot does not support Persian language

My telegram bot does not support Persian language - python-2.7

I built a telegram bot with Python-Telegram-bot, and I want to send a bot to a user in Persian when the user sends /Start ;but the bot does not work.
My Code:
from telegram.ext import Updater,CommandHandler
updater = Updater(token='TOKEN')
def start_method(bot,update):
bot.sendMessage(update.message.chat_id,"سلام")
start_command = CommandHandler('start', start_method)
updater.dispatcher.add_handler(start_command)
updater.start_polling()

If you want to use unicode text in your code, you have to specify the file encoding according to PEP 263.
Place this comment at the beginning of your script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
You can also use Python 3, which has much better unicode support in general and assumes utf-8 encoding for source files by default.

First, need to use a urllib. If your text is something like txt1, you need to quote it first and then send it as a message. like this:
from urllib.parse import quote
......
txt1 = 'سلام. خوش آمدید!'
txt = quote(txt1.encode('utf-8'))

Related

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12

I'm developing a chatbot with the chatterbot library. The chatbot is in my native language --> Slovene, which has a lot of strange characters (for example: š, č, ž). I'm using python 2.7.
When I try to train the bot, the library has trouble with the characters mentioned above. For example, when I run the following code:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
it throws the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12: invalid start byte
I added the # -*- coding: utf-8 -*- line to the top of my file, I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8, I changed the system default encoding with the following code:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
The strings are of type unicode.
When I try to get a response, with these strange characters, it works, it has no issues with them. For example, running the following code in the same execution as the above training code(when I change 'š' to 's' and 'č' to 'c', in the train strings), throws no errors:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam se dopusta?",
"Letos imate se 19 dni dopusta.",
])
chatBot.get_response("Koliko imam še dopusta?")
I can't find a solution to this issue. Any suggestions?
Thanks loads in advance. :)
EDIT: I used from __future__ import unicode_literals, to make strings of type unicode. I also checked if they really were unicode with the method type(myString)
I would also like to paste this link.
EDIT 2: #MallikarjunaraoKosuri - s code works, but in my case, I had one more thing inside the chatbot instance intialization, which is the following:
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer',
storage_adapter='chatterbot.storage.JsonFileStorageAdapter'
)
This is the cause of my error. The json storage file the chatbot creates, is created in my local encoding and not in utf-8. It seems the default storage (.sqlite3), doesn't have this issue, so for now I'll just avoid the json storage. But I am still interested in finding a solution to this error.

The strings from your example are not of type unicode.
Otherwise Python would not throw the UnicodeDecodeError.
This type of error says that at a certain step of program's execution Python tries to decode byte-string into unicode but for some reason fails.
In your case the reason is that:
decoding is configured by utf-8
your source file is not in utf-8 and almost certainly in cp1252:
import unicodedata
b = '\x9a'
# u = b.decode('utf-8') # UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a
# in position 0: invalid start byte
u = b.decode('cp1252')
print unicodedata.name(u) # LATIN SMALL LETTER S WITH CARON
print u # š
So, the 0x9a byte from your cp1252 source can't be decoded with utf-8.
The best solution is to do nothing except convertation your source to utf-8.
With Sublime Text 3 you can easily do it by: File -> Reopen with Encoding -> UTF-8.
But don't forget to Ctrl+C your source code before the convertation beacuse just after that all your š, č, ž chars wil be replaced with ?.

Some of our friends are already suggested good part solutions, However again I would like combine all the solutions into one.
And author #gunthercox suggested some guidelines are described here http://chatterbot.readthedocs.io/en/stable/encoding.html#how-do-i-fix-python-encoding-errors
# -*- coding: utf-8 -*-
from chatterbot import ChatBot
# Create a new chat bot named Test
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer'
)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
Python Terminal
>>> # -*- coding: utf-8 -*-
... from chatterbot import ChatBot
>>>
>>> # Create a new chat bot named Test
... chatBot = ChatBot(
... 'Test',
... trainer='chatterbot.trainers.ListTrainer'
... )
>>>
>>> chatBot.train([
... "Koliko imam še dopusta?",
... "Letos imate še 19 dni dopusta.",
... ])
List Trainer: [####################] 100%
>>>

I built a telegram bot with Python-Telegram-Bot;But do not work

I built a telegram bot with Python-Telegram-Bot.I added the bot to a group and got the bot in the admin group.I have defined a list(mlist) for the bot and put it in a list of words.The bot should check the messages the users send to the group.And if users send a message to the group in which the words defined in the list(mlist) are there, the bot must delete it(delete message).
# -*- coding: utf-8 -*-
import os, sys
from telegram.ext import Updater, MessageHandler, Fliters
import re
def delete_method(bot, update):
if not update.message.text:
print("it does not contain text")
return
mlist=['سلام', 'شادي']
for i in mlist:
if re.search(i, update.message.text):
bot.delete_message(chat_id=update.message.chat_id, message_id=update.message.message_id)
def main():
updater = Updater(token='TOKEN')
dispatcher = updater.dispatcher
dispatcher.add_handler(MessageHandler(Filters.all, delete_method))
updater.start_polling()
updater.idle()
if __name__ == '__main__':
main()
# for exit
# updater.idle()
(The bot should delete the messages that are sent to the group and contain the list(mlist) words)
;But the bot does not work, and does not give error.

Try to replace the words in mlist with english ones and see if it works then. Just to check if that's causing the problem.
EDIT: So it works with english words. The reason is, that Telegram API only supports UTF-8, but Python works with Unicode. Unicode ≠ UTF-8. You have to encode your text with UTF-8. Take a string and add:
.encode('utf-8')

Python 2.7.x - unicode issue

I'm scraping this site www.soundkartell.de, and I'm facing some unicode issues:
results =[]
for article in soup.find_all('article'):
if article.select('a[href*="alternative"]'):
artist = article.h2.text
results.append(artist.encode('latin1').decode("utf-8"))
print artist # Din vän Skuggan
print results # [u'Din v\xe4n Skuggan']
I have -*- coding: utf-8 -*-at the top of my file.
why does python print scraped data correctly and not the appended data?
how do I fix the unicode issue?
I am using Python 2.7.x

You likely do not actually have a problem. What you are seeing is a side effect of how python prints things:
Sample Code:
artist = 'Din vän Skuggan'
artists = [artist]
print 'artist:', artist
print 'artists:', artists
print 'str:', str(artist)
print 'repr:', repr(artist)
Produces:
artist: Din vän Skuggan
artists: ['Din v\xc3\xa4n Skuggan']
str: Din vän Skuggan
repr: 'Din v\xc3\xa4n Skuggan'
So as can be seen above, when python prints a list, it use the repr() for the items in the list. In both cases you have the same contents, python is just showing it differently.
Side Note:
# -*- coding: utf-8 -*-
At top of your script, is useful for string literals with unicode text in your code.

Why OleFileIO_PL only works with .doc file types and not .docx Python?

right so I'm working on a Python script (Python 2.7) that will extract the metadata from OLE files. I am using OleFileIO_PL and it work perfectly file with OLE files 97 - 2003, but any later then that it just says that it is not an OLE2 file type.
Any way I can modify my code to support both .doc and .docx ? Same with .ppt and .pptx etc.
Thank you in advance
Source Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import OleFileIO_PL
import StringIO
import optparse
import sys
import os
def printMetadata(fileName):
data = open(fileName, 'rb').read()
f = StringIO.StringIO(data)
OLEFile = OleFileIO_PL.OleFileIO(f)
meta = OLEFile.get_metadata()
print('Author:', meta.author)
print('Title:', meta.title)
print('Creation date:', meta.create_time)
meta.dump()
OLEFile.close()
def main():
parser = optparse.OptionParser('usage = -F + Name of the OLE file with the extention For example: python Ms Office Metadata Extraction Script.py -F myfile.docx ')
parser.add_option('-F', dest='fileName', type='string',\
help='specify OLE (MS Office) file name')
(options, args) = parser.parse_args()
fileName = options.fileName
if fileName == None:
print parser.usage
exit(0)
else:
printMetadata(fileName)
if __name__ == '__main__':
main()

To answer your question, this is because the newer MS Office 2007+ files (docx, xlsx, xlsb, pptx, etc) have a completely different structure from the legacy MS Office 97-2003 formats.
It is mainly a collection of XML files within a Zip archive. So with a little bit of work, you can extract everything you need using zipfile and ElementTree from the standard library.
If openxmllib does not work for you, you may try other solutions:
officedissector: https://www.officedissector.com/
python-opc: https://pypi.python.org/pypi/python-opc
openpack: https://pypi.python.org/pypi/openpack
paradocx: https://pypi.python.org/pypi/paradocx
BTW, OleFileIO_PL has been renamed to olefile, and the new project page is https://github.com/decalage2/olefile

Scrapy:: issue with encoding when dumping to the json file

Here is the web-site, I would like to parse: [web-site in russian][1]
Here is the code that extracts the info that I need:
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from flats.items import FlatsItem
class DmozSpider(Spider):
name = "dmoz"
start_urls = ['http://rieltor.ua/flats-sale/?ncrnd=6510']
def parse(self, response):
sel=Selector(response)
flats=sel.xpath('//*[#id="content"]')
flats_stored_info=[]
flat_item=FlatsItem()
for flat in flats:
flat_item['square']=[s.encode("utf-8") for s in sel.xpath('//div/strong[#class="param"][1]/text()').extract()]
flat_item['rooms_floor_floors']=[s.encode("utf-8") for s in sel.xpath('//div/strong[#class="param"][2]/text()').extract()]
flat_item['address']=[s.encode("utf-8") for s in flat.xpath('//*[#id="content"]//h2/a/text()').extract()]
flat_item['price']=[s.encode("utf-8") for s in flat.xpath('//div[#class="cost"]/strong/text()').extract()]
flat_item['subway']=[s.encode("utf-8") for s in flat.xpath('//span[#class="flag flag-location"]/a/text()').extract()]
flats_stored_info.append(flat_item)
return flats_stored_info
How I dump to json file
scrapy crawl dmoz -o items.json -t json
The problem is when I replace the code above to print in console the extracted info i.e. like this:
flat_item['square']=sel.xpath('//div/strong[#class="param"][1]/text()').extract()
for bla in flat_item['square']:
print bla
the script properly displays the information in russian.
But, when I use to dump the scraped information using the first version of the script (with encoding to utf-8), it writes to the json file something like this:
[{"square": ["2-\u043a\u043e\u043c\u043d., 16 \u044d\u0442\u0430\u0436 16-\u044d\u0442. \u0434\u043e\u043c", "1-\u043a\u043e\u043c\u043d.,
How can I dump information into json file in russian? Thank you for your advises.
[1]: http://rieltor.ua/flats-sale/?ncrnd=6510

It is correctly encoded, it's just that the json library escapes non-ascii characters by default.
You can load the data and use it (copying data from your example):
>>> import json
>>> print json.loads('"2-\u043a\u043e\u043c\u043d., 16 \u044d\u0442\u0430\u0436 16-\u044d\u0442. \u0434\u043e\u043c"')
2-комн., 16 этаж 16-эт. дом

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

My telegram bot does not support Persian language - python-2.7

First, need to use a urllib. If your text is something like txt1, you need to quote it first and then send it as a message. like this: from urllib.parse import quote ...... txt1 = 'سلام. خوش آمدید!' txt = quote(txt1.encode('utf-8'))

Related

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12

I built a telegram bot with Python-Telegram-Bot;But do not work

Python 2.7.x - unicode issue

Why OleFileIO_PL only works with .doc file types and not .docx Python?

Scrapy:: issue with encoding when dumping to the json file

Categories

Resources