How can I test output of non-ASCII characters using Sphinx doctest? - python-2.7

I'm at a loss how to test printing output that includes non-ASCII characters using Sphinx doctest.
When I have test that include code that generates non-ASCII characters, or that contains expected results that include non-ASCII characters, I get encoding errors.
For example, if I have:
def foo():
return 'γ'
then a doctest including
>>> print(foo())
will produce an error of the form
Encoding error:
'ascii' codec can't encode character u'\u03b3' in position 0: ordinal not in range(128)
as will any test of the form
>>> print('')
γ
Only by ensuring that none of my functions whose results I'm attempting to print, and none of the expected printed results, contain such characters can I avoid these errors. As a result I've had to disable many important tests.
At the head of all my code I have
# encoding: utf8
from __future__ import unicode_literals
and (in desperation) I've tried things like
doctest_global_setup =(
'#encoding: utf8\n\n'
'from __future__ import unicode_literals\n'
)
and
.. testsetup::
from __future__ import unicode_literals
but these (of course) don't change the outcome.
How can I test output of non-ASCI characters using Sphinx doctest?

I believe it is due to your from __future__ import unicode_literals statement. print will implicitly encode Unicode strings to the terminal encoding. Lacking a terminal, Python 2 will default to the ascii codec.
If you skip an explicit print, it will work with or without import:
>>> def foo():
... return 'ë'
...
>>> foo()
'\x89'
Or:
>>> from __future__ import unicode_literals
>>> def foo():
... return 'ë'
...
>>> foo()
u'\xeb'
Then you can test for the escaped representation of the string.
You can also try changing the encoding of print itself with PYTHONIOENCODING=utf8.

Related

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12

I'm developing a chatbot with the chatterbot library. The chatbot is in my native language --> Slovene, which has a lot of strange characters (for example: š, č, ž). I'm using python 2.7.
When I try to train the bot, the library has trouble with the characters mentioned above. For example, when I run the following code:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
it throws the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12: invalid start byte
I added the # -*- coding: utf-8 -*- line to the top of my file, I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8, I changed the system default encoding with the following code:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
The strings are of type unicode.
When I try to get a response, with these strange characters, it works, it has no issues with them. For example, running the following code in the same execution as the above training code(when I change 'š' to 's' and 'č' to 'c', in the train strings), throws no errors:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam se dopusta?",
"Letos imate se 19 dni dopusta.",
])
chatBot.get_response("Koliko imam še dopusta?")
I can't find a solution to this issue. Any suggestions?
Thanks loads in advance. :)
EDIT: I used from __future__ import unicode_literals, to make strings of type unicode. I also checked if they really were unicode with the method type(myString)
I would also like to paste this link.
EDIT 2: #MallikarjunaraoKosuri - s code works, but in my case, I had one more thing inside the chatbot instance intialization, which is the following:
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer',
storage_adapter='chatterbot.storage.JsonFileStorageAdapter'
)
This is the cause of my error. The json storage file the chatbot creates, is created in my local encoding and not in utf-8. It seems the default storage (.sqlite3), doesn't have this issue, so for now I'll just avoid the json storage. But I am still interested in finding a solution to this error.
The strings from your example are not of type unicode.
Otherwise Python would not throw the UnicodeDecodeError.
This type of error says that at a certain step of program's execution Python tries to decode byte-string into unicode but for some reason fails.
In your case the reason is that:
decoding is configured by utf-8
your source file is not in utf-8 and almost certainly in cp1252:
import unicodedata
b = '\x9a'
# u = b.decode('utf-8') # UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a
# in position 0: invalid start byte
u = b.decode('cp1252')
print unicodedata.name(u) # LATIN SMALL LETTER S WITH CARON
print u # š
So, the 0x9a byte from your cp1252 source can't be decoded with utf-8.
The best solution is to do nothing except convertation your source to utf-8.
With Sublime Text 3 you can easily do it by: File -> Reopen with Encoding -> UTF-8.
But don't forget to Ctrl+C your source code before the convertation beacuse just after that all your š, č, ž chars wil be replaced with ?.
Some of our friends are already suggested good part solutions, However again I would like combine all the solutions into one.
And author #gunthercox suggested some guidelines are described here http://chatterbot.readthedocs.io/en/stable/encoding.html#how-do-i-fix-python-encoding-errors
# -*- coding: utf-8 -*-
from chatterbot import ChatBot
# Create a new chat bot named Test
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer'
)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
Python Terminal
>>> # -*- coding: utf-8 -*-
... from chatterbot import ChatBot
>>>
>>> # Create a new chat bot named Test
... chatBot = ChatBot(
... 'Test',
... trainer='chatterbot.trainers.ListTrainer'
... )
>>>
>>> chatBot.train([
... "Koliko imam še dopusta?",
... "Letos imate še 19 dni dopusta.",
... ])
List Trainer: [####################] 100%
>>>

How can i clean urdu data corpus Python without nltk

I have a corpus of more that 10000 words in urdu. Now what i want is to clean my data. There appear a special uni coded data in my text like "!؟ـ،" whenever i use regular expressions it gives me error that your data is not in encoded form.
Kindly provide me some help to clean my data.
Thank you
Here is my sample data:
ظہیر
احمد
ماہرہ
خان
کی،
تصاویر،
نے
دائیں
اور
بائیں
والوں
کو
آسمانوں
پر
پہنچایا
،ہوا
ہے۔
دائیں؟
والے
I used your sample to find all words with ہ or ر
Notice that I had to tell python that I am dealing with utf-8 data by using u in front of the regex string as well as the data string
import re
data = u"""
ظہیر
احمد
ماہرہ
خان
.....
"""
result = re.findall(u'[^\s\n]+[ہر][^\s\n]+',data,re.MULTILINE)
print(result)
The output was
['ظہیر', 'ماہرہ', 'تصاویر،', 'پہنچایا', '،ہوا']
another example, removes all none alphabets except whitespace and makes sure only one whitespace separates the words
result = re.sub(' +',' ',re.sub(u'[\W\s]',' ',data))
print(result)
the output is
ظہیر احمد ماہرہ خان کی تصاویر نے دائیں اور بائیں والوں کو آسمانوں پر پہنچایا ہوا ہے دائیں والے
you can also use word tokanizer,
import nltk
result = nltk.tokenize.wordpunct_tokenize(data)
print(result)
the output will be
['ظہیر', 'احمد', 'ماہرہ'
, 'خان', 'کی', '،', 'تصاویر'
, '،', 'نے', 'دائیں', 'اور', 'بائیں', 'والوں'
, 'کو', 'آسمانوں', 'پر', 'پہنچایا'
, '،', 'ہوا', 'ہے', '۔', 'دائیں', '؟', 'والے']
Edit ... for Python 2.7 you have to specify the encoding at the beginning of the code file as well as telling re that the regex is 'unicode' using re.UNICODE
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import re
data = u"""ظہیر
احمد
ماہرہ
خان
کی،
.....
"""
result = re.sub(ur'\s+',u' ',re.sub(ur'[\W\s]',ur' ',data,re.UNICODE),re.UNICODE)
print(result)
also note the use of ur'' to specify the string is a unicode regex string

How to convert CJK Extention B in QLineEdit of Python3-PyQt4 to utf-8 to Processing it with regex

I have a code like that:
#!/usr/bin/env python3
#-*-coding:utf-8-*-
from PyQt4 import QtGui, QtCore
import re
.....
str = self.lineEdit.text() # lineEdit is a object in QtGui.QLineEdit class
# This line thanks to Fedor Gogolev et al from
#https://stackoverflow.com/questions/12214801/print-a-string-as-hex-bytes
print('\\u'+"\\u".join("{:x}".format(ord(c)) for c in str))
# u+20000-u+2a6d6 is CJK Ext B
cjk = re.compile("^[一-鿌㐀-䶵\U00020000-\U0002A6D6]+$",re.UNICODE)
if cjk.match(str):
print("OK")
else:
print("error")
when I inputted "敏感詞" (0x654F,0x611F, 0x8A5E in utf16 respectively), the result was:
\u654f\u611f\u8a5e
OK
but when I input "詞𠀷𠂁𠁍" (0x8A5E, 0xD840 0xDC37, 0xD840 0xDC81, 0xD840 0xDC4D in utf-16) in which there were 3 characters from CJK Extention B Area. The result which is not expected is:
\u8a5e\ud840\udc37\ud840\udc81\ud840\udc4d
error
how can I processed these CJK characters with converting to utf-8 to be processed suitabliy with re of Python3?
P.S.
the value from sys.maxunicode is 1114111, it might be UCS-4. Hence, I think that the question seems not to be the same as
python regex fails to match a specific Unicode > 2 hex values
another code:
#!/usr/bin/env python3
#-*-coding:utf-8-*-
import re
CJKBlock = re.compile("^[一-鿌㐀-䶵\U00020000-\U0002A6D6]+$") #CJK ext B
print(CJKBlock.search('詞𠀷𠂁𠁍'))
returns <_sre.SRE_Match object; span=(0, 4), match='詞𠀷𠂁𠁍'> #expected result.
even I added self.lineEdit.setText("詞𠀷𠂁𠁍") inside __init__ function of the window class and executed it, the word in LineEdit shows appropriately, but when I pressed enter, the result was still "error"
version:
Python3.4.3
Qt version: 4.8.6
PyQt version: 4.10.4.
There were a few PyQt4 bugs following the implemetation of PEP-393 that can affect conversions between QString and python strings. If you use sip to switch to the v1 API, you should probably be able to confirm that the QString returned by the line-edit does not contain surrogate pairs. But if you then convert it to a python string, the surrogates should appear.
Here is how to test this in an interactive session:
>>> import sip
>>> sip.setapi('QString', 1)
>>> from PyQt4 import QtGui
>>> app = QtGui.QApplication([])
>>> w = QtGui.QLineEdit()
>>> w.setText('詞𠀷𠂁𠁍')
>>> qstr = w.text()
>>> qstr
PyQt4.QtCore.QString('詞𠀷𠂁𠁍')
>>> pystr = str(qstr)
>>> print('\\u' + '\\u'.join('{:x}'.format(ord(c)) for c in pystr))
\u8a5e\u20037\u20081\u2004d
Of course, this last line does not show surrogates for me, because I cannot do the test with PyQt-4.10.4. I have tested with PyQt-4.11.1 and PyQt-4.11.4, though, and I did not get see any problems. So you should try to upgrade to one of those.

PyCharm issue on encoding

I am trying pycharm and facing an encoding issue. Can you please help resolve it.
code:
# -*- coding: utf-8 -*-
__author__ = 'me'
import os, sys
def main():
print repeat('mike',False)
print repeat('mok', True)
"""
comments here..
"""
def repeat(s,exclaim):
result = s*3
if exclaim:
result = result +'!!!'
return result
if __name__ == '__main__':
main()
error:
C:\Python27\python.exe C:\Python27\python.exe C:/Users/prakashs/PycharmProjects/GooglePython/WarmUp.py
File "C:\Python27\python.exe", line 1
SyntaxError: Non-ASCII character '\x90' in file C:\Python27\python.exe on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Process finished with exit code 1
I have set the default encoding in pycharm to utf-8 as well. but i need to know where in pycharm we have to edit the settings.
Thank you.
Googling for Non-ASCII character '\x90' in file gives Using #-*- coding: utf-8 -*- does not remove "Non-ASCII character '\x90' in file hello.exe on line 1, but no encoding declared" error Stackoverflow question as the first hit. There you'll find answer to your question.
You have wrong command starting with C:\Python27\python.exe C:\Python27\python.exe... (python.exe is mentioned twice) which means you try to run executable (python.exe) instead of script file (WarmUp.py).

Selecting nodes with non-ASCII characters in Scrapy

I have the following simple web scraper written in Scrapy:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpiderTest(BaseSpider):
name = 'MySpiderTest'
allowed_domains = ["boliga.dk"]
start_urls = ["http://www.boliga.dk/bbrinfo/3B71489C-AEA0-44CA-A0B2-7BD909B35618",]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = bbrItem()
print hxs.select("id('unitControl')/div[2]/table/tbody/tr[td//text()[contains(.,'Antal Badeværelser')]]/td[2]/text()").extract()
but when I run the spider I get the following syntax error:
SyntaxError: Non-ASCII character '\xe6' in file... on line 32, but no encoding declared
because of the æ in the xpath. The xpath is working in Xpath Checker for Firefox. I tried URL-encoding the æ, but that didn't work. What am I missing?
thanks!
UPDATE: I have added the encoding declaration in the beginning of the code (Latin-1 should support Danish characters)
Use a unicode string for your XPath expression
hxs.select(u"id('unitControl')/div[2]/table/tbody/tr[td//text()[contains(.,'Antal Badeværelser')]]/td[2]/text()").extract()
or
hxs.select(u"id('unitControl')/div[2]/table/tbody/tr[td//text()[contains(.,'Antal Badev\u00e6relser')]]/td[2]/text()").extract()
See Unicode Literals in Python Source Code
SyntaxError: Non-ASCII character ‘\xe2′ in file … on line 40,
but no decoding declared …
This is caused by the replacing standard characters like apostrophe (‘) by non-standard characters like quotation mark (`) during copying.
Try to edit the text copied from pdf.
repsonse.xpath("//tr[contains(., '" + u'中文字符' + "')]").extract()