Character Encoding: Why my email receiving code cannot run in PyQt4? - python-2.7

I am recently finishing a spam classification application as my final project and now I meet a problem.
The problem came from a module to receive emails. I wrote the test code in a single .py file and it worked really well. Here is the code:
#!/usr/bin/env python
# coding=utf-8
import poplib
from email import parser
host = 'pop.qq.com'
username = 'xxxxx#qq.com'
password = 'xxxxxxxxxxxxx'
pop_conn = poplib.POP3_SSL(host)
pop_conn.user(username)
pop_conn.pass_(password)
messages = [pop_conn.retr(i) for i in range(1, len(pop_conn.list()[1]) + 1)]
# Concat message pieces:
messages = ["\n".join(mssg[1]) for mssg in messages]
#print messages
messages = [parser.Parser().parsestr(mssg) for mssg in messages]
i = 0
for message in messages:
i = i + 1
mailName = "mail"+str(i)
f = open(mailName + '.log', 'w');
print >> f, "Date: ", message["Date"]
print >> f, "From: ", message["From"]
print >> f, "To: ", message["To"]
print >> f, "Subject: ", message["Subject"]
print >> f, "Data: "
for part in message.walk():
contentType = part.get_content_type()
if contentType == 'text/plain' :
data = part.get_payload(decode=True)
print >> f, data
f.close()
pop_conn.quit()
But when I tried to transplant exactly the same code to my PyQt4 application, the problem came out in this line:
messages = ["\n".join(mssg[1]) for mssg in messages]
and this is the problem:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 4:ordinal not in range(128)
mssg[1] is a list that contains every line of the mail. I guess this is because the text from the mail was encoded by "utf-8" or "gbk" which can't be decoded by the default "ascii". So I tried to write the code like this:
messages = ["\n".join([m.decode("utf-8") for m in mssg[1]]) for mssg in messages]
The problem became like this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 7
I used Python chardet module to detect the encoding of the text of the email, and it turned out to be "ascii". Now I am really confused. Why the same code can't run on my small application? What is the real problem, and how I can fix it? I will be very appreciated for your help.

I finally solved this problem by receiving the email in a .py file and using my application to import that file. This may not be useful in other situations because I actually didn't solve the character encoding problem. When I was implementing my application, I met lots of encoding problems, and it's quite annoying. For this, I guess it is caused by some irregular text from my mail(maybe some pictures) which is shown in the following picture:
This was shown when I tried to print some of my email data on the screen. However, I still don't know why this cannot run in my application, though it worked well in a simple file. The character encoding problem is very annoying, and maybe I still have a long way to go.:-D

Related

Python: replacing unusual characters in a text file

I am trying to do the following changes/substitutions automatically, in a text file.
â€\u9d = "
“ = "
’ = '
— = :
I consistently run into the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 452: character maps to <undefined>
Here's my recent code:
fin = open("example.md", "rt")
data = fin.read()
data = data.replace(r'â€\u9d', '\"')
data = data.replace(r'“', '\"')
data = data.replace(r'’', '\"')
data = data.replace(r'—', ':')
fin.close()
fin = open("data.txt", "wt")
fin.write(data)
fin.close()
according to this Question ,u can use re.sub, such below :
import re
my_str = "hey th~!ere"
my_new_string = re.sub('[^a-zA-Z0-9 \n\.]', '', my_str)
print my_new_string
I test it, is working :)
You have two problems. First is that you're opening the file with the wrong encoding, leading to a case of mojibake as suggested by #JosefZ in the comments. The solution is exactly as he suggested:
fin = open("example.md", "rt", encoding="utf-8")
The second problem is that you're using a very ham-fisted way of correcting the first problem. You may find that once you read the characters correctly there's no need to fix them. But if you still need to convert curly quotes to straight ones so that everything's compatible with ASCII, there's a much easier way to do that with the unidecode module.
from unidecode import unidecode
data = unidecode(data)
This will take care of all the characters listed in your question, and more besides.

how to get python to recognize the ® symbol [duplicate]

This question already has answers here:
Python to show special characters
(3 answers)
Closed 4 years ago.
Hi there I am trying to make python recognize ® as a symbol( if it doesn't show up that well here but it is the symbol with a capital R within a circle known as the 'registered' symbol)
I understand that it is not recognized in python due to ASCII however i was wondering if anyone knows of a way to use a different decoding system that includes this symbol or a method to make python 'ignore' it.
For some context:
I am trying to make an auto checkout program for a website so my program needs to match the item that the user wants. To do this I am using Beatifulsoup to scrape information however this symbol '®' is within the names of a few of the items causing python to crash.
Here is the current command that I am using but is not working due to ASCII:
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
Any help would be appreciated
Here is the entirety of the program so far(ignore the mess nowhere near done):
import time
import webbrowser
from selenium import webdriver
import mechanize
from bs4 import BeautifulSoup
import urllib2
from selenium.webdriver.support.ui import Select
CnI = []
item = []
colour = []
Uhrefs = []
Whrefs = []
FinalColours = []
selectItemindex = []
selectColourindex = []
#counters
Ccounter = 0
Icounter = 0
Splitcounter = 1
#wanted items suffix options:jackets, shirts, tops_sweaters, sweatshirts, pants, shorts, hats, bags, accessories, skate
suffix = 'accessories'
Wcolour = 'Black'
Witem = '2-Tone Nylon 6-Panel'
driver=webdriver.Chrome()
driver.get('http://www.supremenewyork.com/shop/all/'+suffix)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup)
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
print(colour)
print('#############')
for each in CnI:
each.split(',')
print(each)
while Splitcounter<=len(CnI):
item.append(CnI[Splitcounter-1])
FinalColours.append(CnI[Splitcounter])
Whrefs.append(Uhrefs[Splitcounter])
Splitcounter+=2
print(Uhrefs)
for each in item:
print(each)
for z in FinalColours:
print(z)
for i in Whrefs:
print(i)
##for i in item:
## hold = item.index(i)
## print(hold)
## if Witem == i and Wcolour == FinalColours[i]:
## print('correct')
##
##
for count,elem in enumerate(item):
if Witem in elem:
selectItemindex.append(count+1)
for count,elem in enumerate(FinalColours):
if Wcolour in elem:
selectColourindex.append(count+1)
print(selectColourindex)
print(selectItemindex)
for each in selectColourindex:
if selectColourindex[Ccounter] in selectItemindex:
point = selectColourindex[Ccounter]
print(point)
else:
Ccounter+=1
web = 'http://www.supremenewyork.com'+Whrefs[point-1]
driver.get(web)
elem1 = driver.find_element_by_name('commit')
elem1.click()
time.sleep(1)
elem2 = driver.find_element_by_link_text('view/edit basket')
elem2.click()
time.sleep(1)
elem3 = driver.find_element_by_link_text('checkout now')
elem3.click()
"®" is not a character but a unicode codepoint so if you're using Python2, your code will never work. Instead of using str(), use something like this:
unicode(input_string, 'utf8')
# or
unicode(input_string, 'unicode-escape')
Edit: Given the code surrounding the initial snippet that was posted later and the fact that BeautifulSoup actually returns unicode already, it seems that removal of str() might be the best course of action and #MarkTolonen's answer is spot-on.
BeautifulSoup returns Unicode strings. Stop converting them back to byte strings. Best practice when dealing with text is to:
Decode incoming text to Unicode (what BeautifulSoup is doing).
Process all text using Unicode.
Encode outgoing text to Unicode (to file, to database, to sockets, etc.).
Small example of your issue:
text = u'\N{REGISTERED SIGN}' # syntax to create a Unicode codepoint by name.
bytes = str(text)
Output:
Traceback (most recent call last):
File "test.py", line 2, in <module>
bytes = str(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 0: ordinal not in range(128)
Note the first line works and supports the character. Converting it to a byte string fails because it defaults to encoding in ASCII. You can explicitly encode it with another encoding (e.g. bytes = text.encode('utf8'), but that breaks rule 2 above and creates other issues.
Suggested reading:
https://nedbatchelder.com/text/unipain.html
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12

I'm developing a chatbot with the chatterbot library. The chatbot is in my native language --> Slovene, which has a lot of strange characters (for example: š, č, ž). I'm using python 2.7.
When I try to train the bot, the library has trouble with the characters mentioned above. For example, when I run the following code:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
it throws the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12: invalid start byte
I added the # -*- coding: utf-8 -*- line to the top of my file, I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8, I changed the system default encoding with the following code:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
The strings are of type unicode.
When I try to get a response, with these strange characters, it works, it has no issues with them. For example, running the following code in the same execution as the above training code(when I change 'š' to 's' and 'č' to 'c', in the train strings), throws no errors:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam se dopusta?",
"Letos imate se 19 dni dopusta.",
])
chatBot.get_response("Koliko imam še dopusta?")
I can't find a solution to this issue. Any suggestions?
Thanks loads in advance. :)
EDIT: I used from __future__ import unicode_literals, to make strings of type unicode. I also checked if they really were unicode with the method type(myString)
I would also like to paste this link.
EDIT 2: #MallikarjunaraoKosuri - s code works, but in my case, I had one more thing inside the chatbot instance intialization, which is the following:
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer',
storage_adapter='chatterbot.storage.JsonFileStorageAdapter'
)
This is the cause of my error. The json storage file the chatbot creates, is created in my local encoding and not in utf-8. It seems the default storage (.sqlite3), doesn't have this issue, so for now I'll just avoid the json storage. But I am still interested in finding a solution to this error.
The strings from your example are not of type unicode.
Otherwise Python would not throw the UnicodeDecodeError.
This type of error says that at a certain step of program's execution Python tries to decode byte-string into unicode but for some reason fails.
In your case the reason is that:
decoding is configured by utf-8
your source file is not in utf-8 and almost certainly in cp1252:
import unicodedata
b = '\x9a'
# u = b.decode('utf-8') # UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a
# in position 0: invalid start byte
u = b.decode('cp1252')
print unicodedata.name(u) # LATIN SMALL LETTER S WITH CARON
print u # š
So, the 0x9a byte from your cp1252 source can't be decoded with utf-8.
The best solution is to do nothing except convertation your source to utf-8.
With Sublime Text 3 you can easily do it by: File -> Reopen with Encoding -> UTF-8.
But don't forget to Ctrl+C your source code before the convertation beacuse just after that all your š, č, ž chars wil be replaced with ?.
Some of our friends are already suggested good part solutions, However again I would like combine all the solutions into one.
And author #gunthercox suggested some guidelines are described here http://chatterbot.readthedocs.io/en/stable/encoding.html#how-do-i-fix-python-encoding-errors
# -*- coding: utf-8 -*-
from chatterbot import ChatBot
# Create a new chat bot named Test
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer'
)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
Python Terminal
>>> # -*- coding: utf-8 -*-
... from chatterbot import ChatBot
>>>
>>> # Create a new chat bot named Test
... chatBot = ChatBot(
... 'Test',
... trainer='chatterbot.trainers.ListTrainer'
... )
>>>
>>> chatBot.train([
... "Koliko imam še dopusta?",
... "Letos imate še 19 dni dopusta.",
... ])
List Trainer: [####################] 100%
>>>

Reading from COM port into a text file in python

I have a list of commands saved in text file ('command.log') which I want to run against a unit connected to 'COM5' and save the response for each command in a text file ('output.log'). The script gets stuck on the first command and I could get it to run the remaining commands. Any help will be appreciated.
import serial
def cu():
ser = serial.Serial(
port='COM5',
timeout=None,
baudrate=115200,
parity='N',
stopbits=1,
bytesize=8
)
ser.flushInput()
ser.flushOutput()
## ser.write('/sl/app_config/status \r \n') #sample command
fw = open('output.log','w')
with open('data\command.log') as f:
for line in f:
ser.write(line + '\r\n')
out = ''
while out != '/>':
out += ser.readline()
fw.write(out)
print(out)
fw.close()
ser.close()
print "Finished ... "
cu()
The bytes problem
First of all, you're misusing the serial.readline function: it returns a bytes object, and you act like it was a str object, by doing out += ser.readline(): a TypeError will be raised. Instead, you must write out += str(ser.readline(), 'utf-8'), which first converts the bytes into a str.
How to check when the transmission is ended ?
Now, the problem lays in the out != '/>' condition: I think you want to test if the message sent by the device is finished, and this message ends with '/<'. But, in the while loop, you do out += [...], so in the end of the message, out is like '<here the message>/>', which is totally different from '/>'. However, you're lucky: there is the str.endswith function! So, you must replace while out != '\>' by while not out.endswith('\>'.
WWhatWhat'sWhat's theWhat's the f*** ?
Also, in your loop, you write the whole message, if it's not already ended, in each turn. This will give you, in output.log, something like <<me<mess<messag<messag<message>/>. Instead, I think you want to print only the received characters. This can be achieved using a temporary variable.
Another issue
And, you're using the serial.readline function: accordingly to the docstrings,
The line terminator is always b'\n'
It's not compatible with you're code: you want your out to finish with "\>", instead, you must use only serial.read, which returns all the received characters.
Haaaa... the end ! \o/
Finally, your while loop will look as follows:
# Check if the message is already finished
while not out.endswith('/>'):
# Save the last received characters
# A `bytes` object is encoded in 'utf-8'
received_chars = str(ser.read(), 'utf-8')
# Add them to `out`
out += received_chars
# Log them
fw.write(received_chars)
# Print them, without ending with a new line, more "user-friendly"
print(received_chars, end='')
# Finally, print a new line for clarity
print()

Unable to write to file using Python 2.7

I have written following code I am able to print out the parsed values of Lat and lon but i am unable to write them to a file. I tried flush and also i tried closing the file but of no use. Can somebody point out whats wrong here.
import os
import serial
def get_present_gps():
ser=serial.Serial('/dev/ttyUSB0',4800)
ser.open()
# open a file to write gps data
f = open('/home/iiith/Desktop/gps1.txt', 'w')
data=ser.read(1024) # read 1024 bytes
f.write(data) #write data into file
f = open('/home/iiith/Desktop/gps1.txt', 'r')# fetch the required file
f1 = open('/home/iiith/Desktop/gps2.txt', 'a+')
for line in f.read().split('\n'):
if line.startswith('$GPGGA'):
try:
lat, _, lon= line.split(',')[2:5]
lat=float(lat)
lon=float(lon)
print lat/100
print lon/100
a=[lat,lon]
f1.write(lat+",")
f1.flush()
f1.write(lon+"\n")
f1.flush()
f1.close()
except:
pass
while True:
get_present_gps()
You're covering the error up by using the except: pass. Don't do that... ever. At least log the exception.
One error which it definitely covers is lat+",", which is going to fail because it's float+str and it's not implemented. But there may be more.