Python: replacing unusual characters in a text file

Python: replacing unusual characters in a text file - replace

I am trying to do the following changes/substitutions automatically, in a text file.
â€\u9d = "
â€œ = "
â€™ = '
â€” = :
I consistently run into the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 452: character maps to <undefined>
Here's my recent code:
fin = open("example.md", "rt")
data = fin.read()
data = data.replace(r'â€\u9d', '\"')
data = data.replace(r'â€œ', '\"')
data = data.replace(r'â€™', '\"')
data = data.replace(r'â€”', ':')
fin.close()
fin = open("data.txt", "wt")
fin.write(data)
fin.close()

according to this Question ,u can use re.sub, such below :
import re
my_str = "hey th~!ere"
my_new_string = re.sub('[^a-zA-Z0-9 \n\.]', '', my_str)
print my_new_string
I test it, is working :)

You have two problems. First is that you're opening the file with the wrong encoding, leading to a case of mojibake as suggested by #JosefZ in the comments. The solution is exactly as he suggested:
fin = open("example.md", "rt", encoding="utf-8")
The second problem is that you're using a very ham-fisted way of correcting the first problem. You may find that once you read the characters correctly there's no need to fix them. But if you still need to convert curly quotes to straight ones so that everything's compatible with ASCII, there's a much easier way to do that with the unidecode module.
from unidecode import unidecode
data = unidecode(data)
This will take care of all the characters listed in your question, and more besides.

Related

how to get python to recognize the ® symbol [duplicate]

This question already has answers here:
Python to show special characters
(3 answers)
Closed 4 years ago.
Hi there I am trying to make python recognize ® as a symbol( if it doesn't show up that well here but it is the symbol with a capital R within a circle known as the 'registered' symbol)
I understand that it is not recognized in python due to ASCII however i was wondering if anyone knows of a way to use a different decoding system that includes this symbol or a method to make python 'ignore' it.
For some context:
I am trying to make an auto checkout program for a website so my program needs to match the item that the user wants. To do this I am using Beatifulsoup to scrape information however this symbol '®' is within the names of a few of the items causing python to crash.
Here is the current command that I am using but is not working due to ASCII:
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
Any help would be appreciated
Here is the entirety of the program so far(ignore the mess nowhere near done):
import time
import webbrowser
from selenium import webdriver
import mechanize
from bs4 import BeautifulSoup
import urllib2
from selenium.webdriver.support.ui import Select
CnI = []
item = []
colour = []
Uhrefs = []
Whrefs = []
FinalColours = []
selectItemindex = []
selectColourindex = []
#counters
Ccounter = 0
Icounter = 0
Splitcounter = 1
#wanted items suffix options:jackets, shirts, tops_sweaters, sweatshirts, pants, shorts, hats, bags, accessories, skate
suffix = 'accessories'
Wcolour = 'Black'
Witem = '2-Tone Nylon 6-Panel'
driver=webdriver.Chrome()
driver.get('http://www.supremenewyork.com/shop/all/'+suffix)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup)
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
print(colour)
print('#############')
for each in CnI:
each.split(',')
print(each)
while Splitcounter<=len(CnI):
item.append(CnI[Splitcounter-1])
FinalColours.append(CnI[Splitcounter])
Whrefs.append(Uhrefs[Splitcounter])
Splitcounter+=2
print(Uhrefs)
for each in item:
print(each)
for z in FinalColours:
print(z)
for i in Whrefs:
print(i)
##for i in item:
## hold = item.index(i)
## print(hold)
## if Witem == i and Wcolour == FinalColours[i]:
## print('correct')
##
##
for count,elem in enumerate(item):
if Witem in elem:
selectItemindex.append(count+1)
for count,elem in enumerate(FinalColours):
if Wcolour in elem:
selectColourindex.append(count+1)
print(selectColourindex)
print(selectItemindex)
for each in selectColourindex:
if selectColourindex[Ccounter] in selectItemindex:
point = selectColourindex[Ccounter]
print(point)
else:
Ccounter+=1
web = 'http://www.supremenewyork.com'+Whrefs[point-1]
driver.get(web)
elem1 = driver.find_element_by_name('commit')
elem1.click()
time.sleep(1)
elem2 = driver.find_element_by_link_text('view/edit basket')
elem2.click()
time.sleep(1)
elem3 = driver.find_element_by_link_text('checkout now')
elem3.click()

"®" is not a character but a unicode codepoint so if you're using Python2, your code will never work. Instead of using str(), use something like this:
unicode(input_string, 'utf8')
# or
unicode(input_string, 'unicode-escape')
Edit: Given the code surrounding the initial snippet that was posted later and the fact that BeautifulSoup actually returns unicode already, it seems that removal of str() might be the best course of action and #MarkTolonen's answer is spot-on.

BeautifulSoup returns Unicode strings. Stop converting them back to byte strings. Best practice when dealing with text is to:
Decode incoming text to Unicode (what BeautifulSoup is doing).
Process all text using Unicode.
Encode outgoing text to Unicode (to file, to database, to sockets, etc.).
Small example of your issue:
text = u'\N{REGISTERED SIGN}' # syntax to create a Unicode codepoint by name.
bytes = str(text)
Output:
Traceback (most recent call last):
File "test.py", line 2, in <module>
bytes = str(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 0: ordinal not in range(128)
Note the first line works and supports the character. Converting it to a byte string fails because it defaults to encoding in ASCII. You can explicitly encode it with another encoding (e.g. bytes = text.encode('utf8'), but that breaks rule 2 above and creates other issues.
Suggested reading:
https://nedbatchelder.com/text/unipain.html
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

3D Drawing from a file in an extra directory [duplicate]

I'm trying to get a data parsing script up and running. It works as far as the data manipulation is concerned. What I'm trying to do is set this up so I can enter multiple user defined CSV's with a single command.
e.g.
> python script.py One.csv Two.csv Three.csv
If you have any advice on how to automate the naming of the output CSV so that if input = test.csv, output = test1.csv, I'd appreciate that as well.
Getting
TypeError: coercing to Unicode: need string or buffer, list found
for the line
for line in csv.reader(open(args.infile)):
My code:
import csv
import pprint
pp = pprint.PrettyPrinter(indent=4)
res = []
import argparse
parser = argparse.ArgumentParser()
#parser.add_argument("infile", nargs="*", type=str)
#args = parser.parse_args()
parser.add_argument ("infile", metavar="CSV", nargs="+", type=str, help="data file")
args = parser.parse_args()
with open("out.csv","wb") as f:
output = csv.writer(f)
for line in csv.reader(open(args.infile)):
for item in line[2:]:
#to skip empty cells
if not item.strip():
continue
item = item.split(":")
item[1] = item[1].rstrip("%")
print([line[1]+item[0],item[1]])
res.append([line[1]+item[0],item[1]])
output.writerow([line[1]+item[0],item[1].rstrip("%")])
I don't really understand what is going on with the error. Can someone explain this in layman's terms?
Bear in mind I am new to programming/python as a whole and am basically learning alone, so if possible could you explain what is going wrong/how to fix it so I can note it for future reference.

args.infile is a list of filenames, not one filename. Loop over it:
for filename in args.infile:
base, ext = os.path.splitext(filename)
with open("{}1{}".format(base, ext), "wb") as outf, open(filename, 'rb') as inf:
output = csv.writer(outf)
for line in csv.reader(inf):
Here I used os.path.splitext() to split extension and base filename so you can generate a new output filename adding 1 to the base.

If you specify an nargs argument to .add_argument, the argument will always be returned as a list.
Assuming you want to deal with all of the files specified, loop through that list:
for filename in args.infile:
for line in csv.reader(open(filename)):
for item in line[2:]:
#to skip empty cells
[...]
Or if you really just want to be able to specify a single file; just get rid of nargs="+".

unicodecsv.DictReader not working with io.StringIO (Python 2.7)

I was trying to use csv.DictReader to parse UTF-8 data with special characters but I was getting the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
I read online and found out that Python 2.7's csv library doesn't handle Unicode. I looked for an alternative library and found unicodecsv.
I replaced csv with unicodecsv but I get the same error. Here's a simplified version of my code:
from io import StringIO
from unicodecsv import DictReader, Dialect, QUOTE_MINIMAL
data = (
'first_name,last_name,email\r'
'Elmer,Fudd,elmer#looneytunes.com\r'
'Jo\xc3\xa3o Ant\xc3\xb4nio,Ara\xc3\xbajo,joaoantonio#araujo.com\r'
)
unicode_data = StringIO(unicode(data, 'utf-8-sig'), newline=None)
class CustomDialect(Dialect):
delimiter = ','
doublequote = True
escapechar = '\\'
lineterminator = '\r\n'
quotechar = '"'
quoting = QUOTE_MINIMAL
skipinitialspace = True
rows = DictReader(unicode_data, dialect=CustomDialect)
for row in rows:
print row
If I replace StringIO with BytesIO, the encoding works but I can't send the newlines argument anymore and then I get:
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Does anybody have any idea how I could solve this? Shouldn't unicodecsv be handling StringIO? Thanks

I opened an issue in the unicodecsv github page and it turns out (a bit counterintuitively imo) that the unicodecsv reader expects a bytestring and not a unicode object.
After taking some time to make this whole thing with Unicode and encodings clearer in my head, it turns out I didn't really need unicodecsv in the first place. After all, the initial problem is that io.StringIO, when iterated with .next(), was returning unicode objects to the csv.DictReader, which expected bytestrings. So if unicodecsv also expects bytestrings it obviously can't solve the problem.
My solution was changing the file-like object I was passing to the csv.DictReader so that it returned properly encoded bytestrings instead of unicode objects:
class UTF8EncodedStringIO(StringIO):
def next(self):
return super(UTF8EncodedStringIO, self).next().encode('utf-8')
udata = UTF8EncodedStringIO(unicode(data, 'utf-8-sig'), newline=None)
By writing this simple wrapper around StringIO instead of using BytesIO I could solve the encoding problems and profit from the newline argument. There's a bit of decoding/encoding overhead but I was out of alternatives. If somebody has a better suggestion, feel free to share.

c++ macro for saving enum element names and values to file

Normally I try to avoid the use of macros, so I actually don't know how to use them beyond the very most basic ones, but I'm trying to do some meta-manipulation so I assume macros are needed.
I have an enum listing various log entries and their respective id, e.g.
enum LogID
{
LOG_ID_ITEM1=0,
LOG_ID_ITEM2,
LOG_ID_ITEM3=10,
...
}
which is used within my program when writing data to the log file. Note that they will not, in general, be in any order.
I do most of my log file post-processing in Matlab so I'd like to write the same variable names and values to a file for Matlab to load in. e.g., a file looking like
LOG_ID_ITEM1=0;
LOG_ID_ITEM2=1;
LOG_ID_ITEM3=10;
...
I have no idea how to go about doing this, but it seems like it shouldn't be too complicated. If it helps, I am using c++11.
edit:
For clarification, I'm not looking for the macro itself to write the file. I want a way to store the enum element names and values as strings and ints somehow so I can then use a regular c++ function to write everything to file. I'm thinking the macro might then be used to build up the strings and values into vectors? Does that work? If so, how?

I agree with Adam Burry that a separate script is likely best for this. Not sure which languages you're familiar with, but here's a quick Python script that'll do the job:
#!/usr/bin/python
'''Makes a .m file from an enum in a C++ source file.'''
from __future__ import print_function
import sys
import re
def parse_cmd_line():
'''Gets a filename from the first command line argument.'''
if len(sys.argv) != 2:
sys.stderr.write('Usage: enummaker [cppfilename]\n')
sys.exit(1)
return sys.argv[1]
def make_m_file(cpp_file, m_file):
'''Makes an .m file from enumerations in a .cpp file.'''
in_enum = False
enum_val = 0
lines = cpp_file.readlines()
for line in lines:
if in_enum:
# Currently processing an enumeration
if '}' in line:
# Encountered a closing brace, so stop
# processing and reset value counter
in_enum = False
enum_val = 0
else:
# No closing brace, so process line
if '=' in line:
# If a value is supplied, use it
ev_string = re.match(r'[^=]*=(\d+)', line)
enum_val = int(ev_string.group(1))
# Write output line to file
e_out = re.match(r'[^=\n,]+', line)
m_file.write(e_out.group(0).strip() + '=' +
str(enum_val) + ';\n')
enum_val += 1
else:
# Not currently processing an enum,
# so check for an enum definition
enumstart = re.match(r'enum \w+ {', line)
if enumstart:
in_enum = True
def main():
'''Main function.'''
# Get file names
cpp_name = parse_cmd_line()
m_name = cpp_name.replace('cpp', 'm')
print('Converting ' + cpp_name + ' to ' + m_name + '...')
# Open the files
try:
cpp_file = open(cpp_name, 'r')
except IOError:
print("Couldn't open " + cpp_name + ' for reading.')
sys.exit(1)
try:
m_file = open(m_name, 'w')
except IOError:
print("Couldn't open " + m_name + ' for writing.')
sys.exit(1)
# Translate the cpp file
make_m_file(cpp_file, m_file)
# Finish
print("Done.")
cpp_file.close()
m_file.close()
if __name__ == '__main__':
main()
Running ./enummaker.py testenum.cpp on the following file of that name:
/* Random code here */
enum LogID {
LOG_ID_ITEM1=0,
LOG_ID_ITEM2,
LOG_ID_ITEM3=10,
LOG_ID_ITEM4
};
/* More random code here */
enum Stuff {
STUFF_ONE,
STUFF_TWO,
STUFF_THREE=99,
STUFF_FOUR,
STUFF_FIVE
};
/* Yet more random code here */
produces a file testenum.m containing the following:
LOG_ID_ITEM1=0;
LOG_ID_ITEM2=1;
LOG_ID_ITEM3=10;
LOG_ID_ITEM4=11;
STUFF_ONE=0;
STUFF_TWO=1;
STUFF_THREE=99;
STUFF_FOUR=100;
STUFF_FIVE=101;
This script assumes that the closing brace of an enum block is always on a separate line, that the first identifier is defined on the line following the opening brace, that there are no blank lines between the braces, that enum appears at the start of a line, and that there is no space following the = and the number. Easy enough to modify the script to overcome these limitations. You could have your makefile run this automatically.

Have you considered "going the other way"? It usually makes more sense to maintain your data definitions in a (text) file, then as part of your build process you can generate a C++ header and include it. Python and mako is a good tool for doing this.

Character Encoding: Why my email receiving code cannot run in PyQt4?

I am recently finishing a spam classification application as my final project and now I meet a problem.
The problem came from a module to receive emails. I wrote the test code in a single .py file and it worked really well. Here is the code:
#!/usr/bin/env python
# coding=utf-8
import poplib
from email import parser
host = 'pop.qq.com'
username = 'xxxxx#qq.com'
password = 'xxxxxxxxxxxxx'
pop_conn = poplib.POP3_SSL(host)
pop_conn.user(username)
pop_conn.pass_(password)
messages = [pop_conn.retr(i) for i in range(1, len(pop_conn.list()[1]) + 1)]
# Concat message pieces:
messages = ["\n".join(mssg[1]) for mssg in messages]
#print messages
messages = [parser.Parser().parsestr(mssg) for mssg in messages]
i = 0
for message in messages:
i = i + 1
mailName = "mail"+str(i)
f = open(mailName + '.log', 'w');
print >> f, "Date: ", message["Date"]
print >> f, "From: ", message["From"]
print >> f, "To: ", message["To"]
print >> f, "Subject: ", message["Subject"]
print >> f, "Data: "
for part in message.walk():
contentType = part.get_content_type()
if contentType == 'text/plain' :
data = part.get_payload(decode=True)
print >> f, data
f.close()
pop_conn.quit()
But when I tried to transplant exactly the same code to my PyQt4 application, the problem came out in this line:
messages = ["\n".join(mssg[1]) for mssg in messages]
and this is the problem:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 4:ordinal not in range(128)
mssg[1] is a list that contains every line of the mail. I guess this is because the text from the mail was encoded by "utf-8" or "gbk" which can't be decoded by the default "ascii". So I tried to write the code like this:
messages = ["\n".join([m.decode("utf-8") for m in mssg[1]]) for mssg in messages]
The problem became like this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 7
I used Python chardet module to detect the encoding of the text of the email, and it turned out to be "ascii". Now I am really confused. Why the same code can't run on my small application? What is the real problem, and how I can fix it? I will be very appreciated for your help.

I finally solved this problem by receiving the email in a .py file and using my application to import that file. This may not be useful in other situations because I actually didn't solve the character encoding problem. When I was implementing my application, I met lots of encoding problems, and it's quite annoying. For this, I guess it is caused by some irregular text from my mail(maybe some pictures) which is shown in the following picture:
This was shown when I tried to print some of my email data on the screen. However, I still don't know why this cannot run in my application, though it worked well in a simple file. The character encoding problem is very annoying, and maybe I still have a long way to go.:-D

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python: replacing unusual characters in a text file - replace

according to this Question ,u can use re.sub, such below : import re my_str = "hey th~!ere" my_new_string = re.sub('[^a-zA-Z0-9 \n\.]', '', my_str) print my_new_string I test it, is working :)

Related

how to get python to recognize the ® symbol [duplicate]

3D Drawing from a file in an extra directory [duplicate]

unicodecsv.DictReader not working with io.StringIO (Python 2.7)

c++ macro for saving enum element names and values to file

Character Encoding: Why my email receiving code cannot run in PyQt4?

Categories

Resources