how to get python to recognize the ® symbol [duplicate] - python-2.7

This question already has answers here:
Python to show special characters
(3 answers)
Closed 4 years ago.
Hi there I am trying to make python recognize ® as a symbol( if it doesn't show up that well here but it is the symbol with a capital R within a circle known as the 'registered' symbol)
I understand that it is not recognized in python due to ASCII however i was wondering if anyone knows of a way to use a different decoding system that includes this symbol or a method to make python 'ignore' it.
For some context:
I am trying to make an auto checkout program for a website so my program needs to match the item that the user wants. To do this I am using Beatifulsoup to scrape information however this symbol '®' is within the names of a few of the items causing python to crash.
Here is the current command that I am using but is not working due to ASCII:
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
Any help would be appreciated
Here is the entirety of the program so far(ignore the mess nowhere near done):
import time
import webbrowser
from selenium import webdriver
import mechanize
from bs4 import BeautifulSoup
import urllib2
from selenium.webdriver.support.ui import Select
CnI = []
item = []
colour = []
Uhrefs = []
Whrefs = []
FinalColours = []
selectItemindex = []
selectColourindex = []
#counters
Ccounter = 0
Icounter = 0
Splitcounter = 1
#wanted items suffix options:jackets, shirts, tops_sweaters, sweatshirts, pants, shorts, hats, bags, accessories, skate
suffix = 'accessories'
Wcolour = 'Black'
Witem = '2-Tone Nylon 6-Panel'
driver=webdriver.Chrome()
driver.get('http://www.supremenewyork.com/shop/all/'+suffix)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup)
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
print(colour)
print('#############')
for each in CnI:
each.split(',')
print(each)
while Splitcounter<=len(CnI):
item.append(CnI[Splitcounter-1])
FinalColours.append(CnI[Splitcounter])
Whrefs.append(Uhrefs[Splitcounter])
Splitcounter+=2
print(Uhrefs)
for each in item:
print(each)
for z in FinalColours:
print(z)
for i in Whrefs:
print(i)
##for i in item:
## hold = item.index(i)
## print(hold)
## if Witem == i and Wcolour == FinalColours[i]:
## print('correct')
##
##
for count,elem in enumerate(item):
if Witem in elem:
selectItemindex.append(count+1)
for count,elem in enumerate(FinalColours):
if Wcolour in elem:
selectColourindex.append(count+1)
print(selectColourindex)
print(selectItemindex)
for each in selectColourindex:
if selectColourindex[Ccounter] in selectItemindex:
point = selectColourindex[Ccounter]
print(point)
else:
Ccounter+=1
web = 'http://www.supremenewyork.com'+Whrefs[point-1]
driver.get(web)
elem1 = driver.find_element_by_name('commit')
elem1.click()
time.sleep(1)
elem2 = driver.find_element_by_link_text('view/edit basket')
elem2.click()
time.sleep(1)
elem3 = driver.find_element_by_link_text('checkout now')
elem3.click()

"®" is not a character but a unicode codepoint so if you're using Python2, your code will never work. Instead of using str(), use something like this:
unicode(input_string, 'utf8')
# or
unicode(input_string, 'unicode-escape')
Edit: Given the code surrounding the initial snippet that was posted later and the fact that BeautifulSoup actually returns unicode already, it seems that removal of str() might be the best course of action and #MarkTolonen's answer is spot-on.

BeautifulSoup returns Unicode strings. Stop converting them back to byte strings. Best practice when dealing with text is to:
Decode incoming text to Unicode (what BeautifulSoup is doing).
Process all text using Unicode.
Encode outgoing text to Unicode (to file, to database, to sockets, etc.).
Small example of your issue:
text = u'\N{REGISTERED SIGN}' # syntax to create a Unicode codepoint by name.
bytes = str(text)
Output:
Traceback (most recent call last):
File "test.py", line 2, in <module>
bytes = str(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 0: ordinal not in range(128)
Note the first line works and supports the character. Converting it to a byte string fails because it defaults to encoding in ASCII. You can explicitly encode it with another encoding (e.g. bytes = text.encode('utf8'), but that breaks rule 2 above and creates other issues.
Suggested reading:
https://nedbatchelder.com/text/unipain.html
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Related

Python: replacing unusual characters in a text file

I am trying to do the following changes/substitutions automatically, in a text file.
â€\u9d = "
“ = "
’ = '
— = :
I consistently run into the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 452: character maps to <undefined>
Here's my recent code:
fin = open("example.md", "rt")
data = fin.read()
data = data.replace(r'â€\u9d', '\"')
data = data.replace(r'“', '\"')
data = data.replace(r'’', '\"')
data = data.replace(r'—', ':')
fin.close()
fin = open("data.txt", "wt")
fin.write(data)
fin.close()
according to this Question ,u can use re.sub, such below :
import re
my_str = "hey th~!ere"
my_new_string = re.sub('[^a-zA-Z0-9 \n\.]', '', my_str)
print my_new_string
I test it, is working :)
You have two problems. First is that you're opening the file with the wrong encoding, leading to a case of mojibake as suggested by #JosefZ in the comments. The solution is exactly as he suggested:
fin = open("example.md", "rt", encoding="utf-8")
The second problem is that you're using a very ham-fisted way of correcting the first problem. You may find that once you read the characters correctly there's no need to fix them. But if you still need to convert curly quotes to straight ones so that everything's compatible with ASCII, there's a much easier way to do that with the unidecode module.
from unidecode import unidecode
data = unidecode(data)
This will take care of all the characters listed in your question, and more besides.

PyYAML shows "ScannerError: mapping values are not allowed here" in my unittest

I am trying to test a number of Python 2.7 classes using unittest.
Here is the exception:
ScannerError: mapping values are not allowed here
in "<unicode string>", line 3, column 32:
... file1_with_path: '../../testdata/concat1.csv'
Here is the example the error message relates to:
class TestConcatTransform(unittest.TestCase):
def setUp(self):
filename1 = os.path.dirname(os.path.realpath(__file__)) + '/../../testdata/concat1.pkl'
self.df1 = pd.read_pickle(filename1)
filename2 = os.path.dirname(os.path.realpath(__file__)) + '/../../testdata/concat2.pkl'
self.df2 = pd.read_pickle(filename2)
self.yamlconfig = u'''
--- !ConcatTransform
file1_with_path: '../../testdata/concat1.csv'
file2_with_path: '../../testdata/concat2.csv'
skip_header_lines: [0]
duplicates: ['%allcolumns']
outtype: 'dataframe'
client: 'testdata'
addcolumn: []
'''
self.testconcat = yaml.load(self.yamlconfig)
What is the the problem?
Something not clear to me is that the directory structure I have is:
app
app/etl
app/tests
The ConcatTransform is in app/etl/concattransform.py and TestConcatTransform is in app/tests. I import ConcatTransform into the TestConcatTransform unittest with this import:
from app.etl import concattransform
How does PyYAML associate that class with the one defined in yamlconfig?
A YAML document can start with a document start marker ---, but that has to be at the beginning of a line, and yours is indented eight positions on the second line of the input. That causes the --- to be interpreted as the beginning of a multi-line plain (i.e. non-quoted) scalar, and within such a scalar you cannot have a : (colon + space). You can only have : in quoted scalars. And if your document does not have a mapping or sequence at the root level, as yours doesn't, the whole document can only consists of a single scalar.
If you want to keep your sources nicely indented like you have now, I recommend you use dedent from textwrap.
The following runs without error:
import ruamel.yaml
from textwrap import dedent
yaml_config = dedent(u'''\
--- !ConcatTransform
file1_with_path: '../../testdata/concat1.csv'
file2_with_path: '../../testdata/concat2.csv'
skip_header_lines: [0]
duplicates: ['%allcolumns']
outtype: 'dataframe'
client: 'testdata'
addcolumn: []
''')
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_config)
You should get into the habit to put the backslash (\) at the end of your first triple-quotes, so your YAML document. If you do that, your error would have actually indicated line 2 because the document doesn't start with an empty line anymore.
During loading the YAML parser encouncters the tag !ConcatTransform. A constructor for an object is probably registered with the PyYAML loader, associating that tag with the using PyYAML's add_constructor, during the import.
Unfortunately they registered their constructor with the default, non-safe, loader, which is not necessary, they could have registered with the SafeLoader, and thereby not force users to risk problems with non-controlled input.

Django encoding error when reading from a CSV

When I try to run:
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Most of my data gets created in the database, except for one particular row. When my script reaches the row, I receive the error:
ProgrammingError: You must not use 8-bit bytestrings unless you use a
text_factory that can interpret 8-bit bytestrings (like text_factory = str).
It is highly recommended that you instead just switch your application to Unicode strings.`
The particular row in the CSV that causes this error is:
>>> row
{'FR\xed\x8aD\xed\x8aRIC.ST-DENIS', 'BOS', 'G'}
I've looked at the other similar Stackoverflow threads with the same or similar issues, but most aren't specific to using Sqlite with Django. Any advice?
If it matters, I'm running the script by going into the Django shell by calling python manage.py shell, and copy-pasting it in, as opposed to just calling the script from the command line.
This is the stacktrace I get:
Traceback (most recent call last):
File "<console>", line 4, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 108, in next
row = self.reader.next()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 302, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1674: invalid continuation byte
EDIT: I decided to just manually import this entry into my database, rather than try to read it from my CSV, based on Alastair McCormack's feedback
Based on the output from your question, it looks like the person who made the CSV mojibaked it - it doesn't seem to represent FRÉDÉRIC.ST-DENIS. You can try using windows-1252 instead of utf-8 but I think you'll end up with FRíŠDíŠRIC.ST-DENIS in your database.
I suspect you're using Python 2 - open() returns str which are simply byte strings.
The error is telling you that you need to decode your text to Unicode string before use.
The simplest method is to decode each cell:
with open('data.csv', 'r') as csvfile: # 'U' means Universal line mode and is not necessary
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].decode('utf-8),
team=row['Team'].decode('utf-8),
position=row['Position'].decode('utf-8)
)
That'll work but it's ugly add decodes everywhere and it won't work in Python 3. Python 3 improves things by opening files in text mode and returning Python 3 strings which are the equivalent of Unicode strings in Py2.
To get the same functionality in Python 2, use the io module. This gives you a open() method which has an encoding option. Annoyingly, the Python 2.x CSV module is broken with Unicode, so you need to install a backported version:
pip install backports.csv
To tidy your code and future proof it, do:
import io
from backports import csv
with io.open('data.csv', 'r', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# now every row is automatically decoded from UTF-8
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Encode Player name in utf-8 using .encode('utf-8') in player name
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].encode('utf-8'),
team=row['Team'],
position=row['Position']
)
In Django, decode with latin-1, csv.DictReader(io.StringIO(csv_file.read().decode('latin-1'))), it would devour all special characters and all comma exceptions you get in utf-8.

unicodecsv.DictReader not working with io.StringIO (Python 2.7)

I was trying to use csv.DictReader to parse UTF-8 data with special characters but I was getting the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
I read online and found out that Python 2.7's csv library doesn't handle Unicode. I looked for an alternative library and found unicodecsv.
I replaced csv with unicodecsv but I get the same error. Here's a simplified version of my code:
from io import StringIO
from unicodecsv import DictReader, Dialect, QUOTE_MINIMAL
data = (
'first_name,last_name,email\r'
'Elmer,Fudd,elmer#looneytunes.com\r'
'Jo\xc3\xa3o Ant\xc3\xb4nio,Ara\xc3\xbajo,joaoantonio#araujo.com\r'
)
unicode_data = StringIO(unicode(data, 'utf-8-sig'), newline=None)
class CustomDialect(Dialect):
delimiter = ','
doublequote = True
escapechar = '\\'
lineterminator = '\r\n'
quotechar = '"'
quoting = QUOTE_MINIMAL
skipinitialspace = True
rows = DictReader(unicode_data, dialect=CustomDialect)
for row in rows:
print row
If I replace StringIO with BytesIO, the encoding works but I can't send the newlines argument anymore and then I get:
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Does anybody have any idea how I could solve this? Shouldn't unicodecsv be handling StringIO? Thanks
I opened an issue in the unicodecsv github page and it turns out (a bit counterintuitively imo) that the unicodecsv reader expects a bytestring and not a unicode object.
After taking some time to make this whole thing with Unicode and encodings clearer in my head, it turns out I didn't really need unicodecsv in the first place. After all, the initial problem is that io.StringIO, when iterated with .next(), was returning unicode objects to the csv.DictReader, which expected bytestrings. So if unicodecsv also expects bytestrings it obviously can't solve the problem.
My solution was changing the file-like object I was passing to the csv.DictReader so that it returned properly encoded bytestrings instead of unicode objects:
class UTF8EncodedStringIO(StringIO):
def next(self):
return super(UTF8EncodedStringIO, self).next().encode('utf-8')
udata = UTF8EncodedStringIO(unicode(data, 'utf-8-sig'), newline=None)
By writing this simple wrapper around StringIO instead of using BytesIO I could solve the encoding problems and profit from the newline argument. There's a bit of decoding/encoding overhead but I was out of alternatives. If somebody has a better suggestion, feel free to share.

"UnicodeEncodeError: 'ascii' codec can't encode character"

I'm trying to pass big strings of random html through regular expressions and my Python 2.6 script is choking on this:
UnicodeEncodeError: 'ascii' codec can't encode character
I traced it back to a trademark superscript on the end of this word: Protection™ -- and I expect to encounter others like it in the future.
Is there a module to process non-ascii characters? or, what is the best way to handle/escape non-ascii stuff in python?
Thanks!
Full error:
E
======================================================================
ERROR: test_untitled (__main__.Untitled)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Python26\Test2.py", line 26, in test_untitled
ofile.write(Whois + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 1005: ordinal not in range(128)
Full Script:
from selenium import selenium
import unittest, time, re, csv, logging
class Untitled(unittest.TestCase):
def setUp(self):
self.verificationErrors = []
self.selenium = selenium("localhost", 4444, "*firefox", "http://www.BaseDomain.com/")
self.selenium.start()
self.selenium.set_timeout("90000")
def test_untitled(self):
sel = self.selenium
spamReader = csv.reader(open('SubDomainList.csv', 'rb'))
for row in spamReader:
sel.open(row[0])
time.sleep(10)
Test = sel.get_text("//html/body/div/table/tbody/tr/td/form/div/table/tbody/tr[7]/td")
Test = Test.replace(",","")
Test = Test.replace("\n", "")
ofile = open('TestOut.csv', 'ab')
ofile.write(Test + '\n')
ofile.close()
def tearDown(self):
self.selenium.stop()
self.assertEqual([], self.verificationErrors)
if __name__ == "__main__":
unittest.main()
You're trying to convert unicode to ascii in "strict" mode:
>>> help(str.encode)
Help on method_descriptor:
encode(...)
S.encode([encoding[,errors]]) -> object
Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.
You probably want something like one of the following:
s = u'Protection™'
print s.encode('ascii', 'ignore') # removes the ™
print s.encode('ascii', 'replace') # replaces with ?
print s.encode('ascii','xmlcharrefreplace') # turn into xml entities
print s.encode('ascii', 'strict') # throw UnicodeEncodeErrors
You're trying to pass a bytestring to something, but it's impossible (from the scarcity of info you provide) to tell what you're trying to pass it to. You start with a Unicode string that cannot be encoded as ASCII (the default codec), so, you'll have to encode by some different codec (or transliterate it, as #R.Pate suggests) -- but it's impossible for use to say what codec you should use, because we don't know what you're passing the bytestring and therefore don't know what that unknown subsystem is going to be able to accept and process correctly in terms of codecs.
In such total darkness as you leave us in, utf-8 is a reasonable blind guess (since it's a codec that can represent any Unicode string exactly as a bytestring, and it's the standard codec for many purposes, such as XML) -- but it can't be any more than a blind guess, until and unless you're going to tell us more about what you're trying to pass that bytestring to, and for what purposes.
Passing thestring.encode('utf-8') rather than bare thestring will definitely avoid the particular error you're seeing right now, but it may result in peculiar displays (or whatever it is you're trying to do with that bytestring!) unless the recipient is ready, willing and able to accept utf-8 encoding (and how could WE know, having absolutely zero idea about what the recipient could possibly be?!-)
The "best" way always depends on your requirements; so, what are yours? Is ignoring non-ASCII appropriate? Should you replace ™ with "(tm)"? (Which looks fancy for this example, but quickly breaks down for other codepoints—but it may be just what you want.) Could the exception be exactly what you need; now you just need to handle it in some way?
Only you can really answer this question.
First of all, try installing translations for English language (or any other if needed):
sudo apt-get install language-pack-en
which provides translation data updates for all supported packages (including Python).
And make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
Then double check your system configuration like value of LANG or configuration of locale (/etc/default/locale) and don't forget to re-login your session.