unicodecsv.DictReader not working with io.StringIO (Python 2.7) - python-2.7

I was trying to use csv.DictReader to parse UTF-8 data with special characters but I was getting the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
I read online and found out that Python 2.7's csv library doesn't handle Unicode. I looked for an alternative library and found unicodecsv.
I replaced csv with unicodecsv but I get the same error. Here's a simplified version of my code:
from io import StringIO
from unicodecsv import DictReader, Dialect, QUOTE_MINIMAL
data = (
'first_name,last_name,email\r'
'Elmer,Fudd,elmer#looneytunes.com\r'
'Jo\xc3\xa3o Ant\xc3\xb4nio,Ara\xc3\xbajo,joaoantonio#araujo.com\r'
)
unicode_data = StringIO(unicode(data, 'utf-8-sig'), newline=None)
class CustomDialect(Dialect):
delimiter = ','
doublequote = True
escapechar = '\\'
lineterminator = '\r\n'
quotechar = '"'
quoting = QUOTE_MINIMAL
skipinitialspace = True
rows = DictReader(unicode_data, dialect=CustomDialect)
for row in rows:
print row
If I replace StringIO with BytesIO, the encoding works but I can't send the newlines argument anymore and then I get:
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Does anybody have any idea how I could solve this? Shouldn't unicodecsv be handling StringIO? Thanks

I opened an issue in the unicodecsv github page and it turns out (a bit counterintuitively imo) that the unicodecsv reader expects a bytestring and not a unicode object.
After taking some time to make this whole thing with Unicode and encodings clearer in my head, it turns out I didn't really need unicodecsv in the first place. After all, the initial problem is that io.StringIO, when iterated with .next(), was returning unicode objects to the csv.DictReader, which expected bytestrings. So if unicodecsv also expects bytestrings it obviously can't solve the problem.
My solution was changing the file-like object I was passing to the csv.DictReader so that it returned properly encoded bytestrings instead of unicode objects:
class UTF8EncodedStringIO(StringIO):
def next(self):
return super(UTF8EncodedStringIO, self).next().encode('utf-8')
udata = UTF8EncodedStringIO(unicode(data, 'utf-8-sig'), newline=None)
By writing this simple wrapper around StringIO instead of using BytesIO I could solve the encoding problems and profit from the newline argument. There's a bit of decoding/encoding overhead but I was out of alternatives. If somebody has a better suggestion, feel free to share.

Related

List to CSV writing is not working in Python 2.7

I have converted my python3 code to python2 using the module 3to2 and now am not able to write to CSV file.
TypeError: must be unicode, not str
After the conversion code
def writeCSV(self,filepath,data):
with open(filepath, u'a', newline=u'') as file:
writer = csv.writer(file)
writer.writerow(data)
The data which is passed is ['10/06/2020 04:28:57 AM', u'HCL Docker', u'Down']
. May I know how can overcome this issue?
In the list:
['10/06/2020 04:28:57 AM', u'HCL Docker', u'Down']
The first item, as the error message says, must be unicode, not str. As you can see, Unicode is prefixed with u, like this: u'HCL Docker'.
So you should make sure that each item in data is unicode. You can convert everything to unicode like this:
writer.writerow(map(unicode, data))

how to get python to recognize the ® symbol [duplicate]

This question already has answers here:
Python to show special characters
(3 answers)
Closed 4 years ago.
Hi there I am trying to make python recognize ® as a symbol( if it doesn't show up that well here but it is the symbol with a capital R within a circle known as the 'registered' symbol)
I understand that it is not recognized in python due to ASCII however i was wondering if anyone knows of a way to use a different decoding system that includes this symbol or a method to make python 'ignore' it.
For some context:
I am trying to make an auto checkout program for a website so my program needs to match the item that the user wants. To do this I am using Beatifulsoup to scrape information however this symbol '®' is within the names of a few of the items causing python to crash.
Here is the current command that I am using but is not working due to ASCII:
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
Any help would be appreciated
Here is the entirety of the program so far(ignore the mess nowhere near done):
import time
import webbrowser
from selenium import webdriver
import mechanize
from bs4 import BeautifulSoup
import urllib2
from selenium.webdriver.support.ui import Select
CnI = []
item = []
colour = []
Uhrefs = []
Whrefs = []
FinalColours = []
selectItemindex = []
selectColourindex = []
#counters
Ccounter = 0
Icounter = 0
Splitcounter = 1
#wanted items suffix options:jackets, shirts, tops_sweaters, sweatshirts, pants, shorts, hats, bags, accessories, skate
suffix = 'accessories'
Wcolour = 'Black'
Witem = '2-Tone Nylon 6-Panel'
driver=webdriver.Chrome()
driver.get('http://www.supremenewyork.com/shop/all/'+suffix)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup)
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
print(colour)
print('#############')
for each in CnI:
each.split(',')
print(each)
while Splitcounter<=len(CnI):
item.append(CnI[Splitcounter-1])
FinalColours.append(CnI[Splitcounter])
Whrefs.append(Uhrefs[Splitcounter])
Splitcounter+=2
print(Uhrefs)
for each in item:
print(each)
for z in FinalColours:
print(z)
for i in Whrefs:
print(i)
##for i in item:
## hold = item.index(i)
## print(hold)
## if Witem == i and Wcolour == FinalColours[i]:
## print('correct')
##
##
for count,elem in enumerate(item):
if Witem in elem:
selectItemindex.append(count+1)
for count,elem in enumerate(FinalColours):
if Wcolour in elem:
selectColourindex.append(count+1)
print(selectColourindex)
print(selectItemindex)
for each in selectColourindex:
if selectColourindex[Ccounter] in selectItemindex:
point = selectColourindex[Ccounter]
print(point)
else:
Ccounter+=1
web = 'http://www.supremenewyork.com'+Whrefs[point-1]
driver.get(web)
elem1 = driver.find_element_by_name('commit')
elem1.click()
time.sleep(1)
elem2 = driver.find_element_by_link_text('view/edit basket')
elem2.click()
time.sleep(1)
elem3 = driver.find_element_by_link_text('checkout now')
elem3.click()
"®" is not a character but a unicode codepoint so if you're using Python2, your code will never work. Instead of using str(), use something like this:
unicode(input_string, 'utf8')
# or
unicode(input_string, 'unicode-escape')
Edit: Given the code surrounding the initial snippet that was posted later and the fact that BeautifulSoup actually returns unicode already, it seems that removal of str() might be the best course of action and #MarkTolonen's answer is spot-on.
BeautifulSoup returns Unicode strings. Stop converting them back to byte strings. Best practice when dealing with text is to:
Decode incoming text to Unicode (what BeautifulSoup is doing).
Process all text using Unicode.
Encode outgoing text to Unicode (to file, to database, to sockets, etc.).
Small example of your issue:
text = u'\N{REGISTERED SIGN}' # syntax to create a Unicode codepoint by name.
bytes = str(text)
Output:
Traceback (most recent call last):
File "test.py", line 2, in <module>
bytes = str(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 0: ordinal not in range(128)
Note the first line works and supports the character. Converting it to a byte string fails because it defaults to encoding in ASCII. You can explicitly encode it with another encoding (e.g. bytes = text.encode('utf8'), but that breaks rule 2 above and creates other issues.
Suggested reading:
https://nedbatchelder.com/text/unipain.html
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Django encoding error when reading from a CSV

When I try to run:
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Most of my data gets created in the database, except for one particular row. When my script reaches the row, I receive the error:
ProgrammingError: You must not use 8-bit bytestrings unless you use a
text_factory that can interpret 8-bit bytestrings (like text_factory = str).
It is highly recommended that you instead just switch your application to Unicode strings.`
The particular row in the CSV that causes this error is:
>>> row
{'FR\xed\x8aD\xed\x8aRIC.ST-DENIS', 'BOS', 'G'}
I've looked at the other similar Stackoverflow threads with the same or similar issues, but most aren't specific to using Sqlite with Django. Any advice?
If it matters, I'm running the script by going into the Django shell by calling python manage.py shell, and copy-pasting it in, as opposed to just calling the script from the command line.
This is the stacktrace I get:
Traceback (most recent call last):
File "<console>", line 4, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 108, in next
row = self.reader.next()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 302, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1674: invalid continuation byte
EDIT: I decided to just manually import this entry into my database, rather than try to read it from my CSV, based on Alastair McCormack's feedback
Based on the output from your question, it looks like the person who made the CSV mojibaked it - it doesn't seem to represent FRÉDÉRIC.ST-DENIS. You can try using windows-1252 instead of utf-8 but I think you'll end up with FRíŠDíŠRIC.ST-DENIS in your database.
I suspect you're using Python 2 - open() returns str which are simply byte strings.
The error is telling you that you need to decode your text to Unicode string before use.
The simplest method is to decode each cell:
with open('data.csv', 'r') as csvfile: # 'U' means Universal line mode and is not necessary
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].decode('utf-8),
team=row['Team'].decode('utf-8),
position=row['Position'].decode('utf-8)
)
That'll work but it's ugly add decodes everywhere and it won't work in Python 3. Python 3 improves things by opening files in text mode and returning Python 3 strings which are the equivalent of Unicode strings in Py2.
To get the same functionality in Python 2, use the io module. This gives you a open() method which has an encoding option. Annoyingly, the Python 2.x CSV module is broken with Unicode, so you need to install a backported version:
pip install backports.csv
To tidy your code and future proof it, do:
import io
from backports import csv
with io.open('data.csv', 'r', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# now every row is automatically decoded from UTF-8
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Encode Player name in utf-8 using .encode('utf-8') in player name
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].encode('utf-8'),
team=row['Team'],
position=row['Position']
)
In Django, decode with latin-1, csv.DictReader(io.StringIO(csv_file.read().decode('latin-1'))), it would devour all special characters and all comma exceptions you get in utf-8.

Python 2.7 and Sublime 2 + unicode don't mix

First of all, I've looked here: Sublime Text 3, Python 3 and UTF-8 don't like each other and read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets but am still none the wiser to the following:
Running Python from a file created in Sublime (not compiling) and executing via command prompt on an XP machine
I have a couple of text files named with accents (German, Spanish & French mostly). I want to remove accented characters (umlauts, acutes, graves, cidillas etc) and replace them with their equilivant non accented look a like.
I can strip the accents if they are a string from with the script. But accesing a textfile of the same name causes the the strippAcent function to fail. I'm all out of ideas as I think this is due to a conflict with Sublime and Python.
Here's my script
# -*- coding: utf-8 -*-
import unicodedata
import os
def stripAccents(s):
try:
us = unicode(s,"utf-8")
nice = unicodedata.normalize("NFD", us).encode("ascii", "ignore")
print nice
return nice
except:
print ("Fail! : %s" %(s))
return None
stripAccents("Découvrez tous les logiciels à télécharger")
# Decouvrez tous les logiciels a telecharger
stripAccents("Östblocket")
# Ostblocket
stripAccents("Blühende Landschaften")
# Bluhende Landschaften
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
for name in files:
x = name
x = stripAccents(x)
For the record:
C:\chcp
gets me 437
This is what the code produces for me:
The error in full is:
C:\WINDOWS\system32>D:\LearnPython\unicode_accents.py
Decouvrez tous les logiciels a telecharger
Ostblocket
Bluhende Landschaften
Traceback (most recent call last):
File "D:\LearnPython\unicode_accents.py", line 37, in <module>
x = stripAccents(x)
File "D:\LearnPython\unicode_accents.py", line 8, in stripAccents
us = unicode(s,"utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 2: invalid start byte
C:\WINDOWS\system32>
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
If you want to read Windows's filenames in their native Unicode form you have to ask for that specfically, by passing a Unicode string to filesystem functions:
root = u"D:\\temp\\test\\"
Otherwise Python will default to using the standard byte-based interfaces to the filesystem. On Windows, these return filenames to you encoded in the system's locale-specific legacy encoding (ANSI code page).
In stripAccents you try to decode the byte string you got from here using UTF-8, but the ANSI code page is never UTF-8, and the byte sequence you have doesn't happen to be a valid UTF-8 sequence so you get an error. You can decode from the ANSI code page using the pseudo-encoding mbcs, but it would be better to stick to Unicode filepath strings so you can include characters that don't fit in ANSI.
Always use Unicode strings to represent text in Python. Add from __future__ import unicode_literals at the top so that all "" literals would create Unicode strings. Or use u"" literals everywhere. Drop unicode(s, 'utf-8') from stripAccents(), always pass Unicode strings instead (try unidecode package, to transliterate Unicode to ascii).
Using Unicode solves several issues transparently:
there won't be UnicodeDecodeError because Windows provides Unicode API for filenames: if you pass Unicode input; you get Unicode output
you won't get a mojibake when a bytestring containing text encoded using your Windows encoding such as cp1252 is displayed in console using cp437 encoding e.g., Blühende -> Blⁿhende (ü is corrupted)
you might be able to work with text that can't be represented using neither cp1252 nor cp437 encoding e.g., '❤' (U+2764 HEAVY BLACK HEART).
To print Unicode text to Windows console, you could use win-unicode-console package.

"UnicodeEncodeError: 'ascii' codec can't encode character"

I'm trying to pass big strings of random html through regular expressions and my Python 2.6 script is choking on this:
UnicodeEncodeError: 'ascii' codec can't encode character
I traced it back to a trademark superscript on the end of this word: Protection™ -- and I expect to encounter others like it in the future.
Is there a module to process non-ascii characters? or, what is the best way to handle/escape non-ascii stuff in python?
Thanks!
Full error:
E
======================================================================
ERROR: test_untitled (__main__.Untitled)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Python26\Test2.py", line 26, in test_untitled
ofile.write(Whois + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 1005: ordinal not in range(128)
Full Script:
from selenium import selenium
import unittest, time, re, csv, logging
class Untitled(unittest.TestCase):
def setUp(self):
self.verificationErrors = []
self.selenium = selenium("localhost", 4444, "*firefox", "http://www.BaseDomain.com/")
self.selenium.start()
self.selenium.set_timeout("90000")
def test_untitled(self):
sel = self.selenium
spamReader = csv.reader(open('SubDomainList.csv', 'rb'))
for row in spamReader:
sel.open(row[0])
time.sleep(10)
Test = sel.get_text("//html/body/div/table/tbody/tr/td/form/div/table/tbody/tr[7]/td")
Test = Test.replace(",","")
Test = Test.replace("\n", "")
ofile = open('TestOut.csv', 'ab')
ofile.write(Test + '\n')
ofile.close()
def tearDown(self):
self.selenium.stop()
self.assertEqual([], self.verificationErrors)
if __name__ == "__main__":
unittest.main()
You're trying to convert unicode to ascii in "strict" mode:
>>> help(str.encode)
Help on method_descriptor:
encode(...)
S.encode([encoding[,errors]]) -> object
Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.
You probably want something like one of the following:
s = u'Protection™'
print s.encode('ascii', 'ignore') # removes the ™
print s.encode('ascii', 'replace') # replaces with ?
print s.encode('ascii','xmlcharrefreplace') # turn into xml entities
print s.encode('ascii', 'strict') # throw UnicodeEncodeErrors
You're trying to pass a bytestring to something, but it's impossible (from the scarcity of info you provide) to tell what you're trying to pass it to. You start with a Unicode string that cannot be encoded as ASCII (the default codec), so, you'll have to encode by some different codec (or transliterate it, as #R.Pate suggests) -- but it's impossible for use to say what codec you should use, because we don't know what you're passing the bytestring and therefore don't know what that unknown subsystem is going to be able to accept and process correctly in terms of codecs.
In such total darkness as you leave us in, utf-8 is a reasonable blind guess (since it's a codec that can represent any Unicode string exactly as a bytestring, and it's the standard codec for many purposes, such as XML) -- but it can't be any more than a blind guess, until and unless you're going to tell us more about what you're trying to pass that bytestring to, and for what purposes.
Passing thestring.encode('utf-8') rather than bare thestring will definitely avoid the particular error you're seeing right now, but it may result in peculiar displays (or whatever it is you're trying to do with that bytestring!) unless the recipient is ready, willing and able to accept utf-8 encoding (and how could WE know, having absolutely zero idea about what the recipient could possibly be?!-)
The "best" way always depends on your requirements; so, what are yours? Is ignoring non-ASCII appropriate? Should you replace ™ with "(tm)"? (Which looks fancy for this example, but quickly breaks down for other codepoints—but it may be just what you want.) Could the exception be exactly what you need; now you just need to handle it in some way?
Only you can really answer this question.
First of all, try installing translations for English language (or any other if needed):
sudo apt-get install language-pack-en
which provides translation data updates for all supported packages (including Python).
And make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
Then double check your system configuration like value of LANG or configuration of locale (/etc/default/locale) and don't forget to re-login your session.