I am parsing a csv file (created in windows) and trying to populate a database table using a model i've created.
I am getting this error:
pl = PriceList.objects.create(code=row[0], description=row[1],.........
Incorrect string value: '\xD0h:NAT...' for column 'description' at row 1
My table and the description field use utf-8 and utf8_general_ci collation.
The actual value i am trying to insert is this.
HOUSING:PS-187:1g\xd0h:NATURAL CO
I am not aware of any string processing i should do to get over this error.
I think i used a simple python script before to populate the database using conn.escape_string() and it worked (if that helps)
Thanks
I've had trouble with the CSV reader and unicode before as well. In my case using the following got me past the errors.
From http://docs.python.org/library/csv.html
The csv module doesn’t directly support reading and writing Unicode, ...
unicode_csv_reader() below is a
generator that wraps csv.reader to
handle Unicode CSV data (a list of
Unicode strings). utf_8_encoder() is a
generator that encodes the Unicode
strings as UTF-8, one string (or row)
at a time. The encoded strings are
parsed by the CSV reader, and
unicode_csv_reader() decodes the
UTF-8-encoded cells back into Unicode:
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
Related
I am trying to build a tool that can convert .csv files into .yaml files for further use. I found a handy bit of code that does the job nicely from the link below:
Convert CSV to YAML, with Unicode?
which states that the line will take the dict created by opening a .csv file and dump it to a .yaml file:
out_file.write(ry.safe_dump(dict_example,allow_unicode=True))
However, one small kink I have noticed is that when it is run once, the generated .yaml file is typically incomplete by a line or two. In order to have the .csv file exhaustively read through to create a complete .yaml file, the code must be run two or even three times. Does anybody know why this could be?
UPDATE
Per request, here is the code I use to parse my .csv file, which is two columns long (with a string in the first column and a list of two strings in the second column), and will typically be 50 rows long (or maybe more). Also note that it designed to remove any '\n' or spaces that could potentially cause problems later on in the code.
csv_contents={}
with open("example1.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
first_sequence= line[1].replace(' ','').replace('\r','').replace('\n','')
second_sequence= line[2].replace(' ','').replace('\r','').replace('\n','')
csv_contents[candidate_number]= [first_sequence, second_sequence]
csv_contents.pop('Header name', None)
Ultimately, it is not that important that I maintain the order of the rows from the original dict, just that all the information within the rows is properly structured.
I am not sure what would cause could be but you might be running out of memory as you create the YAML document in memory first and then write it out. It is much better to directly stream it out.
You should also note that the code in the question you link to, doesn't preserve the order of the original columns, something easily circumvented by using round_trip_dump instead of safe_dump.
You probably want to make a top-level sequence (list) as in the desired output of the linked question, with each element being a mapping (dict).
The following parses the CSV, taking the first line as keys for mappings created for each following line:
import sys
import csv
import ruamel.yaml as ry
import dateutil.parser # pip install python-dateutil
def process_line(line):
"""convert lines, trying, int, float, date"""
ret_val = []
for elem in line:
try:
res = int(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = float(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = dateutil.parser.parse(elem)
ret_val.append(res)
continue
except ValueError:
pass
ret_val.append(elem.strip())
return ret_val
csv_file_name = 'xyz.csv'
data = []
header = None
with open(csv_file_name) as inf:
for line in csv.reader(inf):
d = process_line(line)
if header is None:
header = d
continue
data.append(ry.comments.CommentedMap(zip(header, d)))
ry.round_trip_dump(data, sys.stdout, allow_unicode=True)
with input xyz.csv:
id, title_english, title_russian
1, A Title in English, Название на русском
2, Another Title, Другой Название
this generates:
- id: 1
title_english: A Title in English
title_russian: Название на русском
- id: 2
title_english: Another Title
title_russian: Другой Название
The process_line is just some sugar that tries to convert strings in the CSV file to more useful types and strings without leading spaces (resulting in far less quotes in your output YAML file).
I have tested the above on files with 1000 rows, without any problems (I won't post the output though).
The above was done using Python 3 as well as Python 2.7, starting with a UTF-8 encoded file xyz.csv. If you are using Python 2, you can try unicodecsv if you need to handle Unicode input and things don't work out as well as they did for me.
I'm using Python to read values from SQL Server (pypyodbc) and insert them into PostgreSQL (psycopg2)
A value in the NAME field has come up that is causing errors:
Montaño
The value is existing in my MSSQL database just fine (SQL_Latin1_General_CP1_CI_AS encoding), and can be inserted into my PostgreSQL database just fine (UTF8) using PGAdmin and an insert statement.
The problem is selecting it using python causes the value to be converted to:
Monta\xf1o
(xf1 is ASCII for 'Latin small letter n with tilde')
...which is causing the following error to be thrown when trying to insert into PostgreSQL:
invalid byte sequence for encoding "UTF8": 0xf1 0x6f 0x20 0x20
Is there any way to avoid the conversion of the input string to the string that is causing the error above?
Under Python_2 you actually do want to perform a conversion from a basic string to a unicode type. So, if your code looks something like
sql = """\
SELECT NAME FROM dbo.latin1test WHERE ID=1
"""
mssql_crsr.execute(sql)
row = mssql_crsr.fetchone()
name = row[0]
then you probably want to convert the basic latin1 string (retrieved from SQL Server) to the type unicode before using it as a parameter to the PostgreSQL INSERT, i.e., instead of
name = row[0]
you would do
name = unicode(row[0], 'latin1')
new_list = eval(my_list[0]) # my_list contains dictionaries
def get_next_file():
for key, value in new_list.iteritems():
yield value
file = get_next_file
book = xlrd.open_workbook(file_contents=file)
for sheet in book.sheet_names():
print sheet
I am trying to take a string from a dict and turn it into an xls file so it can be processed. It was an xls file that I used str(list(xls_file)) so that it could be saved in my database.
Any thoughts?
the saved string prints out as hex with some words in it.
The library xlrd you are using is only to read informations from an excel file.
If you want to write a new excel file you need to use xlwt.
If you want to change some cells in an excel file you should use xlutils.
Homepage of these three libraries: http://www.python-excel.org/
Hi guys I am having a problem with inserting utf-8 unicode character to my database.
The unicode that I get from my form is u'AJDUK MARKO\u010d'. Next step is to decode it to utf-8. value.encode('utf-8') then I get a string 'AJDUK MARKO\xc4\x8d'.
When I try to update the database, works the same for insert btw.
cur.execute( "UPDATE res_partner set %s = '%s' where id = %s;"%(columns, value, remote_partner_id))
The value gets inserted or updated to the database but the problem is it is exactly in the same format as AJDUK MARKO\xc4\x8d and of course I want AJDUK MARKOČ. Database has utf-8 encoding so it is not that.
What am I doing wrong? Surprisingly couldn't really find anything useful on the forums.
\xc4\x8d is the UTF-8 encoding representation of Č. It looks like the insert has worked but you're not printing the result correctly, probably by printing the whole row as a list. I.e.
>>> print "Č"
"Č"
>>> print ["Č"] # a list with one string
['\xc4\x8c']
We need to see more code to validate (It's always a good idea to give as much reproducible code as possible).
You could decode the result (result.decode("utf-8")) but you should avoid manually encoding or decoding. Psycopg2 already allows you send Unicodes, so you can do the following without encoding first:
cur.execute( u"UPDATE res_partner set %s = '%s' where id = %s;" % (columns, value, remote_partner_id))
- note the leading u
Psycopg2 can return Unicodes too by having strings automatically decoded:
import psycopg2
import psycopg2.extensions
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
Edit:
SQL values should be passed as an argument to .execute(). See the big red box at: http://initd.org/psycopg/docs/usage.html#the-problem-with-the-query-parameters
Instead
E.g.
# Replace the columns field first.
# Strictly we should use http://initd.org/psycopg/docs/sql.html#module-psycopg2.sql
sql = u"UPDATE res_partner set {} = %s where id = %s;".format(columns)
cur.execute(sql, (value, remote_partner_id))
I'm running a parser in python 2.7 that is taking a text field of xml code from a database and using Beautiful Soup to find and pull different tags in the xml. When I am pulling the tags from an tag in the xml and getting to the given text it is returning
<author>
<name>Josef Šimánek</name>
</author>
Josef \xc5\xa0im\xc3\xa1nek
when what it should look like is
Josef Šimánek
my relevant code is as follows:
rss = str(f)
soup = BeautifulSoup(rss)
entries = soup.findAll('entry')
for entry in entries:
author = entry.find('author')
if author != None:
for name in author.findAll("name"):
if(checkNull(name).find(",") != -1):
name = checkNull(name).split(",",1)
for s in name:
print s
else:
print name
As you can see the code pulls out and cycles through the different tags and if the name tag contains a comma separated list of names, then it splits and prints each individually.
def checkNull(item):
if item != None:
return item.text.rstrip()
return " "
Also the check null function is just a helper method to see if the returned tag even contains any text at all as seen above.
I have tried encode, decode, and unicode functions in order to try and resolve the issue however none have succeded. Are there any other methods recommended that i could try to fix this?
name is a BeautifulSoup.Tag type not a string so you're probably getting a __repr__ of the object that's suitable for a terminal that doesn't support UTF-8 (\xc5\xa0 is the Python byte sequence for the UTF-8 encoding of š). name.text is probably the value you actually want, which should be a Unicode string.
If you're using Windows, it's best to avoid printing to the console as its console doesn't easily support UTF-8. You could use https://pypi.python.org/pypi/win_unicode_console, but it's easier to just write your output to a file instead.
I've cleaned up your code a little to make it simpler (quick null checks) and to write your output to a UTF-8 encoded file:
# io provides better access to files with working universal newline support
import io
# open a file in text mode, encoding all output to utf-8
output_file = io.open("output.txt", "w", encoding="utf-8")
rss = str(f)
soup = BeautifulSoup(rss)
entries = soup.findAll('entry')
for entry in entries:
author = entry.find('author')
# If not null or not empty
if author:
for name in author.findAll("name"):
# .text contains the actual Unicode string value
if name.text:
names = name.text.split(",", 1)
# If string contained a comma, you'll have two elements in a list
# else you'll just have the 1 length list
for flname in names:
# remove any whitespace on either side
output_file.write(flname.strip() + "\n")
output_file.close()