Hi guys I am having a problem with inserting utf-8 unicode character to my database.
The unicode that I get from my form is u'AJDUK MARKO\u010d'. Next step is to decode it to utf-8. value.encode('utf-8') then I get a string 'AJDUK MARKO\xc4\x8d'.
When I try to update the database, works the same for insert btw.
cur.execute( "UPDATE res_partner set %s = '%s' where id = %s;"%(columns, value, remote_partner_id))
The value gets inserted or updated to the database but the problem is it is exactly in the same format as AJDUK MARKO\xc4\x8d and of course I want AJDUK MARKOČ. Database has utf-8 encoding so it is not that.
What am I doing wrong? Surprisingly couldn't really find anything useful on the forums.
\xc4\x8d is the UTF-8 encoding representation of Č. It looks like the insert has worked but you're not printing the result correctly, probably by printing the whole row as a list. I.e.
>>> print "Č"
"Č"
>>> print ["Č"] # a list with one string
['\xc4\x8c']
We need to see more code to validate (It's always a good idea to give as much reproducible code as possible).
You could decode the result (result.decode("utf-8")) but you should avoid manually encoding or decoding. Psycopg2 already allows you send Unicodes, so you can do the following without encoding first:
cur.execute( u"UPDATE res_partner set %s = '%s' where id = %s;" % (columns, value, remote_partner_id))
- note the leading u
Psycopg2 can return Unicodes too by having strings automatically decoded:
import psycopg2
import psycopg2.extensions
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
Edit:
SQL values should be passed as an argument to .execute(). See the big red box at: http://initd.org/psycopg/docs/usage.html#the-problem-with-the-query-parameters
Instead
E.g.
# Replace the columns field first.
# Strictly we should use http://initd.org/psycopg/docs/sql.html#module-psycopg2.sql
sql = u"UPDATE res_partner set {} = %s where id = %s;".format(columns)
cur.execute(sql, (value, remote_partner_id))
Related
just another question about encoding in python i think. I have this programm:
regex = re.compile(ur'\b[sw]\w+', flags= re.U | re.I)
ergebnisliste = []
for line in fileobject:
print str(line)
erg = regex.findall(line)
ergebnisliste = ergebnisliste + erg
ergebnislistesortiert = sorted(ergebnisliste, key=lambda x: len(x))
print ergebnislistesortiert
fileobject.close()
I am searching a textfile for words beginning with s or w. My "ergebnislistesortiert" is the sorted result list.
I will print the result list and there appers to be a problem with the encoding:
['so', 'Wer', 'sp\xc3']
the 'sp\xc3' should be print as spät. What is wrong here? Why is the list element utf-8?
And how can i get the right decoding to print "spät"?
Thanks a lot guys!
\xc3 is not UTF-8. It's a fragment of the full UTF-8 encoding of U+00E4 but you're probably reading it with something like a Latin-1 decoder (which is effectively what Python 2 does if you read bytes without specifying an encoding), in which case the second byte in the UTF-8 sequence isn't matched by \w.
The real fix is to decode the data when you are reading it into Python in the first place. If you are writing new code, switching to Python 3 is probably the best and easiest fix.
If you're stuck on Python 2.7, a somewhat Python 3-compatible approach is something like
import io
fileobject = io.open(filename, encoding='utf-8')
If you have control over the input file and want to postpone the proper solution until you are older, (ask your parents for permission to) convert the UTF-8 input file to some legacy 8-bit encoding.
I'm using Python to read values from SQL Server (pypyodbc) and insert them into PostgreSQL (psycopg2)
A value in the NAME field has come up that is causing errors:
Montaño
The value is existing in my MSSQL database just fine (SQL_Latin1_General_CP1_CI_AS encoding), and can be inserted into my PostgreSQL database just fine (UTF8) using PGAdmin and an insert statement.
The problem is selecting it using python causes the value to be converted to:
Monta\xf1o
(xf1 is ASCII for 'Latin small letter n with tilde')
...which is causing the following error to be thrown when trying to insert into PostgreSQL:
invalid byte sequence for encoding "UTF8": 0xf1 0x6f 0x20 0x20
Is there any way to avoid the conversion of the input string to the string that is causing the error above?
Under Python_2 you actually do want to perform a conversion from a basic string to a unicode type. So, if your code looks something like
sql = """\
SELECT NAME FROM dbo.latin1test WHERE ID=1
"""
mssql_crsr.execute(sql)
row = mssql_crsr.fetchone()
name = row[0]
then you probably want to convert the basic latin1 string (retrieved from SQL Server) to the type unicode before using it as a parameter to the PostgreSQL INSERT, i.e., instead of
name = row[0]
you would do
name = unicode(row[0], 'latin1')
I'm running a parser in python 2.7 that is taking a text field of xml code from a database and using Beautiful Soup to find and pull different tags in the xml. When I am pulling the tags from an tag in the xml and getting to the given text it is returning
<author>
<name>Josef Šimánek</name>
</author>
Josef \xc5\xa0im\xc3\xa1nek
when what it should look like is
Josef Šimánek
my relevant code is as follows:
rss = str(f)
soup = BeautifulSoup(rss)
entries = soup.findAll('entry')
for entry in entries:
author = entry.find('author')
if author != None:
for name in author.findAll("name"):
if(checkNull(name).find(",") != -1):
name = checkNull(name).split(",",1)
for s in name:
print s
else:
print name
As you can see the code pulls out and cycles through the different tags and if the name tag contains a comma separated list of names, then it splits and prints each individually.
def checkNull(item):
if item != None:
return item.text.rstrip()
return " "
Also the check null function is just a helper method to see if the returned tag even contains any text at all as seen above.
I have tried encode, decode, and unicode functions in order to try and resolve the issue however none have succeded. Are there any other methods recommended that i could try to fix this?
name is a BeautifulSoup.Tag type not a string so you're probably getting a __repr__ of the object that's suitable for a terminal that doesn't support UTF-8 (\xc5\xa0 is the Python byte sequence for the UTF-8 encoding of š). name.text is probably the value you actually want, which should be a Unicode string.
If you're using Windows, it's best to avoid printing to the console as its console doesn't easily support UTF-8. You could use https://pypi.python.org/pypi/win_unicode_console, but it's easier to just write your output to a file instead.
I've cleaned up your code a little to make it simpler (quick null checks) and to write your output to a UTF-8 encoded file:
# io provides better access to files with working universal newline support
import io
# open a file in text mode, encoding all output to utf-8
output_file = io.open("output.txt", "w", encoding="utf-8")
rss = str(f)
soup = BeautifulSoup(rss)
entries = soup.findAll('entry')
for entry in entries:
author = entry.find('author')
# If not null or not empty
if author:
for name in author.findAll("name"):
# .text contains the actual Unicode string value
if name.text:
names = name.text.split(",", 1)
# If string contained a comma, you'll have two elements in a list
# else you'll just have the 1 length list
for flname in names:
# remove any whitespace on either side
output_file.write(flname.strip() + "\n")
output_file.close()
I'm running the code below. It creates a couple of dataframes that takes a column in another dataframe that has a list of Conference Names, as its index.
df_conf = pd.read_sql("select distinct Conference from publications where year>=1991 and length(conference)>1 order by conference", db)
for index, row in df_conf.iterrows():
row[0]=row[0].encode("utf-8")
df2= pd.DataFrame(index=df_conf['Conference'], columns=['Citation1991','Citation1992'])
df2 = df2.fillna(0)
df_if= pd.DataFrame(index=df_conf['Conference'], columns=['IF1994','IF1995'])
df_if = df_if.fillna(0)
df_pubs=pd.read_sql("select Conference, Year, count(*) as totalPubs from publications where year>=1991 group by conference, year", db)
for index, row in df_pubs.iterrows():
row[0]=row[0].encode("utf-8")
df_pubs= df_pubs.pivot(index='Conference', columns='Year', values='totalPubs')
df_pubs.fillna(0)
for index, row in df2.iterrows():
df_if.ix[index,'IF1994'] = df2.ix[index,'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
The last line keeps giving me the following error:
KeyError: 'Analyse dynamischer Systeme in Medizin, Biologie und \xc3\x96kologie'
Not quite sure what I'm doing wrong. I tried encoding the indexes. It won't work. I even tried .at still wont' work.
I know it has to do with encoding, as it always stops at indexes with non-ascii characters.
I'm using python 2.7
I think the problem with this:
for index, row in df_conf.iterrows():
row[0]=row[0].encode("utf-8")
is that it may or may not work, I'm surprised it didn't raise a warning.
Besides that it's much quicker to use the vectorised str method to encode the series:
df_conf['col_name'] = df_conf['col_name'].str.encode('utf-8')
If needed you can also encode the index in a similar fashion:
df.index = df.index.str.encode('utf-8')
It happens in the line in the last part of the code?
df_if.ix[index,'IF1994'] = df2.ix[index,'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
if then, try
df_if.ix[index,u'IF1994'] = df2.ix[index,u'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
It would work. Dataframe indexing in UTF8 works in strange way even though the script is declared with "# -- coding:utf8 --". Just put "u" in utf8 strings when you use dataframe columns and index with utf8 strings
I am parsing a csv file (created in windows) and trying to populate a database table using a model i've created.
I am getting this error:
pl = PriceList.objects.create(code=row[0], description=row[1],.........
Incorrect string value: '\xD0h:NAT...' for column 'description' at row 1
My table and the description field use utf-8 and utf8_general_ci collation.
The actual value i am trying to insert is this.
HOUSING:PS-187:1g\xd0h:NATURAL CO
I am not aware of any string processing i should do to get over this error.
I think i used a simple python script before to populate the database using conn.escape_string() and it worked (if that helps)
Thanks
I've had trouble with the CSV reader and unicode before as well. In my case using the following got me past the errors.
From http://docs.python.org/library/csv.html
The csv module doesn’t directly support reading and writing Unicode, ...
unicode_csv_reader() below is a
generator that wraps csv.reader to
handle Unicode CSV data (a list of
Unicode strings). utf_8_encoder() is a
generator that encodes the Unicode
strings as UTF-8, one string (or row)
at a time. The encoded strings are
parsed by the CSV reader, and
unicode_csv_reader() decodes the
UTF-8-encoded cells back into Unicode:
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')