Pandas convert object column to str - column contains unicode, float etc - python-2.7

I have pandas data frame where column type shows as object but when I try to convert to string,
df['column'] = df['column'].astype('str')
UnicodeEncodeError get thrown:
*** UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
My next approach was to handle the encoding part:
df['column'] = filtered_df['column'].apply(lambda x: x.encode('utf-8').strip())
But that gives following error:
*** AttributeError: 'float' object has no attribute 'encode'
Whats the best approach to convert this column to string.
Sample of string in the column
Thank you :)
Thank You !!!
responsibilities/assigned job.

I had the same problem in python 2.7 when trying to run a script that was originally intended for python 3. In python 2.7, the default str functionality is to encode to ASCII, which will apparently not work with your data. This can be replicated in a simple example:
import pandas as pd
df = pd.DataFrame({'column': ['asdf', u'uh ™ oh', 123]})
df['column'] = df['column'].astype('str')
Results in:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 3: ordinal not in range(128)
Instead, you can specify unicode:
df['column'] = df['column'].astype('unicode')
Verify that the number has been converted to a string:
df['column'][2]
This outputs u'123', so it has been converted to a unicode string. The special character ™ has been properly preserved as well.

Related

How to fix UnicodeEncodeError: in Pyspark while converting Dataframe Row to a String

I have a simple dataframe with 3 columns.
+------------------+-------------------+-------+
| NM1_PROFILE| CURRENT_DATEVALUE| ID|
+------------------+-------------------+-------+
|XY_12345678 – Main|2019-12-19 00:00:00|myuser1|
+------------------+-------------------+-------+
All i want in the output is a single string consists of all the values in dataframe row separated by comma or pipe. Although there are many rows in the dataframe, i just want 1 row to solve my purpose.
XY_12345678 – Main,2019-12-19 00:00:00,myuser1
I have tried with below and it has worked fine for my other dataframes but for above it gives me an error.
df.rdd.map(lambda line: ",".join([str(x) for x in line])).take(1)[0]
Error when it encounters "-"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 12: ordinal not in range(128)
I am using Spark 1.6 with Python 2 and tried -
import sys
reload(sys)
sys.setdefaultencoding('utf8')
According to the Spark 1.6 Documentation, you can use the concat_ws function, which giving a separator and a set of columns, it will concat them in one string. So this should solve your issue
from pyspark.sql.functions import col, concat_ws
df.select(concat_ws(",", col("NM1_PROFILE"), col("CURRENT_DATEVALUE"), col("ID")).alias("concat")).collect()
Or, if you prefer a more generic way, you can use something like this:
from pyspark.sql.functions import col, concat_ws
cols = [col(column) for column in df.columns]
df.select(concat_ws(",", *cols).alias("concat")).collect()
For more information: https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.functions.concat_ws
Hope this helps

UnicodeEncodeError: 'ascii' codec can't encode characters in position 321-322: ordinal not in range(128)

I am trying to join column values in xlsx file using pandas. I am using the below code to that.
(df.astype(str).groupby('name', as_index=False, sort=False)
.apply(lambda x: pd.Series({v: ','.join(x[v].unique()) for v in x})))
But, I am getting error like
UnicodeEncodeError: 'ascii' codec can't encode characters in position 321-322: ordinal not in range(128)
If you only need string for your DataFrame, you can use the option dtype = unicode in your read_excel function and remove the astype(str).

Using Python codecs but still getting UnicodeDecodeError

I have a non-English list of rows where each row is a list of strings and ints. I need to write this data to a file and convert all numbers to strings accordingly.
The data contents is the following:
[[u'12', u'as', u'ss', u'ge', u'ge', u'm\xfcnze', u'10.0', u'25.2', u'68.05', 1, 2, 0],
[u'13', u'aas', u'sss', u'tge', u'a', u'mat', u'11.0', u'35.7', u'10.1', 1, 1, 1], ...]
The loop breaks on the first list which contains u'm\xfcnze'.
import codecs
with codecs.open("temp.txt", "w", encoding="utf-8") as f:
for row in data:
f.write(' '.join([str(r) for r in row]))
f.write('\n')
The code above fails with UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 38: ordinal not in range(128) error.
Trying r.encode('utf-8') if isinstance(r, str) does not solve this issue, so what am I doing wrong?
This should work:
import codecs
with codecs.open("temp.txt", "w", encoding="utf-8") as f:
for row in data:
f.write(' '.join([unicode(r) for r in row]))
f.write('\n')
I'm using the unicode() function
Note, because Python 3 string data type is string unicode, your code works fine in Python 3 without any modification (no str -> unicode needed)

'ascii' codec can't decode byte 0xdb in position 942: ordinal not in range(128) SQLAlchemy (Django)

I use SQLAlchemy query with utf-8 encode when i use run query on mysqldb i get output, but run code on python i get error :
'ascii' codec can't decode byte 0xdb in position 942: ordinal not in range(128)
query :
query = """SELECT * FROM (SELECT p.ID AS 'persons_ID', p.FirstName AS 'persons_FirstName', p.LastName AS 'persons_LastName',p.NationalCode AS 'persons_NationalCode', p.CityID AS 'persons_CityID', p.Mobile AS 'persons_Mobile',p.Address AS 'persons_Address', cities_1.ID AS 'cities_1_ID', cities_1.Name AS 'cities_1_Name',cities_1.ParentID AS 'cities_1_ParentID', cities_2.ID AS 'cities_2_ID', cities_2.Name AS 'cities_2_Name',cities_2.ParentID AS 'cities_2_ParentID' , cast(#row := #row + 1 as unsigned) as 'persons_row_number' FROM Persons p LEFT OUTER JOIN cities AS cities_2 ON cities_2.ID = p.CityID LEFT OUTER JOIN cities AS cities_1 ON cities_1.ID = cities_2.ParentID , (select #row := 0) as init WHERE 1=1 AND p.FirstName LIKE N'{}%'""".format('رامین')
Conntector charset Mysql :
e = create_engine("mysql+pymysql://#localhost/test?charset=utf8")
do you have idea for resolve ?
Thanks,
Python 2 uses bytestrings (ASCII) strings by default, which support only Latin characters. Python 3 uses Unicode strings by default.
As I see you use some Arabic script in your query and therefore you probably get some in response. The error says, that, obviously, Python can't decode Arabic characters to ASCII. To handle Arabic (or any other non-Latin) characters you have to use unicode in Python. Note: it has nothing to do with unicode setting you provide, which affects only the database.
So your options are:
Switch to Python 3.
Stay as you are, but add from __future__ import unicode_literals at the start of your every module to enable using unicode for strings by default.
Use encode/decode everytime to manipulate with unicode and bytestrings, but it's the worst solution.

pandas dataframe and u'\u2019'

I have a pandas dataframe (python 2.7) containing a u'\u2019' that does not let me extract as csv my result.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 180: ordinal not in range(128)
Is there a way to query the dataframe and substitude these character with another one?
Try using a different encoding when saving to file (the default in pandas for Python 2.x is ascii, that's why you get the error since it can't handle unicode characters):
df.to_csv(path, encoding='utf-8')
I did not manage to export the whole file. However, I managed to identity the row with the character causing problems and eliminate it
faulty_rows = []
for i in range(len(outcome)):
try:
test = outcome.iloc[i]
test.to_csv("/Users/john/test/test.csv")
except:
pass
faulty_rows.append(i)
print i
tocsv = tocsv.drop(outcome.index[[indexes]])
tocsv.to_csv("/Users/john/test/test.csv")