pandas dataframe and u'\u2019'

pandas dataframe and u'\u2019' - python-2.7

I have a pandas dataframe (python 2.7) containing a u'\u2019' that does not let me extract as csv my result.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 180: ordinal not in range(128)
Is there a way to query the dataframe and substitude these character with another one?

Try using a different encoding when saving to file (the default in pandas for Python 2.x is ascii, that's why you get the error since it can't handle unicode characters):
df.to_csv(path, encoding='utf-8')

I did not manage to export the whole file. However, I managed to identity the row with the character causing problems and eliminate it
faulty_rows = []
for i in range(len(outcome)):
try:
test = outcome.iloc[i]
test.to_csv("/Users/john/test/test.csv")
except:
pass
faulty_rows.append(i)
print i
tocsv = tocsv.drop(outcome.index[[indexes]])
tocsv.to_csv("/Users/john/test/test.csv")

Related

How to fix UnicodeEncodeError: in Pyspark while converting Dataframe Row to a String

I have a simple dataframe with 3 columns.
+------------------+-------------------+-------+
| NM1_PROFILE| CURRENT_DATEVALUE| ID|
+------------------+-------------------+-------+
|XY_12345678 – Main|2019-12-19 00:00:00|myuser1|
+------------------+-------------------+-------+
All i want in the output is a single string consists of all the values in dataframe row separated by comma or pipe. Although there are many rows in the dataframe, i just want 1 row to solve my purpose.
XY_12345678 – Main,2019-12-19 00:00:00,myuser1
I have tried with below and it has worked fine for my other dataframes but for above it gives me an error.
df.rdd.map(lambda line: ",".join([str(x) for x in line])).take(1)[0]
Error when it encounters "-"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 12: ordinal not in range(128)
I am using Spark 1.6 with Python 2 and tried -
import sys
reload(sys)
sys.setdefaultencoding('utf8')

According to the Spark 1.6 Documentation, you can use the concat_ws function, which giving a separator and a set of columns, it will concat them in one string. So this should solve your issue
from pyspark.sql.functions import col, concat_ws
df.select(concat_ws(",", col("NM1_PROFILE"), col("CURRENT_DATEVALUE"), col("ID")).alias("concat")).collect()
Or, if you prefer a more generic way, you can use something like this:
from pyspark.sql.functions import col, concat_ws
cols = [col(column) for column in df.columns]
df.select(concat_ws(",", *cols).alias("concat")).collect()
For more information: https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.functions.concat_ws
Hope this helps

Pandas convert object column to str - column contains unicode, float etc

I have pandas data frame where column type shows as object but when I try to convert to string,
df['column'] = df['column'].astype('str')
UnicodeEncodeError get thrown:
*** UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
My next approach was to handle the encoding part:
df['column'] = filtered_df['column'].apply(lambda x: x.encode('utf-8').strip())
But that gives following error:
*** AttributeError: 'float' object has no attribute 'encode'
Whats the best approach to convert this column to string.
Sample of string in the column
Thank you :)
Thank You !!!
responsibilities/assigned job.

I had the same problem in python 2.7 when trying to run a script that was originally intended for python 3. In python 2.7, the default str functionality is to encode to ASCII, which will apparently not work with your data. This can be replicated in a simple example:
import pandas as pd
df = pd.DataFrame({'column': ['asdf', u'uh ™ oh', 123]})
df['column'] = df['column'].astype('str')
Results in:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 3: ordinal not in range(128)
Instead, you can specify unicode:
df['column'] = df['column'].astype('unicode')
Verify that the number has been converted to a string:
df['column'][2]
This outputs u'123', so it has been converted to a unicode string. The special character ™ has been properly preserved as well.

UnicodeEncodeError: 'ascii' codec can't encode characters in position 321-322: ordinal not in range(128)

I am trying to join column values in xlsx file using pandas. I am using the below code to that.
(df.astype(str).groupby('name', as_index=False, sort=False)
.apply(lambda x: pd.Series({v: ','.join(x[v].unique()) for v in x})))
But, I am getting error like
UnicodeEncodeError: 'ascii' codec can't encode characters in position 321-322: ordinal not in range(128)

If you only need string for your DataFrame, you can use the option dtype = unicode in your read_excel function and remove the astype(str).

'ascii' codec can't decode byte 0xdb in position 942: ordinal not in range(128) SQLAlchemy (Django)

I use SQLAlchemy query with utf-8 encode when i use run query on mysqldb i get output, but run code on python i get error :
'ascii' codec can't decode byte 0xdb in position 942: ordinal not in range(128)
query :
query = """SELECT * FROM (SELECT p.ID AS 'persons_ID', p.FirstName AS 'persons_FirstName', p.LastName AS 'persons_LastName',p.NationalCode AS 'persons_NationalCode', p.CityID AS 'persons_CityID', p.Mobile AS 'persons_Mobile',p.Address AS 'persons_Address', cities_1.ID AS 'cities_1_ID', cities_1.Name AS 'cities_1_Name',cities_1.ParentID AS 'cities_1_ParentID', cities_2.ID AS 'cities_2_ID', cities_2.Name AS 'cities_2_Name',cities_2.ParentID AS 'cities_2_ParentID' , cast(#row := #row + 1 as unsigned) as 'persons_row_number' FROM Persons p LEFT OUTER JOIN cities AS cities_2 ON cities_2.ID = p.CityID LEFT OUTER JOIN cities AS cities_1 ON cities_1.ID = cities_2.ParentID , (select #row := 0) as init WHERE 1=1 AND p.FirstName LIKE N'{}%'""".format('رامین')
Conntector charset Mysql :
e = create_engine("mysql+pymysql://#localhost/test?charset=utf8")
do you have idea for resolve ?
Thanks,

Python 2 uses bytestrings (ASCII) strings by default, which support only Latin characters. Python 3 uses Unicode strings by default.
As I see you use some Arabic script in your query and therefore you probably get some in response. The error says, that, obviously, Python can't decode Arabic characters to ASCII. To handle Arabic (or any other non-Latin) characters you have to use unicode in Python. Note: it has nothing to do with unicode setting you provide, which affects only the database.
So your options are:
Switch to Python 3.
Stay as you are, but add from __future__ import unicode_literals at the start of your every module to enable using unicode for strings by default.
Use encode/decode everytime to manipulate with unicode and bytestrings, but it's the worst solution.

Dataframe encoding

Is there a way to encode the index of my dataframe? I have a dataframe where the index is the name of international conferences.
df2= pd.DataFrame(index=df_conf['Conference'], columns=['Citation1991','Citation1992'])
I keep getting:
KeyError: 'Leitf\xc3\xa4den der angewandten Informatik'
whenever my code references a foreign conference name with unknown ascii letters.
I tried:
df.at[x.encode("utf-8"), 'col1']
df.at[x.encode('ascii', 'ignore'), 'col']
Is there a way around it? I tried to see if I could encode the dataframe itself when creating it, but it doesn't seem I can do that either.

If you're not using csv, and you want to encode your string index, this is what worked for me:
df.index = df.index.str.encode('utf-8')

Setting up the encoding should be treated when reading the input file, using the option encoding
df = pd.read_csv('bibliography.csv', delimiter=',', encoding="utf-8")
or if the file uses BOM,
df = pd.read_csv('bibliography.csv', delimiter=',', encoding="utf-8-sig")

Just put "u" in front of utf8 strings such that
df2= pd.DataFrame(index=df_conf[u'Conference'], columns=[u'Citation1991',u'Citation1992'])
It will work.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

pandas dataframe and u'\u2019' - python-2.7

Try using a different encoding when saving to file (the default in pandas for Python 2.x is ascii, that's why you get the error since it can't handle unicode characters): df.to_csv(path, encoding='utf-8')

Related

How to fix UnicodeEncodeError: in Pyspark while converting Dataframe Row to a String

Pandas convert object column to str - column contains unicode, float etc

UnicodeEncodeError: 'ascii' codec can't encode characters in position 321-322: ordinal not in range(128)

'ascii' codec can't decode byte 0xdb in position 942: ordinal not in range(128) SQLAlchemy (Django)

Dataframe encoding

Categories

Resources