Why is Pandas forcing unicode column names in place of strings? - python-2.7

Why is Pandas force converting ascii strings to unicode upon conversion from dictionary to dataframe? Is this a feature or a known bug?
I'm using Python 2.7.3 and Pandas 0.20.2
MWE included below.
import pandas as pd
sample_dict={}
sample_dict['A'] = {'Key_1': 'A1', 'Key-2': 'A2', 'Key_3': 'A3'}
sample_dict['B'] = {'Key_1': 'B1', 'Key-2': 'B2', 'Key_3': 'B3'}
sample_dict['C'] = {'Key_1': 'C1', 'Key-2': 'C2', 'Key_3': 'C3'}
print sample_dict['A'].keys()
sample_df = pd.DataFrame.from_dict(sample_dict, orient='index')
print sample_df.keys()
Results in:
['Key-2', 'Key_1', 'Key_3']
Index([u'Key-2', u'Key_1', u'Key_3'], dtype='object')
Addendum: I came across this similar question, but it's been inactive for a couple of years and does not discuss why this is happening.

from pandas dataframe repr
it says
"""
Return a string representation for a particular object.
Yields Bytestring in Py2, Unicode String in py3.
"""
so I am sure in python 3 you should not be able to see any unicode prefix.

Related

How to fix UnicodeEncodeError: in Pyspark while converting Dataframe Row to a String

I have a simple dataframe with 3 columns.
+------------------+-------------------+-------+
| NM1_PROFILE| CURRENT_DATEVALUE| ID|
+------------------+-------------------+-------+
|XY_12345678 – Main|2019-12-19 00:00:00|myuser1|
+------------------+-------------------+-------+
All i want in the output is a single string consists of all the values in dataframe row separated by comma or pipe. Although there are many rows in the dataframe, i just want 1 row to solve my purpose.
XY_12345678 – Main,2019-12-19 00:00:00,myuser1
I have tried with below and it has worked fine for my other dataframes but for above it gives me an error.
df.rdd.map(lambda line: ",".join([str(x) for x in line])).take(1)[0]
Error when it encounters "-"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 12: ordinal not in range(128)
I am using Spark 1.6 with Python 2 and tried -
import sys
reload(sys)
sys.setdefaultencoding('utf8')
According to the Spark 1.6 Documentation, you can use the concat_ws function, which giving a separator and a set of columns, it will concat them in one string. So this should solve your issue
from pyspark.sql.functions import col, concat_ws
df.select(concat_ws(",", col("NM1_PROFILE"), col("CURRENT_DATEVALUE"), col("ID")).alias("concat")).collect()
Or, if you prefer a more generic way, you can use something like this:
from pyspark.sql.functions import col, concat_ws
cols = [col(column) for column in df.columns]
df.select(concat_ws(",", *cols).alias("concat")).collect()
For more information: https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.functions.concat_ws
Hope this helps

Python/Pandas: How do I convert from datetime64[ns] to datetime

I have a script that processes an Excel file. The department that sends it has a system that generated it, and my script stopped working.
I suddenly got the error Can only use .str accessor with string values, which use np.object_ dtype in pandas for the following line of code:
df['DATE'] = df['Date'].str.replace(r'[^a-zA-Z0-9\._/-]', '')
I checked the type of the date columns in the file from the old system (dtype: object) vs the file from the new system (dtype: datetime64[ns]).
How do I change the date format to something my script will understand?
I saw this answer but my knowledge about date formats isn't this granular.
You can use apply function on the dataframe column to convert the necessary column to String. For example:
df['DATE'] = df['Date'].apply(lambda x: x.strftime('%Y-%m-%d'))
Make sure to import datetime module.
apply() will take each cell at a time for evaluation and apply the formatting as specified in the lambda function.
pd.to_datetime returns a Series of datetime64 dtype, as described here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
df['DATE'] = df['Date'].dt.date
or this:
df['Date'].map(datetime.datetime.date)
You can use pd.to_datetime
df['DATE'] = pd.to_datetime(df['DATE'])

How can I parse multiple date columns in Pandas?

I have a field/column in a .csv file that I am loading into Pandas that will not parse as a datetime data type in Pandas. I don't really understand why. I want both FirstTime and SecondTime to parse as datetime64 in Pandas DataFrame.
# Assigning a header for our data
header = ['FirstTime', 'Col1', 'Col2', 'Col3', 'SecondTime', 'Col4',
'Col5', 'Col6', 'Col7', 'Col8']
# Loading our data into a dataframe
df = pd.read_csv('MyData.csv', names=header, parse_dates=['FirstTime', 'SecondTime'])
The code above will only parse SecondTime as datetime64[ns]. FirstTime is left as a Object data type. If I do the following code instead:
# Assigning a header for our data
header = ['FirstTime', 'Col1', 'Col2', 'Col3', 'SecondTime', 'Col4',
'Col5', 'Col6', 'Col7', 'Col8']
# Loading our data into a dataframe
df = pd.read_csv('MyData.csv', names=header, parse_dates=['FirstTime'])
It still will not parse FirstTime as a datetime64[ns].
The format for both columns is the same:
# Example FirstTime
# (%f is always .000)
2015-11-05 16:52:37.000
# Example SecondTime
# (%f is always .000)
2015-11-04 15:33:15.000
What am I missing here? Is the first column not able to be datetime by default or something in Pandas?
did you try
df = pd.read_csv('MyData.csv', names=header, parse_dates=True)
I had a similar problem and it turned out in one of my date variables there is an integer cell. So, python recognize it as "object" and the other one is recognized as "int64". You need to make sure both variables are integer.
You can use df.dtypes to see the format of your vaiables.

pandas dataframe and u'\u2019'

I have a pandas dataframe (python 2.7) containing a u'\u2019' that does not let me extract as csv my result.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 180: ordinal not in range(128)
Is there a way to query the dataframe and substitude these character with another one?
Try using a different encoding when saving to file (the default in pandas for Python 2.x is ascii, that's why you get the error since it can't handle unicode characters):
df.to_csv(path, encoding='utf-8')
I did not manage to export the whole file. However, I managed to identity the row with the character causing problems and eliminate it
faulty_rows = []
for i in range(len(outcome)):
try:
test = outcome.iloc[i]
test.to_csv("/Users/john/test/test.csv")
except:
pass
faulty_rows.append(i)
print i
tocsv = tocsv.drop(outcome.index[[indexes]])
tocsv.to_csv("/Users/john/test/test.csv")

Dataframe encoding

Is there a way to encode the index of my dataframe? I have a dataframe where the index is the name of international conferences.
df2= pd.DataFrame(index=df_conf['Conference'], columns=['Citation1991','Citation1992'])
I keep getting:
KeyError: 'Leitf\xc3\xa4den der angewandten Informatik'
whenever my code references a foreign conference name with unknown ascii letters.
I tried:
df.at[x.encode("utf-8"), 'col1']
df.at[x.encode('ascii', 'ignore'), 'col']
Is there a way around it? I tried to see if I could encode the dataframe itself when creating it, but it doesn't seem I can do that either.
If you're not using csv, and you want to encode your string index, this is what worked for me:
df.index = df.index.str.encode('utf-8')
Setting up the encoding should be treated when reading the input file, using the option encoding
df = pd.read_csv('bibliography.csv', delimiter=',', encoding="utf-8")
or if the file uses BOM,
df = pd.read_csv('bibliography.csv', delimiter=',', encoding="utf-8-sig")
Just put "u" in front of utf8 strings such that
df2= pd.DataFrame(index=df_conf[u'Conference'], columns=[u'Citation1991',u'Citation1992'])
It will work.