How do I format a pandas dataframe string value to date? - python-2.7

I have following string value obtained from a pandas dataframe.
u'1:19 AM Eastern, Tuesday, May 16, 2017'
How do I convert it to a datetime.datetime(2017,5,16) object?
Thx.

You need to create a custom date parser, to give you some ideas here's a reproducible example:
import pandas as pd
import datetime
from StringIO import StringIO
st = u'01:19 AM Eastern, Tuesday, May 16, 2017'
def parse_date(date_string):
date_string = ",".join(date_string.split(',')[-2:]).strip()
return datetime.datetime.strptime(date_string, '%B %d, %Y')
df = pd.read_csv(StringIO(st), header=None, sep="|", date_parser=parse_date, parse_dates=[0])
If you print the dataFrame content as follows, :
print("dataframe content")
print(df)
you will get this output:
dataframe content
0
0 2017-05-16
checking the dtypes confirms that the column is now of type datetime:
print("dataframe types")
print(df.dtypes)
output:
dataframe types
0 datetime64[ns]
dtype: object

Related

Replacing labels in dataframe while plotting

My dat.csv is as follows:
State, Pop
AP,100
UP,200
TN,90
I want to plot it and so my code is as follows:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('dat.csv')
df.plot(kind='bar').set_xticklabels(df.State)
plt.show()
However, I want to replace the labels which are in another csv file,
labels.csv
Column,Name,Level,Rename
State,AP,AP,Andhra Pradesh
State,TN,TN,Tamil Nadu
State,UP,UP,Uttar Pradesh
Is it possible for me to replace the labels in the plot with the labels in my labels.csv file?
Using merge + set_index
df=df.merge(labels,left_on='State',right_on='Name',how='left')
df
Out[1094]:
State Pop Column Name Level Rename
0 AP 100 State AP AP Andhra Pradesh
1 UP 200 State UP UP Uttar Pradesh
2 TN 90 State TN TN Tamil Nadu
df.set_index('Rename')['Pop'].plot(kind='Bar')

pandas dataframe getting daily data

I have a pandas dataframe with timestamps as index:
I would like to convert it to get a dataframe with daily values but without having to resample the original dataframe (no to sum, or average the hourly data). Ideally I would like to get the 24 daily values in a vector for each day, for example:
Is there a method to do this quickly?
Thanks!
IIUC you can groupby on the date attribute of your index and then apply a lambda that aggregates the values into a list:
In [21]:
# generate some data
df = pd.DataFrame({'GFS_rad':np.random.randn(100), 'GFS_tmp':np.random.randn(100)}, index=pd.date_range(dt.datetime(2016,1,1), freq='1h', periods=100))
df.groupby(df.index.date)['GFS_rad','GFS_tmp'].agg(lambda x: [x['GFS_rad'].values,x['GFS_tmp'].values])
Out[21]:
GFS_rad \
2016-01-01 [-0.324115177542, 1.59297335764, 0.58118555943...
2016-01-02 [-0.0547016526463, -1.10093451797, -1.55790161...
2016-01-03 [-0.34751220092, 1.06246918632, 0.181218794826...
2016-01-04 [0.950977469848, 0.422905080529, 1.98339145764...
2016-01-05 [-0.405124861624, 0.141470757613, -0.191169333...
GFS_tmp
2016-01-01 [-2.36889710412, -0.557972678049, -1.293544410...
2016-01-02 [-0.125562429825, -0.018852674365, -0.96735945...
2016-01-03 [0.802961514703, -1.68049099535, -0.5116769061...
2016-01-04 [1.35789157665, 1.37583167965, 0.538638510171,...
2016-01-05 [-0.297611872638, 1.10546853812, -0.8726761667...

Iterate through a Pandas DataFrame of Float time stamps and convert to date time

I have a Pandas Dataframe with 2000+ rows with date in float format as below:
42704.99686342593 representing datetime value of (2016, 11, 30, 23, 55, 29)
What I want to do is iterate each row in the dataframe and convert the float to the correct datetime format ideally d/m/Y H/M/S and save this to a new dataframe.
Using Python 2.7.
I couldn't find any duplicate questions and was unable to solve the issue with solutions to similar questions so any help appreciated.
Thanks.
It seems you use serial date what is Excel format.
The simpliest is substract 25569 and use to_datetime with parameter unit='d':
df = pd.DataFrame({'date':[42704.99686342593,42704.99686342593]})
print (df)
date
0 42704.996863
1 42704.996863
print (pd.to_datetime(df.date - 25569, unit='d'))
0 2016-11-30 23:55:28.963200
1 2016-11-30 23:55:28.963200
Name: date, dtype: datetime64[ns]
Another solutions are substract timedelta or offset:
print (pd.to_datetime(df.date, unit='d') - pd.to_timedelta('25569 Days'))
0 2016-11-30 23:55:28.963200
1 2016-11-30 23:55:28.963200
Name: date, dtype: datetime64[ns]
print (pd.to_datetime(df.date, unit='d') - pd.offsets.Day(25569))
0 2016-11-30 23:55:28.963200
1 2016-11-30 23:55:28.963200
Name: date, dtype: datetime64[ns]
Thank you Ted Petrou for link.

Plot multiple lines on subplots with pandas df.plot

Is there a way to plot multiple dataframe columns on one plot, with several subplots for the dataframe?
E.g. If df has 12 data columns, on subplot 1, plot columns 1-3, subplot 2, columns 4-6, etc.
I understand how to use df.plot to have one subplot for each column, but am not sure how to group as specified above.
Thanks!
This is an example of how I do it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2)
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randn(100, 6), columns=list('ABCDEF'))
df = df.div(100).add(1.01).cumprod()
df.iloc[:, :3].plot(ax=axes[0])
df.iloc[:, 3:].plot(ax=axes[1])

Pandas dataframe in pyspark to hive

How to send a pandas dataframe to a hive table?
I know if I have a spark dataframe, I can register it to a temporary table using
df.registerTempTable("table_name")
sqlContext.sql("create table table_name2 as select * from table_name")
but when I try to use the pandas dataFrame to registerTempTable, I get the below error:
AttributeError: 'DataFrame' object has no attribute 'registerTempTable'
Is there a way for me to use a pandas dataFrame to register a temp table or convert it to a spark dataFrame and then use it register a temp table so that I can send it back to hive.
I guess you are trying to use pandas df instead of Spark's DF.
Pandas DataFrame has no such method as registerTempTable.
you may try to create Spark DF from pandas DF.
UPDATE:
I've tested it under Cloudera (with installed Anaconda parcel, which includes Pandas module).
Make sure that you have set PYSPARK_PYTHON to your anaconda python installation (or another one containing Pandas module) on all your Spark workers (usually in: spark-conf/spark-env.sh)
Here is result of my test:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
>>> sdf = sqlContext.createDataFrame(df)
>>> sdf.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 98| 33| 75|
| 91| 57| 80|
| 20| 87| 85|
| 20| 61| 37|
| 96| 64| 60|
| 79| 45| 82|
| 82| 16| 22|
| 77| 34| 65|
| 74| 18| 17|
| 71| 57| 60|
+---+---+---+
>>> sdf.printSchema()
root
|-- A: long (nullable = true)
|-- B: long (nullable = true)
|-- C: long (nullable = true)
first u need to convert pandas dataframe to spark dataframe:
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
df = hive_context.createDataFrame(pd_df)
then u can create a temptable which is in memory:
df.registerTempTable('tmp')
now,u can use hive ql to save data into hive:
hive_context.sql("""insert overwrite table target partition(p='p') select a,b from tmp'''
note than:the hive_context must be keep to the same one!
I converted my pandas df to a temp table by
1) Converting the pandas dataframe to spark dataframe:
spark_df=sqlContext.createDataFrame(Pandas_df)
2) Make sure that the data is migrated properly
spark_df.select("*").show()
3) Convert the spark dataframe to a temp table for querying.
spark_df.registerTempTable("table_name").
Cheers..
By Following all the other answers here, I was able to convert a pandas dataframe to a permanent Hive table as follows:
# sc is a spark context created with enableHiveSupport()
from pyspark.sql import HiveContext
hc=HiveContext(sc)
# df is my pandas dataframe
sc.createDataFrame(df).registerTempTable('tmp')
# sch is the hive schema, and tabname is my new hive table name
hc.sql("create table sch.tabname as select * from tmp")