I want to create a new column of datetime by setting a constant for the year, month, day, and minutes, seconds, retaining only the
import pandas as pd
td = pd.DataFrame(['2015-01-01 09:03:00', '2015-01-11 15:47:00',
'2015-01-11 16:47:00', '2015-01-11 01:47:00', '2016-01-11 01:47:00'], columns=['datetime'])
td['datetime'] = pd.to_datetime(td['datetime'])
datetime
0 2015-01-01 09:03:00
1 2015-01-11 15:47:00
2 2015-01-11 16:47:00
3 2015-01-11 01:47:00
4 2016-01-11 01:47:00
The result should look like this.
datetime
0 1900-01-01 09:00:00
1 1900-01-01 15:00:00
2 1900-01-01 16:00:00
3 1900-01-01 01:00:00
4 1900-01-01 01:00:00
How can I code this out? Thanks!
use pd.to_timedelta
pd.to_datetime('1900-01-01') + pd.to_timedelta(td.datetime.dt.hour, unit='H')
0 1900-01-01 09:00:00
1 1900-01-01 15:00:00
2 1900-01-01 16:00:00
3 1900-01-01 01:00:00
4 1900-01-01 01:00:00
Name: datetime, dtype: datetime64[ns]
td['datetime'].apply(lambda x: datetime(1900, 1, 1) + timedelta(hours=x.hour))
0 1900-01-01 09:00:00
1 1900-01-01 15:00:00
2 1900-01-01 16:00:00
3 1900-01-01 01:00:00
4 1900-01-01 01:00:00
Name: datetime, dtype: datetime64[ns]
Related
I need a fast numeric representation of the time of day.
Lets start with some basic data:
> z1 = structure(
+ c(1:5),.Dim = c(5L, 1L), .Dimnames = list(NULL, c("Hour")),
+ index = as.POSIXct(paste("2018-06-06",paste(1:5,":00:00",sep = ""),sep = " "), tz = 'America/Chicago'),
+ .indexCLASS = c("POSIXct", "POSIXt"), .indexTZ = 'America/Chicago',
+ tclass = c("POSIXct", "POSIXt"), tzone = 'America/Chicago', class = c("xts", "zoo"))
> z1
Hour
2018-06-06 01:00:00 1
2018-06-06 02:00:00 2
2018-06-06 03:00:00 3
2018-06-06 04:00:00 4
2018-06-06 05:00:00 5
> index(z1[1])
[1] "2018-06-06 01:00:00 CDT"
So I have 5 times by the hour Chicago time or CDT.
I need to be able to look at the time, like 1AM and get a numeric time like 1/24 = .0416666667.
The XTS index is in Datetime format or seconds from 1970-01-01 so the math should be simple by using the modulo function %%.
Lets try:
> cbind(z1,(unclass(index(z1)) %% (60*60*24))/(60*60*24),(unclass(index(z1)) %% (60*60*24))/(60*60*24)*24)
Hour ..2 ..3
2018-06-06 01:00:00 1 0.2500000 6
2018-06-06 02:00:00 2 0.2916667 7
2018-06-06 03:00:00 3 0.3333333 8
2018-06-06 04:00:00 4 0.3750000 9
2018-06-06 05:00:00 5 0.4166667 10
I unclass the index (to have the same value rcpp will see) then do the modulo for the seconds in the day to the get the days left over and then hours left over.
The problem is obviously the timezone. The resulting time of day is in UTC time zone, but I need it Chicago time just like the the XTS object. If I could simply get the numeric offset for the timezone it would be easy, but it seems getting the offset is not so simple.
So, I need a function in rcpp that if given an XTS time would give me the time of day in the correct timezone. It can be in days, hours or anything else, as long as it is numeric and fast.
The intended use of this TimeOfDay function is, say for running code during typical workday hours of 9 AM to 5 PM.
if(TimeOfDay(Index(1)) > 9.0 && TimeOfDay(Index(1)) < 17.0)
{
//Code to run.
}
Here is a very simple solution using the convenience function in data.table:
R> z2 <- data.table(pt=index(z1))
R> z2[, myhour:=hour(pt)][]
pt myhour
1: 2018-06-06 01:00:00 1
2: 2018-06-06 02:00:00 2
3: 2018-06-06 03:00:00 3
4: 2018-06-06 04:00:00 4
5: 2018-06-06 05:00:00 5
R>
We just pass the POSIXct object in, and derive hour from it. You'd be hard-pressed to beat it in home-grown C/C++ code -- and this solution already exists.
I have a pandas dataframe like this,
Timestamp Meter1 Meter2
0 234 NaN
1 235 NaN
2 236 NaN
0 NaN 100
1 NaN 101
2 NaN 102
and I'm having trouble merging the rows based on the index Timestamp to something like this,
Timestamp Meter1 Meter2
0 234 100
1 235 101
2 236 102
Option 0
df.max(level=0)
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
Option 1
df.sum(level=0)
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
Option 2
Disturbing Answer
df.stack().unstack()
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
As brought up by #jezrael and linked to issue here
However, As I've understood groupby.first and groupby.last is that it will return the first (or last) valid value in the group per column. In other words, it is my belief that this is working as intended.
Option 3
df.groupby(level=0).first()
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
Option 4
df.groupby(level=0).last()
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
Use groupby:
df.groupby(level=0).max()
OR
df.groupby('Timestamp').max()
Output
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
Use groupby and aggregate sum:
df = df.groupby(level=0).sum()
print (df)
Meter1 Meter2
Timestamp
0 234.0 100.0
1 235.0 101.0
2 236.0 102.0
And if only ints:
df = df.groupby(level=0).sum().astype(int)
print (df)
Meter1 Meter2
Timestamp
0 234 100
1 235 101
2 236 102
But maybe problem was you forget axis=1 in concat:
print (df1)
Meter1
Timestamp
0 234
1 235
2 236
print (df2)
Meter2
Timestamp
0 100
1 101
2 102
print (pd.concat([df1, df2]))
Meter1 Meter2
Timestamp
0 234.0 NaN
1 235.0 NaN
2 236.0 NaN
0 NaN 100.0
1 NaN 101.0
2 NaN 102.0
print (pd.concat([df1, df2], axis=1))
Meter1 Meter2
Timestamp
0 234 100
1 235 101
2 236 102
I want to split a time series into two set: train and test.
Here's my code:
train = data.iloc[:1100]
test = data.iloc[1101:]
Here's what the time series looks like:
And here's the train series:There's no time, only the date in the index.
Finally, the test:
How to change the index to same form?
Consider the simplified series s
s = pd.Series(1, pd.date_range('2010-08-16', periods=5, freq='12H'))
s
2010-08-16 00:00:00 1
2010-08-16 12:00:00 1
2010-08-17 00:00:00 1
2010-08-17 12:00:00 1
2010-08-18 00:00:00 1
Freq: 12H, dtype: int64
But when I subset s leaving only Timestamps that need no time element, pandas does me the "favor" of not displaying a bunch of zeros for no reason.
s.iloc[::2]
2010-08-16 1
2010-08-17 1
2010-08-18 1
Freq: 24H, dtype: int64
But rest assured, the values are the same:
s.iloc[::2].index[0] == s.index[0]
True
And have the same dtype and precision
print(s.iloc[::2].index.values.dtype)
dtype('<M8[ns]')
And
print(s.index.values.dtype)
dtype('<M8[ns]')
I think if same dataframe separated by iloc, there are only no 00:00:00 show. So add times is not necessary, because both dtypes are DatetimeIndex.
mux = pd.MultiIndex.from_product([['GOOG'],
pd.DatetimeIndex(['2010-08-16 00:00:00',
'2010-08-17 00:00:00',
'2010-08-18 00:00:00',
'2010-08-19 00:00:00',
'2010-08-20 15:00:00'])], names=('Ticker','Date'))
data = pd.Series(range(5), mux)
print (data)
Ticker Date
GOOG 2010-08-16 00:00:00 0
2010-08-17 00:00:00 1
2010-08-18 00:00:00 2
2010-08-19 00:00:00 3
2010-08-20 15:00:00 4
#splitting
train = data.iloc[:2]
test = data.iloc[2:]
print (train)
Ticker Date
GOOG 2010-08-16 0
2010-08-17 1
dtype: int32
It seems there are some times as mentioned piRSquared:
print (test)
Ticker Date
GOOG 2010-08-18 00:00:00 2
2010-08-19 00:00:00 3
2010-08-20 15:00:00 4
dtype: int32
#check if same dtypes
print (train.index.get_level_values('Date').dtype)
datetime64[ns]
print (test.index.get_level_values('Date').dtype)
datetime64[ns]
#if want see only times in test dataframes
m = test.index.get_level_values('Date').time != pd.to_datetime('2015-01-01').time()
only_times = test[m]
print (only_times)
Ticker Date
GOOG 2010-08-20 15:00:00 4
dtype: int32
I am trying to get the difference between the min date and max date a product is sold in terms of month in a new column.But I am having an unusual return when apply function in groupby.
Any help is much appreciated.
So my steps are :
data :
pch_date day product qty unit_price total_price year_month
421 2013-01-07 tuesday p3 13 4.58 59.54 1
141 2015-09-13 monday p8 3 3.77 11.31 9
249 2015-02-02 monday p5 3 1.80 5.40 2
826 2015-10-09 tuesday p5 6 1.80 10.80 10
427 2014-04-18 friday p7 6 4.21 25.26 4
function definition :
def diff_date(x):
max_date = x.max()
min_date = x.min()
diff_month = (max_date.year - min_date.year)*12 + max_date.month +1
return diff_month
When trying for test:
print diff_date(prod_df['pch_date'])
49 which is correct
But Problem:
print prod_df[['product','pch_date']].groupby(['product']).agg({'pch_date': diff_date}).reset_index()[:5]
Results coming with a extra date:
product pch_date
0 p1 1970-01-01 00:00:00.000000049
1 p10 1970-01-01 00:00:00.000000048
2 p11 1970-01-01 00:00:00.000000045
3 p12 1970-01-01 00:00:00.000000049
4 p13 1970-01-01 00:00:00.000000045
How to get the difference in integer ?
You can use Groupby.apply instead which returns integers and not datetime objects.
df.groupby(['product'])['pch_date'].apply(diff_date).reset_index()
As a workaround for not letting the integer values getting converted to their DatetimeIndex values, you can change the last line of your function to str(diff_month) and you can continue using Groupby.agg as shown:
df.groupby(['product'])['pch_date'].agg({'pch_date': diff_date}).reset_index()
I have a multi-index dataframe defined, e.g. as:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=3,freq='5s')
dates = dates.append(dates)
locations = list('AAABBB')
gascode = ['no2','o3','so2']*2
tup = pd.MultiIndex.from_tuples( zip(locations,gascode,dates), names=['Location','gas','Date'] )
data = pd.DataFrame(data=range(6),index=tup,columns=['val1'])
>>> data
Location gas Date val1
A no2 2013-01-01 00:00:00 0
o3 2013-01-01 00:00:05 1
so2 2013-01-01 00:00:10 2
B no2 2013-01-01 00:00:00 3
o3 2013-01-01 00:00:05 4
so2 2013-01-01 00:00:10 5
Keeping data only from location 'A':
data = data.xs(key='A',level='Location')
Now, I want to create new columns according to the 'gas' index to yield:
Date no2 o3 so2
2013-01-01 00:00:00 0 nan nan
2013-01-01 00:00:05 nan 1 nan
2013-01-01 00:00:10 nan nan 2
I tried pivoting about the 'date' index to put 'gas' to columns, though this failed.
data = data.pivot(index=data.index.get_level_values(level='date'),
columns=situ.index.get_level_values(level='gas'))
I am at a loss of how to achieve this; can anyone recommend an alternative?
You can unstack the result:
In [11]: data.xs(key='A', level='Location').unstack(0)
Out[11]:
val1
gas no2 o3 so2
Date
2013-01-01 00:00:00 0 NaN NaN
2013-01-01 00:00:05 NaN 1 NaN
2013-01-01 00:00:10 NaN NaN 2
[3 rows x 3 columns]