I have a multi-index dataframe defined, e.g. as:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=3,freq='5s')
dates = dates.append(dates)
locations = list('AAABBB')
gascode = ['no2','o3','so2']*2
tup = pd.MultiIndex.from_tuples( zip(locations,gascode,dates), names=['Location','gas','Date'] )
data = pd.DataFrame(data=range(6),index=tup,columns=['val1'])
>>> data
Location gas Date val1
A no2 2013-01-01 00:00:00 0
o3 2013-01-01 00:00:05 1
so2 2013-01-01 00:00:10 2
B no2 2013-01-01 00:00:00 3
o3 2013-01-01 00:00:05 4
so2 2013-01-01 00:00:10 5
Keeping data only from location 'A':
data = data.xs(key='A',level='Location')
Now, I want to create new columns according to the 'gas' index to yield:
Date no2 o3 so2
2013-01-01 00:00:00 0 nan nan
2013-01-01 00:00:05 nan 1 nan
2013-01-01 00:00:10 nan nan 2
I tried pivoting about the 'date' index to put 'gas' to columns, though this failed.
data = data.pivot(index=data.index.get_level_values(level='date'),
columns=situ.index.get_level_values(level='gas'))
I am at a loss of how to achieve this; can anyone recommend an alternative?
You can unstack the result:
In [11]: data.xs(key='A', level='Location').unstack(0)
Out[11]:
val1
gas no2 o3 so2
Date
2013-01-01 00:00:00 0 NaN NaN
2013-01-01 00:00:05 NaN 1 NaN
2013-01-01 00:00:10 NaN NaN 2
[3 rows x 3 columns]
Related
I have the following dataset from a crossover design study with participant_id, treatment_arm, and date_of_treatment as follows:
participant_id
treatment_arm
date_of_treatment
1
A
Jan 1 2022
1
B
Jan 2 2022
1
C
Jan 3 2022
2
C
Jan 4 2022
2
B
Jan 5 2022
2
A
Jan 6 2022
So for participant_id 1, based on the order of the date_of_treatment, the sequence would be ABC. For participant_id 2, it would be CBA.
Based on the above, I want to create column seq as follows:
participant_id
treatment_arm
date_of_treatment
seq
1
A
Jan 1 2022
ABC
1
B
Jan 2 2022
ABC
1
C
Jan 3 2022
ABC
2
C
Jan 4 2022
CBA
2
B
Jan 5 2022
CBA
2
A
Jan 6 2022
CBA
How do I go about creating the column using the 3 variables participant_id, treatment_arm, and date_of_treatment in datastep?
You could use a double DoW Loop
data want;
do until (last.participant_id);
set have;
length seq :$3.;
by participant_id;
seq = cats(seq, treatment_arm);
end;
do until (last.participant_id);
set have;
by participant_id;
output;
end;
run;
Remember to change the length of seq should there be more than 3 treatments for each participant.
participant_id treatment_arm date_of_treatment seq
1 A 01JAN2022 ABC
1 B 02JAN2022 ABC
1 C 03JAN2022 ABC
2 C 04JAN2022 CBA
2 B 05JAN2022 CBA
2 A 06JAN2022 CBA
I need a fast numeric representation of the time of day.
Lets start with some basic data:
> z1 = structure(
+ c(1:5),.Dim = c(5L, 1L), .Dimnames = list(NULL, c("Hour")),
+ index = as.POSIXct(paste("2018-06-06",paste(1:5,":00:00",sep = ""),sep = " "), tz = 'America/Chicago'),
+ .indexCLASS = c("POSIXct", "POSIXt"), .indexTZ = 'America/Chicago',
+ tclass = c("POSIXct", "POSIXt"), tzone = 'America/Chicago', class = c("xts", "zoo"))
> z1
Hour
2018-06-06 01:00:00 1
2018-06-06 02:00:00 2
2018-06-06 03:00:00 3
2018-06-06 04:00:00 4
2018-06-06 05:00:00 5
> index(z1[1])
[1] "2018-06-06 01:00:00 CDT"
So I have 5 times by the hour Chicago time or CDT.
I need to be able to look at the time, like 1AM and get a numeric time like 1/24 = .0416666667.
The XTS index is in Datetime format or seconds from 1970-01-01 so the math should be simple by using the modulo function %%.
Lets try:
> cbind(z1,(unclass(index(z1)) %% (60*60*24))/(60*60*24),(unclass(index(z1)) %% (60*60*24))/(60*60*24)*24)
Hour ..2 ..3
2018-06-06 01:00:00 1 0.2500000 6
2018-06-06 02:00:00 2 0.2916667 7
2018-06-06 03:00:00 3 0.3333333 8
2018-06-06 04:00:00 4 0.3750000 9
2018-06-06 05:00:00 5 0.4166667 10
I unclass the index (to have the same value rcpp will see) then do the modulo for the seconds in the day to the get the days left over and then hours left over.
The problem is obviously the timezone. The resulting time of day is in UTC time zone, but I need it Chicago time just like the the XTS object. If I could simply get the numeric offset for the timezone it would be easy, but it seems getting the offset is not so simple.
So, I need a function in rcpp that if given an XTS time would give me the time of day in the correct timezone. It can be in days, hours or anything else, as long as it is numeric and fast.
The intended use of this TimeOfDay function is, say for running code during typical workday hours of 9 AM to 5 PM.
if(TimeOfDay(Index(1)) > 9.0 && TimeOfDay(Index(1)) < 17.0)
{
//Code to run.
}
Here is a very simple solution using the convenience function in data.table:
R> z2 <- data.table(pt=index(z1))
R> z2[, myhour:=hour(pt)][]
pt myhour
1: 2018-06-06 01:00:00 1
2: 2018-06-06 02:00:00 2
3: 2018-06-06 03:00:00 3
4: 2018-06-06 04:00:00 4
5: 2018-06-06 05:00:00 5
R>
We just pass the POSIXct object in, and derive hour from it. You'd be hard-pressed to beat it in home-grown C/C++ code -- and this solution already exists.
I want to split a time series into two set: train and test.
Here's my code:
train = data.iloc[:1100]
test = data.iloc[1101:]
Here's what the time series looks like:
And here's the train series:There's no time, only the date in the index.
Finally, the test:
How to change the index to same form?
Consider the simplified series s
s = pd.Series(1, pd.date_range('2010-08-16', periods=5, freq='12H'))
s
2010-08-16 00:00:00 1
2010-08-16 12:00:00 1
2010-08-17 00:00:00 1
2010-08-17 12:00:00 1
2010-08-18 00:00:00 1
Freq: 12H, dtype: int64
But when I subset s leaving only Timestamps that need no time element, pandas does me the "favor" of not displaying a bunch of zeros for no reason.
s.iloc[::2]
2010-08-16 1
2010-08-17 1
2010-08-18 1
Freq: 24H, dtype: int64
But rest assured, the values are the same:
s.iloc[::2].index[0] == s.index[0]
True
And have the same dtype and precision
print(s.iloc[::2].index.values.dtype)
dtype('<M8[ns]')
And
print(s.index.values.dtype)
dtype('<M8[ns]')
I think if same dataframe separated by iloc, there are only no 00:00:00 show. So add times is not necessary, because both dtypes are DatetimeIndex.
mux = pd.MultiIndex.from_product([['GOOG'],
pd.DatetimeIndex(['2010-08-16 00:00:00',
'2010-08-17 00:00:00',
'2010-08-18 00:00:00',
'2010-08-19 00:00:00',
'2010-08-20 15:00:00'])], names=('Ticker','Date'))
data = pd.Series(range(5), mux)
print (data)
Ticker Date
GOOG 2010-08-16 00:00:00 0
2010-08-17 00:00:00 1
2010-08-18 00:00:00 2
2010-08-19 00:00:00 3
2010-08-20 15:00:00 4
#splitting
train = data.iloc[:2]
test = data.iloc[2:]
print (train)
Ticker Date
GOOG 2010-08-16 0
2010-08-17 1
dtype: int32
It seems there are some times as mentioned piRSquared:
print (test)
Ticker Date
GOOG 2010-08-18 00:00:00 2
2010-08-19 00:00:00 3
2010-08-20 15:00:00 4
dtype: int32
#check if same dtypes
print (train.index.get_level_values('Date').dtype)
datetime64[ns]
print (test.index.get_level_values('Date').dtype)
datetime64[ns]
#if want see only times in test dataframes
m = test.index.get_level_values('Date').time != pd.to_datetime('2015-01-01').time()
only_times = test[m]
print (only_times)
Ticker Date
GOOG 2010-08-20 15:00:00 4
dtype: int32
I want to create a new column of datetime by setting a constant for the year, month, day, and minutes, seconds, retaining only the
import pandas as pd
td = pd.DataFrame(['2015-01-01 09:03:00', '2015-01-11 15:47:00',
'2015-01-11 16:47:00', '2015-01-11 01:47:00', '2016-01-11 01:47:00'], columns=['datetime'])
td['datetime'] = pd.to_datetime(td['datetime'])
datetime
0 2015-01-01 09:03:00
1 2015-01-11 15:47:00
2 2015-01-11 16:47:00
3 2015-01-11 01:47:00
4 2016-01-11 01:47:00
The result should look like this.
datetime
0 1900-01-01 09:00:00
1 1900-01-01 15:00:00
2 1900-01-01 16:00:00
3 1900-01-01 01:00:00
4 1900-01-01 01:00:00
How can I code this out? Thanks!
use pd.to_timedelta
pd.to_datetime('1900-01-01') + pd.to_timedelta(td.datetime.dt.hour, unit='H')
0 1900-01-01 09:00:00
1 1900-01-01 15:00:00
2 1900-01-01 16:00:00
3 1900-01-01 01:00:00
4 1900-01-01 01:00:00
Name: datetime, dtype: datetime64[ns]
td['datetime'].apply(lambda x: datetime(1900, 1, 1) + timedelta(hours=x.hour))
0 1900-01-01 09:00:00
1 1900-01-01 15:00:00
2 1900-01-01 16:00:00
3 1900-01-01 01:00:00
4 1900-01-01 01:00:00
Name: datetime, dtype: datetime64[ns]
I am trying to get the difference between the min date and max date a product is sold in terms of month in a new column.But I am having an unusual return when apply function in groupby.
Any help is much appreciated.
So my steps are :
data :
pch_date day product qty unit_price total_price year_month
421 2013-01-07 tuesday p3 13 4.58 59.54 1
141 2015-09-13 monday p8 3 3.77 11.31 9
249 2015-02-02 monday p5 3 1.80 5.40 2
826 2015-10-09 tuesday p5 6 1.80 10.80 10
427 2014-04-18 friday p7 6 4.21 25.26 4
function definition :
def diff_date(x):
max_date = x.max()
min_date = x.min()
diff_month = (max_date.year - min_date.year)*12 + max_date.month +1
return diff_month
When trying for test:
print diff_date(prod_df['pch_date'])
49 which is correct
But Problem:
print prod_df[['product','pch_date']].groupby(['product']).agg({'pch_date': diff_date}).reset_index()[:5]
Results coming with a extra date:
product pch_date
0 p1 1970-01-01 00:00:00.000000049
1 p10 1970-01-01 00:00:00.000000048
2 p11 1970-01-01 00:00:00.000000045
3 p12 1970-01-01 00:00:00.000000049
4 p13 1970-01-01 00:00:00.000000045
How to get the difference in integer ?
You can use Groupby.apply instead which returns integers and not datetime objects.
df.groupby(['product'])['pch_date'].apply(diff_date).reset_index()
As a workaround for not letting the integer values getting converted to their DatetimeIndex values, you can change the last line of your function to str(diff_month) and you can continue using Groupby.agg as shown:
df.groupby(['product'])['pch_date'].agg({'pch_date': diff_date}).reset_index()