Pandas: How to stack time series into a dataframe with time columns? - python-2.7

I have a pandas timeseries with minute tick data:
2011-01-01 09:30:00 -0.358525
2011-01-01 09:31:00 -0.185970
2011-01-01 09:32:00 -0.357479
2011-01-01 09:33:00 -1.486157
2011-01-01 09:34:00 -1.101909
2011-01-01 09:35:00 -1.957380
2011-01-02 09:30:00 -0.489747
2011-01-02 09:31:00 -0.341163
2011-01-02 09:32:00 1.588071
2011-01-02 09:33:00 -0.146610
2011-01-02 09:34:00 -0.185834
2011-01-02 09:35:00 -0.872918
2011-01-03 09:30:00 0.682824
2011-01-03 09:31:00 -0.344875
2011-01-03 09:32:00 -0.641186
2011-01-03 09:33:00 -0.501414
2011-01-03 09:34:00 0.877347
2011-01-03 09:35:00 2.183530
What is the best way to stack it into a dataframe such as :
09:30:00 09:31:00 09:32:00 09:33:00 09:34:00 09:35:00
2011-01-01 -0.358525 -0.185970 -0.357479 -1.486157 -1.101909 -1.957380
2011-01-02 -0.489747 -0.341163 1.588071 -0.146610 -0.185834 -0.872918
2011-01-03 0.682824 -0.344875 -0.641186 -0.501414 0.877347 2.183530

I'd make sure that this is actually want you want to do, as the resulting df loses a lot of the nice time-series functionality that pandas has.
But here is some code that would accomplish it. First, a time column is added, and the index is set to just the date part of the DateTimeIndex. The pivot command reshapes the data, setting the times as columns.
In [74]: df.head()
Out[74]:
value
date
2011-01-01 09:30:00 -0.358525
2011-01-01 09:31:00 -0.185970
2011-01-01 09:32:00 -0.357479
2011-01-01 09:33:00 -1.486157
2011-01-01 09:34:00 -1.101909
In [75]: df['time'] = df.index.time
In [76]: df.index = df.index.date
In [77]: df2 = df.pivot(index=df.index, columns='time')
The results dataframe will have a MultiIndex for the columns (the top level just being the name of your values variable). If you want it back to just a list of columns, the code below will flatten the column list.
In [78]: df2.columns = [c for (_, c) in df2.columns]
In [79]: df2
Out[79]:
09:30:00 09:31:00 09:32:00 09:33:00 09:34:00 09:35:00
2011-01-01 -0.358525 -0.185970 -0.357479 -1.486157 -1.101909 -1.957380
2011-01-02 -0.489747 -0.341163 1.588071 -0.146610 -0.185834 -0.872918
2011-01-03 0.682824 -0.344875 -0.641186 -0.501414 0.877347 2.183530

Related

YTD for each asset on a table (Data filtering - Dax)

Imagine a table with 3 columns:
Date
AssetType
Value
2022-01-01
A
1
2022-01-02
A
1.02
2022-01-03
A
1.05
2022-01-04
A
1.09
2022-01-05
A
1.06
2022-01-03
B
1
2022-01-04
B
1.05
2022-01-05
B
1.07
2022-01-06
B
1.09
2022-01-07
B
1.08
The First date of 2022 for each asset is diferent.
Asset A - 2022-01-01
Asset B - 2022-01-03
I want to create a new column or measure that returns the first date of 2022 for both assets.
So far i've tried to use = CALCULATE(STARTOFYEAR(table[date])), FILTER(Table, Table[AssetType] = [Asset type]
Obs. [Asset Type] is a measure tha giver me the type of asset.
But is returning the same date for both assets (2022-01-01)
Does anyone knows how get this done ?
Date
AssetType
Value
FirstDate
2022-01-01
A
1
2022-01-01
2022-01-02
A
1.02
2022-01-01
2022-01-03
A
1.05
2022-01-01
2022-01-04
A
1.09
2022-01-01
2022-01-05
A
1.06
2022-01-01
2022-01-03
B
1
2022-01-03
2022-01-04
B
1.05
2022-01-03
2022-01-05
B
1.07
2022-01-03
2022-01-06
B
1.09
2022-01-03
2022-01-07
B
1.08
2022-01-03
Thx
OK. This Time create a calculated column and paste this code:
FirstDate =
CALCULATE (
MIN ( YourTable[Date] ),
ALLEXCEPT ( YourTable, YourTable[AssetType] )
)
The result :

How to remove the DATE TIME that have NAN vaules

How do I remove the DATE and TIMEs with the NAN value in the 'oo' column.
this is my csv
DATE,TIME,OPEN,HIGH,LOW,CLOSE,VOLUME
02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505
02/03/1997,09:05:00,3047.00,3048.00,3046.00,3047.00,162
02/03/1997,09:06:00,3047.50,3048.00,3047.00,3047.50,98
02/03/1997,09:07:00,3047.50,3047.50,3047.00,3047.50,228
02/03/1997,09:08:00,3048.00,3048.00,3047.50,3048.00,136
02/03/1997,09:09:00,3048.00,3048.00,3046.50,3046.50,174
02/03/1997,09:10:00,3046.50,3046.50,3045.00,3045.00,134
02/03/1997,09:11:00,3045.50,3046.00,3044.00,3045.00,43
02/03/1997,09:12:00,3045.00,3045.50,3045.00,3045.00,214
02/03/1997,09:13:00,3045.50,3045.50,3045.50,3045.50,8
02/03/1997,09:14:00,3045.50,3046.00,3044.50,3044.50,152
02/03/1997,09:15:00,3044.00,3044.00,3042.50,3042.50,126
02/03/1997,09:16:00,3043.50,3043.50,3043.00,3043.00,128
02/03/1997,09:17:00,3042.50,3043.50,3042.50,3043.50,23
02/03/1997,09:18:00,3043.50,3044.50,3043.00,3044.00,51
02/03/1997,09:19:00,3044.50,3044.50,3043.00,3043.00,18
02/03/1997,09:20:00,3043.00,3045.00,3043.00,3045.00,23
02/03/1997,09:21:00,3045.00,3045.00,3044.50,3045.00,51
02/03/1997,09:22:00,3045.00,3045.00,3045.00,3045.00,47
02/03/1997,09:23:00,3045.50,3046.00,3045.00,3045.00,77
02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131
02/03/1997,09:25:00,3044.50,3044.50,3043.50,3043.50,138
02/03/1997,09:26:00,3043.50,3043.50,3043.50,3043.50,6
02/03/1997,09:27:00,3043.50,3043.50,3043.00,3043.00,56
02/03/1997,09:28:00,3043.00,3044.00,3043.00,3044.00,32
02/03/1997,09:29:00,3044.50,3044.50,3044.50,3044.50,63
02/03/1997,09:30:00,3045.00,3045.00,3045.00,3045.00,28
here's my code.
exp = pd.read_csv('example.txt', parse_dates = [["DATE", "TIME"]], index_col=0)
exp['oo'] = opcl.OPEN.resample("5Min").first()
print exp['oo']
and I get this
DATE_TIME
1997-02-03 09:04:00 NaN
1997-02-03 09:05:00 3047.0
1997-02-03 09:06:00 NaN
1997-02-03 09:07:00 NaN
1997-02-03 09:08:00 NaN
1997-02-03 09:09:00 NaN
1997-02-03 09:10:00 3046.5
I want to get rid of all the DATE_TIME rows with NaN vaules in the 'oo' column.
I've tried.
exp['oo'] = exp['oo'].dropna()
But I get the same thing.
I've looked all threw the http://pandas.pydata.org/pandas-docs/stable/missing_data.html
And looked all over this website.
I would like to keep my csv reader the same but idk.
If anybody could help it would be greatly appreciated thanks so much for your time.
I think you want this:
>>> exp.OPEN.resample("5Min", how='first')
DATE_TIME
1997-02-03 09:00:00 3046.0
1997-02-03 09:05:00 3047.0
1997-02-03 09:10:00 3046.5
1997-02-03 09:15:00 3044.0
1997-02-03 09:20:00 3043.0
1997-02-03 09:25:00 3044.5
1997-02-03 09:30:00 3045.0
Freq: 5T, Name: OPEN, dtype: float64

SAS ARIMA modelling for 5 different variables

I am trying do a ARIMA model estimation for 5 different variables. The data consists of 16 months of Point of Sales. How do I approach this complicated ARIMA modelling?
Furthermore I would like to do:
A simple moving average of each product group
A Holt-Winters
exponential smoothing model
Data is as follows with date and product groups:
Date Gloves ShoeCovers Socks Warmers HeadWear
apr-14 11015 3827 3465 1264 772
maj-14 11087 2776 4378 1099 1423
jun-14 7645 1432 4490 674 670
jul-14 10083 7975 2577 1558 8501
aug-14 13887 8577 6854 1305 15621
sep-14 9186 5213 5244 1183 6784
okt-14 7611 4279 4150 977 6191
nov-14 6410 4033 2918 507 8276
dec-14 4856 3552 3192 450 4810
jan-15 17506 7274 3137 2216 3979
feb-15 21518 5672 8848 1838 2321
mar-15 17395 5200 5712 1604 2282
apr-15 11405 4531 5185 1479 1888
maj-15 11509 5690 4370 1145 2369
jun-15 9945 2610 4884 882 1709
jul-15 8707 5658 4570 1948 6255
Any skilled forecasters out there willing to help? Much appreciated!

Pandas - Python 2.7: How convert timeseries index to seconds of the day?

I'm trying to convert a time series index to a seconds of the day i.e. so that the seconds increases from 0-86399 as the day progresses. I currently can recover the time of the day, but am having trouble converting this to seconds in a vectorized way:
df['timeofday'] = df.index.time
Any ideas? Thanks.
As #Jeff points out my original answer misunderstood what you were doing. But the following should work and it is vectorized. My answer relies on numpy datetime64 operations (subtract the beginning of the day from the current datetime64 and the divide through with a timedelta64 to get seconds):
>>> df
A
2011-01-01 00:00:00 -0.112448
2011-01-01 01:00:00 1.006958
2011-01-01 02:00:00 -0.056194
2011-01-01 03:00:00 0.777821
2011-01-01 04:00:00 -0.552584
2011-01-01 05:00:00 0.156198
2011-01-01 06:00:00 0.848857
2011-01-01 07:00:00 0.248990
2011-01-01 08:00:00 0.524785
2011-01-01 09:00:00 1.510011
2011-01-01 10:00:00 -0.332266
2011-01-01 11:00:00 -0.909849
2011-01-01 12:00:00 -1.275335
2011-01-01 13:00:00 1.361837
2011-01-01 14:00:00 1.924534
2011-01-01 15:00:00 0.618478
df['sec'] = (df.index.values
- df.index.values.astype('datetime64[D]'))/np.timedelta64(1,'s')
A sec
2011-01-01 00:00:00 -0.112448 0
2011-01-01 01:00:00 1.006958 3600
2011-01-01 02:00:00 -0.056194 7200
2011-01-01 03:00:00 0.777821 10800
2011-01-01 04:00:00 -0.552584 14400
2011-01-01 05:00:00 0.156198 18000
2011-01-01 06:00:00 0.848857 21600
2011-01-01 07:00:00 0.248990 25200
2011-01-01 08:00:00 0.524785 28800
2011-01-01 09:00:00 1.510011 32400
2011-01-01 10:00:00 -0.332266 36000
2011-01-01 11:00:00 -0.909849 39600
2011-01-01 12:00:00 -1.275335 43200
2011-01-01 13:00:00 1.361837 46800
2011-01-01 14:00:00 1.924534 50400
2011-01-01 15:00:00 0.618478 54000
May be a bit overdone, but this would be my answer:
from pandas import date_range, Series, to_datetime
# Some test data
rng = date_range('1/1/2011 01:01:01', periods=3, freq='s')
df = Series(randn(len(rng)), index=rng).to_frame()
def sec_in_day(timestamp):
date = timestamp.date() # We get the date less the time
elapsed_time = timestamp.to_datetime() - to_datetime(date) # We get the time
return elapsed_time.total_seconds()
Series(df.index).apply(sec_in_day)
I modified KarlD's answer for datetime with time zone:
d = pd.DataFrame({"t_naive":pd.date_range("20160101","20160102", freq = "2H")})
d['t_utc'] = d['t_naive'].dt.tz_localize("UTC")
d['t_ct'] = d['t_utc'].dt.tz_convert("America/Chicago")
print(d.head())
# t_naive t_utc t_ct
# 0 2016-01-01 00:00:00 2016-01-01 00:00:00+00:00 2015-12-31 18:00:00-06:00
# 1 2016-01-01 02:00:00 2016-01-01 02:00:00+00:00 2015-12-31 20:00:00-06:00
# 2 2016-01-01 04:00:00 2016-01-01 04:00:00+00:00 2015-12-31 22:00:00-06:00
# 3 2016-01-01 06:00:00 2016-01-01 06:00:00+00:00 2016-01-01 00:00:00-06:00
# 4 2016-01-01 08:00:00 2016-01-01 08:00:00+00:00 2016-01-01 02:00:00-06:00
The answer by KarlD gives sec of day in UTC
s0 = (d["t_naive"].values - d["t_naive"].values.astype('datetime64[D]'))/np.timedelta64(1,'s')
s0
# array([ 0., 7200., 14400., 21600., 28800., 36000., 43200.,
# 50400., 57600., 64800., 72000., 79200., 0.])
s1 = (d["t_ct"].values - d["t_ct"].values.astype('datetime64[D]'))/np.timedelta64(1,'s')
s1
# array([ 0., 7200., 14400., 21600., 28800., 36000., 43200.,
# 50400., 57600., 64800., 72000., 79200., 0.])
For sec of day in local time, use:
s2 = (d["t_ct"].view("int64") - d["t_ct"].dt.normalize().view("int64"))//pd.Timedelta(1, unit='s')
#use d.index.normalize() for index
s2.values
# array([64800, 72000, 79200, 0, 7200, 14400, 21600, 28800, 36000,
# 43200, 50400, 57600, 64800], dtype=int64)
or,
s3 = d["t_ct"].dt.hour*60*60 + d["t_ct"].dt.minute*60+ d["t_ct"].dt.second
s3.values
# array([64800, 72000, 79200, 0, 7200, 14400, 21600, 28800, 36000,
# 43200, 50400, 57600, 64800], dtype=int64)

Resample pandas dataframe and count instances

If I have a dataframe such as:
index = pd.date_range(start='2014 01 01 00:00', end='2014 01 05 00:00', freq='12H')
df = pd.DataFrame(pd.np.random.randn(9),index=index,columns=['A'])
df
Out[5]:
A
2014-01-01 00:00:00 2.120577
2014-01-01 12:00:00 0.968724
2014-01-02 00:00:00 1.232688
2014-01-02 12:00:00 0.328104
2014-01-03 00:00:00 -0.836761
2014-01-03 12:00:00 -0.061087
2014-01-04 00:00:00 -1.239613
2014-01-04 12:00:00 0.513896
2014-01-05 00:00:00 0.089544
And I want to resample to daily frequency, it is quite easy:
df.resample(rule='1D',how='mean')
Out[6]:
A
2014-01-01 1.544650
2014-01-02 0.780396
2014-01-03 -0.448924
2014-01-04 -0.362858
2014-01-05 0.089544
However, I need to track how many instances are going into each day. Is there a good pythonic way of using resample to both perform the specified "how" operation AND track number of data points going into each mean value, e.g. yielding
Out[6]:
A Instances
2014-01-01 1.544650 2
2014-01-02 0.780396 2
2014-01-03 -0.448924 2
2014-01-04 -0.362858 2
2014-01-05 0.089544 2
Conveniently, how accepts a list:
df1 = df.resample(rule='1D', how=['mean', 'count'])
This will return a DataFrame with a MultiIndex column: one level for 'A' and another level for 'mean' and 'count'. To get a simple DataFrame like the desired output in your question, you can drop the extra level like df1.columns = df1.columns.droplevel(0) or, better, you can do your resampling on df['A'] instead of df.