Pandas: How to convert a 10-min interval timeseries into a dataframe? - python-2.7

I have a time series similar to:
ts = pd.Series(np.random.randn(60),index=pd.date_range('1/1/2000',periods=60, freq='2h'))
Is there an easy way to make it so that the row index is dates and the column index is the hour?
Basically I am trying to convert from a time-series into a dataframe.

There's always a slicker way to do things than the way I reach for, but I'd make a flat frame first and then pivot. Something like
>>> ts = pd.Series(np.random.randn(10000),index=pd.date_range('1/1/2000',periods=10000, freq='10min'))
>>> df = pd.DataFrame({"date": ts.index.date, "time": ts.index.time, "data": ts.values})
>>> df = df.pivot("date", "time", "data")
This produces too large a frame to paste, but looking the top left corner:
>>> df.iloc[:5, :5]
time 00:00:00 00:10:00 00:20:00 00:30:00 00:40:00
date
2000-01-01 -0.180811 0.672184 0.098536 -0.687126 -0.206245
2000-01-02 0.746777 0.630105 0.843879 -0.253666 1.337123
2000-01-03 1.325679 0.046904 0.291343 -0.467489 -0.531110
2000-01-04 -0.189141 -1.346146 1.378533 0.887792 2.957479
2000-01-05 -0.232299 -0.853726 -0.078214 -0.158410 0.782468
[5 rows x 5 columns]

Related

Converting List to pandas DataFrame

I need to build a DataFrame with a very specific structure. Yield curve values as the data, a single date as the index, and days to maturity as the column names.
In[1]: yield_data # list of size 38, with yield values
Out[1]:
[0.096651956137087325,
0.0927199778042056,
0.090000225505577847,
0.088300016028163508,...
In[2]: maturity_data # list of size 38, with days until maturity
Out[2]:
[6,
29,
49,
70,...
In[3]: today
Out[3]:
Timestamp('2017-07-24 00:00:00')
Then I try to create the DataFrame
pd.DataFrame(data=yield_data, index=[today], columns=maturity_data)
but it returns the error
ValueError: Shape of passed values is (1, 38), indices imply (38, 1)
I tried using the transpose of these lists, but it does not allow to transpose them.
how can I create this DataFrame?
IIUC, I think you want a dataframe with a single row, you need to reshape your data input list into a list of list.
yield_data = [0.09,0.092, 0.091]
maturity_data = [6,10,15]
today = pd.to_datetime('2017-07-25')
pd.DataFrame(data=[yield_data],index=[today],columns=maturity_data)
Output:
6 10 15
2017-07-25 0.09 0.092 0.091

pandas dataframe getting daily data

I have a pandas dataframe with timestamps as index:
I would like to convert it to get a dataframe with daily values but without having to resample the original dataframe (no to sum, or average the hourly data). Ideally I would like to get the 24 daily values in a vector for each day, for example:
Is there a method to do this quickly?
Thanks!
IIUC you can groupby on the date attribute of your index and then apply a lambda that aggregates the values into a list:
In [21]:
# generate some data
df = pd.DataFrame({'GFS_rad':np.random.randn(100), 'GFS_tmp':np.random.randn(100)}, index=pd.date_range(dt.datetime(2016,1,1), freq='1h', periods=100))
df.groupby(df.index.date)['GFS_rad','GFS_tmp'].agg(lambda x: [x['GFS_rad'].values,x['GFS_tmp'].values])
Out[21]:
GFS_rad \
2016-01-01 [-0.324115177542, 1.59297335764, 0.58118555943...
2016-01-02 [-0.0547016526463, -1.10093451797, -1.55790161...
2016-01-03 [-0.34751220092, 1.06246918632, 0.181218794826...
2016-01-04 [0.950977469848, 0.422905080529, 1.98339145764...
2016-01-05 [-0.405124861624, 0.141470757613, -0.191169333...
GFS_tmp
2016-01-01 [-2.36889710412, -0.557972678049, -1.293544410...
2016-01-02 [-0.125562429825, -0.018852674365, -0.96735945...
2016-01-03 [0.802961514703, -1.68049099535, -0.5116769061...
2016-01-04 [1.35789157665, 1.37583167965, 0.538638510171,...
2016-01-05 [-0.297611872638, 1.10546853812, -0.8726761667...

Pandas - Calculate the mean of Timestamps

I'm having trouble with calculating the mean of Timestamps.
I have a few values with Timestamps in my Data Frame, and I want to aggregate the values into a single value with the sum of all values and the weighted mean of the appropriate Timestamps
My input is:
Timestamp Value
ID
0 2013-02-03 13:39:00 79
0 2013-02-03 14:03:00 19
1 2013-02-04 11:36:00 2
2 2013-02-04 12:07:00 2
3 2013-02-04 14:04:00 1
And I want to aggregate the data using the ID index.
I was able to sum the Values using
manp_func = {'Value':['sum'] }
new_table =table.groupby(level='ID).agg(manp_func)
but, how can I find the weighted mean of the Timestamps related to the values?
Thanks
S.A
agg = lambda x: (x['Timestamp'].astype('i8') * (x['Value'].astype('f8') / x['Value'].sum())).sum()
new_table = table.groupby(level='ID').apply(agg).astype('i8').astype('datetime64[ns]')
Output of new_table
ID
0 2013-02-03 13:43:39.183673344
2 2013-02-04 11:51:30.000000000
3 2013-02-04 14:04:00.000000000
dtype: datetime64[ns]
The main idea is to compute the weighted average as normal, but there are a couple of subtleties:
You have to convert the datetime64[ns] to an integer offset first because multiplication is not defined between those two types. Then you have to convert it back.
Calculating the weighted sum as sum(a*w)/sum(w) will result in overflow (a*w is too large to be represented as an 8-byte integer), so it has to be calculated as sum(a*(w/sum(w)).
Preparing a sample dataframe:
# Initiate dataframe
date_var = "date"
df = pd.DataFrame(data=[['A', '2018-08-05 17:06:01'],
['A', '2018-08-05 17:06:02'],
['A', '2018-08-05 17:06:03'],
['B', '2018-08-05 17:06:07'],
['B', '2018-08-05 17:06:09'],
['B', '2018-08-05 17:06:11']],
columns=['column', date_var])
# Convert date-column to proper pandas Datetime-values/pd.Timestamps
df[date_var] = pd.to_datetime(df[date_var])
Extraction of the desired average Timestamp-value:
# Extract the numeric value associated to each timestamp (epoch time)
# NOTE: this is being accomplished via accessing the .value - attribute of each Timestamp in the column
In:
[tsp.value for tsp in df[date_var]]
Out:
[
1533488761000000000, 1533488762000000000, 1533488763000000000,
1533488767000000000, 1533488769000000000, 1533488771000000000
]
# Use this to calculate the mean, then convert the result back to a timestamp
In:
pd.Timestamp(np.nanmean([tsp.value for tsp in df[date_var]]))
Out:
Timestamp('2018-08-05 17:06:05.500000')

Filter data to get only first day of the month rows

I have a dataset of daily data. I need to get only the data of the first day of each month in the data set (The data is from 1972 to 2013). So for example I would need Index 20, Date 2013-12-02 value of 0.1555 to be extracted.
The problem I have is that the first day for each month is different, so I cannot use a step such as relativedelta(months=1), how would I go about of extracting these values from my dataset?
Is there a similar command as I have found in another post for R?
R - XTS: Get the first dates and values for each month from a daily time series with missing rows
17 2013-12-05 0.1621
18 2013-12-04 0.1698
19 2013-12-03 0.1516
20 2013-12-02 0.1555
21 2013-11-29 0.1480
22 2013-11-27 0.1487
23 2013-11-26 0.1648
I would groupby the month and then get the zeroth (nth) row of each group.
First set as index (I think this is necessary):
In [11]: df1 = df.set_index('date')
In [12]: df1
Out[12]:
n val
date
2013-12-05 17 0.1621
2013-12-04 18 0.1698
2013-12-03 19 0.1516
2013-12-02 20 0.1555
2013-11-29 21 0.1480
2013-11-27 22 0.1487
2013-11-26 23 0.1648
Next sort, so that the first element is the first date of that month (Note: this doesn't appear to be necessary for nth, but I think that's actually a bug!):
In [13]: df1.sort_index(inplace=True)
In [14]: df1.groupby(pd.TimeGrouper('M')).nth(0)
Out[14]:
n val
date
2013-11-26 23 0.1648
2013-12-02 20 0.1555
another option is to resample and take the first entry:
In [15]: df1.resample('M', 'first')
Out[15]:
n val
date
2013-11-30 23 0.1648
2013-12-31 20 0.1555
Thinking about this, you can do this much simpler by extracting the month and then grouping by that:
In [21]: pd.DatetimeIndex(df.date).to_period('M')
Out[21]:
<class 'pandas.tseries.period.PeriodIndex'>
[2013-12, ..., 2013-11]
Length: 7, Freq: M
In [22]: df.groupby(pd.DatetimeIndex(df.date).to_period('M')).nth(0)
Out[22]:
n date val
0 17 2013-12-05 0.1621
4 21 2013-11-29 0.1480
This time the sortedness of df.date is (correctly) relevant, if you know it's in descending date order you can use nth(-1):
In [23]: df.groupby(pd.DatetimeIndex(df.date).to_period('M')).nth(-1)
Out[23]:
n date val
3 20 2013-12-02 0.1555
6 23 2013-11-26 0.1648
If this isn't guaranteed then sort by the date column first: df.sort('date').
One way is to add a column for the year, month and day:
df['year'] = df.SomeDatetimeColumn.map(lambda x: x.year)
df['month'] = df.SomeDatetimeColumn.map(lambda x: x.month)
df['day'] = df.SomeDatetimeColumn.map(lambda x: x.day)
Then group by the year and month, order by day, and take only the first entry (which will be the minimum day entry).
df.groupby(
['year', 'month']
).apply(lambda x: x.sort('day', ascending=True)).head(1)
The use of the lambda expressions makes this less than ideal for large data sets. You may not wish to grow the size of the data by keeping separately stored year, month, and day values. However, for these kinds of ad hoc date alignment problems, sooner or later having these values separated is very helpful.
Another approach is to group directly by a function of the datetime column:
dfrm.groupby(
by=dfrm.dt.map(lambda x: (x.year, x.month))
).apply(lambda x: x.sort('dt', ascending=True).head(1))
Normally these problems arise because of a dysfunctional database or data storage schema that exists one level prior to the Python/pandas layer.
For example, in this situation, it should be commonplace to rely on the existence of a calendar database table or a calendar data set which contains (or makes it easy to query for) the earliest active date in a month relative to the given data set (such as, the first trading day, the first week day, the first business day, the first holiday, or whatever).
If a companion database table exists with this data, it should be easy to combine it with the dataset you already have loaded (say, by joining on the date column you already have) and then it's just a matter of applying a logical filter on the calendar data columns.
This becomes especially important once you need to use date lags: for example, lining up a company's 1-month-ago market capitalization with the company's current-month stock return, to calculate a total return realized over that 1-month period.
This can be done by lagging the columns in pandas with shift, or trying to do a complicated self-join that is likely very bug prone and creates the problem of perpetuating the particular date convention to every place downstream that uses data from that code.
Much better to simply demand (or do it yourself) that the data must have properly normalized date features in its raw format (database, flat files, whatever) and to stop what you are doing, fix that date problem first, and only then get back to carrying out some analysis with the date data.
import pandas as pd
dates = pd.date_range('2014-02-05', '2014-03-15', freq='D')
df = pd.DataFrame({'vals': range(len(dates))}, index=dates)
g = df.groupby(lambda x: x.strftime('%Y-%m'), axis=0)
g.apply(lambda x: x.index.min())
#Or depending on whether you want the index or the vals
g.apply(lambda x: x.ix[x.index.min()])
The above didn't work for me because I needed more than one row per month where the number of rows every month could change. This is what I did:
dates_month = pd.bdate_range(df['date'].min(), df['date'].max(), freq='1M')
df_mth = df[df['date'].isin(dates_month)]

Creating a pandas.DataFrame from a dict

I'm new to using pandas and I'm trying to make a dataframe with historical weather data.
The keys are the day of the year (ex. Jan 1) and the values are lists of temperatures from those days over several years.
I want to make a dataframe that is formatted like this:
... Jan1 Jan2 Jan3 etc
1 temp temp temp etc
2 temp temp temp etc
etc etc etc etc
I've managed to make a dataframe with my dictionary with
df = pandas.DataFrame(weather)
but I end up with 1 row and a ton of columns.
I've checked the documentation for DataFrame and DataFrame.from_dict, but neither were very extensive nor provided many examples.
Given that "the keys are the day of the year... and the values are lists of temperatures", your method of construction should work. For example,
In [12]: weather = {'Jan 1':[1,2], 'Jan 2':[3,4]}
In [13]: df = pd.DataFrame(weather)
In [14]: df
Out[14]:
Jan 1 Jan 2
0 1 3
1 2 4