I'm having trouble with calculating the mean of Timestamps.
I have a few values with Timestamps in my Data Frame, and I want to aggregate the values into a single value with the sum of all values and the weighted mean of the appropriate Timestamps
My input is:
Timestamp Value
ID
0 2013-02-03 13:39:00 79
0 2013-02-03 14:03:00 19
1 2013-02-04 11:36:00 2
2 2013-02-04 12:07:00 2
3 2013-02-04 14:04:00 1
And I want to aggregate the data using the ID index.
I was able to sum the Values using
manp_func = {'Value':['sum'] }
new_table =table.groupby(level='ID).agg(manp_func)
but, how can I find the weighted mean of the Timestamps related to the values?
Thanks
S.A
agg = lambda x: (x['Timestamp'].astype('i8') * (x['Value'].astype('f8') / x['Value'].sum())).sum()
new_table = table.groupby(level='ID').apply(agg).astype('i8').astype('datetime64[ns]')
Output of new_table
ID
0 2013-02-03 13:43:39.183673344
2 2013-02-04 11:51:30.000000000
3 2013-02-04 14:04:00.000000000
dtype: datetime64[ns]
The main idea is to compute the weighted average as normal, but there are a couple of subtleties:
You have to convert the datetime64[ns] to an integer offset first because multiplication is not defined between those two types. Then you have to convert it back.
Calculating the weighted sum as sum(a*w)/sum(w) will result in overflow (a*w is too large to be represented as an 8-byte integer), so it has to be calculated as sum(a*(w/sum(w)).
Preparing a sample dataframe:
# Initiate dataframe
date_var = "date"
df = pd.DataFrame(data=[['A', '2018-08-05 17:06:01'],
['A', '2018-08-05 17:06:02'],
['A', '2018-08-05 17:06:03'],
['B', '2018-08-05 17:06:07'],
['B', '2018-08-05 17:06:09'],
['B', '2018-08-05 17:06:11']],
columns=['column', date_var])
# Convert date-column to proper pandas Datetime-values/pd.Timestamps
df[date_var] = pd.to_datetime(df[date_var])
Extraction of the desired average Timestamp-value:
# Extract the numeric value associated to each timestamp (epoch time)
# NOTE: this is being accomplished via accessing the .value - attribute of each Timestamp in the column
In:
[tsp.value for tsp in df[date_var]]
Out:
[
1533488761000000000, 1533488762000000000, 1533488763000000000,
1533488767000000000, 1533488769000000000, 1533488771000000000
]
# Use this to calculate the mean, then convert the result back to a timestamp
In:
pd.Timestamp(np.nanmean([tsp.value for tsp in df[date_var]]))
Out:
Timestamp('2018-08-05 17:06:05.500000')
Related
I need to build a DataFrame with a very specific structure. Yield curve values as the data, a single date as the index, and days to maturity as the column names.
In[1]: yield_data # list of size 38, with yield values
Out[1]:
[0.096651956137087325,
0.0927199778042056,
0.090000225505577847,
0.088300016028163508,...
In[2]: maturity_data # list of size 38, with days until maturity
Out[2]:
[6,
29,
49,
70,...
In[3]: today
Out[3]:
Timestamp('2017-07-24 00:00:00')
Then I try to create the DataFrame
pd.DataFrame(data=yield_data, index=[today], columns=maturity_data)
but it returns the error
ValueError: Shape of passed values is (1, 38), indices imply (38, 1)
I tried using the transpose of these lists, but it does not allow to transpose them.
how can I create this DataFrame?
IIUC, I think you want a dataframe with a single row, you need to reshape your data input list into a list of list.
yield_data = [0.09,0.092, 0.091]
maturity_data = [6,10,15]
today = pd.to_datetime('2017-07-25')
pd.DataFrame(data=[yield_data],index=[today],columns=maturity_data)
Output:
6 10 15
2017-07-25 0.09 0.092 0.091
I have a Pandas Dataframe with 2000+ rows with date in float format as below:
42704.99686342593 representing datetime value of (2016, 11, 30, 23, 55, 29)
What I want to do is iterate each row in the dataframe and convert the float to the correct datetime format ideally d/m/Y H/M/S and save this to a new dataframe.
Using Python 2.7.
I couldn't find any duplicate questions and was unable to solve the issue with solutions to similar questions so any help appreciated.
Thanks.
It seems you use serial date what is Excel format.
The simpliest is substract 25569 and use to_datetime with parameter unit='d':
df = pd.DataFrame({'date':[42704.99686342593,42704.99686342593]})
print (df)
date
0 42704.996863
1 42704.996863
print (pd.to_datetime(df.date - 25569, unit='d'))
0 2016-11-30 23:55:28.963200
1 2016-11-30 23:55:28.963200
Name: date, dtype: datetime64[ns]
Another solutions are substract timedelta or offset:
print (pd.to_datetime(df.date, unit='d') - pd.to_timedelta('25569 Days'))
0 2016-11-30 23:55:28.963200
1 2016-11-30 23:55:28.963200
Name: date, dtype: datetime64[ns]
print (pd.to_datetime(df.date, unit='d') - pd.offsets.Day(25569))
0 2016-11-30 23:55:28.963200
1 2016-11-30 23:55:28.963200
Name: date, dtype: datetime64[ns]
Thank you Ted Petrou for link.
I have a Pandas dataframe df1 with x rows. I also have a numpy.ndarray n1 with x rows. n1 has only one column, with values of either 0, or 1. I want to pick only the first column of the dataframe df1, where the corresponding ndarray column has value 1. How can this be done ?
The use case is like this :- I have a invoice dataframe, whose first column is the customer code. I also have a ndarray which is the output of a scikit churn prediction, based on this invoice dataframe as input. The ndarray has 1 for those invoices which has symptoms of churn and 0 for invoices which do not churn. So i want to extract customers who churn. Ofcourse the output will have repeated values of same customer, but that can be filtered.
You can convert your indicators to booleans and then use boolean filtering.
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
n1 = np.array([0, 1, 1])
>>> df1
a b
0 1 4
1 2 5
2 3 6
>>> df1[n1.astype('bool')]
a b
1 2 5
2 3 6
I'm new to using pandas and I'm trying to make a dataframe with historical weather data.
The keys are the day of the year (ex. Jan 1) and the values are lists of temperatures from those days over several years.
I want to make a dataframe that is formatted like this:
... Jan1 Jan2 Jan3 etc
1 temp temp temp etc
2 temp temp temp etc
etc etc etc etc
I've managed to make a dataframe with my dictionary with
df = pandas.DataFrame(weather)
but I end up with 1 row and a ton of columns.
I've checked the documentation for DataFrame and DataFrame.from_dict, but neither were very extensive nor provided many examples.
Given that "the keys are the day of the year... and the values are lists of temperatures", your method of construction should work. For example,
In [12]: weather = {'Jan 1':[1,2], 'Jan 2':[3,4]}
In [13]: df = pd.DataFrame(weather)
In [14]: df
Out[14]:
Jan 1 Jan 2
0 1 3
1 2 4
I have a time series similar to:
ts = pd.Series(np.random.randn(60),index=pd.date_range('1/1/2000',periods=60, freq='2h'))
Is there an easy way to make it so that the row index is dates and the column index is the hour?
Basically I am trying to convert from a time-series into a dataframe.
There's always a slicker way to do things than the way I reach for, but I'd make a flat frame first and then pivot. Something like
>>> ts = pd.Series(np.random.randn(10000),index=pd.date_range('1/1/2000',periods=10000, freq='10min'))
>>> df = pd.DataFrame({"date": ts.index.date, "time": ts.index.time, "data": ts.values})
>>> df = df.pivot("date", "time", "data")
This produces too large a frame to paste, but looking the top left corner:
>>> df.iloc[:5, :5]
time 00:00:00 00:10:00 00:20:00 00:30:00 00:40:00
date
2000-01-01 -0.180811 0.672184 0.098536 -0.687126 -0.206245
2000-01-02 0.746777 0.630105 0.843879 -0.253666 1.337123
2000-01-03 1.325679 0.046904 0.291343 -0.467489 -0.531110
2000-01-04 -0.189141 -1.346146 1.378533 0.887792 2.957479
2000-01-05 -0.232299 -0.853726 -0.078214 -0.158410 0.782468
[5 rows x 5 columns]