I have a dataframe df1 like this, where starttime and endtime are datetime objects.
StartTime EndTime
9:08 9:10
9:10 9:35
9:35 9:55
9:55 10:10
10:10 10:20
If endtime.hour is not the same as startime.hour, I would like to split times like this
StartTime EndTime
9:08 9:10
9:10 9:55
9:55 10:00
10:00 10:10
10:10 10:20
Essentially insert a row into the existing dataframe df1. I have looked at a ton of examples but haven't figured out how to do this. If my question isn't clear please let me know.
Thanks
This does what you want ...
# load your data into a DataFrame
data="""StartTime EndTime
9:08 9:10
9:10 9:35
9:35 9:55
9:55 10:10
10:10 10:20
"""
from StringIO import StringIO # import from io for Python 3
df = pd.read_csv(StringIO(data), header=0, sep=' ', index_col=None)
# convert strings to Pandas Timestamps (we will ignore the date bit) ...
import datetime as dt
df.StartTime = [dt.datetime.strptime(x, '%H:%M') for x in df.StartTime]
df.EndTime = [dt.datetime.strptime(x, '%H:%M') for x in df.EndTime]
# assumption - all intervals are less than 60 minutes
# - ie. no multi-hour intervals
# add rows
dfa = df[df.StartTime.dt.hour != df.EndTime.dt.hour].copy()
dfa.EndTime = [dt.datetime.strptime(str(x), '%H') for x in dfa.EndTime.dt.hour]
# play with the start hour ...
df.StartTime = df.StartTime.where(df.StartTime.dt.hour == df.EndTime.dt.hour,
other = [dt.datetime.strptime(str(x), '%H') for x in df.EndTime.dt.hour])
# bring back together and sort
df = pd.concat([df, dfa], axis=0) #top/bottom
df = df.sort('StartTime')
# convert the Timestamps to times for easy reading
df.StartTime = [x.time() for x in df.StartTime]
df.EndTime = [x.time() for x in df.EndTime]
And yields
In [40]: df
Out[40]:
StartTime EndTime
0 09:08:00 09:10:00
1 09:10:00 09:35:00
2 09:35:00 09:55:00
3 09:55:00 10:00:00
3 10:00:00 10:10:00
4 10:10:00 10:20:00
Related
I have a dataframe as following, the index is datetime(every Friday in a week).
begin close
date
2014-1-10 1.0 2.5
2014-1-17 2.6 2.6
........................
2016-12-30 3.5 3.8
2017-6-16 4.5 4.7
I want to extract the previour 2 year data from 2017-6-16. My code is following.
import datetime
from dateutil.relativedelta import relativedelta
df_index = df.index
df_index_test = df_index[-1] - relativedelta(years=2)
df_test = df[df_index_test:-1]
But it seems it is wrong, since the day of df_index_test may not in the dataframe.
Thanks!
You need boolean indexing, instead relativedelta is possible use DateOffset:
df_test = df[df.index >= df_index_test]
Sample:
rng = pd.date_range('2001-04-03', periods=10, freq='15M')
df = pd.DataFrame({'a': range(10)}, index=rng)
print (df)
a
2001-04-30 0
2002-07-31 1
2003-10-31 2
2005-01-31 3
2006-04-30 4
2007-07-31 5
2008-10-31 6
2010-01-31 7
2011-04-30 8
2012-07-31 9
df_test = df[df.index >= df.index[-1] - pd.offsets.DateOffset(years=2)]
print (df_test)
a
2011-04-30 8
2012-07-31 9
I have a dataframe that gets read in from csv and has extraneous data.
Judgment on what is extraneous is made by evaluating one column, SystemStart.
Any data per row that is in a column with a heading of date value lower than SystemStart for that row, is set to nan. For example, index = 'one' has a SystemStart date of '2016-1-5', and when the pd.date_range is set up, it has no nan values to populate. index= 'three' is '2016-1-7' and hence has two nan values replacing the original data.
I can go row-by-row and throw np.nan values at all columns, but that is slow. Is there a faster way?
I've created a representative dataframe below, and am looking to get the same result without iterative operations, or a way to speed up those operations. Any help would be greatly appreciated.
import pandas as pd
import numpy as np
start_date = '2016-1-05'
end_date = '2016-1-7'
dates = pd.date_range(start_date, end_date, freq='D')
dt_dates = pd.to_datetime(dates, unit='D')
ind = ['one', 'two', 'three']
df = pd.DataFrame(np.random.randint(0,100,size=(3, 3)), columns = dt_dates, index = ind)
df['SystemStart'] = pd.to_datetime(['2016-1-5', '2016-1-6', '2016-1-7'])
print 'Initial Dataframe: \n', df
for msn in df.index:
zero_date_range = pd.date_range(start_date, df.loc[msn,'SystemStart'] - pd.Timedelta(days=1), freq='D')
# we set zeroes for all columns in the index element in question - this is a horribly slow way to do this
df.loc[msn, zero_date_range] = np.NaN
print '\nAltered Dataframe: \n', df
Below are the df outputs, Initial and Altered:
Initial Dataframe:
2016-01-05 00:00:00 2016-01-06 00:00:00 2016-01-07 00:00:00 \
one 24 23 65
two 21 91 59
three 62 77 2
SystemStart
one 2016-01-05
two 2016-01-06
three 2016-01-07
Altered Dataframe:
2016-01-05 00:00:00 2016-01-06 00:00:00 2016-01-07 00:00:00 \
one 24.0 23.0 65
two NaN 91.0 59
three NaN NaN 2
SystemStart
one 2016-01-05
two 2016-01-06
three 2016-01-07
First thing I do is make sure SystemStart is datetime
df.SystemStart = pd.to_datetime(df.SystemStart)
Then I strip out SystemStart to a separate series
st = df.SystemStart
Then I drop SytstemStart from my df
d1 = df.drop('SystemStart', 1)
Then I convert the columns I have left to datetime
d1.columns = pd.to_datetime(d1.columns)
Finally I use numpy broadcasting to mask the appropriate cells and join SystemStart back in.
d1.where(d1.columns.values >= st.values[:, None]).join(st)
I have many columns in a data frame and I have to find the difference of time in two column named as in_time and out_time and put it in the new column in the same data frame.
The format of time is like this 2015-09-25T01:45:34.372Z.
I am using Pandas DataFrame.
I want to do like this:
df.days = df.out_time - df.in_time
I have many columns and I have to increase 1 more column in it named days and put the differences there.
You need to convert the strings to datetime dtype, you can then subtract whatever arbitrary date you want and on the resulting series call dt.days:
In [15]:
df = pd.DataFrame({'date':['2015-09-25T01:45:34.372Z']})
df
Out[15]:
date
0 2015-09-25T01:45:34.372Z
In [19]:
df['date'] = pd.to_datetime(df['date'])
df['day'] = (df['date'] - dt.datetime.now()).dt.days
df
Out[19]:
date day
0 2015-09-25 01:45:34.372 -252
Well, it all kinda depends on the time format you use. I'd recommend using datetime.
If in_time and out_time are currently strings, convert them with datetime.strptime():
from datetime import datetime
f = lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%fZ')
df.in_time = df.in_time.apply(f)
df.out_time = df.out_time.apply(f)
and then you can simply subtract them, and assign the result to a new column named 'days':
df['days'] = df.out_time - df.in_time
Example: (3 seconds and 1 day differences)
In[5]: df = pd.DataFrame({'in_time':['2015-09-25T01:45:34.372Z','2015-09-25T01:45:34.372Z'],
'out_time':['2015-09-25T01:45:37.372Z','2015-09-26T01:45:34.372Z']})
In[6]: df
Out[6]:
in_time out_time
0 2015-09-25T01:45:34.372Z 2015-09-25T01:45:37.372Z
1 2015-09-25T01:45:34.372Z 2015-09-26T01:45:34.372Z
In[7]: type(df.loc[0,'in_time'])
Out[7]: str
In[8]: df.in_time = df.in_time.apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%fZ'))
In[9]: df.out_time = df.out_time.apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%fZ'))
In[10]: df # notice that it looks exactly the same, but the type is different
Out[10]:
in_time out_time
0 2015-09-25 01:45:34.372 2015-09-25T01:45:37.372Z
1 2015-09-25 01:45:34.372 2015-09-26T01:45:34.372Z
In[11]: type(df.loc[0,'in_time'])
Out[11]: pandas.tslib.Timestamp
And the creation of the new column:
In[12]: df['days'] = df.out_time - df.in_time
In[13]: df
Out[13]:
in_time out_time days
0 2015-09-25 01:45:34.372 2015-09-25 01:45:37.372 0 days 00:00:03
1 2015-09-25 01:45:34.372 2015-09-26 01:45:34.372 1 days 00:00:00
Now you can play with the output format. For example, the portion of seconds difference:
In[14]: df.days = df.days.apply(lambda x: x.total_seconds()/60)
In[15]: df
Out[15]:
in_time out_time days
0 2015-09-25 01:45:34.372 2015-09-25 01:45:37.372 0.05
1 2015-09-25 01:45:34.372 2015-09-26 01:45:34.372 1440.00
Note: Regarding the in_time and out_time format, notice that I made some assumptions (for example, that you're using a 24H clock (thus using %H and not %I)). To play with the format have a look at: strptime() documentation.
Note2: It would obviously be better if you can design your program to use datetime from the beginning (instead of using strings and converting them).
First of all, you need to convert in_time and out_time columns to datetime type.
for col in ('in_time', 'out_time') : # Looping a tuple is faster than a list
df[col] = pd.to_datetime(df[col])
You can check the type using dtypes:
df['in_time'].dtypes
Should give: datetime64[ns, UTC]
Now you can substract them and get the difference time using dt.days or from numpy using np.timedelta64.
Example:
import numpy as np
df['days'] = (df['out_time'] - df['in_time']).dt.days
# Or
df['days'] = (df['out_time'] - df['in_time']) / np.timedelta64(1, 'D')
I have several dataframes which have the same look but different data.
DataFrame 1
bid
close
time
2016-05-24 00:00:00 NaN
2016-05-24 00:05:00 0.000611
2016-05-24 00:10:00 -0.000244
2016-05-24 00:15:00 -0.000122
DataFrame 2
bid
close
time
2016-05-24 00:00:00 NaN
2016-05-24 00:05:00 0.000811
2016-05-24 00:10:00 -0.000744
2016-05-24 00:15:00 -0.000322
I need to build a list of the dataframes, then pass that list of dataframes to a function that can take a list of dataframes and converts it to a numpy array. So below, each entry in the matrix is the elements of the dataframe ('bid
close') column. Notice I don't need the index 'time' column
data = np.array([dataFrames])
returns this (example not actual data)
[[-0.00114415 0.02502565 0.00507831 ..., 0.00653057 0.02183072
-0.00194293] `DataFrame` 1 is here ignore that the data doesn't match above
[-0.01527224 0.02899528 -0.00327654 ..., 0.0322364 0.01821731
-0.00766773] `DataFrame` 2 is here ignore that the data doesn't match above
....]]
Try
master_matrix = pd.concat(list_of_dfs, axis=1)
master_matrix = master_matrix.values.reshape(master_matrix.shape, order='F')
if each row in the final matrix corresponds to the same date
master_matrix = pd.concat(list_of_dfs, axis=1).values
otherwise.
Edit to address the newly added example.
In this case, you can use np.vstack on columns returned from each dataframe.
import pandas as pd
import numpy as np
from io import StringIO
df1 = pd.read_csv(StringIO(
'''
time bid_close
2016-05-24 00:00:00 NaN
2016-05-24 00:05:00 0.000611
2016-05-24 00:10:00 -0.000244
2016-05-24 00:15:00 -0.000122
'''), sep=r' +')
df2 = pd.read_csv(StringIO(
'''
time bid_close
2016-05-24 00:00:00 NaN
2016-05-24 00:05:00 0.000811
2016-05-24 00:10:00 -0.000744
2016-05-24 00:15:00 -0.000322
'''), sep=r' +')
dfs = [df1, df2]
out = np.vstack(df.iloc[:,-1].values for df in dfs)
Result:
In [10]: q.out
Out[10]:
array([[ nan, 0.000611, -0.000244, -0.000122],
[ nan, 0.000811, -0.000744, -0.000322]])
Setup
import pandas as pd
import numpy as np
df1 = pd.DataFrame([1, 2, 3, 4],
index=pd.date_range('2016-04-01', periods=4),
columns=pd.MultiIndex.from_tuples([('bid', 'close')]))
df2 = pd.DataFrame([5, 6, 7, 8],
index=pd.date_range('2016-03-01', periods=4),
columns=pd.MultiIndex.from_tuples([('bid', 'close')]))
print df1
bid
close
2016-04-01 1
2016-04-02 2
2016-04-03 3
2016-04-04 4
print df2
bid
close
2016-03-01 5
2016-03-02 6
2016-03-03 7
2016-03-04 8
Solution
df = np.concatenate([d.T.values for d in [df1, df2]])
print df
[[1 2 3 4]
[5 6 7 8]]
Note
The indices were not required to line up. This just takes the raw np.array from each dataframe and uses np.concatenate to do the rest.
I have a model as follows:
class WorkTime(models.Model):
person = models.ForeignKey(Person)
entry_date = models.DateField(default=datetime.datetime.now(), verbose_name='date')
start_time = models.TimeField(verbose_name='start')
end_time = models.TimeField(verbose_name='end')
With data as follows:
person, date, start, end
1 01/01/2014 08:00 12:00
1 01/01/2014 13:00 18:00
1 02/01/2014 08:00 12:00
1 02/01/2014 13:00 18:00
1 03/01/2014 08:00 16:00
1 01/02/2014 08:30 12:00
1 01/02/2014 13:00 18:00
2 01/01/2014 09:00 13:00
2 01/01/2014 14:00 18:00
How would one sum up the time delta (i.e. end_time - start_time) and GROUP BY person to show the hours worked by person, as below?
person, hours
1 34:30
2 08:00
I don't know a good way to do this in the ORM. It's pretty straight forward to build a dictionary by iterating through a queryset.
from collections import defaultdict
from datetime import datetime
totals = defaultdict()
work_times = WorkTime.objects.
for work_time in work_times:
totals[work_time.person] += datetime.combine(work_time.entry_date, work_time.end_time) - datetime.combine(work_time.entry_date, work_time.start_time)
# print results
for person, total_time in totals.items():
# total_time will be a timedelta, you can do some more work to return hours and minutes
print person, total_time
Depending on your usage, the performance of this might be good enough.
This has been a while, but maybe someone will stumble upon this.
You can do it this way:
from django.db.models import F, Sum
queryset = queryset.values('person').annotate(sum_delta=Sum(F('end_time')-F('start_time')).order_by('person__id')
This will sum up all timedeltas for each person and you dont have to store extra stuff in the db.