Delete rows from an excel file in Python based on date - python-2.7

I have written this code that pulls and excel file into python reorganizes the columns and tries to remove any rows where the data in that row is older than 2012.
import pandas as pd
import numpy as np
import datetime
#Read the excel with the data from IHS
Coal_Exports = pd.read_excel("U:\drive\Coal Exports.xlsx")
#Rearrange the data by columns
Coal_Exports = pd.melt(Coal_Exports, id_vars= ["Export Country", "Import Country", "Coal type", "Date Last Updated", "Unit"],var_name="Date", value_name="value")
#set the Date as the Index
Coal_Exports = Coal_Exports.set_index(keys="Date")
#Delete pre 2012 Rows
Coal_Exports = Coal_Exports[(Coal_Exports.ix[0] > "01/01/2012")]
#Send the reformatted data to excel
Coal_Exports.to_excel ("U:\drive\Formatted Coal Exports.xlsx")
print "Done"
My issue is deleting the rows pre 2012,
I get this error:
---------------------------------------------------------------------------
IndexingError Traceback (most recent call last)
<ipython-input-25-8eb1513ad7c2> in <module>()
14 #Delete pre 2012 Rows
15 #Coal_Exports = Coal_Exports.drop[(Coal_Exports.set_index(keys="Date")<('01/01/2012') )]
---> 16 Coal_Exports = Coal_Exports[(Coal_Exports.ix[0] > "01/01/2011")]
17
18
C:\Users\xxxxx\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
1989 if isinstance(key, (Series, np.ndarray, Index, list)):
1990 # either boolean or fancy integer index
-> 1991 return self._getitem_array(key)
1992 elif isinstance(key, DataFrame):
1993 return self._getitem_frame(key)
C:\Users\xxxxx\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _getitem_array(self, key)
2029 # check_bool_indexer will throw exception if Series key cannot
2030 # be reindexed to match DataFrame rows
-> 2031 key = check_bool_indexer(self.index, key)
2032 indexer = key.nonzero()[0]
2033 return self.take(indexer, axis=0, convert=False)
C:\Users\xxxxx\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\indexing.pyc in check_bool_indexer(ax, key)
1799 mask = com.isnull(result._values)
1800 if mask.any():
-> 1801 raise IndexingError('Unalignable boolean Series key provided')
1802
1803 result = result.astype(bool)._values
IndexingError: Unalignable boolean Series key provided
The dates in the excel file are of the format 31/02/2011 etc.
I think it's something to do with the format of dates in excel being different from pandas but i'm not sure, can anyone help with this?

Related

How to filter csv data to remove data after a specified year?

I am reading a csv in python with multiple columns.
The first column is the date and I have to delete the rows that correspond to years previous to 2017.
time high low Volume Plot Rango
0 2017-12-22 25.17984 24.280560 970 0.329943 0.899280
1 2017-12-26 25.17984 23.381280 2579 1.057921 1.798560
2 2017-12-27 25.17984 23.381280 2499 0.998083 1.798560
3 2017-12-28 25.17984 24.280560 1991 0.919885 0.899280
4 2017-12-29 25.17984 24.100704 2703 1.237694 1.079136
.. ... ... ... ... ... ...
580 2020-04-16 5.45000 4.450000 117884 3.168380 1.000000
581 2020-04-17 5.35000 4.255200 58531 1.370538 1.094800
582 2020-04-20 4.66500 4.100100 25770 0.582999 0.564900
583 2020-04-21 4.42000 3.800000 20914 0.476605 0.620000
584 2020-04-22 4.22000 3.710100 23212 0.519275 0.509900
I want to delete the rows corresponding to years prior to 2018, so 2017,2016,2015... should be deleted
I am trying with this but does not work
if 2017 in datos['time']: datos['time'].remove() #check if number 2017 is in each of the items of the column 'time'
The dates are recognized as numbers, not as datatime but I think I do not need to declare it as datatime.
In pandas
Given your data
Use Boolean indexing
time must be datetime64[ns] format
df.info() will give the dtypes
df['date'] = pd.to_datetime(df['date'])
df[df['time'].dt.year >= 2018]

How to apply operations on rows of a dataframe, but with variable columns affected?

I have a dataframe that gets read in from csv and has extraneous data.
Judgment on what is extraneous is made by evaluating one column, SystemStart.
Any data per row that is in a column with a heading of date value lower than SystemStart for that row, is set to nan. For example, index = 'one' has a SystemStart date of '2016-1-5', and when the pd.date_range is set up, it has no nan values to populate. index= 'three' is '2016-1-7' and hence has two nan values replacing the original data.
I can go row-by-row and throw np.nan values at all columns, but that is slow. Is there a faster way?
I've created a representative dataframe below, and am looking to get the same result without iterative operations, or a way to speed up those operations. Any help would be greatly appreciated.
import pandas as pd
import numpy as np
start_date = '2016-1-05'
end_date = '2016-1-7'
dates = pd.date_range(start_date, end_date, freq='D')
dt_dates = pd.to_datetime(dates, unit='D')
ind = ['one', 'two', 'three']
df = pd.DataFrame(np.random.randint(0,100,size=(3, 3)), columns = dt_dates, index = ind)
df['SystemStart'] = pd.to_datetime(['2016-1-5', '2016-1-6', '2016-1-7'])
print 'Initial Dataframe: \n', df
for msn in df.index:
zero_date_range = pd.date_range(start_date, df.loc[msn,'SystemStart'] - pd.Timedelta(days=1), freq='D')
# we set zeroes for all columns in the index element in question - this is a horribly slow way to do this
df.loc[msn, zero_date_range] = np.NaN
print '\nAltered Dataframe: \n', df
Below are the df outputs, Initial and Altered:
Initial Dataframe:
2016-01-05 00:00:00 2016-01-06 00:00:00 2016-01-07 00:00:00 \
one 24 23 65
two 21 91 59
three 62 77 2
SystemStart
one 2016-01-05
two 2016-01-06
three 2016-01-07
Altered Dataframe:
2016-01-05 00:00:00 2016-01-06 00:00:00 2016-01-07 00:00:00 \
one 24.0 23.0 65
two NaN 91.0 59
three NaN NaN 2
SystemStart
one 2016-01-05
two 2016-01-06
three 2016-01-07
First thing I do is make sure SystemStart is datetime
df.SystemStart = pd.to_datetime(df.SystemStart)
Then I strip out SystemStart to a separate series
st = df.SystemStart
Then I drop SytstemStart from my df
d1 = df.drop('SystemStart', 1)
Then I convert the columns I have left to datetime
d1.columns = pd.to_datetime(d1.columns)
Finally I use numpy broadcasting to mask the appropriate cells and join SystemStart back in.
d1.where(d1.columns.values >= st.values[:, None]).join(st)

Difference between two dates in Pandas DataFrame

I have many columns in a data frame and I have to find the difference of time in two column named as in_time and out_time and put it in the new column in the same data frame.
The format of time is like this 2015-09-25T01:45:34.372Z.
I am using Pandas DataFrame.
I want to do like this:
df.days = df.out_time - df.in_time
I have many columns and I have to increase 1 more column in it named days and put the differences there.
You need to convert the strings to datetime dtype, you can then subtract whatever arbitrary date you want and on the resulting series call dt.days:
In [15]:
df = pd.DataFrame({'date':['2015-09-25T01:45:34.372Z']})
df
Out[15]:
date
0 2015-09-25T01:45:34.372Z
In [19]:
df['date'] = pd.to_datetime(df['date'])
df['day'] = (df['date'] - dt.datetime.now()).dt.days
df
Out[19]:
date day
0 2015-09-25 01:45:34.372 -252
Well, it all kinda depends on the time format you use. I'd recommend using datetime.
If in_time and out_time are currently strings, convert them with datetime.strptime():
from datetime import datetime
f = lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%fZ')
df.in_time = df.in_time.apply(f)
df.out_time = df.out_time.apply(f)
and then you can simply subtract them, and assign the result to a new column named 'days':
df['days'] = df.out_time - df.in_time
Example: (3 seconds and 1 day differences)
In[5]: df = pd.DataFrame({'in_time':['2015-09-25T01:45:34.372Z','2015-09-25T01:45:34.372Z'],
'out_time':['2015-09-25T01:45:37.372Z','2015-09-26T01:45:34.372Z']})
In[6]: df
Out[6]:
in_time out_time
0 2015-09-25T01:45:34.372Z 2015-09-25T01:45:37.372Z
1 2015-09-25T01:45:34.372Z 2015-09-26T01:45:34.372Z
In[7]: type(df.loc[0,'in_time'])
Out[7]: str
In[8]: df.in_time = df.in_time.apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%fZ'))
In[9]: df.out_time = df.out_time.apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%fZ'))
In[10]: df # notice that it looks exactly the same, but the type is different
Out[10]:
in_time out_time
0 2015-09-25 01:45:34.372 2015-09-25T01:45:37.372Z
1 2015-09-25 01:45:34.372 2015-09-26T01:45:34.372Z
In[11]: type(df.loc[0,'in_time'])
Out[11]: pandas.tslib.Timestamp
And the creation of the new column:
In[12]: df['days'] = df.out_time - df.in_time
In[13]: df
Out[13]:
in_time out_time days
0 2015-09-25 01:45:34.372 2015-09-25 01:45:37.372 0 days 00:00:03
1 2015-09-25 01:45:34.372 2015-09-26 01:45:34.372 1 days 00:00:00
Now you can play with the output format. For example, the portion of seconds difference:
In[14]: df.days = df.days.apply(lambda x: x.total_seconds()/60)
In[15]: df
Out[15]:
in_time out_time days
0 2015-09-25 01:45:34.372 2015-09-25 01:45:37.372 0.05
1 2015-09-25 01:45:34.372 2015-09-26 01:45:34.372 1440.00
Note: Regarding the in_time and out_time format, notice that I made some assumptions (for example, that you're using a 24H clock (thus using %H and not %I)). To play with the format have a look at: strptime() documentation.
Note2: It would obviously be better if you can design your program to use datetime from the beginning (instead of using strings and converting them).
First of all, you need to convert in_time and out_time columns to datetime type.
for col in ('in_time', 'out_time') : # Looping a tuple is faster than a list
df[col] = pd.to_datetime(df[col])
You can check the type using dtypes:
df['in_time'].dtypes
Should give: datetime64[ns, UTC]
Now you can substract them and get the difference time using dt.days or from numpy using np.timedelta64.
Example:
import numpy as np
df['days'] = (df['out_time'] - df['in_time']).dt.days
# Or
df['days'] = (df['out_time'] - df['in_time']) / np.timedelta64(1, 'D')

Splitting time by the hour Python

I have a dataframe df1 like this, where starttime and endtime are datetime objects.
StartTime EndTime
9:08 9:10
9:10 9:35
9:35 9:55
9:55 10:10
10:10 10:20
If endtime.hour is not the same as startime.hour, I would like to split times like this
StartTime EndTime
9:08 9:10
9:10 9:55
9:55 10:00
10:00 10:10
10:10 10:20
Essentially insert a row into the existing dataframe df1. I have looked at a ton of examples but haven't figured out how to do this. If my question isn't clear please let me know.
Thanks
This does what you want ...
# load your data into a DataFrame
data="""StartTime EndTime
9:08 9:10
9:10 9:35
9:35 9:55
9:55 10:10
10:10 10:20
"""
from StringIO import StringIO # import from io for Python 3
df = pd.read_csv(StringIO(data), header=0, sep=' ', index_col=None)
# convert strings to Pandas Timestamps (we will ignore the date bit) ...
import datetime as dt
df.StartTime = [dt.datetime.strptime(x, '%H:%M') for x in df.StartTime]
df.EndTime = [dt.datetime.strptime(x, '%H:%M') for x in df.EndTime]
# assumption - all intervals are less than 60 minutes
# - ie. no multi-hour intervals
# add rows
dfa = df[df.StartTime.dt.hour != df.EndTime.dt.hour].copy()
dfa.EndTime = [dt.datetime.strptime(str(x), '%H') for x in dfa.EndTime.dt.hour]
# play with the start hour ...
df.StartTime = df.StartTime.where(df.StartTime.dt.hour == df.EndTime.dt.hour,
other = [dt.datetime.strptime(str(x), '%H') for x in df.EndTime.dt.hour])
# bring back together and sort
df = pd.concat([df, dfa], axis=0) #top/bottom
df = df.sort('StartTime')
# convert the Timestamps to times for easy reading
df.StartTime = [x.time() for x in df.StartTime]
df.EndTime = [x.time() for x in df.EndTime]
And yields
In [40]: df
Out[40]:
StartTime EndTime
0 09:08:00 09:10:00
1 09:10:00 09:35:00
2 09:35:00 09:55:00
3 09:55:00 10:00:00
3 10:00:00 10:10:00
4 10:10:00 10:20:00

Converting '4 days ago' etc. to the actual dates

I have a massive spreadsheet in which all dates are written this way:
2 days ago
9 days ago
34 days ago
54 days ago
etc.
Is there a clever Python way to convert these data to the actual dates, if I tell Python what date '1 day ago' is?
Use timedelta.
Extract the value from that string in your spreadsheet and then use
d = date.today() - timedelta(days_to_subtract)
If the input date format may slightly vary (human input) then you could use parsedatetime module to parse human-readable date/time text into datetime objects:
#!/usr/bin/env python
import sys
from datetime import datetime
import parsedatetime # $ pip install parsedatetime
now = datetime(2015, 3, 8) # the reference date
cal = parsedatetime.Calendar()
for line in sys.stdin: # at most one date per line
dt, type = cal.parseDT(line, now)
if type > 0:
print(dt)
Output
2015-03-06 00:00:00
2015-02-27 00:00:00
2015-02-02 00:00:00
2015-01-13 00:00:00