Splitting a column that contains multiple date formats - python-2.7

I have a csv file that contains a column with multiple date formats. I need to split them and get the extracted result in the same format.
Wednesday 12 August 2015
Wednesday 12 August 2015
Friday April 1 2016
Friday April 1 2016
5/12/2016
5/12/2016
This is the file and i want it in the mm/dd/yy format. My code is as follows:
import re
import csv
import pandas as pd
#delimiters = " ", "/"
#f = open('merged_34.csv')
f = open('test3.csv')
df = pd.read_csv('test3.csv')
for item in df['serverDatePrettyFirstAction']:
if '/' in item:
newDate.append(item)
else:
item = item.split(' ', 1)[1]
newDate.append(item)
df['newDate'] = newDate
df.to_csv('D:/Python/10.36.202.64/newfile.csv', index = False)
And this is what i get:
serverDatePrettyFirstAction newDate
Wednesday 12 August 2015 12-Aug-15
Wednesday 12 August 2015 12-Aug-15
Friday April 1 2016 April 1 2016
Friday April 1 2016 April 1 2016
5/12/2016 5/12/2016
5/12/2016 5/12/2016
Also is there a way to overwrite the values in the same column itself

a faster approach would be to use pandas's method to_datetime():
In [2]: df
Out[2]:
Date
0 Wednesday 12 August 2015
1 Wednesday 12 August 2015
2 Friday April 1 2016
3 Friday April 1 2016
4 5/12/2016
5 5/12/2016
In [6]: df['newDate'] = pd.to_datetime(df['Date'])
Result:
In [7]: df
Out[7]:
Date newDate
0 Wednesday 12 August 2015 2015-08-12
1 Wednesday 12 August 2015 2015-08-12
2 Friday April 1 2016 2016-04-01
3 Friday April 1 2016 2016-04-01
4 5/12/2016 2016-05-12
5 5/12/2016 2016-05-12

You can use third party dateutil library as long as your data is not too big.( After all, It guesses format every time)
import pandas as pd
from dateutil import parser
df = pd.read_csv('test3.csv')
df['newDate'] = df['serverDatePrettyFirstAction'].apply(parser.parse)
df.to_csv('newfile.csv', index=False, date_format='%Y-%m-%d ')
to overwrite the values in the same column
Use
df['serverDatePrettyFirstAction']=df['serverDatePrettyFirstAction'].apply(parser.parse)

Related

How to create a column that enumarate the month based on it's month number?

So I have the following data table:
Month
Month##
id
November
11
BC221
July
7
1232SAD
August
8
DSAGD323
December
12
OKSDF93
October
10
OPAFSD83
September
9
POWER928
August
8
DSAGD323
December
12
DASF32
October
10
HSKJFH73264
September
9
9812973HJKSDF
And I want to create a new columns that enumerate/rank the month in a ascending order like this:
Month
Month##
id
rnk
November
11
BC221
5
July
7
1232SAD
1
August
8
DSAGD323
2
December
12
OKSDF93
6
October
10
OPAFSD83
4
September
9
POWER928
3
August
8
DSAGD323
2
December
12
DASF32
6
October
10
HSKJFH73264
4
September
9
9812973HJKSDF
3
So as you can see above July is going to be the first, August the second, September the third and so on. How can I achieve this?
the easiest way;
simply; Subtract 6 from the month number, if the result is greater than 0, leave it; otherwise, add 6 to the original value;

How to remove duplicates in SAS data

I am trying to delete the observations in my data set that are the same across multiple variables.
For example
PIN Start Date End Date
1 Jan 1 2014 Jan 3 2014>
1 Jan 1 2014 Jan 3 2015
3 March 2 2014 March 5 2014
4 July 1 2014 July 8 2014
5 July 1 2014 July 8 2014
6 August 9 2014 August 24 2014
I would want to remove those with the same PIN and Start Date.
Translate the string dates into SAS dates first.
data have2;
set have(rename=(start_date = _start_date
end_date = _end_date) );
start_date = input(strip(_start_date), anydtdte10.);
end_date = input(strip(_end_date), anydtdte10.);
format start_date end_date date9.;
drop _start_date _end_date;
run;
Then use proc sort nodupkey.
proc sort data=have2 nodupkey;
by pin start_date;
run;

Pandas add multiple new columns at once from list of lists

I have a list of timestamp lists where each inner list looks like this:
['Tue', 'Feb', '7', '10:07:40', '2017']
Is it possible with Pandas to add five new columns at the same time to an already created dataframe (same length as the outer list), that are equal to each of these values, with names 'day','month','date','time','year'?
I think you can use DataFrame constructor with concat:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
L = [['Tue', 'Feb', '7', '10:07:40', '2017'],
['Tue', 'Feb', '7', '10:07:40', '2017'],
['Tue', 'Feb', '7', '10:07:40', '2017']]
cols = ['day','month','date','time','year']
df1 = pd.DataFrame(L, columns=cols)
print (df1)
day month date time year
0 Tue Feb 7 10:07:40 2017
1 Tue Feb 7 10:07:40 2017
2 Tue Feb 7 10:07:40 2017
df2 = pd.concat([df, df1], axis=1)
print (df2)
A B C day month date time year
0 1 4 7 Tue Feb 7 10:07:40 2017
1 2 5 8 Tue Feb 7 10:07:40 2017
2 3 6 9 Tue Feb 7 10:07:40 2017
One liner:
df2 = pd.concat([df, pd.DataFrame(L, columns=['day','month','date','time','year'])], axis=1)
print (df2)
A B C day month date time year
0 1 4 7 Tue Feb 7 10:07:40 2017
1 2 5 8 Tue Feb 7 10:07:40 2017
2 3 6 9 Tue Feb 7 10:07:40 2017

How to obtain the "time" values of a schedule

Assuming a fixed rate bond with the schedule shown in the sample code below.
I am able to obtain the number of days between the tenors by using the businessDaysBetween function.
Now I would like the "time value". Is there a way of doing it without creating a new function?
Here is the expected result:
May 14th, 2012 .5
November 14th, 2012 .5
May 14th, 2013 .5
November 14th, 2013 .5
May 14th, 2014 .5
November 14th, 2014 .5
May 14th, 2015 .5
November 16th, 2015 .505556
May 16th, 2016 .5
November 14th, 2016 .49444
Here is the code:
from QuantLib import *
import pandas as pd
effective_date = Date(14, 11, 2011)
termination_date = Date(14, 11, 2016)
tenor = Period(Semiannual)
calendar = UnitedStates()
business_convention = ModifiedFollowing
termination_business_convention = Following
date_generation = DateGeneration.Forward
end_of_month = False
day_count = Thirty360()
schedule = Schedule(effective_date,
termination_date,
tenor,
calendar,
business_convention,
termination_business_convention,
date_generation,
end_of_month)
t = []
for i, d in enumerate(schedule):
tmp = i+1, d,
t.append(tmp)
df = pd.DataFrame(t,columns = ['tenorNo','tenorDate'])
nbDays = []
for x in df['tenorNo'] :
if x == 1:
tmp = 0
else:
tmp = calendar.businessDaysBetween(df['tenorDate'][x-2],df['tenorDate'][x-1])
nbDays.append(tmp)
df['nbDays'] = nbDays
print df
tenorNo tenorDate nbDays
0 1 November 14th, 2011 0
1 2 May 14th, 2012 125
2 3 November 14th, 2012 127
3 4 May 14th, 2013 124
4 5 November 14th, 2013 127
5 6 May 14th, 2014 124
6 7 November 14th, 2014 127
7 8 May 14th, 2015 124
8 9 November 16th, 2015 127
9 10 May 16th, 2016 125
10 11 November 14th, 2016 125
That's what DayCounter instances are for. The time will depend on the day-count convention you choose (for example, you seem to be using 30/360).
Calling
day_count.yearFraction(date1, date2)
will return the time between date1 and date2.

How do I get coupon payment dates for a simple fixed bond using quantlib, quantlib-swig and python

I am trying yo learn quantlib (1.3) & python bindings using quantlib-swig (1.2) in ubuntu 13.04. As a starter I am trying to determine the payment dates for a very simple bond as given below using 30/360 European day counter
from QuantLib import *
faceValue = 100.0
doi = Date(31, August, 2000)
dom = Date(31, August, 2008)
coupons = [0.05]
dayCounter = Thirty360(Thirty360.European)
schedule = Schedule(doi, dom, Period(Semiannual),
India(),
Unadjusted, Unadjusted,
DateGeneration.Backward, False)
Following are my questions:
Which method of schedule object will give me the payment dates?
Where do I need to specify the dayCounter object so that the dates are appropriately calculated?
Using Dimitri Reiswich' Presentation, I tried mimicking C++ code, but schedule.dates() returns an error as no such method.
The payment dates for this Fixed Rate bond are, (obtained by using oocalc)
Feb 28, 2001; Aug 31, 2001
Feb 28, 2002; Aug 31, 2002
Feb 28, 2003; Aug 31, 2003
Feb 29, 2004; Aug 31, 2004
Feb 28, 2005; Aug 31, 2005
Feb 28, 2006; Aug 31, 2006
Feb 28, 2007; Aug 31, 2007
Feb 29, 2008; Aug 31, 2008
How do I get the payment dates for this simple bond using python & quantlib? Can someone please help?
regards
K
If you want to look at the schedule you just generated, you can iterate over it:
>>> for d in schedule: print d
...
August 31st, 2000
February 28th, 2001
August 31st, 2001
February 28th, 2002
August 31st, 2002
February 28th, 2003
August 31st, 2003
February 29th, 2004
August 31st, 2004
February 28th, 2005
August 31st, 2005
February 28th, 2006
August 31st, 2006
February 28th, 2007
August 31st, 2007
February 29th, 2008
August 31st, 2008
or call list(schedule) if you want to store them. However, are you sure that those are the payment dates? They are the start and end date for accrual calculation; but some of these fall on a Saturday or a Sunday, and the bond will be paying on the next business day. You can see the effect if you instantiate the bond and retrieve the coupons:
>>> settlement_days = 3
>>> bond = FixedRateBond(settlement_days, faceValue, schedule, coupons, dayCounter)
>>> for c in bond.cashflows():
... print c.date()
...
February 28th, 2001
August 31st, 2001
February 28th, 2002
September 2nd, 2002
February 28th, 2003
September 1st, 2003
March 1st, 2004
August 31st, 2004
February 28th, 2005
August 31st, 2005
February 28th, 2006
August 31st, 2006
February 28th, 2007
August 31st, 2007
February 29th, 2008
September 1st, 2008
September 1st, 2008
(that is, unless Saturdays and Sundays shouldn't be holidays for the Indian calendar. If you think they shouldn't, file a bug report with QuantLib).