python: obtaining a column of dates from the columns of years-months-days - python-2.7

Suppose I have a very simple dataframe:
>>> a
Out[158]:
monthE yearE dayE
0 10 2014 15
1 2 2012 15
2 2 2014 15
3 12 2015 15
4 2 2012 15
Suppose that I want to create the column with the date related to every line, using three columns of integers.
When I have simple numbers it is enough to do like:
>>> datetime.date(1983,11,8)
Out[159]: datetime.date(1983, 11, 8)
If I have to create a column of dates (theoretically a very basic request), instead:
a.apply(lambda x: datetime.date(x['yearE'],x['monthE'],x['dayE']))
I obtain the following error:
KeyError: ('yearE', u'occurred at index monthE')

I think you can first remove last char E and then use to_datetime, but then get pandas timestamps not python dates:
df.columns = df.columns.str[:-1]
df['date'] = pd.to_datetime(df)
#if multiple columns filter by subset
#df['date'] = pd.to_datetime(df[['year','month','day']])
print (df)
month year day date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
print (df.date.dtypes)
datetime64[ns]
print (df.date.iloc[0])
2014-10-15 00:00:00
print (type(df.date.iloc[0]))
<class 'pandas.tslib.Timestamp'>
Thank you MaxU for solution:
df['date'] = pd.to_datetime(df.rename(columns = lambda x: x[:-1]))
#if another columns in df
#df['date'] = pd.to_datetime(df[['yearE','monthE','dayE']].rename(columns=lambda x: x[:-1]))
print (df)
monthE yearE dayE date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
But if really need python dates add axis=1 to apply, but then is impossible use some pandas functions:
df['date'] =df.apply(lambda x: datetime.date(x['yearE'],x['monthE'],x['dayE']), axis=1)
print (df)
monthE yearE dayE date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
print (df.date.dtypes)
object
print (df.date.iloc[0])
2014-10-15
print (type(df.date.iloc[0]))
<class 'datetime.date'>

Related

Replacing variable entries to be the same in each group

I'm working with panel data in Stata, and I have a set up like the following:
ID
year
value
1
2010
1
2011
20
1
2012
20
1
2013
1
2014
2
2010
2
2011
14
2
2012
14
2
2013
14
2
2014
14
and I want to change the blank entries to be the same as the other entries within that ID, for any year. I.e., I want something like the following:
ID
year
value
1
2010
20
1
2011
20
1
2012
20
1
2013
20
1
2014
20
2
2010
14
2
2011
14
2
2012
14
2
2013
14
2
2014
14
What do you recommend?
If the value in variable value are always the same within id you can use this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id int year byte value
1 2010 .
1 2011 20
1 2012 20
1 2013 .
1 2014 .
2 2010 .
2 2011 14
2 2012 14
2 2013 14
2 2014 14
end
*Get mean of values within id
bysort id : egen value2 = mean(value)
*Transfer values back to original var to maintain var labels etc. then drop value2
replace value = value2
drop value2

Keep individuals in the same firm by year (Stata)

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like this:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In the case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample and Id 3 and 4 for 2010.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010
Any suggestions on how to perform this in Stata?
Regards,
bysort Id (Firm_id) : keep if Firm_id[1] == Firm_id[_N]
See FAQ here.

Filter specific observations

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample. Individual number 3 in 2011 and Individual 4 in 2011 would be dropped.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010
This works for your data example:
clear
input Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
end
bysort Year Firm_id : keep if Id[1] != Id[_N]
sort Id Year
list

Pandas add multiple new columns at once from list of lists

I have a list of timestamp lists where each inner list looks like this:
['Tue', 'Feb', '7', '10:07:40', '2017']
Is it possible with Pandas to add five new columns at the same time to an already created dataframe (same length as the outer list), that are equal to each of these values, with names 'day','month','date','time','year'?
I think you can use DataFrame constructor with concat:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
L = [['Tue', 'Feb', '7', '10:07:40', '2017'],
['Tue', 'Feb', '7', '10:07:40', '2017'],
['Tue', 'Feb', '7', '10:07:40', '2017']]
cols = ['day','month','date','time','year']
df1 = pd.DataFrame(L, columns=cols)
print (df1)
day month date time year
0 Tue Feb 7 10:07:40 2017
1 Tue Feb 7 10:07:40 2017
2 Tue Feb 7 10:07:40 2017
df2 = pd.concat([df, df1], axis=1)
print (df2)
A B C day month date time year
0 1 4 7 Tue Feb 7 10:07:40 2017
1 2 5 8 Tue Feb 7 10:07:40 2017
2 3 6 9 Tue Feb 7 10:07:40 2017
One liner:
df2 = pd.concat([df, pd.DataFrame(L, columns=['day','month','date','time','year'])], axis=1)
print (df2)
A B C day month date time year
0 1 4 7 Tue Feb 7 10:07:40 2017
1 2 5 8 Tue Feb 7 10:07:40 2017
2 3 6 9 Tue Feb 7 10:07:40 2017

Extract weeks from datetime (Python Pandas)

I have a dataframe:
time year month
0 12/28/2013 0:17 2013 12
1 12/28/2013 0:20 2013 12
2 12/28/2013 0:26 2013 12
3 12/29/2013 0:20 2013 12
4 12/29/2013 0:26 2013 12
5 12/30/2013 0:31 2013 12
6 12/30/2013 0:31 2013 12
7 12/31/2013 0:17 2013 12
8 12/31/2013 0:20 2013 12
9 12/31/2013 0:26 2013 12
10 1/1/2014 4:30 2014 1
11 1/1/2014 4:34 2014 1
12 1/1/2014 4:37 2014 1
13 1/2/2014 4:30 2014 1
14 1/2/2014 5:30 2014 1
15 1/3/2014 4:30 2014 1
16 1/3/2014 4:34 2014 1
17 1/3/2014 4:37 2014 1
18 1/4/2014 4:30 2014 1
19 1/4/2014 4:34 2014 1
20 1/4/2014 4:37 2014 1
I use the following code to extract the week information:
df['week'] = df['time'].dt.week
This makes the dataframe as following:
time year month week
0 2013-12-28 00:17:00 2013 12 52
1 2013-12-28 00:20:00 2013 12 52
2 2013-12-28 00:26:00 2013 12 52
3 2013-12-29 00:20:00 2013 12 52
4 2013-12-29 00:26:00 2013 12 52
5 2013-12-30 00:31:00 2013 12 1
6 2013-12-30 00:31:00 2013 12 1
7 2013-12-31 00:17:00 2013 12 1
8 2013-12-31 00:20:00 2013 12 1
9 2013-12-31 00:26:00 2013 12 1
10 2014-01-01 04:30:00 2014 1 1
11 2014-01-01 04:34:00 2014 1 1
12 2014-01-01 04:37:00 2014 1 1
13 2014-01-02 04:30:00 2014 1 1
14 2014-01-02 05:30:00 2014 1 1
15 2014-01-03 04:30:00 2014 1 1
16 2014-01-03 04:34:00 2014 1 1
17 2014-01-03 04:37:00 2014 1 1
18 2014-01-04 04:30:00 2014 1 1
19 2014-01-04 04:34:00 2014 1 1
20 2014-01-04 04:37:00 2014 1 1
I would like to create another column showing year-week (e.g., 2013-52, 2014-1). The problem is when I combine two columns (year, week) in rows 5 through 9, the result is 2013-1 saying the first week of 2013. This is not correct. Is there any solution for this issue?
Use dt.strftime
reference http://strftime.org/
df.time.dt.strftime('%Y-%W')
0 2013-51
1 2013-51
2 2013-51
3 2013-51
4 2013-51
5 2013-52
6 2013-52
7 2013-52
8 2013-52
9 2013-52
10 2014-00
11 2014-00
12 2014-00
13 2014-00
14 2014-00
15 2014-00
16 2014-00
17 2014-00
18 2014-00
19 2014-00
20 2014-00
Name: time, dtype: object
As #TrigonaMinima pointed out, the first week of the year as defined by ISO 8601 (which dt.week follows):
It is the first week with a majority (4 or more) of its days in
January
In your case, week = 1 has 2 days in December and the rest in January, thus fitting the definition of the first week.