Filter specific observations - stata

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample. Individual number 3 in 2011 and Individual 4 in 2011 would be dropped.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010

This works for your data example:
clear
input Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
end
bysort Year Firm_id : keep if Id[1] != Id[_N]
sort Id Year
list

Related

Replacing variable entries to be the same in each group

I'm working with panel data in Stata, and I have a set up like the following:
ID
year
value
1
2010
1
2011
20
1
2012
20
1
2013
1
2014
2
2010
2
2011
14
2
2012
14
2
2013
14
2
2014
14
and I want to change the blank entries to be the same as the other entries within that ID, for any year. I.e., I want something like the following:
ID
year
value
1
2010
20
1
2011
20
1
2012
20
1
2013
20
1
2014
20
2
2010
14
2
2011
14
2
2012
14
2
2013
14
2
2014
14
What do you recommend?
If the value in variable value are always the same within id you can use this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id int year byte value
1 2010 .
1 2011 20
1 2012 20
1 2013 .
1 2014 .
2 2010 .
2 2011 14
2 2012 14
2 2013 14
2 2014 14
end
*Get mean of values within id
bysort id : egen value2 = mean(value)
*Transfer values back to original var to maintain var labels etc. then drop value2
replace value = value2
drop value2

Keep individuals in the same firm by year (Stata)

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like this:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In the case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample and Id 3 and 4 for 2010.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010
Any suggestions on how to perform this in Stata?
Regards,
bysort Id (Firm_id) : keep if Firm_id[1] == Firm_id[_N]
See FAQ here.

About keeping observation with specified criteria in SAS

Hello and many thanks in advance for your answers and efforts to help newby users in this forum.
i have a sas table with the variables : ID, Year, Month, and Creation date.
What i desire is, per month and year and Creation date to keep only one ID.
My HAVE data is :
ID Year Month Date of creation
1 2019 1 a
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
My WANT data is
ID Year Month Date of creation
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
I tried nodup key but it removes ID's.
Your example seems to work fine with NODUPKEY option of PROC SORT. Perhaps you used the wrong BY variables?
data have;
input ID Year Month Creation $ ;
cards;
1 2019 1 a
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
;
proc sort data=have out=want nodupkey;
by id year month creation ;
run;
You can also use distinct clause from proc sql, it will remove duplicates based on all columns
proc sql;
create table want
as
select distinct * from have;
quit;

Extract weeks from datetime (Python Pandas)

I have a dataframe:
time year month
0 12/28/2013 0:17 2013 12
1 12/28/2013 0:20 2013 12
2 12/28/2013 0:26 2013 12
3 12/29/2013 0:20 2013 12
4 12/29/2013 0:26 2013 12
5 12/30/2013 0:31 2013 12
6 12/30/2013 0:31 2013 12
7 12/31/2013 0:17 2013 12
8 12/31/2013 0:20 2013 12
9 12/31/2013 0:26 2013 12
10 1/1/2014 4:30 2014 1
11 1/1/2014 4:34 2014 1
12 1/1/2014 4:37 2014 1
13 1/2/2014 4:30 2014 1
14 1/2/2014 5:30 2014 1
15 1/3/2014 4:30 2014 1
16 1/3/2014 4:34 2014 1
17 1/3/2014 4:37 2014 1
18 1/4/2014 4:30 2014 1
19 1/4/2014 4:34 2014 1
20 1/4/2014 4:37 2014 1
I use the following code to extract the week information:
df['week'] = df['time'].dt.week
This makes the dataframe as following:
time year month week
0 2013-12-28 00:17:00 2013 12 52
1 2013-12-28 00:20:00 2013 12 52
2 2013-12-28 00:26:00 2013 12 52
3 2013-12-29 00:20:00 2013 12 52
4 2013-12-29 00:26:00 2013 12 52
5 2013-12-30 00:31:00 2013 12 1
6 2013-12-30 00:31:00 2013 12 1
7 2013-12-31 00:17:00 2013 12 1
8 2013-12-31 00:20:00 2013 12 1
9 2013-12-31 00:26:00 2013 12 1
10 2014-01-01 04:30:00 2014 1 1
11 2014-01-01 04:34:00 2014 1 1
12 2014-01-01 04:37:00 2014 1 1
13 2014-01-02 04:30:00 2014 1 1
14 2014-01-02 05:30:00 2014 1 1
15 2014-01-03 04:30:00 2014 1 1
16 2014-01-03 04:34:00 2014 1 1
17 2014-01-03 04:37:00 2014 1 1
18 2014-01-04 04:30:00 2014 1 1
19 2014-01-04 04:34:00 2014 1 1
20 2014-01-04 04:37:00 2014 1 1
I would like to create another column showing year-week (e.g., 2013-52, 2014-1). The problem is when I combine two columns (year, week) in rows 5 through 9, the result is 2013-1 saying the first week of 2013. This is not correct. Is there any solution for this issue?
Use dt.strftime
reference http://strftime.org/
df.time.dt.strftime('%Y-%W')
0 2013-51
1 2013-51
2 2013-51
3 2013-51
4 2013-51
5 2013-52
6 2013-52
7 2013-52
8 2013-52
9 2013-52
10 2014-00
11 2014-00
12 2014-00
13 2014-00
14 2014-00
15 2014-00
16 2014-00
17 2014-00
18 2014-00
19 2014-00
20 2014-00
Name: time, dtype: object
As #TrigonaMinima pointed out, the first week of the year as defined by ISO 8601 (which dt.week follows):
It is the first week with a majority (4 or more) of its days in
January
In your case, week = 1 has 2 days in December and the rest in January, thus fitting the definition of the first week.

python: obtaining a column of dates from the columns of years-months-days

Suppose I have a very simple dataframe:
>>> a
Out[158]:
monthE yearE dayE
0 10 2014 15
1 2 2012 15
2 2 2014 15
3 12 2015 15
4 2 2012 15
Suppose that I want to create the column with the date related to every line, using three columns of integers.
When I have simple numbers it is enough to do like:
>>> datetime.date(1983,11,8)
Out[159]: datetime.date(1983, 11, 8)
If I have to create a column of dates (theoretically a very basic request), instead:
a.apply(lambda x: datetime.date(x['yearE'],x['monthE'],x['dayE']))
I obtain the following error:
KeyError: ('yearE', u'occurred at index monthE')
I think you can first remove last char E and then use to_datetime, but then get pandas timestamps not python dates:
df.columns = df.columns.str[:-1]
df['date'] = pd.to_datetime(df)
#if multiple columns filter by subset
#df['date'] = pd.to_datetime(df[['year','month','day']])
print (df)
month year day date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
print (df.date.dtypes)
datetime64[ns]
print (df.date.iloc[0])
2014-10-15 00:00:00
print (type(df.date.iloc[0]))
<class 'pandas.tslib.Timestamp'>
Thank you MaxU for solution:
df['date'] = pd.to_datetime(df.rename(columns = lambda x: x[:-1]))
#if another columns in df
#df['date'] = pd.to_datetime(df[['yearE','monthE','dayE']].rename(columns=lambda x: x[:-1]))
print (df)
monthE yearE dayE date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
But if really need python dates add axis=1 to apply, but then is impossible use some pandas functions:
df['date'] =df.apply(lambda x: datetime.date(x['yearE'],x['monthE'],x['dayE']), axis=1)
print (df)
monthE yearE dayE date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
print (df.date.dtypes)
object
print (df.date.iloc[0])
2014-10-15
print (type(df.date.iloc[0]))
<class 'datetime.date'>