Keep individuals in the same firm by year (Stata) - stata

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like this:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In the case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample and Id 3 and 4 for 2010.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010
Any suggestions on how to perform this in Stata?
Regards,

bysort Id (Firm_id) : keep if Firm_id[1] == Firm_id[_N]
See FAQ here.

Related

Replacing variable entries to be the same in each group

I'm working with panel data in Stata, and I have a set up like the following:
ID
year
value
1
2010
1
2011
20
1
2012
20
1
2013
1
2014
2
2010
2
2011
14
2
2012
14
2
2013
14
2
2014
14
and I want to change the blank entries to be the same as the other entries within that ID, for any year. I.e., I want something like the following:
ID
year
value
1
2010
20
1
2011
20
1
2012
20
1
2013
20
1
2014
20
2
2010
14
2
2011
14
2
2012
14
2
2013
14
2
2014
14
What do you recommend?
If the value in variable value are always the same within id you can use this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id int year byte value
1 2010 .
1 2011 20
1 2012 20
1 2013 .
1 2014 .
2 2010 .
2 2011 14
2 2012 14
2 2013 14
2 2014 14
end
*Get mean of values within id
bysort id : egen value2 = mean(value)
*Transfer values back to original var to maintain var labels etc. then drop value2
replace value = value2
drop value2

Filter specific observations

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample. Individual number 3 in 2011 and Individual 4 in 2011 would be dropped.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010
This works for your data example:
clear
input Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
end
bysort Year Firm_id : keep if Id[1] != Id[_N]
sort Id Year
list

How to add a factor/group variable to line plot in Stata

I would like to have a line plot of a continuous variable over time using xtline and overlay a scatterplot or label for each data point indicating a group membership at this point.
* Example generated by -dataex-. To install: ssc install dataex
clear
input double(id year group variable)
101 2003 3 12
102 2003 2 10
102 2005 1 10
102 2007 2 10
102 2009 1 10
102 2011 2 10
103 2003 4 3
103 2005 2 1
104 2003 4 50
105 2003 4 8
105 2005 4 12
105 2007 4 12
105 2009 4 12
106 2003 1 6
106 2005 1 28
106 2007 2 15
106 2009 2 4
106 2011 3 4
106 2015 1 2
106 2017 1 2
end
xtset id year
xtline variable, overlay
Here I added/marked/labelled groups of id 103.
I have four groups, which I hope can be shown in the legend as well.
Solutions
preserve
separate variable, by(id) veryshortlabel
line variable101-variable106 year ///
|| scatter variable year, ///
mla(group) ms(none) mlabc(black) ytitle(variable)
restore
Alternatively
xtline variable, overlay addplot(scatter variable year, mlabel(group))
I recommend direct labelling here. It is likely to yield a slightly messy graph, but your own example is already messy and will only get worse if you add more details.
Here is a reproducible example.
webuse grunfeld, clear
set scheme s1color
separate invest, by(company) veryshortlabel
line invest1-invest10 year , ysc(log) ///
|| scatter invest year if year == 1954, ///
mla(company) ms(none) mlabc(black) legend(off) yla(1 10 100 1000, ang(h)) ytitle(investment)
EDIT:
In your example two identifiers are present only for single years. To show some technique for line plots with panel data, I focus on the others.
* Example generated by -dataex-. To install: ssc install dataex
clear
input double(id year group variable)
101 2003 3 12
102 2003 2 10
102 2005 1 10
102 2007 2 10
102 2009 1 10
102 2011 2 10
103 2003 4 3
103 2005 2 1
104 2003 4 50
105 2003 4 8
105 2005 4 12
105 2007 4 12
105 2009 4 12
106 2003 1 6
106 2005 1 28
106 2007 2 15
106 2009 2 4
106 2011 3 4
106 2015 1 2
106 2017 1 2
end
bysort id : gen include = _N > 1
ssc install fabplot
set scheme s1color
fabplot line variable year if include, xla(2003 " 2003" 2010 2017 "2017 ") by(id) frontopts(lw(thick)) xtitle("")

About keeping observation with specified criteria in SAS

Hello and many thanks in advance for your answers and efforts to help newby users in this forum.
i have a sas table with the variables : ID, Year, Month, and Creation date.
What i desire is, per month and year and Creation date to keep only one ID.
My HAVE data is :
ID Year Month Date of creation
1 2019 1 a
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
My WANT data is
ID Year Month Date of creation
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
I tried nodup key but it removes ID's.
Your example seems to work fine with NODUPKEY option of PROC SORT. Perhaps you used the wrong BY variables?
data have;
input ID Year Month Creation $ ;
cards;
1 2019 1 a
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
;
proc sort data=have out=want nodupkey;
by id year month creation ;
run;
You can also use distinct clause from proc sql, it will remove duplicates based on all columns
proc sql;
create table want
as
select distinct * from have;
quit;

Creating a flag using indexes

I'm looking to build flags for students who have repeated a grade, skipped a grade, or who have an unusual grade progression (e.g. 4th grade in 2008 and 7th grade in 2009). My data is unique at the student id-year-subject level and structured like this (albeit with more variables):
id year subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
This is the code that I've used:
sort id year grade
gen repeat_flag = .
replace repeat_flag = 1 if year!=year[_n+1] & grade==grade[_n+1] ///
& subject!=subject[_n+1] & id==id[_n+1]
replace repeat_flag = 0 if repeat_flag==.
One problem is that there are a lot of students who took a test in say 6 grade, didn't take one in 7th and then took one in 8th grade. This varies across years and school districts, as certain school districts adopted tests in different years for different grade levels. My code doesn't account this.
Regardless though I think there must be more elegant ways to do this and as a side note I wanted to know if the use of indexes is appropriate for a problem like this. Thanks!
Edit
Included a sample of what my data looks like above in response to one of the comments below. If still not clear any feedback is welcomed.
What may seem anomalous are students progressing faster or more slowly in tested grade than the passage of time would imply. That's possibly just one line for the grunt work:
clear
input id year str1 subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
end
bysort id (year) : gen flag = (tested - tested[_n-1]) - (year - year[_n-1])
list if flag != 0 & flag < . , sepby(id)
+---------------------------------------+
| id year subject tested~e flag |
|---------------------------------------|
5. | 2 2012 r 7 2 |
+---------------------------------------+