I have a dataset covering a number of companies for which there is a variable for the firms employees. Some years the number of employees has not been reported, hence a some years appear blank while the year before and after contains a value.
The data is similar to:
COMPANY YEAR NO. EMPLOYEES
Company 1 2007 4
Company 1 2008 5
Company 1 2009 5
Company 1 2010 5
Company 2 2007 11
Company 2 2008 10
Company 2 2009
Company 2 2010 10
Company 3 2007 3
Company 3 2008 4
Company 3 2009
Company 3 2010 3
I would like to be able to search the dataset for any such occurrences, making an indicator of these years, and afterwards replace any blank spots with the year before. If there is no previous year to use as a replacement or the previous year is blank, the year after the blank spot. I am hoping for the dataset to like:
COMPANY YEAR NO. EMPLOYEES
Company 1 2007 4
Company 1 2008 5
Company 1 2009 5
Company 1 2010 5
Company 2 2007 11
Company 2 2008 10
Company 2 2009 10
Company 2 2010 10
Company 3 2007 3
Company 3 2008 4
Company 3 2009 4
Company 3 2010 3
To sum up, at first i need to check whether or not i do have a problem with missing values in-between two years (important that the codes do not replace missing values before or after the last year with a non-missing value, since som firms exit the sample). Next, if any blank years in between any two years that are non-blank, I would like to replace these blank spots as mentioned above.
The method I would use:
1. Sort the dataset company/year.
2. Replace missing values using LAG function if the missing value is not the first observation of the company group.
3. Reverse the sort order
4. Repeat step 2 on the dataset with reversed order
5. Return the dataset to the original order
Please note, I have changed your original data for Company 3 in order to have a case for your second scenario (missing value, no previous record).
DATA HAVE;
input COMPANY $ 0-10 YEAR 13-17 N_EMPLOYEES 24-27;
datalines;
Company 1 2007 4
Company 1 2008 5
Company 1 2009 5
Company 1 2010 5
Company 2 2007 11
Company 2 2008 10
Company 2 2009
Company 2 2010 10
Company 3 2007
Company 3 2008 3
Company 3 2009 4
Company 3 2010 3
;
run;
PROC SORT DATA=HAVE
OUT=DOSOMEWORKHERE;
BY COMPANY YEAR;
RUN;
DATA DOSOMEWORKHERE (drop=PREV_N_EMPLOYEES);
set DOSOMEWORKHERE;
by COMPANY;
PREV_N_EMPLOYEES = LAG(N_EMPLOYEES);
if first.COMPANY then
do;
PREV_N_EMPLOYEES = .;
end;
if N_EMPLOYEES = . then N_EMPLOYEES = PREV_N_EMPLOYEES;
run;
PROC SORT DATA=DOSOMEWORKHERE
OUT=DOSOMEWORKHERE;
BY DESCENDING COMPANY DESCENDING YEAR ;
RUN;
DATA DOSOMEWORKHERE (drop=PREV_N_EMPLOYEES);
set DOSOMEWORKHERE;
by DESCENDING COMPANY;
PREV_N_EMPLOYEES = LAG(N_EMPLOYEES);
if first.COMPANY then
do;
PREV_N_EMPLOYEES = .;
end;
if N_EMPLOYEES = . then N_EMPLOYEES = PREV_N_EMPLOYEES;
run;
PROC SORT DATA=DOSOMEWORKHERE
OUT=WANT;
BY COMPANY YEAR;
RUN;
Result:
Related
Hello and many thanks in advance for your answers and efforts to help newby users in this forum.
i have a sas table with the variables : ID, Year, Month, and Creation date.
What i desire is, per month and year and Creation date to keep only one ID.
My HAVE data is :
ID Year Month Date of creation
1 2019 1 a
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
My WANT data is
ID Year Month Date of creation
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
I tried nodup key but it removes ID's.
Your example seems to work fine with NODUPKEY option of PROC SORT. Perhaps you used the wrong BY variables?
data have;
input ID Year Month Creation $ ;
cards;
1 2019 1 a
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
;
proc sort data=have out=want nodupkey;
by id year month creation ;
run;
You can also use distinct clause from proc sql, it will remove duplicates based on all columns
proc sql;
create table want
as
select distinct * from have;
quit;
I want to keep only the row with the highest rank1 for each team. If there is a tie, I want the row with the higher rank2. And then the higher rank3.
For example,
data test;
input name $ team $ rank1 rank2 rank3 country $
datalines;
Bob A 5 6 5 US
Joe A 8 2 6 UK
Dav B 9 7 2 GER
Jim B 9 4 4 FRA
Bob C 3 4 1 FRA
Dan D 5 2 7 GER
Ike D 5 2 7 US
Jay D 5 2 8 UK
run;
I want:
Joe A 8 2 6 UK
Dav B 9 7 2 GER
Bob C 3 4 1 FRA
Jay D 5 2 8 UK
What is the most efficient way to do this? The dataset I'm working with is pretty big and is not sorted. I tried the below code but the sorts take forever to run. And the second sort sorts already sorted data. What if most teams only appear once in the dataset? Is it faster to split into duplicates and non-duplicates, sort only the duplicates and then append?
proc sort data=test;
by team descending rank1 descending rank2 descending rank3;
run;
proc sort data=test nodupkey;
by team;
run;
You can do that with PROC SUMMARY. Not sure about performance compared to what you are already doing.
proc summary data=test nway;
class team;
output out=ranked(drop=_:) idgroup(max(rank:) out(name rank: country)=);
run;
Hopefully this is a rather simple problem, unfortunately I haven't been able to crack the problem yet. I have a dataset of several companies containing a variable indicating when the company stops its activities. Unfortunately this dataset has been updated each year without adjusting any previous years, and as a consequence the actual year of exit/stop only enters once. Take for example company 1 in the table below. The company exits in 2010 but in each year leading up to 2010 a dummy ("9999") for still active is written instead. For company 1, I would like to replace this "9999" by "2010" (i.e. the year of exit) while leaving the "9999" for companies that are still active at the end of the period such as company 3.
company year exit/stop year
company 1 2007 9999
company 1 2008 9999
company 1 2009 9999
company 1 2010 9999
company 2 2007 9999
compnay 2 2008 9999
company 2 2009 2009
company 3 2007 9999
company 3 2008 9999
company 3 2009 9999
company 3 2010 9999
company 4 2007 9999
company 4 2008 2008
... ... ...
I have tried to find the lowest value for each company and replace all values in "EXIT/STOP YEAR" by the lowest value, but so far it have not worked properly so I was wondering if anyone might have an idea of how to do this operation?
Bests,
You could just take the last record and merge it back onto the data. Or it is even easier to just take the records that are not 9999 and merge them back on.
data have ;
input company &:$20. year exit ;
cards;
company 2 2007 9999
company 2 2008 9999
company 2 2009 2009
company 3 2007 9999
company 3 2008 9999
company 3 2009 9999
company 3 2010 9999
company 4 2007 9999
company 4 2008 2008
;
data want ;
merge have
have(keep=company exit rename=(exit=final)
where=(final ne 9999) )
;
by company ;
exit = coalesce(final,exit);
run;
I'm looking to build flags for students who have repeated a grade, skipped a grade, or who have an unusual grade progression (e.g. 4th grade in 2008 and 7th grade in 2009). My data is unique at the student id-year-subject level and structured like this (albeit with more variables):
id year subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
This is the code that I've used:
sort id year grade
gen repeat_flag = .
replace repeat_flag = 1 if year!=year[_n+1] & grade==grade[_n+1] ///
& subject!=subject[_n+1] & id==id[_n+1]
replace repeat_flag = 0 if repeat_flag==.
One problem is that there are a lot of students who took a test in say 6 grade, didn't take one in 7th and then took one in 8th grade. This varies across years and school districts, as certain school districts adopted tests in different years for different grade levels. My code doesn't account this.
Regardless though I think there must be more elegant ways to do this and as a side note I wanted to know if the use of indexes is appropriate for a problem like this. Thanks!
Edit
Included a sample of what my data looks like above in response to one of the comments below. If still not clear any feedback is welcomed.
What may seem anomalous are students progressing faster or more slowly in tested grade than the passage of time would imply. That's possibly just one line for the grunt work:
clear
input id year str1 subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
end
bysort id (year) : gen flag = (tested - tested[_n-1]) - (year - year[_n-1])
list if flag != 0 & flag < . , sepby(id)
+---------------------------------------+
| id year subject tested~e flag |
|---------------------------------------|
5. | 2 2012 r 7 2 |
+---------------------------------------+
Quick question. I'm working with code that produces a spreadsheet that contains the information like the following:
year business sales profit
2001 a 5 3
2002 a 6 4
2003 a 4 2
2001 b 2 1
2002 b 6 3
2003 b 7 5
How can I get Stata to total sales and profits across years?
Thanks
Try
collapse (sum) sales profit, by(year)
or, if you want to retain your original data,
bysort year: egen tot_sales = total(sales)
egen stands for extended generate, a very useful command.