I have two tables and need to create one more table working with other two:
first_table: SECOND TABLE
id term id term majr_code
3 2014 3 2010 ACT
3 2015 3 2010 ACT
4 2014 3 2011 GNST
4 2015 3 2015 BUSA
5 2013 3 2015 BUSA
5 2014 4 2009 TIM
6 2013 4 2010 BAL
6 2014 4 2014 TAR
5 2011 SAR
5 2013 COR
6 2010 PAT
6 2013 TOR
This is two tables I have. I need to create another table which is same with first table and adding one more column majr_code.
first_table:
id term majr_code
3 2014 GNST
3 2015 BUSA
4 2014 TAR
4 2015 TAR
5 2013 COR
5 2014 COR
6 2013 TOR
6 2014 TOR
what I need to do is, for the same id if second table has the same term with first table, I will keep same majr_code. For example: For first table has 2014 and second table has 2011 and 2015, I need to use 2011's majr_Code for 2014 term. For example: first table has 2013 and 2014 terms for the same id, and if second table's highest term is 2013, I will keep same majr_Code for 2013 and 2014
I know its complicated, it should be more clear if you check the tables and result. If still complicated, I can delete the question. This is how I can explain. Thanks!
I think the below code should do the trick. It works as follows:
1) reads in the sample datasets.
2) Create a table titled second_table_nogaps which is just the second_table but with no yearly gaps up through 2015. Basically, for each ID in the second table, it checks if a given yearly record exists. If so, the record is output, if not, it creates a new record with the prior year's majr_code. If the last record for a given id is not 2015, then new records are generated up through 2015. (for example a new record is created for id=4, year=2014, majr_code = TAR)
3) Merged the unique values of id+term+majr_code to first_table. The resulting table First_table_2 should be what you're looking for! However, BE CAREFUL, if there are multiple majr_codes for the same id+term this step will result in duplication.
Hope this helps! The code in step 2 could probably be simplified as my handling of the first and last record was not particularly efficient.
data first_table;
infile datalines ;
input id term;
datalines ;
3 2014
3 2015
4 2014
4 2015
5 2013
5 2014
6 2013
6 2014
;
run;
data second_table;
infile datalines ;
input id term majr_code $;
datalines ;
3 2010 ACT
3 2010 ACT
3 2011 GNST
3 2015 BUSA
3 2015 BUSA
4 2009 TIM
4 2010 BAL
4 2014 TAR
5 2011 SAR
5 2013 COR
6 2010 PAT
6 2013 TOR
;
run;
proc sort data=second_table ; by id term; run;
data second_table_nogaps (keep=id_nogaps term_nogaps majr_code_nogaps );
set second_table end=eof;
retain id_nogaps term_nogaps majr_code_nogaps ;
*first set up the first row... establishes retained variables and outputs;
if _N_ = 1 then do;
id_nogaps = id ;
term_nogaps = term;
majr_code_nogaps = majr_code;
output;
end;
*for all but the first and last row;
else if not eof then do;
do while ( (term_nogaps + 1 < term ) /*this is to fill in gaps between years. (e.g. major code in 2011 and major code in 2014 within the same id*/
or
((id_nogaps ne id) and term_nogaps < 2015) /*this is to fill major code for all terms up through 2015 (e.g. last major code for id 4 is in 2014)*/
);
term_nogaps = term_nogaps + 1;
output;
end;
id_nogaps=id;
term_nogaps = term;
majr_code_nogaps=majr_code;
output;
end;
else do;
do while (term_nogaps + 1 < term );
term_nogaps = term_nogaps + 1;
output;
end;
id_nogaps=id;
term_nogaps = term;
majr_code_nogaps=majr_code;
output;
do while ( term_nogaps < 2015 );
term_nogaps = term_nogaps + 1;
output;
end;
end;
run;
proc sql;
create table First_table_2 as
Select a.* , b.majr_code_nogaps as majr_code
from first_table a
left join
(select distinct id_nogaps, term_nogaps, majr_code_nogaps from second_table_nogaps) b /*select distinct values to prevent duplication*/
on a.id = b.id_nogaps and a.term = b.term_nogaps;
quit;
There are a few approaches to this, but sql is probably easiest. You don't provide code, so i'll just include a pointer. You need to use having to filter the table after it's been grouped to having term=max(term).
Related
I have a data set of approximately this format:
Table format :
ID
2012
2013
2014
A
1
3
B
2
4
And I want to transpose it to this format:
Table format :
ID
Source
Value
A
2012
1
A
2013
3
B
2012
2
B
2014
4
Using the Transpose task. I'm working in EG 5.1 and I've got a massive mental block on how to do this. Most of the guides are for doing this the opposite way around. Thanks so much in advance for any advice.
Use proc transpose instead. Create a new SAS program and run the following code:
proc transpose data = have
out = want(rename = (COL1 = Value)
where = (NOT missing(Value) )
)
name = Source;
by id;
var _NUMERIC_;
run;
Output:
ID Source Value
A 2012 1
A 2013 3
B 2012 2
B 2014 4
In Enterprise Guide, this is the Stack Columns task:
I have test scores from many students in 8 different years. I want to retain only the max total score of each student, but then also retain all the student-year related information to that test score (that is, all the columns from the same year in which the student got the highest total score).
An example of the datasets I have:
%macro score;
%do year = 2010 %to 2018;
data student_&year.;
do id=1 to 10;
english=25*rand('uniform');
math=25*rand('uniform');
sciences=25*rand('uniform');
history=25*rand('uniform');
total_score=sum(english, math, sciences, history);
output;
end;
%end;
run;
%mend;
%score;
In my expected output, I would like to retain the max of total_score for each student, and also have the other columns related to that total score. If possible, I would also like to have the information about the year in which the student got the max of total_score. An example of the expected output would be:
DATA want;
INPUT id total_score english math sciences history year;
CARDS;
1 75.4 15.4 20 20 20 2017
2 63.8 20 13.8 10 20 2016
3 48 10 10 18 10 2018
4 52 12 10 10 20 2016
5 69.5 20 19.5 20 10 2013
6 85 20.5 20.5 21 23 2011
7 41 5 12 14 10 2010
8 55.3 15 20.3 10 10 2012
9 51.5 10 20 10 11.5 2013
10 48.9 12.9 16 10 10 2015
;
RUN;
I have been trying to work with the SAS UPDATE procedure. But it just get the most recent value for each student. I want the max total score. Also, within the update framework, I need to update two tables at a time. I would like to compare all tables at the same time. So this strategy I am trying does not work:
data want;
update score_2010 score_2011;
by id;
Thanks to anyone who can provide insights.
It is easier to obtain what you want if you have only one longitudinal dataset with all the original information of your students. It also makes more sense, since you are comparing students across different years.
To build a longitudinal dataset, you will first need to insert a variable informing the year of each of your original datasets. For example with:
%macro score;
%do year = 2010 %to 2018;
data student_&year.;
do id=1 to 10;
english=25*rand('uniform');
math=25*rand('uniform');
sciences=25*rand('uniform');
history=25*rand('uniform');
total_score=sum(english, math, sciences, history);
year=&year.;
output;
end;
%end;
run;
%mend;
%score;
After including the year, you can get a longitudinal dataset with:
data student_allyears;
set student_201:;
run;
Finally, you can get what you want with a proc sql, in which you select the max of "total_score" grouped by "id":
proc sql;
create table want as
select distinct *
from student_allyears
group by id
having total_score=max(total_score);
Create a view that stacks the individual data sets and perform your processing on that.
Example (SQL select, group by, and having)
data scores / view=scores;
length year $4;
set work.student_2010-work.student_2018 indsname=dsname;
year = scan(dsname,-1,'_');
run;
proc sql;
create table want as
select * from scores
group by id
having total_score=max(total_score)
;
Example DOW loop processing
Stack data so the view is processible BY ID. The first DOW loops computes which record has the max total score over the group and the second selects the record in the group for OUTPUT
data scores_by_id / view=scores_by_id;
set work.student_2010-work.student_2018 indsname=dsname;
by id;
year = scan(dsname,-1,'_');
run;
data want;
* compute which record in group has max measure;
do _n_ = 1 by 1 until (last.id);
set scores_by_id;
by id;
if total_score > _max then do;
_max = total_score;
_max_at_n = _n_;
end;
end;
* output entire record having the max measure;
do _n_ = 1 to _n_;
set scores_by_id;
if _n_ = _max_at_n then OUTPUT;
end;
drop _max:;
run;
I have a SAS question. I have a dataset containing ID and year. I want to create the dummyvariables "2011" and "2012" that should take on the value 1 if the ID has an observation in the given year and 0 otherwise. Eg. ID 2 should have 2011=1 and 2012=0, since the ID only has an observation for 2011.
ID Year 2011 2012
1 2011 1 1
1 2012 1 1
2 2011 1 0
3 2012 0 1
Can anyone help? Thanks!
For one thing, 2011 or 2012 are not valid names for SAS variables. SAS variables must start with a letter or an underscore (e.g., _2011).
If you really need to, you can get around that limitation by setting the system option validvarname=any and surrounding your 'invalid' variable names with single quotes and appending an n.
This would do what you want:
data have;
infile datalines;
input ID year;
datalines;
1 2011
1 2012
2 2011
3 2012
;
run;
options validvarname=ANY;
proc sql;
create table want as
select ID
,year
,exists(select * from have b where year=2011 and a.id=b.id) as '2011'n
,exists(select * from have b where year=2012 and a.id=b.id) as '2012'n
from have a
;
quit;
I am trying to define a new value for an observation with a user defined format. However, my if/then/else statement seems to only work for observations with a year value of "2014". The put statements are not working for other values. In SAS, the put statement is blue in the first statement, and black in the other two. Here is a picture of what I mean:
Does anyone know what I am missing here? Here is my complete code:
data claims_t03_group;
set output.claims_t02_group;
if year = "2014" then test = put(compress(lookup,"_"),$G_14_PROD35.);
else if year = "2015" then test = put(compress(lookup,"_"),$G_15_PROD35.);
else test = put(compress(lookup,"_"),$G_16_PROD35.);
run;
Here is an example of what I mean when I say that the process seems to "work" for 2014:
As you can see, when the Year value is 2014, the format lookup works correctly, and the test field returns the value I am expecting. However, for years 2015 and 2016, the test field returns the lookup value without any formatting.
Your code utilises user-defined formats, $G_14_PROD.-$G_16_PROD.. My guess would be that there is a problem with one or more of these, but unless you can provide the format definitions it will be difficult to assist you further.
Try running the following and sharing the resulting output dataset work.prdfmts:
proc sql noprint;
select cats(libname,'.',memname) into :myfmtlib
from sashelp.vcatalg
where objname = 'G_14_PROD';
quit;
proc format cntlout = prdfmts library=&myfmtlib;
select G_14_PROD G_15_PROD G_16_PROD;
run;
N.B. this assumes that you only have one catalogue containing a format with that name, and that the format definitions for all 3 formats are contained in the same catalogue. If not, you will need to adapt this a bit and run it once for each format to find and export the definition.
Not that it solves your actual problem, but you could eliminate the IF/THEN by using the PUTC() function instead.
data have ;
do year=2014,2015,2016;
do lookup='00_01','00_02' ;
output;
end;
end;
run;
proc format ;
value $G_14_PROD '0001'='2014 - 1' '0002'='2014 - 2' ;
value $G_15_PROD '0001'='2015 - 1' '0002'='2015 - 2' ;
value $G_16_PROD '0001'='2016 - 1' '0002'='2016 - 2' ;
run;
data want ;
set have ;
length test $35 ;
if 2014 <= year <= 2016 then
test = putc(compress(lookup,'_'),cats('$G_',year-2000,'_PROD.'))
;
run;
Result
Obs year lookup test
1 2014 00_01 2014 - 1
2 2014 00_02 2014 - 2
3 2015 00_01 2015 - 1
4 2015 00_02 2015 - 2
5 2016 00_01 2016 - 1
6 2016 00_02 2016 - 2
I have panel data that looks something like this:
ID year dummy
1234 2007 0
1234 2008 0
1234 2009 0
1234 2010 1
1234 2011 1
2345 2008 0
2345 2009 1
2345 2010 1
2345 2011 1
3456 2008 0
3456 2009 0
3456 2010 1
3456 2011 1
With more observations following the same pattern and many more variables that aren't relevant to this problem.
I want to establish a treatment sample of IDs where the dummy variable "switches" at 2010 (is 0 when year<2010 and 1 when year>=2010). In the example data above, 1234 and 3456 would be in the sample and 2345 would not.
I'm fairly new to SAS and I guess I'm not familiar enough with CLASS and BY statements to figure out how to do this.
So far I've done this:
data c_temp;
set c_data_full;
if year < 2010 and dummy=0
then trtmt_grp=1;
else pre_grp=0;
if year >=2010 and dummy=1
then trtmt_grp=1;
run;
But that doesn't do anything about the panel aspect of the data. I can't figure out how to do the last step of selecting only the IDs where trtmt_grp is 1 for every year.
All help is appreciated! Thanks!
Don't think you need double DoW loop, unless you need to append the data to the other rows. Simple single pass should suffice if you just need a single row per ID that matches.
data want;
set have;
by id;
retain grpcheck; *keep its value for multiple passes;
if first.id and year < 2010 then grpcheck=1; *reset for each ID to 1 (kept);
else if first.id and year ge 2010 then grpcheck=0;
if (year<2010) and (dummy=1) then grpcheck=0; *if a non-zero is found before 2010, set to 0;
if (year >= 2010) and (dummy=0) then grpcheck=0; *if a 0 is found at/after 2010, set to 0;
if last.id and year >= 2010 and grpcheck=1; *if still 1 by last.id and it hits at least 2010 then output;
run;
Any time you want to do some logic for each ID (or, each logically grouped set of rows by some variable's value), you start by setting your flag/etc. in an if first.id statement group. Then, modify your flag as appropriate for each row. Then, add an if last.id group which checks to see if the flag is still set when you've hit the last row.
I think you probably want a double DOW loop. First loop to calculate your TRTMT_GRP flag at the ID level and the second to select the detailed records.
data want ;
do until (last.id);
set c_data_full;
by id dummy ;
if first.dummy and dummy=1 and year=2010 then trtmt_grp=1;
end;
do until (last.id);
set c_data_full;
by id ;
if trtmt_grp=1 then output;
end;
run;
It seems to me that Proc SQL can deliver a pretty straightforward approach,
proc sql;
select distinct id from have
group by id
having sum(year<=2009 and dummy = 1)=0 and sum(year>=2010 and dummy=0) = 0
;
quit;