I have a SAS question. I have a large dataset containing unique ID's and a bunch of variables for each year in a time series. Some ID's are present throughout the entire timeseries, some new ID's are added and some old ID's are removed.
ID Year Var3 Var4
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
As can be seen from the table above, ID 1 is present in all three years (2015-2017), ID 2 is present from 2016 and onwards, ID 3 is removed in 2017 and ID 4 is present in 2015, removed in 2016 and then present again in 2017.
I would like to know which ID's are new and which are removed in any given year, whilst keeping all the data. Eg. a new table with indicators for which ID's are new and which are removed. Furthermore, it would be nice to get a frequency of how many ID' are added/removed in a given year and the sum og their "Var3" and "Var4". Do you have any suggestions how to do that?
************* UPDATE ******************
Okay, so I tried the following program:
**** Addition to suggested code ****;
options validvarname=any;
proc sql noprint;
create table years as
select distinct year
from have;
create table ids as
select distinct id
from have;
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
Now this will provide me with a table that only contains the ID's that are new in 2017:
data new_in_17;
set indicators;
where ('2016'n=0) and ('2017'n=1);
run;
I can now merge this table to add var3 and var4:
data new17;
merge new_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
Now I can find the frequence of new ID's in 2017 and the sum of var3 and var4:
proc means data=new17 noprint;
var var3 var4;
where year in (2017);
output out=sum_var_freq_new sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
This gives me the output I need. However, I would like the equivalent output for the ID's that are "gone" between 2016 and 2017 which can be made from:
data gone_in_17;
set indicators;
where ('2016'n=1) and ('2017'n=0);
run;
data gone17;
merge gone_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
proc means data=gone17 noprint;
var var3 var4;
where year in (2016);
output out=sum_var_freq_gone sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
The end result should be a combination of the two tables "sum_var_freq_new" and "sum_var_freq_gone" into one table. Furthermore, I need this table for every new year, so my current approach is very inefficient. Do you guys have any suggestions how to achieve this efficiently?
Aside from a different sample, you didn't provide much extra info from your previous question in order to understand what was lacking in the previous answer.
To build on the latter though, you could use a macro do loop to dynamically account for the distinct year values present in your dataset.
data have;
infile datalines;
input ID year var3 var4;
datalines;
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
;
run;
proc sql noprint;
select distinct year
into :year1-
from have
;
quit;
%macro doWant;
proc sql;
create table want as
select distinct ID
%let i=1;
%do %while(%symexist(year&i.));
,exists(select * from have b where year=&&year&i.. and a.id=b.id) as "&&year&i.."n
%let i=%eval(&i.+1);
%end;
from have a
;
quit;
%mend;
%doWant;
This will produce the following result:
ID 2015 2016 2017
-----------------
1 1 1 1
2 0 1 1
3 1 1 0
4 1 0 1
Here is a more efficient way of doing this and also giving you the summary values.
First a little SQL magic. Create the cross product of years and IDs, then join that to the table you have to create an indicator;
proc sql noprint;
/*All Years*/
create table years as
select distinct year
from have;
/*All IDS*/
create table ids as
select distinct id
from have;
/*All combinations of ID/year*/
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
/*Original data with rows added for missing years. Indicator=1 if it*/
/*existed prior, 0 if not.*/
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
Now transpose that.
proc transpose data=indicators out=indicators(drop=_name_);
by id;
id year;
var indicator;
run;
Create the sums. You could also add other summary stats if you wanted here:
proc summary data=have;
by id;
var var3 var4;
output out=summary sum=;
run;
Merge the indicators and the summary values:
data want;
merge indicators summary(keep=id var3 var4);
by id;
run;
Related
I have test scores from many students in 8 different years. I want to retain only the max total score of each student, but then also retain all the student-year related information to that test score (that is, all the columns from the same year in which the student got the highest total score).
An example of the datasets I have:
%macro score;
%do year = 2010 %to 2018;
data student_&year.;
do id=1 to 10;
english=25*rand('uniform');
math=25*rand('uniform');
sciences=25*rand('uniform');
history=25*rand('uniform');
total_score=sum(english, math, sciences, history);
output;
end;
%end;
run;
%mend;
%score;
In my expected output, I would like to retain the max of total_score for each student, and also have the other columns related to that total score. If possible, I would also like to have the information about the year in which the student got the max of total_score. An example of the expected output would be:
DATA want;
INPUT id total_score english math sciences history year;
CARDS;
1 75.4 15.4 20 20 20 2017
2 63.8 20 13.8 10 20 2016
3 48 10 10 18 10 2018
4 52 12 10 10 20 2016
5 69.5 20 19.5 20 10 2013
6 85 20.5 20.5 21 23 2011
7 41 5 12 14 10 2010
8 55.3 15 20.3 10 10 2012
9 51.5 10 20 10 11.5 2013
10 48.9 12.9 16 10 10 2015
;
RUN;
I have been trying to work with the SAS UPDATE procedure. But it just get the most recent value for each student. I want the max total score. Also, within the update framework, I need to update two tables at a time. I would like to compare all tables at the same time. So this strategy I am trying does not work:
data want;
update score_2010 score_2011;
by id;
Thanks to anyone who can provide insights.
It is easier to obtain what you want if you have only one longitudinal dataset with all the original information of your students. It also makes more sense, since you are comparing students across different years.
To build a longitudinal dataset, you will first need to insert a variable informing the year of each of your original datasets. For example with:
%macro score;
%do year = 2010 %to 2018;
data student_&year.;
do id=1 to 10;
english=25*rand('uniform');
math=25*rand('uniform');
sciences=25*rand('uniform');
history=25*rand('uniform');
total_score=sum(english, math, sciences, history);
year=&year.;
output;
end;
%end;
run;
%mend;
%score;
After including the year, you can get a longitudinal dataset with:
data student_allyears;
set student_201:;
run;
Finally, you can get what you want with a proc sql, in which you select the max of "total_score" grouped by "id":
proc sql;
create table want as
select distinct *
from student_allyears
group by id
having total_score=max(total_score);
Create a view that stacks the individual data sets and perform your processing on that.
Example (SQL select, group by, and having)
data scores / view=scores;
length year $4;
set work.student_2010-work.student_2018 indsname=dsname;
year = scan(dsname,-1,'_');
run;
proc sql;
create table want as
select * from scores
group by id
having total_score=max(total_score)
;
Example DOW loop processing
Stack data so the view is processible BY ID. The first DOW loops computes which record has the max total score over the group and the second selects the record in the group for OUTPUT
data scores_by_id / view=scores_by_id;
set work.student_2010-work.student_2018 indsname=dsname;
by id;
year = scan(dsname,-1,'_');
run;
data want;
* compute which record in group has max measure;
do _n_ = 1 by 1 until (last.id);
set scores_by_id;
by id;
if total_score > _max then do;
_max = total_score;
_max_at_n = _n_;
end;
end;
* output entire record having the max measure;
do _n_ = 1 to _n_;
set scores_by_id;
if _n_ = _max_at_n then OUTPUT;
end;
drop _max:;
run;
Dataset a:-
cc dob enrolled
1 10-13-1981 10-13-2001
2 10-17-1984 12-15-2004
3 07-20-1957 12-20-2007
4 10-13-1989 12-24-2010
5 10-13-1996 12-28-2013
6 10-14-1996 12-11-1999
7 10-15-1996 12-24-2010
8 10-16-1996 12-24-2010
9 10-17-1996 12-24-2010
10 10-18-1996 12-24-2010
SAS Code:-
proc sql;
select distinct count(*) as cust_enrolled ,year(enrolled) as yr
from a
group by yr
order by cust_enrolled desc;
quit;
Result:-
cust_enrolled yr
5 2010
1 2013
1 2004
1 1999
1 2001
1 2007
My query is to get the first row from this result. How can I achieve this?
Typically I would use a having clause testing an aggregate such as freq=max(freq). However, since freq is already an aggregate count(*) that has to be in a sub-select.
Example:
data have;
input cc dob: mmddyy10. enrolled: mmddyy10.;
format dob enrolled mmddyy10.;
datalines;
1 10-13-1981 10-13-2001
2 10-17-1984 12-15-2004
3 07-20-1957 12-20-2007
4 10-13-1989 12-24-2010
5 10-13-1996 12-28-2013
6 10-14-1996 12-11-1999
7 10-15-1996 12-24-2010
8 10-16-1996 12-24-2010
9 10-17-1996 12-24-2010
10 10-18-1996 12-24-2010
;
proc sql;
create table most_popular_enrollment_year as
select * from
(select count(*) as freq, year(enrolled) as yr_enroll
from have
group by yr_enroll
)
having freq=max(freq)
;
quit;
If there are multiple years with the max number of year enrollment count the query will return multiple rows. If you want the earliest year of those you need another nesting.
proc sql;
create table earliest_most_popular as
select * from
(
select * from
(
select count(*) as freq, year(enrolled) as yr_enroll
from have
group by yr_enroll
)
having freq=max(freq)
)
having yr_enroll=min(yr_enroll)
;
quit;
Another way is to sort by yr_enroll and use Proc SQL option OUTOBS=1 to grab the first
proc sql outobs=1;
create table earliest_most_popular as
select * from
(
select count(*) as freq, year(enrolled) as yr_enroll
from have
group by yr_enroll
)
having freq=max(freq)
order by yr_enroll
;
reset outobs=max;
You can use the OUTOBS option of PROC SQL to control how many observations the SELECT statement writes to the output destination(s).
First let's convert your listing into an actual dataset.
data have;
input cc dob :mmddyy. enrolled :mmddyy.;
format dob enrolled date9.;
datalines;
1 10-13-1981 10-13-2001
2 10-17-1984 12-15-2004
3 07-20-1957 12-20-2007
4 10-13-1989 12-24-2010
5 10-13-1996 12-28-2013
6 10-14-1996 12-11-1999
7 10-15-1996 12-24-2010
8 10-16-1996 12-24-2010
9 10-17-1996 12-24-2010
10 10-18-1996 12-24-2010
;
Now let's run your SELECT statement with OUTOBS set to 1. Make sure to give it some criteria for deciding which observation to take when there are ties for the largest count.
proc sql outobs=1;
select year(enrolled) as yr
, count(*) as cust_enrolled
from have
group by yr
order by cust_enrolled desc, yr
;
quit;
Results:
cust_
yr enrolled
----------------------
2010 5
You can use data set options anywhere. SQL doesn't guarantee an order so you often will want logic that's more complicated than simply the first, but if that's what you want using the OBS=1 option is a decent option.
proc sql;
select * from sashelp.class(obs=1);
quit;
If you want something besides the first, use FIRSTOBS and OBS together.
proc sql;
select * from sashelp.class(firstobs=10 obs=10);
quit;
I want to do summation for each group and create a new variable for the sum for each group. I tried proc sql, but it only created a new variable.
My dataset looks like:
data have;
input firm year product$ value;
datalines;
1 2012 a 5
1 2012 a 6
1 2012 b 3
1 2013 a 4
1 2013 a 3
1 2013 b 4
1 2013 b 3
2 2012 a 5
2 2012 a 6
2 2012 b 3
2 2012 b 4
2 2012 b 2
2 2013 a 4
2 2013 a 5
2 2013 b 3
2 2013 b 3
;
run;
what I want is a table with four columns: firm year productA_sum productB_sum.
I tried this way:
proc sql;
create table h.want as
select a.*, sum(a.value) as sumvalue
from h.have as a
group by firm, year, product;
quit;
But it only create a new column.
because u group three variables, but in the select, you choose all variables. it will cause group by function useless.
/*Try this one*/
proc sql;
create table h.want as
select a.firm, a.year, a.product, sum(a.value) as sumvalue
from h.have as a
group by firm, year, product;
quit;
To get separate SUM() results based on another variable's value you need to use a CASE statement, not include it in the grouping variables.
proc sql;
create table want as
select firm, year
, sum(case when (product='a') then value else . end) as sum_product_A
, sum(case when (product='b') then value else . end) as sum_product_B
from have
group by firm,year
;
quit;
If you want the sum to be zero instead of missing if the product never appears then replace the missing values in the else clauses with 0 instead.
You are pivoting an aggregate sum. A two step approach could be more desirable if there are more than two product values to contend with.
proc summary data=have nway noprint;
class firm year product;
var value;
output out=class_sums sum=sum;
run;
proc transpose data=sums suffix=_sum out=want(drop=_name_);
by firm year;
id product;
var sum;
run;
I have a sas datebase with something like this:
id birthday Date1 Date2
1 12/4/01 12/4/13 12/3/14
2 12/3/01 12/6/13 12/2/14
3 12/9/01 12/4/03 12/9/14
4 12/8/13 12/3/14 12/10/16
And I want the data in this form:
id Date Datetype
1 12/4/01 birthday
1 12/4/13 1
1 12/3/14 2
2 12/3/01 birthday
2 12/6/13 1
2 12/2/14 2
3 12/9/01 birthday
3 12/4/03 1
3 12/9/14 2
4 12/8/13 birthday
4 12/3/14 1
4 12/10/16 2
Thanks by ur help, i'm on my second week using sas <3
Edit: thanks by remain me that i was not finding a sorting method.
Good day. The following should be what you are after. I did not come up with an easy way to rename the columns as they are not in beginning data.
/*Data generation for ease of testing*/
data begin;
input id birthday $ Date1 $ Date2 $;
cards;
1 12/4/01 12/4/13 12/3/14
2 12/3/01 12/6/13 12/2/14
3 12/9/01 12/4/03 12/9/14
4 12/8/13 12/3/14 12/10/16
; run;
/*The trick here is to use date: The colon means everything beginning with date, comparae with sql 'date%'*/
proc transpose data= begin out=trans;
by id;
var birthday date: ;
run;
/*Cleanup. Renaming the columns as you wanted.*/
data trans;
set trans;
rename _NAME_= Datetype COL1= Date;
run;
See more from Kent University site
Two steps
Pivot the data using Proc TRANSPOSE.
Change the names of the output columns and their labels with PROC DATASETS
Sample code
proc transpose
data=have
out=want
( keep=id _label_ col1)
;
by id;
var birthday date1 date2;
label birthday='birthday' date1='1' date2='2' ; * Trick to force values seen in pivot;
run;
proc datasets noprint lib=work;
modify want;
rename
_label_ = Datetype
col1 = Date
;
label
Datetype = 'Datetype'
;
run;
The column order in the TRANSPOSE output table is:
id variables
copy variables
_name_ and _label_
data based column names
The sample 'want' shows the data named columns before the _label_ / _name_ columns. The only way to change the underlying column order is to rewrite the data set. You can change how that order is perceived when viewed is by using an additional data view, or an output Proc that allows you to specify the specific order desired.
Here is my data :
data example;
input id sports_name;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
This is just a sample. The variable sports_name is categorical with 56 types.
I am trying to transpose the data to wide form where each row would have a user_id and the names of sports as the variables with values being 1/0 indicating Presence or absence.
So far, I used proc freq procedure to get the cross tabulated frequency table and put that in a different data set and then transposed that data. Now i have missing values in some cases and count of the sports in rest of the cases.
Is there any better way to do this?
Thanks!!
You need a way to create something from nothing. You could have also used the SPARSE option in PROC FREQ. SAS names cannot have length greater than 32.
data example;
input id sports_name :$16.;
retain y 1;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;;;;
run;
proc print;
run;
proc summary data=example nway completetypes;
class id sports_name;
output out=freq(drop=_type_);
run;
proc print;
run;
proc transpose data=freq out=wide(drop=_name_);
by id;
var _freq_;
id sports_name;
run;
proc print;
run;
Same theory here, generate a list of all possible combinations using SQL instead of Proc Summary and then transposing the results.
data example;
informat sports_name $20.;
input id sports_name $;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;
run;
proc sql;
create table complete as
select a.id, a_x.sports_name, case when not missing(e.sports_name) then 1 else 0 end as Present
from (select distinct ID from example) a
cross join (select distinct sports_name from example) a_x
full join example as e
on e.id=a.id
and e.sports_name=a_x.sports_name;
quit;
proc transpose data=complete out=want;
by id;
id sports_name;
var Present;
run;