SAS - I'd like to count the number of times a record appears within a variable (Ref) and apply a count in a new variable (Count) for these.
Eg.
Ref
Count
1000
1
1001
1
2000
1
3000
1
1000
2
1000
3
What is the best way to do this?
That is what PROC FREQ is for. It will count the number of OBSERVATIONS for each value of a variable (or combination of variables).
proc freq data=have;
tables REF ;
run;
If you want the result in a dataset then use the OUT= option of the TABLES statement.
proc freq data=have;
tables REF / out=want;
run;
Managed to achive the results with the below code.
Please note - Data needs to be sorted with a PROC SORT before running this.
DATA want;
set have;
BY Variable;
IF FIRST.Variable then counter = 1
ELSE counter + 1;
RUN;
If you use the SAS hash object, you can do it with the original order intact
data have;
input Ref $;
datalines;
1000
1001
2000
3000
1000
1000
;
data want;
if _N_ = 1 then do;
dcl hash h();
h.definekey('Ref');
h.definedata('Count');
h.definedone();
end;
set have;
if h.find() ne 0 then Count = 1;
else Count + 1;
h.replace();
run;
I have a data set in SAS containing individuals as rows and a variable for each period as columns. It looks something like this:
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;
run;
What I want is for SAS to count how many there is of each number for each time period. So I want to get something like it:
data want;
input count t1 t2 t3;
cards;
111 1 3 1
112 3 1 0
123 0 0 3
;
run;
I could do this with proc freq, but outputting this doesn't work very well, when I have a lot of columns.
Thanks
In general having data in the meta data is a bad idea, as here where PERIOD is coded into the Tn variables and you really want that to be a group. Having said that you can still have your cake and eat it too.
PROC SUMMARY can get the counts for each Tn quickly and then you will have smaller data set to fiddle with. Here is one approach that should work well for many time periods.
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;;;;
run;
proc print;
run;
proc summary data=have chartype;
class t:;
ways 1;
output out=want;
run;
proc print;
run;
data want;
set want;
p = findc(_type_,'1');
c = coalesce(of t1-t3);
run;
proc print;
run;
proc summary data=want nway completetypes;
class c p;
freq _freq_;
output out=final;
run;
proc print;
run;
proc transpose data=final out=morefinal(drop=_name_) prefix=t;
by c;
id p;
var _freq_;
run;
proc print;
run;
First restructure the data so that it is in more of a vertical fashion. This will be easier to work with. We also want to create a flag that we will use as a counter later on.
data have2;
set have;
array arr[*] t1-t3;
flag = 1;
do period=lbound(arr) to hbound(arr);
val = arr[period];
output;
end;
keep period val flag;
run;
Summarize the data so we have the number of times that value occurred in each of the periods.
proc sql noprint;
create table smry as
select val,
period,
sum(flag) as count
from have3
group by 1,2
order by 1,2
;
quit;
Transpose the data so we have one line per value and then the counts for each period after that:
proc transpose data=smry out=want(drop=_name_);
by val;
id period;
var count;
run;
Note that when you define the array in the first step you could use this notation which would allow for a dynamic number of periods:
array arr[*] t:;
This assumes every variable beginning with 't' in the dataset should go into the array.
If your computer memory is large enough to hold the entire output, then Hash could be a viable solution:
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;
run;
data _null_;
if _n_=1 then
do;
/*This is to construct a Hash, where count is tracked and t1-t3 is maintained*/
declare hash h(ordered:'a');
h.definekey('count');
h.definedata('count', 't1','t2','t3');
h.definedone();
call missing(count, t1,t2,t3);
end;
set have(rename=(t1-t3=_t1-_t3))
/*rename to avoid conflict between input data and Hash object*/
end=last;
array _t(*) _t:;
array t(*) t:;
/*The key is to set up two arrays, one is for input data,
another is for Hash feed, and maneuver their index variable accordingly*/
do i=1 to dim(_t);
count=_t(i);
rc=h.find(); /*search the Hash and bring back data elements if found*/
/*If there is a match, then corresponding 't' will increase by '1'*/
if rc=0 then
t(i)+1;
else
do;
/*If there is no match, then corresponding 't' will be initialized as '1',
and all of the other 't' reset to '0'*/
do j=1 to dim(t);
t(j)=0;
end;
t(i)=1;
end;
rc=h.replace(); /*Update the Hash*/
end;
if last then
rc=h.output(dataset:'want');
run;
Try this:
%macro freq(dsn);
proc sql;
select name into:name separated by ' ' from dictionary.columns where libname='WORK' and memname='HAVE' and name like 't%';
quit;
%let ncol=%sysfunc(countw(&name,%str( )));
%do i=1 %to &ncol;
%let col=%scan(&name,&i);
proc freq data=have;
table &col/out=col_&i(keep=&col count rename=(&col=count count=&col));
run;
%end;
data temp;
merge
%do i=1 %to &ncol;
col_&i
%end;
;
by count;
run;
data want;
set temp;
array vars t:;
do over vars;
if missing(vars) then vars=0;
end;
run;
%mend;
%freq(have)
I have a data set with one row for each country and 100 columns (10 variables with 10 data years each).
For each variable I am trying to make a new data set with the three most recent data years for that variable for each country (which might not be successive).
This is what I have so far, but I know its wrong because of the nest loop, and its has same value for recent1 recent2 recent3 however I haven't figured out how to create recent1 recent2 recent3 without two loops.
%macro test();
data Maternal_care_recent;
set wb;
keep country MATERNAL_CARE_2004 -- MATERNAL_CARE_2013 recent_1 recent_2 recent_3;
%let rc = 1;
%do i = 2013 %to 2004 %by -1;
%do rc = 1 %to 3 %by 1;
%if MATERNAL_CARE_&i. ne . %then %do;
recent_&rc. = MATERNAL_CARE_&i.;
%end;
%end;
%end; run; %mend; %test();
You don't need to use a macro to do this - just some arrays:
data Maternal_care_recent;
set wb;
keep country MATERNAL_CARE_2004-MATERNAL_CARE_2013 recent_1 recent_2 recent_3;
array mc {*} MATERNAL_CARE_2004-MATERNAL_CARE_2013;
array recent {*} recent1-recent3;
do i = 2013 to 2004 by -1;
do rc = 1 to 3 by 1;
if mc[i] ne . then do;
recent[rc] = mc[i];
end;
end;
run;
Maybe I don't get your request, but according to your description:
"For each variable I am trying to make a new data set with the three most recent data years for that variable for each country (which might not be successive)" I created this sample dataset with dt1 and dt2 and 2 locations.
The output will be 2 datasets (and generally the number of the variables starting with DT) named DS1 and DS2 with 3 observations for each country, the first one for the first variable, the second one for the second variable.
This is the sample dataset:
data sample_ds;
length city $10 dt1 dt2 8.;
infile datalines dlm=',';
input city $ dt1 dt2;
datalines;
MS,5,0
MS,3,9
MS,3,9
MS,2,0
MS,1,8
MS,1,7
CA,6,1
CA,6,.
CA,6,.
CA,2,8
CA,1,5
CA,0,4
;
This is the sample macro:
%macro help(ds=);
data vars(keep=dt:); set &ds; if _n_ not >0; run;
%let op = %sysfunc(open(vars));
%let nvrs = %sysfunc(attrn(&op,nvars));
%let cl = %sysfunc(close(&op));
%do idx=1 %to &nvrs.;
proc sort data=&ds(keep=city dt&idx.) out=ds&idx.(where=(dt&idx. ne .)) nodupkey; by city DESCENDING dt&idx.; run;
data ds&idx.; set ds&idx.;
retain cnt;
by city DESCENDING dt&idx.;
if first.city then cnt=0; else cnt=cnt+1;
run;
data ds&idx.(drop=cnt); set ds&idx.(where=(cnt<3)); rename dt&idx.=act&idx.; run;
%end;
%mend;
You will run this macro with:
%help(ds=sample_ds);
In the first statement of the macro I select the variables on which I want to iterate:
data vars(keep=dt:); set &ds; if _n_ not >0; run;
Work on this if you want to make this work for your code, or simply rename your variables as DT1 DT2...
Let me know if it is correct for you.
When writing macro code, always keep in mind what has to be done when. SAS processes your code stepwise.
Before your sas code is even compiled, your macro variables are resolved and your macro code is executed
Then the resulting SAS Base code is compiled
Finally the code is executed.
When you write %if MATERNAL_CARE_&i. ne . %then %do, this is macro code interpreded before compilation.
At that time MATERNAL_CARE_&i. is not a variable but a text string containing a macro variable.
The first time you run trhough your %do i = 2013 %to 2004 by -1, it is filled in as MATERNAL_CARE_2013, the second as MATERNAL_CARE_2012., etc.
Then the macro %if statement is interpreted, and as the text string MATERNAL_CARE_1 is not equal to a dot, it is evaluated to FALSE
and recent_&rc. = MATERNAL_CARE_&i. is not included in the code to pass to your compiler.
You can see that if you run your code with option mprint;
The resolution;
options mprint;
%macro test();
data Maternal_care_recent;
set wb;
keep country MATERNAL_CARE_: recent_:;
** The : acts as a wild card here **;
%do i = 2013 %to 2004 %by -1;
if MATERNAL_CARE_&i. ne . then do;
%do rc = 1 %to 3 %by 1;
recent_&rc. = MATERNAL_CARE_&i.;
%end;
end;
%end;
run;
%mend;
%test();
Now, before compilation of if MATERNAL_CARE_&i. ne . then do, only the &i. is evalueated and if MATERNAL_CARE_2013 ne . then do is passed to the compiler.
The compiler will see this as a test if the SAS variable MATERNAL_CARE_1 has value missing, and that is just what you wanted;
Remark:
It is not essential that I moved the if statement above the ``. It is just more efficient because the condition is then evaluated less often.
It is however essential that you close your %ifs and %dos with an %end and your ifs and dos with an end;
Remark:
you do not need %let rc = 1, because %do rc = 1 to 3 already initialises &rc.;
For completeness SAS is compiled stepwise:
The next PROC or data step and its macro code are only considered when the preveous one is executed.
That is why you can write macro variables from a data step or sql select into that will influence the code you compile in your next step,
somehting you can not do for instance with C++ pre compilation;
Thanks everyone. Found a hybrid solution from a few solutions posted.
data sample_ds;
infile datalines dlm=',';
input country $ maternal_2004 maternal_2005
maternal_2006 maternal_2007 maternal_2008 maternal_2009 maternal_2010 maternal_2011 maternal_2012 maternal_2013;
datalines;
MS,5,0,5,0,5,.,5,.,5,.
MW,3,9,5,0,5,0,5,.,5,0
WE,3,9,5,0,5,.,.,.,.,0
HU,2,0,5,.,5,.,5,0,5,0
MI,1,8,5,0,5,0,5,.,5,0
HJ,1,7,5,0,5,0,.,0,.,0
CJ,6,1,5,0,5,0,5,0,5,0
CN,6,1,.,5,0,5,0,5,0,5
CE,6,5,0,5,0,.,0,5,.,8
CT,2,5,0,5,0,5,0,5,0,9
CW,1,5,0,5,0,5,.,.,0,7
CH,0,5,0,5,0,.,0,.,0,5
;
%macro test(var);
data &var._recent;
set sample_ds;
keep country &var._1 &var._2 &var._3;
array mc {*} &var._2004-&var._2013;
array recent {*} &var._1-&var._25;
count=1;
do i = 10 to 1 by -1;
if mc[i] ne . then do;
recent[count] = mc[i];
count=count+1;
end;
end;
run;
%mend;
Sorry for the vauge title.
My data set looks essentially like this:
ID X
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
what I want is to find the max value of x for each ID. In this dataset, that would be 2 for ID=18 and 3 for ID=361.
Any feedback would be greatly appreciated.
Proc Means with a class statement (so you don't have to sort) and requesting the max statistic is probably the most straightforward approach (untested):
data sample;
input id x;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
run;
proc means data=sample noprint max nway missing;
class id;
var x;
output out=sample_max (drop=_type_ _freq_) max=;
run;
Check out the online SAS documentation for more details on Proc Means (http://support.sas.com/onlinedoc/913/docMainpage.jsp).
I don't quite understand your example. I can't imagine that the input data set really has all the values in one observation. Do you instead mean something like this?
data sample;
input myid myvalue;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
proc sort data=sample;
by myid myvalue;
run;
data result;
set sample;
by myid;
if last.myid then output;
run;
proc print data=result;
run;
This would give you this result:
Obs myid myvalue
1 18 2
2 361 1
3 369 3
If you want to keep both all records and the max value of X by id, I would use either the PROC MEANS aproach followed by a merge statement, or you can sort the data by Id and DESCENDING X first, and then use the RETAIN statement to create the max_value directly in the datastep:
PROC SORT DATA=A; BY ID DESCENDING X; RUN;
DATA B; SET A;
BY ID;
RETAIN X_MAX;
IF FIRST.ID THEN X_MAX = X;
ELSE X_MAX = X_MAX;
RUN;
You could try this:
PROC SQL;
CREATE TABLE CHCK AS SELECT MYID, MAX(MYVALUE) FROM SAMPLE
GROUP BY 1;
QUIT;
A couple of more over-engineered options that might be of interest for anyone who needs to do this with a really big dataset, where performance is more of a concern:
If your dataset is already sorted by ID, but not by X within each ID, you can still do this in a single data step without any sorting, using a retained max within each by group. Alternatively, you can use proc means (as per the top answer) but with a by statement rather than a class statement - this reduces the memory usage.
data sample;
input id x;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
run;
data want;
do until(last.ID);
set sample;
by ID;
xmax = max(x, xmax);
end;
x = xmax;
drop xmax;
run;
Even if your dataset is not sorted by ID, you can still do this in one data step, without sorting it, by using a hash object to keep track of the maximum x value you've found for each ID as you go along. This will be a little faster than proc means and will typically use less memory, as proc means does various calculations in the background which are not needed in the output dataset.
data _null_;
set sample end = eof;
if _n_ = 1 then do;
call missing(xmax);
declare hash h(ordered:'a');
rc = h.definekey('ID');
rc = h.definedata('ID','xmax');
rc = h.definedone();
end;
rc = h.find();
if rc = 0 then do;
if x > xmax then do;
xmax = x;
rc = h.replace();
end;
end;
else do;
xmax = x;
rc = h.add();
end;
if eof then rc = h.output(dataset:'want2');
run;
In this example, on my PC, the hash approach used this much memory:
memory 966.15k
OS Memory 27292.00k
vs. this much for an equivalent proc summary:
memory 8706.90k
OS Memory 35760.00k
Not a bad saving if you really need it to scale up!
Use an appropriate proc with the by statement. For instance,
data sample;
input myid myvalue;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
run;
proc sort data=sample;
by myid;
run;
proc means data=sample;
var myvalue;
by myid;
run;
I would just sort by x and id putting the highest value for each ID at the top.
NODUPKEY removes every duplicate below.
proc sort data=yourstacked_data out=yourstacked_data_sorted;
by DECENDING x id;
run;
proc sort data=yourstacked_data NODUPKEY out=top_value_only;
by id;
run;
A multidata hash should be used if you want the result to show each id at max value. That is, for the cases when more than one id is found having a max value
Example code:
Find the ids associated with the max value of 40 different numeric variables.
The code is Proc DS2 data program.
data have;
call streaminit(123);
do id = 1 to 1e5; %* 10,000 rows;
array v v1-v40; %* 40 different variables;
do over v; v=ceil(rand('uniform', 2e5)); end;
output;
end;
run;
proc ds2;
data _null_;
declare char(32) _name_ ; %* global declarations;
declare double value id;
declare package hash result();
vararray double v[*] v:; %* variable based array, limit yourself to 1,000;
declare double max[1000]; %* temporary array for holding the vars maximum values;
method init();
declare package sqlstmt s('drop table want'); %* DS2 version of `delete`;
s.execute();
result.keys([_name_]); %* instantiate a multidata hash;
result.data([_name_ value id]);
result.multidata();
result.ordered('ascending');
result.defineDone();
end;
method term();
result.output('want'); %* write the results to a table;
end;
method run();
declare int index;
set have;
%* process each variable being examined for 'id at max';
do index = 1 to dim(v);
if v[index] > max[index] then do; %* new maximum for this variable ?
_name_ = vname(v[index]); %* retrieve variable name;
value = v[index]; %* move value into hash host variable;
if not missing (max[index]) then do;
result.removeall(); %* remove existing multidata items associated with the variable;
end;
result.add(); %* add new multidata item to hash;
max[index] = v[index]; %* track new maximum;
end;
else
if v[index] = max[index] then do; %* two or more ids have same max;
_name_ = vname(v[index]);
value = v[index];
result.add(); %* add id to the multidata item;
end;
end;
end;
enddata;
run;
quit;
%let syslast=want;
Reminder: Proc DS2 defaults are to not overwrite existing tables. To 'overwrite' a table you need to either:
Use table option overwrite=yes when syntax allows
The package hash .output() method does not recognize the table option
Drop the table before recreating it
The above code can be used in a Base SAS DATA step with minor modifications.