Calculate the number of each variable where each variable =1 - sas

Could you help to calculate the number of each variable where each variable =1? I posted how can I calculate the missing number here. Hopefully, it is the similar way. Thanks in advance.
/*y00*/
%let list0=OCALZHMR OCARTERY OCARTH OCCHD OCDIABTS OCHBP OCMENTAL OCMYOCAR
OCOTHART OCPSYCH OCSTROKE;
/*y01 and y02*/
%let list1=D_CFAIL D_CHD D_HBP D_MYOCAR D_OTHHRT D_PSYCH D_RHYTHM D_STROKE
D_VALVE OCALZHMR OCARTERY OCARTH OCCHD OCDIABTS OCHBP OCMENTAL OCMYOCAR
OCOTHART OCPSYCH OCSTROKE;
proc means data=cohort00 nmiss noprint;
var &list0;
output out=y2000_nmiss(drop=_:) nmiss= ;
run;
proc means data=cohort01 nmiss noprint;
var &list1;
output out=y2001_nmiss(drop=_:) nmiss= ;
run;
data y2000_nmiss;
set y2000_nmiss;
j=1;
run;
data y2001_nmiss;
set y2001_nmiss;
j=1;
run;
proc transpose data=y2000_nmiss out=long0(rename=(COL1=Y2000 _name_=VAR));
by j;
run;
proc transpose data=y2001_nmiss out=long1(rename=(COL1=Y2001 _name_=VAR));
by j;
run;
data ATC_missing;
merge long0 long1;
by VAR;
drop j;
run;
Here is the part of output table for the number of missing :
VARS Y2000 Y2001 Y2002
OCDIABTS 0 1 0
OCHBP 0 0 0
OCMENTAL 17 18 10
OCMYOCAR 0 0 0
OCOTHART 0 0 4758
OCOTHHRT . . .
OCPSYCH 0 0 0

%let list1=Width Length Depth;
data work.is_even / view=work.is_even;
set sashelp.lake;
array vars {*} &list1 ;
drop i;
do i=1 to dim(vars);
if mod(round(vars(i), 1),2) = 0 /* would be VARS(I)=1 for your case */
then vars(i)=1;
else vars(i)=.;
end;
run;
proc means data=work.is_even n;
run;
First I create a datastep view work.is_even (view in order to avoid a full copy of data) that manipulates the data in the way I need.
Here it overrides the original value of variables (not in original data, just in that view) by 1 if the rounded value is even, by null if it's odd.
Then, just count the nonmissing values (N statistic in PROC MEANS).

Related

Loop through SAS variables and create data sets

I have a SAS data set t3. I want to run a data step inside a loop through a set of variables to create additional sets based on the variable value = 1, and rank two variables bal and otheramt in each subset, and then merge the ranks for each subset onto the original data set. Each rank column needs to be dynamically named so I know what subset is getting ranked. I know how to do proc rank and macros basically but do not know how to do this in the most dynamic way inside of a macro. Can you assist?
ID
bal
otheramt
firstvar
secondvar
lastvar
444
581
100
1
1
555
255
200
1
1
1
666
255
300
--------------
1
--------------
%macro dog();
data new;
set t3;
ARRAY Indicators(5) FirstVar--LastVar;
/*create data set for each of the subsets if firstvar = 1, secondvar = 1 ... lastvar = 1 */
/*for each new data set, rank by bal and otheramt*/
/*name the new rank columns [FirstVar]BalRank, [FirstVar]OtherAmtRank; */
/*merge the new ranks onto the original data set by ID*/
%mend;
%dog()
The Proc rank section would be something like this, but I would need the rank columns to have information about what subset I am ranking.
proc rank data=subset1 out=subset1ranked;
var bal otheramt;
ranks bal_rank otheramt_rank;
run;
Instead of using macro, use data transformation and reshaping that allows simpler steps to be written.
Example:
Rows are split into multiple rows based on flag so group processing in RANK can occur. Two transposes are required to reshape the results back a single row per id.
data have;
call streaminit(20230216);
do id = 1 to 100;
foo = rand('integer', 50,150);
bar = rand('integer', 100,200);
flag1 = rand('integer', 0, 1);
flag2 = rand('integer', 0, 1);
flag3 = rand('integer', 0, 1);
output;
end;
run;
data step1;
set have;
/* important: the group value becomes part of the variable name later */
if flag1 then do; group='flag1_'; output; end;
if flag2 then do; group='flag2_'; output; end;
if flag3 then do; group='flag3_'; output; end;
drop flag:;
run;
proc sort data=step1;
by group;
run;
proc rank data=step1 out=step2;
by group;
var foo bar;
ranks foo_rank bar_rank;
run;
proc sort data=step2;
by id group;
run;
* pivot (reshape) so there is one row per ranked var;
proc transpose data=step2 out=step3(drop=_label_);
by id foo bar group;
var foo_rank bar_rank;
run;
* pivot again so there is one row per id;
proc transpose data=step3 out=step4(drop=_name_);
by id;
var col1;
id group _name_;
run;
* merge so those 0 0 0 flag rows remain intact;
data want;
merge have step4;
by id;
run;
Since we don't have much sample data, I created test data from sashelp.class with some indicator variables like yours.
data have;
set sashelp.class;
firstvar=round(rand('uniform',1));
secondvar=round(rand('uniform',1));
thirdvar=round(rand('uniform',1));
drop sex weight;
run;
Partial output:
Name Age Height firstvar secondvar thirdvar
Alfred 14 69 1 0 1
Alice 13 56.5 0 1 1
Barbara 13 65.3 1 0 0
Carol 14 62.8 0 0 0
To dynamically rank data based on indicator variables, I created a macro that accepts a list of indicators and rank variables. The 2 lists help to create the specific variable names you requested. Here's the macro call:
%rank(indicators=firstvar secondvar thirdvar,
rank_vars=age height);
Here's part of the final output. Notice the indicators in the sample output above coincide with the ranks in this output. Also note that Carol is not in the output because she had no indicators set to 1.
Name Age Height firstvar_age_rank firstvar_height_rank secondvar_age_rank secondvar_height_rank thirdvar_age_rank thirdvar_height_rank
Alfred 14 69 8 11 . . 6.5 10
Alice 13 56.5 . . 3.5 2 4.5 2
Barbara 13 65.3 6.5 8 . . . .
Henry 14 63.5 . . 5.5 5 . .
The full macro is listed below. It has 3 parts.
Create a temp data set with a group variable that contains the number of the indicator variable based on the order of the variable in the list. Whenever an indicator = 1 the obs is output. If an obs has all 3 indicators set to 1 then it will be output 3 times with the group variable set to the number of each indicator variable. This step is important because proc rank will rank groups independently.
Generate the rankings on the temp data set. Each group will be ranked independently of the other groups and can be done in one step.
Construct the final data set by essentially transposing the ranked data into columns.
%macro rank(indicators=, rank_vars=);
%let cnt_ind = %sysfunc(countw(&indicators));
%let cnt_vars = %sysfunc(countw(&rank_vars));
data temp;
set have;
array indicators(*) &indicators;
do i = 1 to dim(indicators);
if indicators(i) = 1 then do;
group = i; * create a group based on order of indicators;
output; * an obs can be output multiple times;
end;
end;
drop i &indicators;
run;
proc sort data=temp;
by group;
run;
* Generate rankings by group;
proc rank data=temp out=ranks;
by group;
var &rank_vars;
ranks
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
run;
proc sort data=ranks;
by name group;
run;
* Contruct final data set by transposing the ranks into columns;
data want;
set ranks;
by name;
* retain statement to declare new variables and retain values;
retain
%let vars = ;
%do i = 1 %to &cnt_ind;
%let ivar = %scan(&indicators, &i);
%do j = 1 %to &cnt_vars;
%let jvar = %scan(&rank_vars, &j);
%let vars = &vars &ivar._&jvar._rank;
%end;
%end;
&vars;
if first.name then call missing (of &vars);
* option 1: build series of IF statements;
%let vars = ;
%do i = 1 %to &cnt_ind;
%let ivar = %scan(&indicators, &i);
%str(if group = &i then do;)
%do j = 1 %to &cnt_vars;
%let jvar = %scan(&rank_vars, &j);
%let newvar = &ivar._&jvar._rank;
%str(&newvar = &jvar._rank;)
%end;
%str(end;)
%end;
if last.name then output;
drop group
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
run;
%mend;
When constructing the final data set and transposing the rank variables, there are a couple of options. The first option shown above is to dynamically build a series of if statements. Here is what the code generates:
MPRINT(RANK): * option 1: build series of IF statements;
MPRINT(RANK): if group = 1 then do;
MPRINT(RANK): firstvar_age_rank = age_rank;
MPRINT(RANK): firstvar_height_rank = height_rank;
MPRINT(RANK): end;
MPRINT(RANK): if group = 2 then do;
MPRINT(RANK): secondvar_age_rank = age_rank;
MPRINT(RANK): secondvar_height_rank = height_rank;
MPRINT(RANK): end;
MPRINT(RANK): if group = 3 then do;
MPRINT(RANK): thirdvar_age_rank = age_rank;
MPRINT(RANK): thirdvar_height_rank = height_rank;
MPRINT(RANK): end;
The 2nd option is to use an array and mathematically calculate the index into the array by the group number and variable number. Here is the snippet of macro code to replace the if series code:
* option 2: create arrays and calculate index into array
* by group number and variable number;
array ranks(*) &vars;
array rankvars(*)
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
%str(idx = dim(rankvars) * (group - 1);)
%str(do i = 1 to dim(rankvars);)
%str(ranks(idx + i) = rankvars(i);)
%str(end;)
Here is the generated code:
MPRINT(RANK): * option 2: create arrays and calculate index into array * by group number and variable number;
MPRINT(RANK): array ranks(*) firstvar_age_rank firstvar_height_rank secondvar_age_rank secondvar_height_rank thirdvar_age_rank
thirdvar_height_rank;
MPRINT(RANK): array rankvars(*) age_rank height_rank;
MPRINT(RANK): idx = dim(rankvars) * (group - 1);
MPRINT(RANK): do i = 1 to dim(rankvars);
MPRINT(RANK): ranks(idx + i) = rankvars(i);
MPRINT(RANK): end;
It takes a minute to understand the array option, but once you do, it is preferable over generating if statments. As the number of variables increases, the code generated by the array option is the same and operates more efficiently.

Count the number of times record appears within a variable and apply count in a new variable

SAS - I'd like to count the number of times a record appears within a variable (Ref) and apply a count in a new variable (Count) for these.
Eg.
Ref
Count
1000
1
1001
1
2000
1
3000
1
1000
2
1000
3
What is the best way to do this?
That is what PROC FREQ is for. It will count the number of OBSERVATIONS for each value of a variable (or combination of variables).
proc freq data=have;
tables REF ;
run;
If you want the result in a dataset then use the OUT= option of the TABLES statement.
proc freq data=have;
tables REF / out=want;
run;
Managed to achive the results with the below code.
Please note - Data needs to be sorted with a PROC SORT before running this.
DATA want;
set have;
BY Variable;
IF FIRST.Variable then counter = 1
ELSE counter + 1;
RUN;
If you use the SAS hash object, you can do it with the original order intact
data have;
input Ref $;
datalines;
1000
1001
2000
3000
1000
1000
;
data want;
if _N_ = 1 then do;
dcl hash h();
h.definekey('Ref');
h.definedata('Count');
h.definedone();
end;
set have;
if h.find() ne 0 then Count = 1;
else Count + 1;
h.replace();
run;

Frequency of entire table in SAS

Is it possible to get the frequency of an entire table in SAS? For example I want to count how many yes's or no's are in an entire table? Thanks
A hash component object has keys and can track .FIND references in key summary variable specified with the keysum: tag attribute supplied at instantiation. The keysum variable, when incremented by 1 per suminc: variable will compute a frequency count.
data have;
* Words array from Abstract;
* "How Do I Love Hash Tables? Let Me Count The Ways!";
* by Judy Loren, Health Dialog Analytic Solutions;
* SGF 2008 - Beyond the Basics;
* https://support.sas.com/resources/papers/proceedings/pdfs/sgf2008/029-2008.pdf;
array words(17) $10 _temporary_ (
'I' 'love' 'hash' 'tables'
'You' 'will' 'too' 'after' 'you' 'see'
'what' 'they' 'can' 'do' '--' 'Judy' 'Loren'
);
call streaminit(123);
do row = 1 to 127;
attrib RESPONSE1-RESPONSE20 length = $10;
array RESPONSE RESPONSE1-RESPONSE20;
do over RESPONSE;
RESPONSE = words(rand('integer', 1, dim(words)));
end;
output;
end;
run;
data _null_;
set have;
if _n_ = 1 then do;
length term $10;
call missing (term);
retain one 1;
retain count 0;
declare hash bins(suminc:'one', keysum:'count');
bins.defineKey('term');
bins.defineData('term');
bins.defineDone();
end;
set have end=lastrow;
array response response1-response20;
do over response;
if bins.find(key:response) ne 0 then do;
bins.add(key:response, data:response, data:1);
end;
end;
if lastrow;
bins.output(dataset:'all_freq');
run;
Original answer, presumed only Yes and No
Yes. You can array values, compute as 0/1 flag for each No/Yes value and then use SUM to count 0's and 1's. SUM computes FREQ only when dealing with just 0's and 1's.
Example:
data have;
call streaminit(123);
do row = 1 to 100;
attrib ANSWER1-ANSWER20 length = $3;
array ANSWER ANSWER1-ANSWER20;
do over ANSWER; ANSWER = ifc(rand('uniform') > 0.15,'Yes','No'); end;
output;
end;
run;
data want(keep=freq_1 freq_0);
set have end=lastrow;
array ANSWER ANSWER1-ANSWER20;
array X(20) _temporary_;
do over ANSWER; x(_I_) = ANSWER = 'Yes'; end;
freq_1 + sum (of X(*));
freq_0 + dim(X) - sum (of X(*));
if lastrow;
run;
Transpose your main data and then do a proc freq. This is fully dynamic and scales if the number of question scales or the responses scale. You do need to have all variables be of the same type though - character or numeric.
*generate fake data;
data have;
call streaminit(99);
array q(30) q1-q30;
do i=1 to 100;
do j=1 to dim(q);
q(j) = rand('bernoulli', 0.8);
end;
output;
end;
run;
*flip it to a long format;
proc transpose data=have out=long;
by I;
var q1-q30;
run;
*get the summaries needed;
proc freq data=long;
table col1;
run;
You should get output as follows:
The FREQ Procedure
COL1 Frequency Percent Cumulative
Frequency Cumulative
Percent
0 581 19.37 581 19.37
1 2419 80.63 3000 100.00

Count number of 0 values

Similar to here, I can count the number of missing observations:
data dataset;
input a b c;
cards;
1 2 3
0 1 0
0 0 0
7 6 .
. 3 0
0 0 .
;
run;
proc means data=dataset NMISS N;
run;
But how can I also count the number of observations that are 0?
If you want to count the number of observations that are 0, you'd want to use proc tabulate or proc freq, and do a frequency count.
If you have a lot of values and you just want "0/not 0", that's easy to do with a format.
data have;
input a b c;
cards;
1 2 3
0 1 0
0 0 0
7 6 .
. 3 0
0 0 .
;
run;
proc format;
value zerof
0='Zero'
.='Missing'
other='Not Zero';
quit;
proc freq data=have;
format _numeric_ zerof.;
tables _numeric_/missing;
run;
Something along those lines. Obviously be careful about _numeric_ as that's all numeric variables and could get messy quickly if you have a lot of them...
I add this as an additional answer. It requires you to have PROC IML.
This uses matrix manipulation to do the count.
(ds=0) -- creates a matrix of 0/1 values (false/true) of values = 0
[+,] -- sums the rows for all columns. If we have 0/1 values, then this is the number of value=0 for each column.
' -- operator is transpose.
|| -- merge matrices {0} || {1} = {0 1}
Then we just print the values.
proc iml;
use dataset;
read all var _num_ into ds[colname=names];
close dataset;
ds2 = ((ds=0)[+,])`;
n = nrow(ds);
ds2 = ds2 || repeat(n,ncol(ds),1);
cnames = {"N = 0", "Count"};
mattrib ds2 rowname=names colname=cnames;
print ds2;
quit;
Easiest to use PROC SQL. You will have to use a UNION to replicate the MEANS output;
Each section of the first FROM counts the 0 values for each variable and UNION stacks them up.
The last section just counts the number of observations in DATASET.
proc sql;
select n0.Variable,
n0.N_0 label="Number 0",
n.count as N
from (
select "A" as Variable,
count(a) as N_0
from dataset
where a=0
UNION
select "B" as Variable,
count(b) as N_0
from dataset
where b=0
UNION
select "C" as Variable,
count(c) as N_0
from dataset
where c=0
) as n0,
(
select count(*) as count
from dataset
) as n;
quit;
there is levels options in proc freq you could use.
proc freq data=dataset levels;
table _numeric_;
run;

How to find max value of variable for each unique observation in "stacked" dataset

Sorry for the vauge title.
My data set looks essentially like this:
ID X
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
what I want is to find the max value of x for each ID. In this dataset, that would be 2 for ID=18 and 3 for ID=361.
Any feedback would be greatly appreciated.
Proc Means with a class statement (so you don't have to sort) and requesting the max statistic is probably the most straightforward approach (untested):
data sample;
input id x;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
run;
proc means data=sample noprint max nway missing;
class id;
var x;
output out=sample_max (drop=_type_ _freq_) max=;
run;
Check out the online SAS documentation for more details on Proc Means (http://support.sas.com/onlinedoc/913/docMainpage.jsp).
I don't quite understand your example. I can't imagine that the input data set really has all the values in one observation. Do you instead mean something like this?
data sample;
input myid myvalue;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
proc sort data=sample;
by myid myvalue;
run;
data result;
set sample;
by myid;
if last.myid then output;
run;
proc print data=result;
run;
This would give you this result:
Obs myid myvalue
1 18 2
2 361 1
3 369 3
If you want to keep both all records and the max value of X by id, I would use either the PROC MEANS aproach followed by a merge statement, or you can sort the data by Id and DESCENDING X first, and then use the RETAIN statement to create the max_value directly in the datastep:
PROC SORT DATA=A; BY ID DESCENDING X; RUN;
DATA B; SET A;
BY ID;
RETAIN X_MAX;
IF FIRST.ID THEN X_MAX = X;
ELSE X_MAX = X_MAX;
RUN;
You could try this:
PROC SQL;
CREATE TABLE CHCK AS SELECT MYID, MAX(MYVALUE) FROM SAMPLE
GROUP BY 1;
QUIT;
A couple of more over-engineered options that might be of interest for anyone who needs to do this with a really big dataset, where performance is more of a concern:
If your dataset is already sorted by ID, but not by X within each ID, you can still do this in a single data step without any sorting, using a retained max within each by group. Alternatively, you can use proc means (as per the top answer) but with a by statement rather than a class statement - this reduces the memory usage.
data sample;
input id x;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
run;
data want;
do until(last.ID);
set sample;
by ID;
xmax = max(x, xmax);
end;
x = xmax;
drop xmax;
run;
Even if your dataset is not sorted by ID, you can still do this in one data step, without sorting it, by using a hash object to keep track of the maximum x value you've found for each ID as you go along. This will be a little faster than proc means and will typically use less memory, as proc means does various calculations in the background which are not needed in the output dataset.
data _null_;
set sample end = eof;
if _n_ = 1 then do;
call missing(xmax);
declare hash h(ordered:'a');
rc = h.definekey('ID');
rc = h.definedata('ID','xmax');
rc = h.definedone();
end;
rc = h.find();
if rc = 0 then do;
if x > xmax then do;
xmax = x;
rc = h.replace();
end;
end;
else do;
xmax = x;
rc = h.add();
end;
if eof then rc = h.output(dataset:'want2');
run;
In this example, on my PC, the hash approach used this much memory:
memory 966.15k
OS Memory 27292.00k
vs. this much for an equivalent proc summary:
memory 8706.90k
OS Memory 35760.00k
Not a bad saving if you really need it to scale up!
Use an appropriate proc with the by statement. For instance,
data sample;
input myid myvalue;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
run;
proc sort data=sample;
by myid;
run;
proc means data=sample;
var myvalue;
by myid;
run;
I would just sort by x and id putting the highest value for each ID at the top.
NODUPKEY removes every duplicate below.
proc sort data=yourstacked_data out=yourstacked_data_sorted;
by DECENDING x id;
run;
proc sort data=yourstacked_data NODUPKEY out=top_value_only;
by id;
run;
A multidata hash should be used if you want the result to show each id at max value. That is, for the cases when more than one id is found having a max value
Example code:
Find the ids associated with the max value of 40 different numeric variables.
The code is Proc DS2 data program.
data have;
call streaminit(123);
do id = 1 to 1e5; %* 10,000 rows;
array v v1-v40; %* 40 different variables;
do over v; v=ceil(rand('uniform', 2e5)); end;
output;
end;
run;
proc ds2;
data _null_;
declare char(32) _name_ ; %* global declarations;
declare double value id;
declare package hash result();
vararray double v[*] v:; %* variable based array, limit yourself to 1,000;
declare double max[1000]; %* temporary array for holding the vars maximum values;
method init();
declare package sqlstmt s('drop table want'); %* DS2 version of `delete`;
s.execute();
result.keys([_name_]); %* instantiate a multidata hash;
result.data([_name_ value id]);
result.multidata();
result.ordered('ascending');
result.defineDone();
end;
method term();
result.output('want'); %* write the results to a table;
end;
method run();
declare int index;
set have;
%* process each variable being examined for 'id at max';
do index = 1 to dim(v);
if v[index] > max[index] then do; %* new maximum for this variable ?
_name_ = vname(v[index]); %* retrieve variable name;
value = v[index]; %* move value into hash host variable;
if not missing (max[index]) then do;
result.removeall(); %* remove existing multidata items associated with the variable;
end;
result.add(); %* add new multidata item to hash;
max[index] = v[index]; %* track new maximum;
end;
else
if v[index] = max[index] then do; %* two or more ids have same max;
_name_ = vname(v[index]);
value = v[index];
result.add(); %* add id to the multidata item;
end;
end;
end;
enddata;
run;
quit;
%let syslast=want;
Reminder: Proc DS2 defaults are to not overwrite existing tables. To 'overwrite' a table you need to either:
Use table option overwrite=yes when syntax allows
The package hash .output() method does not recognize the table option
Drop the table before recreating it
The above code can be used in a Base SAS DATA step with minor modifications.