sas collapsing categorical variables clustering analysis - sas

I came across the following code in the logistic regression modeling course offered by SAS:
data dataset(drop=i);
set data;
array mi{*} mi_Ag mi_Inc
mi_WR;
array x{*} Ag Inc WR;
do i=1 to dim(mi);
mi{i}=(x{i}=.);
end;
run;
I need to understand two things:
1.) there is a column created titled "i" once this data step is run. What does that signify and why is there. The drop "i" essentially drops it but if i don't use drop option the column stays in the data set
2.) this do step is replacing all the missing values with a 1 and rest with 0. How is that happening when nothing is clearly specified in the do step as to what needs to be done. In my eyes, "do i=1 to dim(mi); mi{i}=(x{i}=.);" should simply put dots in mi(i) wherever it finds dots in x(i).
Part 2:
While collapsing the categorical variable, following code has been used:
proc freq data=example1 noprint;
tables CLUSTER_CODE*TARGET_B / chisq;
output out=out_chi(keep=_pchi_) chisq;
run;
data ex_cutoff;
if _n_=1 then set out_chi;
set ex_cluster;
chisquare=_pchi_*rsquared;
degfree=numberofclusters-1;
logpvalue=logsdf('CHISQ',chisquare,degfree);
run;
what is n=1 doing ? and also, why are we creating chisquare=_pchi*rsquared. pchi is already chisquare so whats the point of multiplying it with R square?
Thanks
P.S. The code is from one of the SAS learning courses. Hopefully I am allowed to share it here for discussion/learning purposes.

i is the array iterator (created in the do loop). It's dropped since it's not really intended to be kept on the dataset, it's just an iterator (letting you go through the array one element at a time and during that iteration letting you reference a single element).
mi{i}=(x{i}=.); is assigning 1/0 like this:
x(i)=. is either true or false. If it is true, it evaluates to a 1. If it is false, it evaluates to 0. Thus when it's true that x(i)=. then m(i) is assigned a 1; otherwise it is assigned a 0. That's just how SAS works with boolean (True/False) values; many other langauges work that way as well (True is nonzero, False is zero); and when converted to number, True is converted to 1 (but any nonzero nonmissing value is 'True' when converted the other way around).

Related

How does one or more data step OUTPUT statements work and can it be implicit?

When running a data step in SAS, why does the output statement seem to 'stop' the iterating of the set statement?
I need to conditionally output duplicate observations. While I can use a plethora of output statements, I'd like if SAS did it's normal iterating and output just created an additional observation.
1) Does the run statement in SAS have a built in output statement? (The way sum statements have a built in retain)
2) What is happening when I ask SAS to output certain observations - in particular after a set statement? Will it set all the values until a condition and then only keep the values I request? or does it have some kind of similarities with other statements such as the point= statement?
3) Is there a similar statement to output that will continue to set the values from a previous data step and then output an additional observation when requested?
For example:
data test;
do i = 1 to 100;
output;
end;
run;
data test2;
set test;
if _N_ in (4 8 11) then output;
run;
data test3;
set test;
if _N_ in (4 8 11) then output;
output;
run;
test has 100 observations, test2 has 3 observations, and test3 has 103 observations. This make me think that there is some kind of built in output statement for either the run statement, or the data step itself.
output in SAS is an explicit instruction to write out a row to the output dataset(s) (all of the dataset(s) named in the data statement, unless you specify a single dataset in output).
run, in addition to ending the step (meaning no statements after run are processed until that data step is completed - equivalent to the ending } in a c-style programming language module, basically) contains an implicit return statement.
Unless you are using link or goto, return tells SAS to return to the beginning of the data step loop. In addition, return contains an implicit output statement that outputs rows to all datasets named in the data statement, unless there is an output statement in the data step code - in which case that is not present.
It is return that causes SAS to actually stop processing things after it - not the output. In fact, SAS happily does things after the output statement; they just may not be output anywhere. For example:
data x;
do row = 1 to 100;
output;
row_prev+1;
end;
run;
That row_prev+1 statement is executed, even though it's after the output statement - its presence can be seen on the next row. In your example where you told it to just output three rows, it still processed the other 97 - just nothing was output from them. If any effects had happened from that processing, it would occur - in fact, the incrementing of _n_ is one of those effects (_n_ is not the row number, but the iteration count of data step looping).
You should probably read up on the data step itself. SAS documentation includes a lot of information on that, or you could read papers like The Essence of Data Step Programming. This sort of thing is quite common in SGF papers, in part because SAS certification requires understanding this fairly well.
The best way to understand everything is by reading about the Program Data Vector (PDV). The short answer to your questions:
The output statement is implied at the run boundary of every SAS data step that uses set, merge, update, or (nothing).
The set statement takes the contents of the current row and reads them into the PDV, if you have a single set statement
The output statement simply outputs the contents of the PDV at that moment into your output dataset
SAS only goes to a new row in the source dataset defined by your set statement when it reaches a run boundary, delete statement, return statement, or failing the conditions of an if without then statement
point= forces SAS to go directly to an observation number defined by a variable; otherwise, it will read every row sequentially, one by one
It's implicit at the end, unless it's used in one or more places in that data step.
Each time the execution encounters an OUTPUT statement, or the implicit one if it exists, it will output a new row.
You are very close.
1) There is an implied OUTPUT at the end of the data step, unless your data step includes an explicit OUTPUT statement. That is why your first step wrote all 100 observations and the second only three.
2) The OUTPUT statement tells SAS to write the current record to the output dataset.
3) There is not a direct way to do what you want to duplicate records without using OUTPUT statements, but for some similar problems you can cause the duplication on the input side instead of the output side.
For example if you felt your class didn't have enough eleven year-olds you could make two copies of all eleven year-olds by reading them twice.
data want;
set sashelp.class
sashelp.class(where=(age=11))
;
by name;
run;

Naming variable using _n_, a column for each iteration of a datastep

I need to declare a variable for each iteration of a datastep (for each n), but when I run the code, SAS will output only the last one variable declared, the greatest n.
It seems stupid declaring a variable for each row, but I need to achieve this result, I'm working on a dataset created by a proc freq, and I need a column for each group (each row of the dataset).
The result will be in a macro, so it has to be completely flexible.
proc freq data=&data noprint ;
table &group / out=frgroup;
run;
data group1;
set group (keep=&group count ) end=eof;
call symput('gr', _n_);
*REQUESTED code will go here;
run;
I tried these:
var&gr.=.;
call missing(var&gr.);
and a lot of other statement, but none worked.
Always the same result, the ds includes only var&gr where &gr is the maximum n.
It seems that the PDV is overwriting the new variable each iteration, but the name is different.
Please, include the result in a single datastep, or, at least, let the code take less time as possible.
Any idea on how can I achieve the requested result?
Thanks.
Macro variables don't work like you think they do. Any macro variable reference is resolved at compile time, so your call symput is changing the value of the macro variable after all the references have been resolved. The reason you are getting results where the &gr is the maximum n is because that is what &gr was as a result of the last time you ran the code.
If you know you can determine the maximum _n_, you can put the max value into a macro variable and declare an array like so:
Find max _n_ and assign value to maxn:
data _null_;
set have end=eof;
if eof then call symput('maxn',_n_);
run;
Create variables:
data want;
set have;
array var (&maxn);
run;
If you don't like proc transpose (if you need 3 columns you can always use it once for every column and then put together the outputs) what you ask can be done with arrays.
First thing you need to determine the number of groups (i.e. rows) in the input dataset and then define an array with dimension equal to that number.
Then the i-th element of your array can be recalled using _n_ as index.
In the following code &gr. contains the number of groups:
data group1;
set group;
array arr_counts(&gr.) var1-var&gr.;
arr_counts(_n_)= count;
run;
In SAS there're several methods to determine the number of obs in a dataset, my favorite is the following: (doesn't work with views)
data _null_;
if 0 then set group nobs=n;
call symputx('gr',n);
run;

Is sorting more favorable (efficient) in if-else statement?

Assume two functions fun1, fun2 have been defined to carry out some calculation given input x.
The structure of data have is:
Day Group x
01Jul14 A 1.5
02JUl14 B 2.7
I want to do sth like this:
data want;
set have;
if Group = 'A' then y = fun1(x);
if Group = 'B' then y = fun2(x);
run;
Is it better to do proc sort data=have;by Group;run; first then move on to the data step? Or it doesn't matter because each time it just picks one observation and determines which if statement it falls into?
So long as you are not doing anything to alter the normal input of observations - such as using random access (point=), building a hash table, using a by statement, etc. - sorting will have no impact: you read each row regardless of the if statement, check both lines, execute one of them. Nothing different occurs sorted or unsorted.
This is easy to test. Write something like this:
%put Before Unsorted Time: %sysfunc(time(),time8.);
***your datastep here***;
%put After Unsorted Time: %sysfunc(time(),time8.);
proc sort data=your_dataset;
by x;
run;
%put Before Sorted Time: %sysfunc(time(),time8.);
***your datastep here***;
%put After Sorted Time: %sysfunc(time(),time8.);
Or just run your datasteps and look at the execution time!
You may be confusing this with sorting your if statements (ie, changing the order of them in the code). That could have an impact, if your data is skewed and you use else. That's because SAS won't have to evaluate further downstream conditionals. It's not very common for this to have any sort of impact - it only matters when you have extremely skewed data, large numbers of observations, and certain other conditions based on your code - so I wouldn't program for it.

how does drop statement works behind the scene ?

data _null_ ;
set sashelp.cars ;
markup=invoice+msrp;
drop invoice msrp ;
run;
in compilation phase of data step
PDV is intialized according to all the sas variables that are there in dataset
then one more column is added in PDV
but then in drop statements it drops two columns namely(invoice , msrp) from PDV
then how come in execution phase it calculates value of markup variable which uses values of columns it has dropped already ?
The drop statement affects what variables are written out to any datasets produced by it, marking those variables to not be written out (similarly, keep affects in the opposite way - marking only those variables listed to be written out, and not the remaining variables). Neither operation has any effect on the contents of the PDV during operation; it only affects what is sent to the resultant dataset(s).
There are quite a few other variables that are available during data step execution but are not written out. Use put _all_; to see them. Among others are _N_, the first. and last. variables produced by the by statement, temporary array variables, error checking variables, and more.

SAS proc Freq & gchart display additional value's frequency/ bars

This might be a weird question. I have a data set contains data like agree, neutral, disagree...for many questions. There is not so many observations so for some question, one or more options has frequency of 0, say neutral. When I run proc freq, since neutral shows up in that variable, the table does not contain a row for neutral. I end up with tables with different number of rows. I would like to know if there is a option to show these 0 frequency rows. I will also need to run proc gchart for the same data set, and I will run into the same problem for having different number of bars. Please help me on this. Thank you!
This depends on how exactly you are running your PROC FREQ. It has the sparse option, which tells it to create a value for every logical cell on the table when creating an output dataset; normally, while you would have a cell with a missing value (or zero) in a crosstab, if that is output to a dataset (which is vertical, ie each combination of x and y axis value are placed in one row) those rows are left off. Sparse makes sure that doesn't happen; and in a larger (n-dimensional) crosstab, it creates rows for every possible combination of every variable, even ones that don't occur in the data.
However, if you're just doing
proc freq data=mydata;
tables myvar;
run;
That won't help you, as SAS doesn't really have anything to go on to figure out what should be there.
For that, you have to use a class variable procedure. Proc Tabulate is one of such procedures, and is similar to Proc Freq in its syntax (sort of). You need to either use CLASSDATA on the proc statement, or PRINTMISS on the table statement. In the former case, you do not need to use a format, I don't believe. In the latter case (PRINTMISS), you need to create a format for your variable (if you don't already have one) that contains all levels of the data that you want to display (even if it's just an identity format, e.g. formatting character strings to identical character strings), and specify PRELOADFMT on the proc statement. See this man page for more details.