how does drop statement works behind the scene ? - sas

data _null_ ;
set sashelp.cars ;
markup=invoice+msrp;
drop invoice msrp ;
run;
in compilation phase of data step
PDV is intialized according to all the sas variables that are there in dataset
then one more column is added in PDV
but then in drop statements it drops two columns namely(invoice , msrp) from PDV
then how come in execution phase it calculates value of markup variable which uses values of columns it has dropped already ?

The drop statement affects what variables are written out to any datasets produced by it, marking those variables to not be written out (similarly, keep affects in the opposite way - marking only those variables listed to be written out, and not the remaining variables). Neither operation has any effect on the contents of the PDV during operation; it only affects what is sent to the resultant dataset(s).
There are quite a few other variables that are available during data step execution but are not written out. Use put _all_; to see them. Among others are _N_, the first. and last. variables produced by the by statement, temporary array variables, error checking variables, and more.

Related

Common Values in Two Tables in SAS (without Proc SQL)

I am learning SAS at the moment and I wanted to know how do you join two tables without using any SQL where I need to get only the common values in two tables.
Both tables have a common unique id. Also the tables don't have common variables.
Please don't give any documentation links as I already have and I know merge. I am trying it with an IN operator.
Table 1 : Screenshot
Table 2 : Screenshot
Description: The first table has 157 records and the other has 161 records.
I tried searching a solution but didn't get any. Please refer a solution.
Thanks !
In DATA Step you will want to use the MERGE statement and the IN= option which sets up flags indicating 'contribution' to the current state of the program data vector (PDV)
data want;
merge
have1 (in=_from1)
have2 (in=_from2)
;
by uniqueid; * variable of same name, type and length should be in have1 and have2;
if _from1 and _from2; * subsetting if;
run;
DATA Step is an implicit loop. The MERGE automatically advances reads through the contributing data, synchronizing about the BY variables.
When a DATA Step has no explicit OUTPUT statement, there will be an implicit OUPUT of the values in the PDV when control reaches the bottom of the step. Thus, the if without a then is called subsetting because control only goes past the if (and reaches the bottom for implicit output) when both flags are true (or when data is coming from both tables at a common key value)

SAS NOTSORTED Equivalent

I was using the following code to analyze data:
set taq.cq_&yyyymmdd:;
by symbol date time NOTSORTED ex;
There are are thousands of datasets I am running the code on in the unit of days. When &yyyymmdd only specifies one dataset (for one day. for example, 20130102), it works. However, when I try to run it for multiple datasets (for example, 201301:), SAS returns the following errors:
BY NOTSORTED/NOBYSORTED cannot be used with SET statement when
more than one data set is specified.
If I cannot use NOTSORTED here, what is an equivalent statement that I could use?
My understanding of the keyword NOTSORTED is that you use it when the data is not sorted yet. Therefore, do I need to sort it first? How to do it?
I am also confused by the number of variables that NOTSORTED is referencing. Does it only have an effect on "time", or it has effect on "symbol, data, time"?
Many thanks!
UPDATE#2:
The rest of the process immediately following the set statement is: (pseudo code as i don't have the permission to post the original code)
Data _quotes;
SET STATEMENT HERE
Change the name of a variable in the dataset (Variable name is EXN).
last.EXN in a if statement. If the condition is satisfied, label EXN.
Drop some variables.
Run;
DATA NEWDATASET (sortedby= SYMBOL DATE TIME index=(SYMBOL)
label="WRDS-TAQ NBBO Data");
SET _quotes;
by symbol date time;
....
Run;
NOTSORTED means that SAS can assume the sort order in the data is correct, so it may not have explicitly gone through a PROC SORT but it is in logical order as listed in the BY statement.
All variables in the BY statement are included in the NOTSORTED option. Given that I suspect you fully don't understand BY group processing.
It's usually a bit dangerous to use, especially if you don't understand BY group processing. If your data is in the same group but not adjacent it won't work properly and will not produce an error. The correct workaround depends on your processes to be honest.
I would suggest reviewing the documentation regarding BY group processing. It's quite in depth and has lots of samples to illustrate the different type of calculations.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n138da4gme3zb7n1nifpfhqv7clq.htm
NOTSORTED is often used in example posts to either avoid a sort or when using a custom sort that's difficult to implement in other ways. Explicitly sorting will remove this issue but you may also be misunderstanding how SAS processes data when you have a SET statement with a BY statement. I believe this is called interleaving.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n1tgk0uanvisvon1r26lc036k0w7.htm
I suspect that the NOTSORTED keyword is being using to find groups for observations with the same value for the EX variable within the same symbol,date,time. If you only need to find the FIRST then you can use the LAG() function to calculate the FIRST.EX flag.
data want;
set taq.cq_&yyyymmdd:;
by symbol date time;
first_ex = first.time or ex ne lag(ex);
Otherwise then perhaps you want to convert the process to data step views and then set the views together.
data work.view_cq_20130102 / view=work.view_cq_20130102;
set taq.cq_20130102;
by symbol date time ex NOTSORTED;
...
run;
...
data want ;
set work.view_cq_201301: ;
by symbol date time;
...

How does one or more data step OUTPUT statements work and can it be implicit?

When running a data step in SAS, why does the output statement seem to 'stop' the iterating of the set statement?
I need to conditionally output duplicate observations. While I can use a plethora of output statements, I'd like if SAS did it's normal iterating and output just created an additional observation.
1) Does the run statement in SAS have a built in output statement? (The way sum statements have a built in retain)
2) What is happening when I ask SAS to output certain observations - in particular after a set statement? Will it set all the values until a condition and then only keep the values I request? or does it have some kind of similarities with other statements such as the point= statement?
3) Is there a similar statement to output that will continue to set the values from a previous data step and then output an additional observation when requested?
For example:
data test;
do i = 1 to 100;
output;
end;
run;
data test2;
set test;
if _N_ in (4 8 11) then output;
run;
data test3;
set test;
if _N_ in (4 8 11) then output;
output;
run;
test has 100 observations, test2 has 3 observations, and test3 has 103 observations. This make me think that there is some kind of built in output statement for either the run statement, or the data step itself.
output in SAS is an explicit instruction to write out a row to the output dataset(s) (all of the dataset(s) named in the data statement, unless you specify a single dataset in output).
run, in addition to ending the step (meaning no statements after run are processed until that data step is completed - equivalent to the ending } in a c-style programming language module, basically) contains an implicit return statement.
Unless you are using link or goto, return tells SAS to return to the beginning of the data step loop. In addition, return contains an implicit output statement that outputs rows to all datasets named in the data statement, unless there is an output statement in the data step code - in which case that is not present.
It is return that causes SAS to actually stop processing things after it - not the output. In fact, SAS happily does things after the output statement; they just may not be output anywhere. For example:
data x;
do row = 1 to 100;
output;
row_prev+1;
end;
run;
That row_prev+1 statement is executed, even though it's after the output statement - its presence can be seen on the next row. In your example where you told it to just output three rows, it still processed the other 97 - just nothing was output from them. If any effects had happened from that processing, it would occur - in fact, the incrementing of _n_ is one of those effects (_n_ is not the row number, but the iteration count of data step looping).
You should probably read up on the data step itself. SAS documentation includes a lot of information on that, or you could read papers like The Essence of Data Step Programming. This sort of thing is quite common in SGF papers, in part because SAS certification requires understanding this fairly well.
The best way to understand everything is by reading about the Program Data Vector (PDV). The short answer to your questions:
The output statement is implied at the run boundary of every SAS data step that uses set, merge, update, or (nothing).
The set statement takes the contents of the current row and reads them into the PDV, if you have a single set statement
The output statement simply outputs the contents of the PDV at that moment into your output dataset
SAS only goes to a new row in the source dataset defined by your set statement when it reaches a run boundary, delete statement, return statement, or failing the conditions of an if without then statement
point= forces SAS to go directly to an observation number defined by a variable; otherwise, it will read every row sequentially, one by one
It's implicit at the end, unless it's used in one or more places in that data step.
Each time the execution encounters an OUTPUT statement, or the implicit one if it exists, it will output a new row.
You are very close.
1) There is an implied OUTPUT at the end of the data step, unless your data step includes an explicit OUTPUT statement. That is why your first step wrote all 100 observations and the second only three.
2) The OUTPUT statement tells SAS to write the current record to the output dataset.
3) There is not a direct way to do what you want to duplicate records without using OUTPUT statements, but for some similar problems you can cause the duplication on the input side instead of the output side.
For example if you felt your class didn't have enough eleven year-olds you could make two copies of all eleven year-olds by reading them twice.
data want;
set sashelp.class
sashelp.class(where=(age=11))
;
by name;
run;

Set in the first row, why do I have two output rows?

I tried this code in SAS but the output isn't the same as I expect.
data temp;
input sumy;
datalines;
36
;
run;
data aaa;
if _n_ = 1 then
set temp;
run;
proc print data = aaa;
run;
May I ask why there are two observations, sas have "set" twice? How does the "set" and PDV work here during iteration? Thank you in advance.
Because you executed the implied OUTPUT at the end of the data step twice. The first time, with _N_=1, you read one observation from the input dataset. The second time you did not read a new observation, since _N_ now is 2, and the values from the previous observation were retained. After the second observation SAS stops because it has detected that your data step is in a loop.
If you want only one observation then either add a STOP statement before the RUN statement or recode the data step to use OBS=1 dataset option on the input dataset instead of the IF statement.
Note that if the input data set was empty then you would have output zero observations because the data step would have stopped when the SET statement read past the end of the input dataset.
There are two observations because during the second DATA Step iteration no read operation occurred.
The SET statement has two roles.
Compile time role (unconditional) - the data set header is read by the compiler and its variables are added the the steps PDV.
Run time role (conditional) - a row from the data set is read and the values are placed in the PDV each time the code reaches the SET statement.
Additionally, every variable coming from (or corresponding to) a SET statement has it's value automatically retained. That is why the second observation created by your sample code has sumy=36.
Additional detail from the SAS support site:
Usage Note 8914: DATA step stopped due to looping message
If a DATA step is written such that no data reading statements (e.g.
SET, INPUT) are executed, the step is terminated after one iteration
and the following message is written to the SAS log:
NOTE: DATA STEP stopped due to looping.
As SAS creates a new data set, it reads one record at a time, saving values from that record in the Program Data Vector (PDV) until values from the next record replace them. SAS continues in this way until it has reached the last record.
You can refer this link for better understanding
http://www.lexjansen.com/nesug/nesug07/cc/cc45.pdf
Also you can go through this answer on stack overflow
SAS . Are variables set to missing at every iteration of a data step?

sas collapsing categorical variables clustering analysis

I came across the following code in the logistic regression modeling course offered by SAS:
data dataset(drop=i);
set data;
array mi{*} mi_Ag mi_Inc
mi_WR;
array x{*} Ag Inc WR;
do i=1 to dim(mi);
mi{i}=(x{i}=.);
end;
run;
I need to understand two things:
1.) there is a column created titled "i" once this data step is run. What does that signify and why is there. The drop "i" essentially drops it but if i don't use drop option the column stays in the data set
2.) this do step is replacing all the missing values with a 1 and rest with 0. How is that happening when nothing is clearly specified in the do step as to what needs to be done. In my eyes, "do i=1 to dim(mi); mi{i}=(x{i}=.);" should simply put dots in mi(i) wherever it finds dots in x(i).
Part 2:
While collapsing the categorical variable, following code has been used:
proc freq data=example1 noprint;
tables CLUSTER_CODE*TARGET_B / chisq;
output out=out_chi(keep=_pchi_) chisq;
run;
data ex_cutoff;
if _n_=1 then set out_chi;
set ex_cluster;
chisquare=_pchi_*rsquared;
degfree=numberofclusters-1;
logpvalue=logsdf('CHISQ',chisquare,degfree);
run;
what is n=1 doing ? and also, why are we creating chisquare=_pchi*rsquared. pchi is already chisquare so whats the point of multiplying it with R square?
Thanks
P.S. The code is from one of the SAS learning courses. Hopefully I am allowed to share it here for discussion/learning purposes.
i is the array iterator (created in the do loop). It's dropped since it's not really intended to be kept on the dataset, it's just an iterator (letting you go through the array one element at a time and during that iteration letting you reference a single element).
mi{i}=(x{i}=.); is assigning 1/0 like this:
x(i)=. is either true or false. If it is true, it evaluates to a 1. If it is false, it evaluates to 0. Thus when it's true that x(i)=. then m(i) is assigned a 1; otherwise it is assigned a 0. That's just how SAS works with boolean (True/False) values; many other langauges work that way as well (True is nonzero, False is zero); and when converted to number, True is converted to 1 (but any nonzero nonmissing value is 'True' when converted the other way around).