creating two data sets in one data step in sas - sas

This should be very simple, but somehow I confuse myself.
data in_both
missing_name (drop = name);
merge employee (in=in_employee)
hours (in = in_hours);
by ID;
if in_employee and in_hours then output in_both;
else if in_employee and not in_hours then output missing_name;
run;
I have two questions:
(1): For the first statement "missing_name(drop = name)", I understand that, it means keep all the data except the column whose head is name. But keep which data here? What is the input?
(2): I know we can create two datasets within one data step, but that means we should use "data in_both missing_name", instead of "data in_both", right?
Many thanks for your time and attention. I appreciate your help.

(1) The DROP= option refers to dropping variables from the dataset MISSING_NAME. With no drop= or keep= option, all variables that exist in EMPLOYEE or HOURS would be written to MISSING_NAME. You can run PROC CONTENTS on the four datasets to see which variables are included in each.
(2) As written, your code will output two datasets IN_BOTH and MISSING_NAME. As #Tom just commented, your current DATA statement already lists both datasets, because the semicolon ends the statement, not the white space/carriage return.

The DATA statement is determining which datasets will be created by the data step. The dataset options, like the DROP= option in your example, can we used to control which of the variables are written into those datasets.
It is the OUTPUT statement that is deciding which observations will be written. So in your example your IF/THEN/ELSE logic is determining which output statements to execute.

Using your posted code:
data in_both
missing_name (drop = name);
merge employee (in=in_employee)
hours (in = in_hours);
by ID;
run;
Inputs - merge_employee & hours
Outputs - in_both & missing_name
In this example the output missing_name has the column NAME dropped.
The best way to view what's going on if the line breaks are confusing is to look for the semi-colon. At first glance I got a little confused too!

Related

How to accumulate the records of a newly imported table with the records of another table that I have stored on the servers in SAS?

I am new to SAS and I have the following problem:
When trying to join records I just imported (in one table) with records I have stored in another table.
What happens is that I am going to run the code in SAS daily, and I need the table that I am going to create today (17/05/2021) by importing a file 'X', to join the table that I created yesterday (16/05/2021) by importing a file 'Y'.
And so the code will be executed tomorrow, the next day and so on.
In conclusion the records will accumulate as the days go by.
To tackle this problem, I am first creating two variables, one with the date of the day the code will be executed and the other with the date of the last execution.
%let daily_date = 20210423; /*AAAAMMDD*/
%let last_execution_date = 20210422; /*AAAAMMDD*/
Then the import of a file is done, we can see that the name of this created table has the date of the day in which the code is being executed.
data InputAC.RA_ratings&daily_date;
infile "&ruta_InputRA." FIRSTOBS=2
dsd lrecl=4096 truncover;
input
#1 RA_Customer_ID $10.
#11 Rating_ID 10.
#21 ISRM_Model_Overlay_ID $10.
#31 Constant_ID 10.
#41 Value $100.
;
run;
proc sort data=inputac.RA_ratings&daily_date;
by RA_Customer_ID Rating_ID;
quit;
Finally the union of InputAC.RA_ratings&daily_date with InputAC.RA_ratings&last_execution_date is made. ('InputAC.RA_ratings&last_execution_date' should be the table that was imported at an earlier date than today.)
data InputAC.RA_ratings&fec_diario;
merge
InputAC.RA_ratings&fec_diario
InputAC.RA_ratings&ultima_fecha_de_ejecucion;
by RA_Customer_ID Rating_ID;
run;
This is how the tables are being stored on the server.
(Ignore date 20210413, let's imagine it is 20210422)
However, I have to perform this task without using the variable 'last_execution_date'.
I've been researching but I still can't find any SAS function that can help me with this problem.
I hope someone can help me, thank you very much in advance.
This is a pretty complex and interesting question from an operations point of view. The answer depends on a few things.
How much control do you have over the execution of this process?
Is "yesterday" guaranteed, or does the process need to work if "last execution date" is not yesterday?
What should happen if the process is run twice today?
The best practices way to solve this is to have a dataset (or table) that stores the last execution date. That allows you to handle #2 trivially, and the answer to #3 might guide exactly how you store this but is easily handled anyway.
Say for example you have a table, MetaAC.LastExecDate (or, in spanish, MetaAC.UltimaFecha or similar). It could store things this way:
data LastExecDate;
timestamp = datetime();
execdate = input(&daily_date,yymmdd8.);
run;
proc append base=MetaAC.LastExecDate data=LastExecDate;
run;
This lets you store an arbitrary execdate even if it's not today, and also store when you ran it (for audit purposes), and you could even add who ran it if that's interesting (there is a macro variable &sysuserid or similar). Then put all this at the bottom of your process, and it updates as you go.
Then, you can pull out from this the exact info you want - for example,
proc sql;
select max(execdate)
into :last_exec_date
from MetaAC.LastExecDate
where execdate ne today()
;
quit;
Now, if you don't have control over this for some reason, you could determine this in a different way. Again, the exact process depends on your circumstances and your answers to 2 and 3.
If your answer to 2 is you always want it to be yesterday, then this is really easy - just do this:
%let daily_date=20210517;
%let last_execution_date = %sysfunc(putn(%sysevalf(%sysfunc(inputn(&daily_date,yymmdd8.))-1),yymmddn8.));
%put &=last_execution_date;
The two %sysfuncs just do the input/put from SAS datastep inside the macro language, and %sysevalf lets you do math.
If you don't want it to always be the prior day (if there are weekends, or other days you don't necessarily want to assume it's the prior day), then your best bet is to either use the dictionary tables to look at what's there and find the largest date prior to your date, or maybe use a x command to look at the folder and do the same thing (might be easier to use OS command than to use SQL for this, sometimes SQL dictionary tables can be slow).

using by group processing First. and Last

I just start learning sas and would like some help with understanding the following chunk of code. The following program computes the annual payroll by department.
proc sort data = company.usa out=work.temp;
by dept;
run;
data company.budget(keep=dept payroll);
set work.temp;
by dept;
if wagecat ='S' then yearly = wagrate *12;
else if wagecat = 'H' then yearly = wagerate *2000;
if first.dept then payroll=0;
payroll+yearly;
if last.dept;
run;
Questions:
What does out = work.temp do in the first line of this code?
I understand the data step created 2 temporary variables for each by variable (first.varibale/last.variable) and the values are either 1 or 0, but what does first.dept and last.dept exactly do here in the code?
Why do we need payroll=0 after first.dept in the second to the last line?
This code takes the data for salaries and calculates the payroll amount for each department for a year, assuming salary is the same for all 12 months and that an hourly worker works 2000 hours.
It creates a copy of the data set which is sorted and stored in the work library. RTM.
From the docs
OUT= SAS-data-set
names the output data set. If SAS-data-set does not exist, then PROC SORT creates it.
CAUTION:
Use care when you use PROC SORT without OUT=.
Without the OUT= option, PROC SORT replaces the original data set with the sorted observations when the procedure executes without errors.
Default Without OUT=, PROC SORT overwrites the original data set.
Tips With in-database sorts, the output data set cannot refer to the input table on the DBMS.
You can use data set options with OUT=.
See SAS Data Set Options: Reference
Example Sorting by the Values of Multiple Variables
First.DEPT is an indicator variable that indicates the first observation of a specific BY group. So when you encounter the first record for a department it is identified. Last.DEPT is the last record for that specific department. It means the next record would the first record for a different department.
It sets PAYROLL to 0 at the first of each record. Since you have if last.dept; that means that only the last record for each department is outputted. This code is not intuitive - it's a manual way to sum the wages for people in each department. The common way would be to use a summary procedure, such as MEANS/SUMMARY but I assume they were trying to avoid having two passes of the data. Though if you're not sorting it may be just as fast anyways.
Again, RTM here. The SAS documentation is quite thorough on these beginner topics.
Here's an alternative method that should generate the exact same results but is more intuitive IMO.
data temp;
set company.usa;
if wagecat='S' then factor=12; *salary in months;
else if wagecat='H' then factor=2000; *salary in hours;
run;
proc means data=temp noprint NWAY;
class dept;
var wagerate;
weight factor;
output out=company.budget sum(wagerate)=payroll;
run;

SAS NOTSORTED Equivalent

I was using the following code to analyze data:
set taq.cq_&yyyymmdd:;
by symbol date time NOTSORTED ex;
There are are thousands of datasets I am running the code on in the unit of days. When &yyyymmdd only specifies one dataset (for one day. for example, 20130102), it works. However, when I try to run it for multiple datasets (for example, 201301:), SAS returns the following errors:
BY NOTSORTED/NOBYSORTED cannot be used with SET statement when
more than one data set is specified.
If I cannot use NOTSORTED here, what is an equivalent statement that I could use?
My understanding of the keyword NOTSORTED is that you use it when the data is not sorted yet. Therefore, do I need to sort it first? How to do it?
I am also confused by the number of variables that NOTSORTED is referencing. Does it only have an effect on "time", or it has effect on "symbol, data, time"?
Many thanks!
UPDATE#2:
The rest of the process immediately following the set statement is: (pseudo code as i don't have the permission to post the original code)
Data _quotes;
SET STATEMENT HERE
Change the name of a variable in the dataset (Variable name is EXN).
last.EXN in a if statement. If the condition is satisfied, label EXN.
Drop some variables.
Run;
DATA NEWDATASET (sortedby= SYMBOL DATE TIME index=(SYMBOL)
label="WRDS-TAQ NBBO Data");
SET _quotes;
by symbol date time;
....
Run;
NOTSORTED means that SAS can assume the sort order in the data is correct, so it may not have explicitly gone through a PROC SORT but it is in logical order as listed in the BY statement.
All variables in the BY statement are included in the NOTSORTED option. Given that I suspect you fully don't understand BY group processing.
It's usually a bit dangerous to use, especially if you don't understand BY group processing. If your data is in the same group but not adjacent it won't work properly and will not produce an error. The correct workaround depends on your processes to be honest.
I would suggest reviewing the documentation regarding BY group processing. It's quite in depth and has lots of samples to illustrate the different type of calculations.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n138da4gme3zb7n1nifpfhqv7clq.htm
NOTSORTED is often used in example posts to either avoid a sort or when using a custom sort that's difficult to implement in other ways. Explicitly sorting will remove this issue but you may also be misunderstanding how SAS processes data when you have a SET statement with a BY statement. I believe this is called interleaving.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n1tgk0uanvisvon1r26lc036k0w7.htm
I suspect that the NOTSORTED keyword is being using to find groups for observations with the same value for the EX variable within the same symbol,date,time. If you only need to find the FIRST then you can use the LAG() function to calculate the FIRST.EX flag.
data want;
set taq.cq_&yyyymmdd:;
by symbol date time;
first_ex = first.time or ex ne lag(ex);
Otherwise then perhaps you want to convert the process to data step views and then set the views together.
data work.view_cq_20130102 / view=work.view_cq_20130102;
set taq.cq_20130102;
by symbol date time ex NOTSORTED;
...
run;
...
data want ;
set work.view_cq_201301: ;
by symbol date time;
...

How does one or more data step OUTPUT statements work and can it be implicit?

When running a data step in SAS, why does the output statement seem to 'stop' the iterating of the set statement?
I need to conditionally output duplicate observations. While I can use a plethora of output statements, I'd like if SAS did it's normal iterating and output just created an additional observation.
1) Does the run statement in SAS have a built in output statement? (The way sum statements have a built in retain)
2) What is happening when I ask SAS to output certain observations - in particular after a set statement? Will it set all the values until a condition and then only keep the values I request? or does it have some kind of similarities with other statements such as the point= statement?
3) Is there a similar statement to output that will continue to set the values from a previous data step and then output an additional observation when requested?
For example:
data test;
do i = 1 to 100;
output;
end;
run;
data test2;
set test;
if _N_ in (4 8 11) then output;
run;
data test3;
set test;
if _N_ in (4 8 11) then output;
output;
run;
test has 100 observations, test2 has 3 observations, and test3 has 103 observations. This make me think that there is some kind of built in output statement for either the run statement, or the data step itself.
output in SAS is an explicit instruction to write out a row to the output dataset(s) (all of the dataset(s) named in the data statement, unless you specify a single dataset in output).
run, in addition to ending the step (meaning no statements after run are processed until that data step is completed - equivalent to the ending } in a c-style programming language module, basically) contains an implicit return statement.
Unless you are using link or goto, return tells SAS to return to the beginning of the data step loop. In addition, return contains an implicit output statement that outputs rows to all datasets named in the data statement, unless there is an output statement in the data step code - in which case that is not present.
It is return that causes SAS to actually stop processing things after it - not the output. In fact, SAS happily does things after the output statement; they just may not be output anywhere. For example:
data x;
do row = 1 to 100;
output;
row_prev+1;
end;
run;
That row_prev+1 statement is executed, even though it's after the output statement - its presence can be seen on the next row. In your example where you told it to just output three rows, it still processed the other 97 - just nothing was output from them. If any effects had happened from that processing, it would occur - in fact, the incrementing of _n_ is one of those effects (_n_ is not the row number, but the iteration count of data step looping).
You should probably read up on the data step itself. SAS documentation includes a lot of information on that, or you could read papers like The Essence of Data Step Programming. This sort of thing is quite common in SGF papers, in part because SAS certification requires understanding this fairly well.
The best way to understand everything is by reading about the Program Data Vector (PDV). The short answer to your questions:
The output statement is implied at the run boundary of every SAS data step that uses set, merge, update, or (nothing).
The set statement takes the contents of the current row and reads them into the PDV, if you have a single set statement
The output statement simply outputs the contents of the PDV at that moment into your output dataset
SAS only goes to a new row in the source dataset defined by your set statement when it reaches a run boundary, delete statement, return statement, or failing the conditions of an if without then statement
point= forces SAS to go directly to an observation number defined by a variable; otherwise, it will read every row sequentially, one by one
It's implicit at the end, unless it's used in one or more places in that data step.
Each time the execution encounters an OUTPUT statement, or the implicit one if it exists, it will output a new row.
You are very close.
1) There is an implied OUTPUT at the end of the data step, unless your data step includes an explicit OUTPUT statement. That is why your first step wrote all 100 observations and the second only three.
2) The OUTPUT statement tells SAS to write the current record to the output dataset.
3) There is not a direct way to do what you want to duplicate records without using OUTPUT statements, but for some similar problems you can cause the duplication on the input side instead of the output side.
For example if you felt your class didn't have enough eleven year-olds you could make two copies of all eleven year-olds by reading them twice.
data want;
set sashelp.class
sashelp.class(where=(age=11))
;
by name;
run;

What is the difference between concatenate, appending and merge on SAS?

I am trying to run codes on SAS for Concatenate, Appending and Merge and unable to understand the difference between the same. Looking for some one to help me understand the same with examples.
Concatenate and Append are similar, but not used the same way. In SAS, Append is used most commonly to mean concatenation in place. In other words, adding rows to a dataset without reading in the original dataset. This is very efficient, as you skip reading one of the datasets, but it has limitations (largely, you can't interleave or do other data step type things while appending). Append is most often done by PROC APPEND.
Concatenate, on the other hand, while it can mean appending, is usually used when combining the rows from two datasets into a new dataset with all of the rows from each source dataset as separate rows, but not in-place. This would be done with a set statement in a data step, most commonly. This reads both datasets in and writes a new dataset (that could replace one of the original input datasets, or have a new name). Concatenate also is often used to mean combine two string values into a variable; that's actually the most common usage I've heard it in.
Merge is not the same at all, though; it is side-by-side in some fashion, placing the data from one dataset in new variables on the same rows as the data from the other dataset. New rows can be created as part of merge, when one dataset has different key identifier values from the other, but that's usually not the point of the merge (usually!). Merge is done most often in the data step, either with the merge or the update statement.
Concatenate and Merge can also be done in other ways, of course, including SQL.
In a nutshell:
Concatenate: add a dataset on top (or to the bottom) of another one. Look into the SET statement of the DATA Step or the UNION clause of PROC SQL.
Append: Just another word for concatenate. Look into PROC DATASETS / APPEND, but it accomplishes the same task with different means.
Merge: add a dataset to the side (right, generally) of another one. Look into the MERGE statement of the DATA Step and/or the various JOIN's allowed by PROC SQL.
SAS Documentation will show you plenty of examples!
concatenate :it is used to append observations from one data set to another data set,so you specify a list of data set names in the set statement,to concatenate two data set the SAS system must process all observations from both data sets to create a new one.
APPEND:bypasses the observations from original data set and add the new observations directly at the end of the original data set when you have different variables you use the force=option with the append procedure.It function the same way as the append statement in the datasets statement.
and you can only append one data set at a time while in concatenation you can add as many data set in the set statement.
MERGE:you should have a common variable or several variables which taken together uniquely identify each observations,it sequentaly checks observation for each by-value(you have to sort your data sets before you can merge them),then write the combined observation to the new data set