SAS sortedby option - sas

I am reading some code that involves a sortedby option which confuses me a little bit.
So in the first part of the code, the following code was used to readin the data:
data _quotes / view=_quotes;
set taq.&yyyymmdd:;
by symbol date time NOTSORTED ex; length EXN 3.;
run;
Note that there is a NOTSORTED option here. when I remove it, SAS would return an error ERROR: BY variables are not properly sorted on data set.
Based on my understanding of how SAS NOTSORTED works, the taq dataset is not properly sorted, but in proper groups.
However, in the following code almost immediately after the previous code (no sort code is involved in between), there in no more NOTSORTED option, but there is no error:
data &outset (sortedby= SYMBOL DATE TIME index=(SYMBOL)
label="WRDS-TAQ NBBO Data");
set _quotes;
by symbol date time;
run;
Therefore, I was wondering if it is because of the sortedby statement that made the difference? I read the SAS documentation, it seems that sortedby will NOT sort the dataset but only specify how the data is currently sorted.
But why did the by statement without the NOTSORTED option work in the second code but not the first code?

Note that the variable ex is not on the BY statement on the second step. If the source data is sorted by symbol date time but is not sorted by ex, what you are observing makes sense.
It's unusual to put the notsorted option in the middle of a list of variables. No matter where it is placed, it applies to the entire list. In this case, it's possible the author was intending to suggest to the reader which variable was not sorted. I find this style confusing.
To check, add ex to the BY statement on the second step and see if it throws an error.

Related

Common Values in Two Tables in SAS (without Proc SQL)

I am learning SAS at the moment and I wanted to know how do you join two tables without using any SQL where I need to get only the common values in two tables.
Both tables have a common unique id. Also the tables don't have common variables.
Please don't give any documentation links as I already have and I know merge. I am trying it with an IN operator.
Table 1 : Screenshot
Table 2 : Screenshot
Description: The first table has 157 records and the other has 161 records.
I tried searching a solution but didn't get any. Please refer a solution.
Thanks !
In DATA Step you will want to use the MERGE statement and the IN= option which sets up flags indicating 'contribution' to the current state of the program data vector (PDV)
data want;
merge
have1 (in=_from1)
have2 (in=_from2)
;
by uniqueid; * variable of same name, type and length should be in have1 and have2;
if _from1 and _from2; * subsetting if;
run;
DATA Step is an implicit loop. The MERGE automatically advances reads through the contributing data, synchronizing about the BY variables.
When a DATA Step has no explicit OUTPUT statement, there will be an implicit OUPUT of the values in the PDV when control reaches the bottom of the step. Thus, the if without a then is called subsetting because control only goes past the if (and reaches the bottom for implicit output) when both flags are true (or when data is coming from both tables at a common key value)

How do I get SAS to ignore missing variable names?

If a SAS DATA step references a non-existant variable in a DROP, KEEP, or RENAME statement, it returns an error saying such and stops the DATA step due to this error.
How do I get SAS to keep going with the step when it references a non-existent variable? I assume there's an OPTION for this (?) but I can't figure out what it's called if this is the case.
(I'm dealing with yearly datasets for which variables occasionally get added or deleted from year to year.)
Try using:
options dkricond=nowarn dkrocond=nowarn;
First one is for input datasets, second one is for output datasets.
You might want to set these back to warn or error after you are done with the specific data steps where you know this will be an issue.
SAS Manual page

SAS NOTSORTED Equivalent

I was using the following code to analyze data:
set taq.cq_&yyyymmdd:;
by symbol date time NOTSORTED ex;
There are are thousands of datasets I am running the code on in the unit of days. When &yyyymmdd only specifies one dataset (for one day. for example, 20130102), it works. However, when I try to run it for multiple datasets (for example, 201301:), SAS returns the following errors:
BY NOTSORTED/NOBYSORTED cannot be used with SET statement when
more than one data set is specified.
If I cannot use NOTSORTED here, what is an equivalent statement that I could use?
My understanding of the keyword NOTSORTED is that you use it when the data is not sorted yet. Therefore, do I need to sort it first? How to do it?
I am also confused by the number of variables that NOTSORTED is referencing. Does it only have an effect on "time", or it has effect on "symbol, data, time"?
Many thanks!
UPDATE#2:
The rest of the process immediately following the set statement is: (pseudo code as i don't have the permission to post the original code)
Data _quotes;
SET STATEMENT HERE
Change the name of a variable in the dataset (Variable name is EXN).
last.EXN in a if statement. If the condition is satisfied, label EXN.
Drop some variables.
Run;
DATA NEWDATASET (sortedby= SYMBOL DATE TIME index=(SYMBOL)
label="WRDS-TAQ NBBO Data");
SET _quotes;
by symbol date time;
....
Run;
NOTSORTED means that SAS can assume the sort order in the data is correct, so it may not have explicitly gone through a PROC SORT but it is in logical order as listed in the BY statement.
All variables in the BY statement are included in the NOTSORTED option. Given that I suspect you fully don't understand BY group processing.
It's usually a bit dangerous to use, especially if you don't understand BY group processing. If your data is in the same group but not adjacent it won't work properly and will not produce an error. The correct workaround depends on your processes to be honest.
I would suggest reviewing the documentation regarding BY group processing. It's quite in depth and has lots of samples to illustrate the different type of calculations.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n138da4gme3zb7n1nifpfhqv7clq.htm
NOTSORTED is often used in example posts to either avoid a sort or when using a custom sort that's difficult to implement in other ways. Explicitly sorting will remove this issue but you may also be misunderstanding how SAS processes data when you have a SET statement with a BY statement. I believe this is called interleaving.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n1tgk0uanvisvon1r26lc036k0w7.htm
I suspect that the NOTSORTED keyword is being using to find groups for observations with the same value for the EX variable within the same symbol,date,time. If you only need to find the FIRST then you can use the LAG() function to calculate the FIRST.EX flag.
data want;
set taq.cq_&yyyymmdd:;
by symbol date time;
first_ex = first.time or ex ne lag(ex);
Otherwise then perhaps you want to convert the process to data step views and then set the views together.
data work.view_cq_20130102 / view=work.view_cq_20130102;
set taq.cq_20130102;
by symbol date time ex NOTSORTED;
...
run;
...
data want ;
set work.view_cq_201301: ;
by symbol date time;
...

Simplifying a Macro Definition

So I have created a macro, which works perfectly fine. Within the macro, I set where the observation will begin reading, and then how many observations it will read.
But, in my proc print call, I am not able to simply do:
(firstobs=&start obs=&obs)
Because although firstobs will correctly start where I want it, obs does not cooperate, as it must be a higher number than firstobs. For example,
%testmacro(start=5, obs=3)
Does not work, because it is reading in the first 3 observations, but trying to start at observation 5. What I want the macro to do is, start at observation 5, and then read the next 3. So what I did is this:
(firstobs=&start obs=%eval((&obs-1)+&start))
This works perfectly fine when I use it. But I am just wondering if there is a simpler way to do this, rather than having to use the whole %eval... call. Is there one simple call, something like numberofobservations=...?
I don't think there is. You can only simplify your macro a little, within the %eval(). .
%let start=5;
%let obs=3;
data want;
set sashelp.class (firstobs=&start obs=%eval(&obs-1+&start));
run;
Data set options are listed here:
http://support.sas.com/documentation/cdl/en/ledsoptsref/68025/HTML/default/viewer.htm#p0h5nwbig8mobbn1u0dwtdo0c0a0.htm
You could count the obs inside the data step using a counter and only outputting the records desired, but that won't work on something like proc print and isn't efficient for larger data steps.
You could try the point= option, but I'm not familiar with that method, and again I don't think it will work with proc print.
As #Reeza said - there is not a dataset option that will do what you are looking for. You need to calculate the ending observation unfortunately, and %eval() is about as good a way to do it as any.
On a side-note, I would recommend making your macro parameter more flexible. Rather than this:
%testmacro(start=5, obs=3)
Change it to take a single parameter which will be the list of data-set options to apply:
%macro testmacro(iDsOptions);
data want;
set sashelp.class (&iDsOptions);
run;
%mend;
%testmacro(firstobs=3 obs=7);
This provides more flexibility if you need to add in additional options later, which means fewer future code changes, and it's simpler to call the macro. You also defer figuring out the observation counts in this case to the calling program which is a good thing.

creating two data sets in one data step in sas

This should be very simple, but somehow I confuse myself.
data in_both
missing_name (drop = name);
merge employee (in=in_employee)
hours (in = in_hours);
by ID;
if in_employee and in_hours then output in_both;
else if in_employee and not in_hours then output missing_name;
run;
I have two questions:
(1): For the first statement "missing_name(drop = name)", I understand that, it means keep all the data except the column whose head is name. But keep which data here? What is the input?
(2): I know we can create two datasets within one data step, but that means we should use "data in_both missing_name", instead of "data in_both", right?
Many thanks for your time and attention. I appreciate your help.
(1) The DROP= option refers to dropping variables from the dataset MISSING_NAME. With no drop= or keep= option, all variables that exist in EMPLOYEE or HOURS would be written to MISSING_NAME. You can run PROC CONTENTS on the four datasets to see which variables are included in each.
(2) As written, your code will output two datasets IN_BOTH and MISSING_NAME. As #Tom just commented, your current DATA statement already lists both datasets, because the semicolon ends the statement, not the white space/carriage return.
The DATA statement is determining which datasets will be created by the data step. The dataset options, like the DROP= option in your example, can we used to control which of the variables are written into those datasets.
It is the OUTPUT statement that is deciding which observations will be written. So in your example your IF/THEN/ELSE logic is determining which output statements to execute.
Using your posted code:
data in_both
missing_name (drop = name);
merge employee (in=in_employee)
hours (in = in_hours);
by ID;
run;
Inputs - merge_employee & hours
Outputs - in_both & missing_name
In this example the output missing_name has the column NAME dropped.
The best way to view what's going on if the line breaks are confusing is to look for the semi-colon. At first glance I got a little confused too!