How to subset data in SAS

How to subset data in SAS - sas

Variable ParameterEstimate
slope 1
intercept 2.5
slope 2
intercept 5.6
slope 22.2
intercept 9
Suppose my dataset looks something like this, where the variable names are Variable and ParameterEstimate. I want to extract just the ParameterEstimates of the slope. However, I can't think of a simple way to do that. How can I go about getting just the slopes, i.e just 1, 2, and 22.2?

You can use a subsetting where clause like this:
data want;
set have;
where variable = 'slope';
run;
This reads only those observations from data set "have" where the value of variable equals 'slope'

Depending on what you're doing with it next, you can probably do this without a separate datastep.
Say you wanted the mean of the slopes:
proc means data=have(where=(variable='slope'));
var parameterEstimate;
run;
In most cases you can use a where data set option, unless you have reason to create a new dataset (perhaps you're going to use this subset for 50 different steps or somesuch where it's easier to create one than to type that bit out 50 times).

Related

Picking rows in SAS based on a specific value in a specific column

So I'm working with a data set that has millions of rows. I'm trying to cut down the number of rows, so that I can merge this data set and another data set by zipcode.
What I'm trying to do is take a specific column "X6" and search through it for the value of "357". Then every row that has that value I want to move into a new data set.
I'm assuming that I'm going to have to use some form of if/then statement, but I can't get anything to work successfully. If needed I can post a snapshot of some of my data or what SAS code I currently have. I've seen other things that are similar, but none of them involve SAS.
Thanks for all of your help in advanced.

RamB gave a great way to parse into two datasets.
If you just want a new dataset that is a subset of the original, the following will work well
DATA NEW;
SET ORIGINAL;
IF X6="357"; *NOTE: THIS ASSUMES X6 IS DEFINED AS CHARACTER*
RUN;
A nice function can also parse multiple criteria. Say you wanted to keep records where X6 = 357 or 588.
DATA NEW;
SET ORIGINAL;
IF X6 IN("357","588"); *NOTE: THIS ASSUMES X6 IS DEFINED AS CHARACTER*
RUN;
Lastly, the NOTIN also works to exclude.

With data step this is really simple. I'll give you an example.
data dataset_with_357
original_without_357;
set original_dataset;
if compress(x6) = "357" then output dataset_with_357;
else output original_without_357;
run;
As I said, there are several ways of doing this, and it wasn't clear for me which is better for you.

Just use Proc SQL to create your data set, then reference the value your looking for in your query -
Proc SQL;
Create table new as
Select *
From dataset
Where x6 = 357
;
Quit;
Assuming your x6 variable is numeric...
On a mobile device...sorry for no code text

Simplifying a Macro Definition

So I have created a macro, which works perfectly fine. Within the macro, I set where the observation will begin reading, and then how many observations it will read.
But, in my proc print call, I am not able to simply do:
(firstobs=&start obs=&obs)
Because although firstobs will correctly start where I want it, obs does not cooperate, as it must be a higher number than firstobs. For example,
%testmacro(start=5, obs=3)
Does not work, because it is reading in the first 3 observations, but trying to start at observation 5. What I want the macro to do is, start at observation 5, and then read the next 3. So what I did is this:
(firstobs=&start obs=%eval((&obs-1)+&start))
This works perfectly fine when I use it. But I am just wondering if there is a simpler way to do this, rather than having to use the whole %eval... call. Is there one simple call, something like numberofobservations=...?

I don't think there is. You can only simplify your macro a little, within the %eval(). .
%let start=5;
%let obs=3;
data want;
set sashelp.class (firstobs=&start obs=%eval(&obs-1+&start));
run;
Data set options are listed here:
http://support.sas.com/documentation/cdl/en/ledsoptsref/68025/HTML/default/viewer.htm#p0h5nwbig8mobbn1u0dwtdo0c0a0.htm
You could count the obs inside the data step using a counter and only outputting the records desired, but that won't work on something like proc print and isn't efficient for larger data steps.
You could try the point= option, but I'm not familiar with that method, and again I don't think it will work with proc print.

As #Reeza said - there is not a dataset option that will do what you are looking for. You need to calculate the ending observation unfortunately, and %eval() is about as good a way to do it as any.
On a side-note, I would recommend making your macro parameter more flexible. Rather than this:
%testmacro(start=5, obs=3)
Change it to take a single parameter which will be the list of data-set options to apply:
%macro testmacro(iDsOptions);
data want;
set sashelp.class (&iDsOptions);
run;
%mend;
%testmacro(firstobs=3 obs=7);
This provides more flexibility if you need to add in additional options later, which means fewer future code changes, and it's simpler to call the macro. You also defer figuring out the observation counts in this case to the calling program which is a good thing.

Creating a template dataset from multiple datasets with different columns

Currently, I have several sets of business unit data that I'd like to put into a standard template format. Some business unit data contains columns that others don't. I would like to check if certain columns exist and then to create them if they don't. I understand that techniques to achieve similar functionality have been discussed earlier, here and here. However, I was wondering if a better method exists.
My current code is:
data Source_Data4;
set Interm.Source_Data3;
if 0 then do;
a="";
b="";
end;
run;

Using the RETAIN statement should be the fastest and easiest way to do this. If the field you are checking for is numeric then put a . instead of "".
data Source_Data4;
set Interm.Source_Data3;
retain a b "";
run;

If you have multiple datasets with different columns that you want to use a template for, an excellent way to do this is something like this:
data want;
if 0 then set template;
set have2;
run;
This is far easier to code than a bunch of retain/length statements. It accomplishes the identical results as the retain solution (it defines the PDV), with one exception; it will define lengths and formats of variables based on template (while retain does not affect length or format). This may be desirable or may not be, depending on your use case. It is very helpful when combining multiple datasets, as it provides a single point at which length/format differences can be tested for; once this step occurs, you can be confident that your various datasets are all identical in variable length/format.
Creating this dataset can be done a number of ways. One simple way is:
data template;
if 0 then set have;
if 0 then set have2;
stop;
run;
That will create a blank dataset with have1 order followed by any new variables from have2. If that's not desired, you may want to add a RETAIN statement prior to the if 0's that draws from a data dictionary.

Extracting sub-data from a SAS dataset & applying to a different dataset

I have written a macro to use proc univariate to calculate custom quantiles for variables in a dataset (say dsn1) %cust_quants(dsn= , varlist= , quant_list= ). The output is a summary dataset (say dsn2)that looks something like the following:
q_1 q_2.5 q_50 q_80 q_97.5 q_99 var_name
1 2.5 50 80 97.5 99 ex_var_1_100
-2 10 25 150 500 20000 ex_var_pos_skew
-20000 -500 -150 0 10 50 ex_var_neg_skew
What I would like to do is to use the summary dataset to cap/floor extreme values in the original dataset. My idea is to extract the column of interest (say q_99) and put it into a vector of macro-variables (say q_99_1, q_99_2, ..., q_99_n). I can then do something like the following:
/* create summary of dsn1 as above example */
%cust_quants(dsn= dsn1, varlist= ex_var_1_100 ex_var_pos_skew ex_var_neg_skew,
quant_list= 1 2.5 50 80 97.5 99);
/* cap dsn1 var's at 99th percentile */
data dsn1_cap;
set dsn1;
if ex_var_1_100 > &q_99_1 then ex_var_1_100 = &q_99_1;
if ex_var_pos_skew > &q_99_2 then ex_var_pos_skew = &q_99_2;
/* don't cap neg skew */
run;
In R, it is very easy to do this. One can extract sub-data from a data-frame using matrix like indexing and assign this sub-data to an object. This second object can then be referenced later. R example--extracting b from data-frame a:
> a <- as.data.frame(cbind(c(1,2,3), c(4,5,6)))
> print(a)
V1 V2
1 1 4
2 2 5
3 3 6
> a[, 2]
[1] 4 5 6
> b <- a[, 2]
> b[1]
[1] 4
Is it possible to do the same thing in SAS? I want to be able to assign a column(s) of sub-data to a macro variable / array, such that I can then use the macro / array within a 2nd data step. One thought is proc sql into::
proc sql noprint;
select v2 into :v2_macro separated by " "
from a;
run;
However, this creates a single string variable when what I really want is a vector of variables (or array--no vectors in SAS). Another thought is to add %scan (assuming this is inside a macro):
proc sql noprint;
select v2 into :v2_macro separated by " "
from a;
run;
%let i = 1;
%do %until(%scan(&v2_macro, &i) = "");
%let var_&i = %scan(&v2_macro, &i);
%let &i = %eval(&i + 1);
%end;
This seems inefficient and takes a lot of code. It also requires the programmer to remember which var_&i corresponds to each future purpose. Is there a simpler / cleaner way to do this?
**Please let me know in the comments if this is enough background / example. I'm happy to give a more complete description of why I'm doing what I'm attempting if needed.

First off, I assume you are talking about SAS/Base not SAS/IML; SAS/IML is essentially similar to R and has the same kind of operations available in the same manner.
SAS/Base is more similar to a database language than a matrix language (though has some elements of both, and some elements of an OOP language, as well as being a full-featured functional programming language).
As a result, you do things somewhat differently in order to achieve the same goal. Additionally, because of the cost of moving data in a large data table, you are given multiple methods to achieve the same result; you can choose the appropriate method for the required situation.
To begin with, you generally should not store data in a macro variable in the manner you suggest. It is bad programming practice, and it is inefficient (as you have already noticed). SAS Datasets exist to store data; SAS macro variables exist to help simplify your programming tasks and drive the code.
Creating the dataset "b" as above is trivial in Base SAS:
data b;
set a;
keep v2;
run;
That creates a new dataset with the same rows as A, but only the second column. KEEP and DROP allow you to control which columns are in the dataset.
However, there would be very little point in this dataset, unless you were planning on modifying the data; after all, it contains the same information as A, just less. So for example, if you wanted to merge V2 into another dataset, rather than creating b, you could simply use a dataset option with A:
data c;
merge z a(keep=v2);
by id;
run;
(Note: I presuppose an ID variable of some form to combine A and Z.)
This merge combines the v2 column onto z, in a new dataset, c. This is equivalent to vertically concatenating two matrices (although a straight-up concatenation would remove the 'by id;' requirement, in databases you do not typically do that, as order is not guaranteed to be what you expect).
If you plan on using b to do something else, how you create and/or use it depends on that usage. You can create a format, which is a mapping of values [ie, 1='Hello' 2='Goodbye'] and thus allows you to convert one value to another with a single programming statement. You can load it into a hash table. You can transpose it into a row (proc transpose). Supply more detail and a more specific answer can be provided.

Interleaving output from two different procedures with a by value

I have a large SAS dataset and I would like to make a series of tables and charts using by value processing. I am outputing these to a PDF.
Is there any way to get SAS to alternate between the table and the chart as it goes through the data? Right now, I have to print all of the tables first and then print the charts. If it were just 4 tables/charts, then I would be ok writing
Here is a simple example:
data sample;
input byval $ item $ amount;
datalines;
A X 15
A Y 16
A Z 12
B X 25
B Y 10
B Z 18
;
run;
symbol1 i=j;
proc print data=sample;
by byval;
var item amount;
run;
proc gplot uniform data=sample;
by byval;
plot amount*item;
run;
This prints 2 tables, followed by 2 charts.
I would like the Chart for "A" to come after the table for "A" so that the reader can flip through the pdf and always see the associated charts and tables together.
I could write separate procs for each one, but then the gplot won't have a uniform axis (and it gets messy if I have 100 different groups instead of 2).
I thought about pumping them into greplay but then you can't use titles with "#BYVAL1".
Is there any easy way to do this?

I've never used it, but it may be worth checking out ODS DOCUMENT. This allows you to store the output of all your procedures and then reference specific items from them using PROC DOCUMENT.
Below is a link to the SAS website with useful information about this, in particular the paper by Cynthia Zender for the SAS Global Forum 2009.
http://support.sas.com/rnd/base/ods/odsdocument/index.html
Cynthia also regularly contributes to the SAS Support Communities website (https://communities.sas.com/community/support-communities), so it may be worth asking on there if you are still stuck.
Good luck

I don't know of any way to do what you ask directly. GREPLAY is probably the closest you'll come; the primary problem is that SAS processes the PROCs linearly, first processing the entire PROC PRINT, then the entire PROC GPLOT. GREPLAY would allow you to redisplay the output, but if that doesn't work for your needs due to the #BYVAL issue, I'm not sure there's a better solution. Perhaps you can modify the title afterwards (not sure if GREPLAY allows this)?
You could try using ODS LAYOUT, but I don't think that would be any better. The one way it could be better is if you can work out having two columns on a 'page', one column being the PROC PRINT outputs, one the PROC GPLOT, and then print the columns one page than the other. I don't think this is possible, but it might be worth exploring.
You might also try setting up a macro to do each BYVAL separately, defining the axis in a uniform manner manually (ie, defining it based on your own calculation of the correct axis parameters, as an argument to the macro). That is probably the easiest solution that might still allow #BYVAL to work properly.
You might also try browsing about Richard DeVenezia's site (http://www.devenezia.com/downloads/sas/samples/ ) which has a lot of examples of SAS/GRAPH solutions. He also posts on SAS-L (sasl#listserv.uga.edu) sometimes, not sure if I've seen him on StackOverflow. He's probably the person most likely to be able to answer the question that I know of.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js