CREATING SUBSETS IN SAS "automatically" [duplicate] - sas

This question already has an answer here:
Dynamic first observation: Need to put a variable in firstobs=
(1 answer)
Closed 8 years ago.
I have generated a dataset through sas consisting of 144 rows and now I want to create 18 subsets of the dataset (each subset containing 8 rows). I know how to create subsets "manually using the firstobs= and obs= commands. However, I want the subsets to be created "automatically". The reason for this is that I am extracting data from a particular website and each time I run the code the dataset is of different size, all I know is that each time I generate a dataset I want to create X subsets containing 8 rows (e.g the first subset will consist of rows 1-8, the second will consist of rows 9-16 and so on...).
So my question, how do I go about attacking this problem?

I don't know what the purpose of your subsets is, but it's generally a bad idea to split datasets up into multiple versions. The better approach is to create a variable in the original dataset that holds the subset number, that way it is easy to extract particular subsets or group by them later on.
Here's an easy way to do that, clearly the final subset will not have 8 rows if the number of records is not divisible by 8. I assumed from your question that 8 rows is a fixed amount regardless of record size.
data want;
set sashelp.citimon;
subset = ceil(_n_/8);
run;

Macro approach:
I just provided a rough code. Optimize it according to your requirements.
proc sql;
select count(*) into :total from source_data;
quit;
%macro create_subsets(count,ds);
%let cnt=%sysfunc(ceil(%sysevalf(&count/8)));
%let num=1;
%do i=1 %to &cnt;
%if(&i = &cnt) %then %do;
%let toread=&count;
%end;
%else %do;
%let toread=&num+7;
%end;
data &ds._&i;
set &ds(firstobs=&num obs=%eval(&toread));
run;
%let num=%eval(&num + 8);
%end;
%mend create_subsets;
%create_subsets(&total,source_data)
Edited based on the valuable comment from Robert.
Note: A non macro approach will always be easier and more efficient.

Related

How to skip code if created dataset has zero rows

I have a job which at first imports some xlsx files, then connects to multiple DB tables. Based on conditions, the job selects rows to output, and creates an excel file to send on to the final end-user.
Sometimes, that job returns zero rows, which is acceptable; in that case, I would prefer to create an empty excel file with only the variables, but not run the other code (checking/cleaning code).
How can I conditionally execute code only when there are results?
Something like this:
I get 0 rows
If Result = 0 then Go to *"here"*
Else *"just run the code further"*
You have a few useful things that can help you here.
First off, PROC SQL sets a macro variable SQLOBS, which is particularly useful in identifying how many records were returned from the last SQL query it ran.
proc sql;
select * from sashelp.class;
quit;
%put I returned &SQLOBS rows;
You might use this to drive further processing, either with %IF blocks as Tom notes in comments or other methods I will cover below.
You can also check how many rows are in a dataset explicitly, if you prefer a slightly more robust option.
proc sql;
select count(*) into :class_count from sashelp.class;
quit;
%put I returned &class_count rows;
For very large datasets, there are faster options (using the dataset descriptors, dictionary tables, or a few other options), but for most tables this is fine.
Either way, what I would typically do with a program I intended to run in production would be then to drive the rest of the program from macros.
%macro whatIWantToDo(params);
...
do stuff
...
%mend whatIWantToDo;
proc sql;
mySqlStuff;
quit;
%if &sqlobs. gt 0 %then %do;
%whatIWantToDo(params);
%end;
%else %do;
%put Nothing to do;
%end;
Another option is to use call execute; this is appropriate if your data drives the macro parameters. The big advantage of call execute is that it only runs if you have data rows - if you have zero, it won't do anything!
Say you have some datasets to run code on. You could have up to twelve - one per month - but only have them for the current calendar year, so in Jan you have one, Feb you have two, etc. You could do this:
data mydata_jan mydata_feb mydata_mar;
set sashelp.class;
run;
%macro printit(data=);
title "Printing &data.";
proc print data=&data;
run;
title;
%mend printit;
data _null_;
set sashelp.vtable;
where upcase(memname) like 'MYDATA_%' and nobs gt 0;
callstr = cats('%printit(data=',memname,')');
call execute(Callstr);
run;
First I make the datasets, with a name I can programmatically identify. Then I make the macro that I want to run on each (this could be checking, cleaning, whatever). Then I use sashelp.vtable which shows which tables are created, and check the nobs variable (number of observations) is more than zero. Then I use call execute to run the macro on that dataset!

SAS: How to create datasets in loop based on macro variable

I have a macro variable like this:
%let months = 202002 202001 201912 201911 201910;
As one can see, we have 5 months, separated by space ' '.
I would like to create 5 datasets like a_202002, a_202001, a_201912, a_2019_11, a_201910. How can I run this in loop and create 5 datasets, instead of writing the datastep 5 times?
Pseudo code:
for m in &months.
data a_m;
....
....
run;
How can I do that in SAS? I tried %do_over but that did not help me.
Use the knowledge you gained from #Tom answer in an earlier question to create the macro
%macro datasets_for_months ...
...
%mend;
Specify the output data sets in the DATA statement:
DATA %datasets_for_months(...);
...
RUN;
Direct rows to specific output data sets by naming the data set, such as
OUTPUT a_202002;
Note:
A step with no OUTPUT statements will implicitly output to each data set
A step with an OUTPUT statement will cause records to be written to either ALL data sets, or only the ones named in the statement:
OUTPUT writes records to all output data sets
OUTPUT data-set-name-1 writes records to only data sets specified
The DATA Step documentation covers what you need to know in greater detail
DATA Statement
Begins a DATA step and provides names for any output such as SAS data sets, views, or programs.
...
Syntax
Form 1:
DATA statement for creating output data sets
DATA <data-set-name-1 <(data-set-options-1)>>
... <data-set-name-n <(data-set-options-n)>>
... ;
THE ROAD AHEAD
You will likely discover that month will be better served in a conceptual role as a categorical variable in a single large data set, instead of breaking the data into multiple month-named data sets.
A categorical variable will let you leverage the power of SAS' partitioning and segregating statements such as WHERE, BY and CLASS when pursuing processing, reporting and visualization of your data at different combinations of class level values.
How about this approach? Create the data set names in another macro variable and use a single data step.
%let months = 202002 202001 201912 201911 201910;
data _null_;
ds = prxchange('s/(\d+)/a_$1/', -1, "&months.");
call symputx('ds', ds);
run;
options symbolgen;
data &ds.;
run;
You can use a %DO loop and the %SCAN() function. Use the COUNTW() function to find the upper bound.
%do i=1 %to %sysfunc(countw(&months,%str( )));
%let month=%scan(&months,&i,%str( ));
....
%end;

SAS: proc reg and macro

i have a data that contain 30 variable and 2000 Observations.
I want to calculate regression in a loop, whan in each step I delete the i row in the data.
so in the end I need thet my output will be 2001 regrsion, one for the regrsion on all the data end 2000 on each time thet I drop a row.
I am new to sas, and I tray to find how to do it withe macro, but I didn't understand.
Any comments and help will be appreciated!
This will create the data set I was talking about in my comment to Chris.
data del1V /view=del1v;
length group _obs_ 8;
set sashelp.class nobs=nobs;
_obs_ = _n_;
group=0;
output;
do group=1 to nobs;
if group eq _n_ then;
else output;
end;
run;
proc sort out=analysis;
by group;
run;
DATA NEW;
DATA OLD;
do i = 1 to 2001;
IF _N_ ^= i THEN group=i;
else group=.;
output;
end;
proc sort data=new;
by group;
proc reg syntax;
by group;
run;
This will create a data set that is much longer. You will only call proc reg once, but it will run 2001 models.
Examining 2001 regression outputs will be difficult just written as output. You will likely need to go read the PROC REG support documentation and look into the output options for whatever type of output you're interested in. SAS can create a data set with the GROUP column to differentiate the results.
I edited my original answer per #data null suggestion. I agree that the above is probably faster, though I'm not as confident that it would be 100x faster. I do not know enough about the costs of the overhead of proc reg versus the cost of the group by statement and a larger data set. Regardless the answer above is simpler programming. Here is my original answer/alternate approach.
You can do this within a macro program. It will have this general structure:
%macro regress;
%do i=1 %to 2001;
DATA NEW;
DATA OLD;
IF _N_=&I THEN DELETE;
RUN;
proc reg syntax;
run;
%end;
%mend;
%regress
Macros are an advanced programming function in SAS. The macro program is required in order to do a loop of proc reg. The %'s are indicative of macro functions. &i is a macro variable (& is the prefix of a macro variable that is being called). The macro is created in a block that starts and ends with %macro / %mend, and called by %regress.
Examining 2001 regression outputs will be difficult just written as output. You will likely need to go read the PROC REG support documentation and look into the output options for whatever type of output you're interested in. Use &i to create a different data set each time and then append together as part of the macro loop.

Using a series of values for a SAS macro parameter

I'm looking for a way to use a series of values for a macro parameter instead of a single value. I'm basically manipulating a series of files for consecutive months (May 2014 to Sept 2015) and I've written a macro to take advantage of the naming conventions. However, I'm still manually writing out the months to use the macro. I'm doing this many times over with lots of different files from this month. Is there a way to have the parameter reference a list of values and go through them like an array/do-loop? I've looked into %ARRAY as a possibility but that doesn't seem to do what I'm looking for unless I'm not seeing it's full capability. I've attached a sample of this code below.
%MACRO memmonth(monyr=);
proc freq data=work.both_&monyr ;
table var1/ out=work.mem_&monyr;
run;
data work.mem_&monyr;
set work.mem_&monyr;
count_&monyr=count;
run;
%MEND memmonth;
%memmonth(monyr=may14)
%memmonth(monyr=jun14)
%memmonth(monyr=jul14)
%memmonth(monyr=aug14)
%memmonth(monyr=sep14)
%memmonth(monyr=oct14)
%memmonth(monyr=nov14)
%memmonth(monyr=dec14)
%memmonth(monyr=jan15)
%memmonth(monyr=feb15)
%memmonth(monyr=mar15)
%memmonth(monyr=apr15)
%memmonth(monyr=may15)
%memmonth(monyr=jun15)
%memmonth(monyr=jul15)
%memmonth(monyr=aug15)
%memmonth(monyr=sep15)
In general I would recommend passing the list of values as a space delimited list and adding looping logic in the macro. If spaces are valid characters in the values then use some other delimiter. Do not use comma as the delimiter as it means you will need to use macro quoting to call the macro.
So your basic macro is this.
%macro memmonth(monyr);
proc freq data=work.both_&monyr ;
table var1/ out=work.mem_&monyr (rename=(count=count_&monyr)) ;
run;
%mend memmonth;
%memmonth(may14)
%memmonth(jun14)
You could change it to this.
%macro memmonth(monyrlist);
%local i monyr;
%do i=1 %to %sysfunc(countw(&monyrlist));
%let monyr=%scan(&monyrlist,&i);
proc freq data=work.both_&monyr ;
table var1/ out=work.mem_&monyr (rename=(count=count_&monyr)) ;
run;
%end;
%mend memmonth;
%memmonth(may14 jun14)
If you always want to process all of the months in an interval then you could just pass in the start and end month of the interval.
%macro memmonth(start,end);
%local i monyr;
%do i=0 %to %sysfunc(intck(month,"01&start"d,"01&end"d));
%let monyr=%sysfunc(intnx(month,"01&start"d,&i),monyy5.);
proc freq data=work.both_&monyr ;
table var1/ out=work.mem_&monyr (rename=(count=count_&monyr)) ;
run;
%end;
%mend memmonth;
%memmonth(may14,sep15)
If you have a source list, whether it is a text file or a dataset, you can use a simple data step to generate macro calls. So if you have an input dataset with the variable MONYR then your driver program would look like this:
data _null_;
set mylist ;
call execute(cats('%nrstr(memmonth)(',MONYR,')'));
run;
If the source is a file with the names then replace the SET statement with the appropriate INFILE and INPUT statements. If the source is a directory name then look at one of the many SAS ways to read the names of files in a directory into a dataset and use that to drive the macro call generation.

Sas macro with proc sql

I want to perform some regression and i would like to count the number of nonmissing observation for each variable. But i don't know yet which variable i will use. I've come up with the following solution which does not work. Any help?
Here basically I put each one of my explanatory variable in variable. For example
var1 var 2 -> w1 = var1, w2= var2. Notice that i don't know how many variable i have in advance so i leave room for ten variables.
Then store the potential variable using symput.
data _null_;
cntw=countw(&parameters);
i = 1;
array w{10} $15.;
do while(i <= cntw);
w[i]= scan((&parameters"),i, ' ');
i = i +1;
end;
/* store a variable globally*/
do j=1 to 10;
call symput("explanVar"||left(put(j,3.)), w(j));
end;
run;
My next step is to perform a proc sql using the variable i've stored. It does not work as
if I have less than 10 variables.
proc sql;
select count(&explanVar1), count(&explanVar2),
count(&explanVar3), count(&explanVar4),
count(&explanVar5), count(&explanVar6),
count(&explanVar7), count(&explanVar8),
count(&explanVar9), count(&explanVar10)
from estimation
;quit;
Can this code work with less than 10 variables?
You haven't provided the full context for this project, so it's unclear if this will work for you - but I think this is what I'd do.
First off, you're in SAS, use SAS where it's best - counting things. Instead of the PROC SQL and the data step, use PROC MEANS:
proc means data=estimation n;
var &parameters.;
run;
That, without any extra work, gets you the number of nonmissing values for all of your variables in one nice table.
Secondly, if there is a reason to do the PROC SQL, it's probably a bit more logical to structure it this way.
proc sql;
select
%do i = 1 %to %sysfunc(countw(&parameters.));
count(%scan(&parameters.,&i.) ) as Parameter_&i., /* or could reuse the %scan result to name this better*/
%end; count(1) as Total_Obs
from estimation;
quit;
The final Total Obs column is useful to simplify the code (dealing with the extra comma is mildly annoying). You could also put it at the start and prepend the commas.
You finally could also drive this from a dataset rather than a macro variable. I like that better, in general, as it's easier to deal with in a lot of ways. If your parameter list is in a data set somewhere (one parameter per row, in the dataset "Parameters", with "var" as the name of the column containing the parameter), you could do
proc sql;
select cats('%countme(var=',var,')') into :countlist separated by ','
from parameters;
quit;
%macro countme(var=);
count(&var.) as &var._count
%mend countme;
proc sql;
select &countlist from estimation;
quit;
This I like the best, as it is the simplest code and is very easy to modify. You could even drive it from a contents of estimation, if it's easy to determine what your potential parameters might be from that (or from dictionary.columns).
I'm not sure about your SAS macro, but the SQL query will work with these two notes:
1) If you don't follow your COUNT() functions with an identifier such as "COUNT() AS VAR1", your results will not have field headings. If that's ok with you, then you may not need to worry about it. But if you export the data, it will be helpful for you if you name them by adding "...AS "MY_NAME".
2) For observations with fewer than 10 variables, the query will return NULL values. So don't worry about not getting all of the results with what you have, because as long as the table you're querying has space for 10 variables (10 separate fields), you will get data back.