I have the following code that is being used generate running totals of features for the past 1 day, 7 days, 1 month, 3 months, and 6 months.
LIBNAME A "C:\Users\James\Desktop\data\Base Data";
LIBNAME DATA "C:\Users\James\Desktop\data\Data1";
%MACRO HELPER(P);
data a1;
set data.final_master_&P. ;
QUERY = '%TEST('||STRIP(DATETIME)||','||STRIP(PARTICIPANT)||');';
CALL EXECUTE(QUERY);
run;
%MEND;
%MACRO TEST(TIME,PAR);
proc sql;
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_24, :APP_2_24, :APP_3_24, :APP_4_24, :APP_5_24
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24) AND &TIME.;
/* 7 Days */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_7DAY, :APP_2_7DAY, :APP_3_7DAY, :APP_4_7DAY, :APP_5_7DAY
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7) AND &TIME.;
/* One Month */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_1MONTH, :APP_2_1MONTH, :APP_3_1MONTH, :APP_4_1MONTH, :APP_5_1MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4) AND &TIME.;
/* Three Months */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_3MONTH, :APP_2_3MONTH, :APP_3_3MONTH, :APP_4_3MONTH, :APP_5_3MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4*3) AND &TIME.;
/* Six Months */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_6MONTH, :APP_2_6MONTH, :APP_3_6MONTH, :APP_4_6MONTH, :APP_5_6MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4*6) AND &TIME.;
quit;
DATA T;
PARTICIPANT = &PAR.;
DATETIME = &TIME;
APP_1_24 = &APP_1_24.;
APP_2_24 = &APP_2_24.;
APP_3_24 = &APP_3_24.;
APP_4_24 = &APP_4_24.;
APP_5_24 = &APP_5_24.;
APP_1_7DAY = &APP_1_7DAY.;
APP_2_7DAY = &APP_2_7DAY.;
APP_3_7DAY = &APP_3_7DAY.;
APP_4_7DAY = &APP_4_7DAY.;
APP_5_7DAY = &APP_5_7DAY.;
APP_1_1MONTH = &APP_1_1MONTH.;
APP_2_1MONTH = &APP_2_1MONTH.;
APP_3_1MONTH = &APP_3_1MONTH.;
APP_4_1MONTH = &APP_4_1MONTH.;
APP_5_1MONTH = &APP_5_1MONTH.;
APP_1_3MONTH = &APP_1_3MONTH.;
APP_2_3MONTH = &APP_2_3MONTH.;
APP_3_3MONTH = &APP_3_3MONTH.;
APP_4_3MONTH = &APP_4_3MONTH.;
APP_5_3MONTH = &APP_5_3MONTH.;
APP_1_6MONTH = &APP_1_6MONTH.;
APP_2_6MONTH = &APP_2_6MONTH.;
APP_3_6MONTH = &APP_3_6MONTH.;
APP_4_6MONTH = &APP_4_6MONTH.;
APP_5_6MONTH = &APP_5_6MONTH.;
FORMAT DATETIME DATETIME.;
RUN;
PROC APPEND BASE=DATA.FLAGS_&par. DATA=T;
RUN;
%MEND;
%helper(1);
This code runs perfectly if I limit the number of observations in the %helper macro, using an (obs=) in the creation of the a1 dataset. However, when I put no limit on the obs number, i.e. execute the %test macro for every row in the dataset a1, I get errors. In SAS EG, I get a "server disconnected" popup after the status bar hangs at "running data step", and on Base SAS 9.4 I get the error that none of the macro variables have been resolved that are created in the proc sql into.
I'm confused as the code works fine for a limited amount of observations, but when trying on the whole dataset it hangs or gives errors. The dataset I'm doing this for has around 130,000 observations.
The answer to your actual question is that you're simply generating too much macro code and perhaps even simply taking too much time. The way you are doing this is going to operate on an O=n^2 level, as you're basically doing a cartesian join of every record to every record, and then some. 130,000 * 130,000 is a pretty decent sized number, and on top of that you're actually opening the SQL environment several times for each 130,000 rows. Ouch.
The solution is to do it either in a way that isn't too slow, or if it is, in a way that won't have too much overhead at least.
The fast solution is to not do the cartesian join, or to limit how much needs to be joined. One good solution would be to restructure the problem, not require every record to be compared, but instead consider each calendar day, say, a period, especially in the over-24h periods (24h you might do the way you do above, but not the other four). 1 month, 3 month, etc., do you really need to figure out time of day? Probably won't make much difference. If you can get rid of that, then you can use built in PROCs to precompile all possible 1 month periods, all possible 3 month periods, etc., and then join on the appropriate one. But that won't work with 130,000 of them; it would only work if you could limit it to one per day, probably.
If you must do it at the second level (or worse), what you'll want to do is avoid the cartesian join, and instead keep track of the various records you've seen already, and the sums. The short explanation of the algorithm is:
For each row:
Add this row's values to the rolling sums (at the end of the queue)
Check if the current item of the queue is outside of the period; if it is, subtract it from the rolling sums, and check the next item (repeat until not outside of the period), updating the current queue position
Return the sum at this point
This requires checking each row typically twice (except at odd boundaries where you have no rows popped off for several iterations, due to months having different numbers of days). This operates on O=n time, much faster than the cartesian join, and on top of that has far less memory/space required (the cartesian join might need to hit disk space).
The hash version of this solution is below. This will be the fastest solution I think that compares every row. Note that I intentionally make the test data have 1 for every row and same number of rows for every day; that lets you see how it works on a row-wise manner very easily. (For example, every 24h period has 481 rows, because I made 480 rows per day exactly, and 481 includes the same time yesterday - if you change lt to le it will be 480, if you prefer not to include same time yesterday). You can see that the 'month' based periods will have slightly odd results at the boundaries where months change because the '01FEB20xx' to '01MAY20xx' period has far fewer days (and thus rows) than the '01JUL20xx' to '01OCT20xx' period, for example; better would be 30/90/180 day periods.
data test_data;
array app[5] app_1-app_5;
do _i = 1 to 130000;
dt_var = datetime() - _i*180;
do _j = 1 to dim(app);
*app[_j] = floor(rand('Uniform')*6); *generate 0 to 5 integer;
app[_j]=1;
end;
output;
end;
format dt_var datetime17.;
run;
proc sort data=test_data;
by dt_var;
run;
%macro add(array=);
do _i = 1 to dim(app);
&array.[_i] + app[_i];
end;
%mend add;
%macro subtract(array=);
do _i = 1 to dim(app);
&array.[_i] + (-1*app[_i]);
end;
%mend subtract;
%macro process_array_add(array=);
array app_&array. app_&array._1-app_&array._5;
%add(array=app_&array.);
%mend process_array_add;
%macro process_array_subtract(array=, period=, number=);
if _n_ eq 1 then do;
declare hiter hi_&array.('td');
rc_&array. = hi_&array..first();
end;
else do;
rc_&array. = hi_&array..setcur(key:firstval_&array.);
end;
do while (intnx("&period.",dt_var,&number.,'s') lt curr_dt_var and rc_&array.=0);
%subtract(array=app_&array.);
rc_&array. = hi_&array..next();
end;
retain firstval_&array.;
firstval_&array. = dt_var;
%mend process_array_subtract;
data want;
set test_data;
* if _n_ > 10000 then stop;
curr_dt_var = dt_var;
array app[5] app_1-app_5;
if _n_ eq 1 then do;
declare hash td(ordered:'a');
td.defineKey('dt_var');
td.defineData('dt_var','app_1','app_2','app_3','app_4','app_5');
td.defineDone();
end;
rc_a = td.add();
*start macro territory;
%process_array_add(array=24h);
%process_array_add(array=1wk);
%process_array_add(array=1mo);
%process_array_add(array=3mo);
%process_array_add(array=6mo);
%process_array_subtract(array=24h,period=DTDay, number=1);
%process_array_subtract(array=1wk,period=DTDay, number=7);
%process_array_subtract(array=1mo,period=DTMonth, number=1);
%process_array_subtract(array=3mo,period=DTMonth, number=3);
%process_array_subtract(array=6mo,period=DTMonth, number=6);
*end macro territory;
rename curr_dt_var=dt_var;
format curr_dt_var datetime21.3;
drop dt_var rc: _:;
output;
run;
Here's a pure data step non-hash version. On my machine it's actually faster than the hash solution; I suspect it's not actually faster on a machine with a HDD (I have an SSD, so point access is not substantially slower than hash access, and I avoid having to load the hash). I would recommend using it if you don't know hashes very well or at all, as it'll be easier to troubleshoot, and it scales similarly. For most rows it accesses 11 rows, the current row and five other rows twice (one row, subtract it, then another row) for a total of around a million and a half total reads for 130k rows. (Compare that to about 17 billion reads for the cartesian...)
I suffix the macros with "_2" to differentiate them from the macros in the hash solution.
data test_data;
array app[5] app_1-app_5;
do _i = 1 to 130000;
dt_var = datetime() - _i*180;
do _j = 1 to dim(app);
*app[_j] = floor(rand('Uniform')*6); *generate 0 to 5 integer;
app[_j]=1;
end;
output;
end;
format dt_var datetime17.;
run;
proc sort data=test_data;
by dt_var;
run;
%macro add_2(array=);
do _i = 1 to dim(app);
&array.[_i] + app[_i];
end;
%mend add;
%macro subtract_2(array=);
do _i = 1 to dim(app);
&array.[_i] + (-1*app[_i]);
end;
%mend subtract;
%macro process_array_add_2(array=);
array app_&array. app_&array._1-app_&array._5; *define array;
%add_2(array=app_&array.); *add current row to array;
%mend process_array_add_2;
%macro process_array_sub_2(array=, period=, number=);
if _n_ eq 1 then do; *initialize point variable;
point_&array. = 1;
end;
else do; *do not have to do this _n_=1 as we only have that row;
set test_data point=point_&array.; *set the row that we may be subtracting;
end;
do while (intnx("&period.",dt_var,&number.,'s') lt curr_dt_var and point_&array. < _N_); *until we hit a row that is within the period...;
%subtract_2(array=app_&array.); *subtract the rows values;
point_&array. + 1; *increment the point to look at;
set test_data point=point_&array.; *set the new row;
end;
%mend process_array_sub_2;
data want;
set test_data;
*if _n_ > 10000 then stop; *useful for testing if you want to check time to execute;
curr_dt_var = dt_var; *save dt_var value from originally set record;
array app[5] app_1-app_5; *base array;
*start macro territory;
%process_array_add_2(array=24h); *have to do all of these adds before we start subtracting;
%process_array_add_2(array=1wk); *otherwise we have the wrong record values;
%process_array_add_2(array=1mo);
%process_array_add_2(array=3mo);
%process_array_add_2(array=6mo);
%process_array_sub_2(array=24h,period=DTDay, number=1); *now start checking to subtract what we need to;
%process_array_sub_2(array=1wk,period=DTDay, number=7);
%process_array_sub_2(array=1mo,period=DTMonth, number=1);
%process_array_sub_2(array=3mo,period=DTMonth, number=3);
%process_array_sub_2(array=6mo,period=DTMonth, number=6);
*end macro territory;
rename curr_dt_var=dt_var;
format curr_dt_var datetime21.3;
drop dt_var _:;
output; *unneeded in this version but left for comparison to hash;
run;
Related
I'm using SAS and I'd like to create an indicator variable.
The data I have is like this (DATA I HAVE):
and I want to change this to (DATA I WANT):
I have a fixed number of total time that I want to use, and the starttime has duplicate time value (in this example, c1 and c2 both started at time 3). Although the example I'm using is small with 5 names and 12 time values, the actual data is very large (about 40,000 names and 100,000 time values - so the outcome I want is a matrix with 100,000x40,000.)
Can someone please provide any tips/solution on how to handle this?
40k variables is a lot. It will be interesting to see how well this scales. How do you determine the stop time?
data have;
input starttime name :$32.;
retain one 1;
cards;
1 varx
3 c1
3 c2
5 c3x
10 c4
11 c5
;;;;
run;
proc print;
run;
proc transpose data=have out=have2(drop=_name_ rename=(starttime=time));
by starttime;
id name;
var one;
run;
data time;
if 0 then set have2(drop=time);
array _n[*] _all_;
retain _n 0;
do time=.,1 to 12;
output;
call missing(of _n[*]);
end;
run;
data want0 / view=want0;
merge time have2;
by time;
retain dummy '1';
run;
data want;
length time 8;
update want0(obs=0) want0;
by dummy;
if not missing(time);
output;
drop dummy;
run;
proc print;
run;
This will work. There may be a simpler solution that does it all in one data step. My data step creates a staggered results that has to be collapsed which I do by summing in the sort/means.
data have;
input starttime name $;
datalines;
3 c1
3 c2
5 c3
10 c4
11 c5
;
run;
data want(drop=starttime name);
set have;
array cols (*) c1-c5;
do time=1 to 100;
if starttime < time then cols(_N_)=1;
else cols(_N_)=0;
output;
end;
run;
proc sort data=want;
by time;
proc means data=want noprint;
by time;
var _numeric_;
output out=want2(drop=_type_ _freq_) sum=;
run;
I am not recommending you do it this way. You didn't provide enough information to let us know why you want a matrix of that size. You may have processing issues getting it to run.
In the line do time=1 to 100 you can change that to 100000 or whatever length.
I think the code below will work:
%macro answer_macro(data_in, data_out);
/* Deduplication of initial dataset just to assure that every variable has a unique starting time*/
proc sort data=&data_in. out=data_have_nodup; by name starttime; run;
proc sort data=data_have_nodup nodupkey; by name; run;
/*Getting min and max starttime values - here I am assuming that there is only integer values form starttime*/
proc sql noprint;
select min(starttime)
,max(starttime)
into :min_starttime /*not used. Use this (and change the loop on the next dataset) to start the time variable from the value where the first variable starts*/
,:max_starttime
from data_have_nodup
;quit;
/*Getting all pairs of name/starttime*/
proc sql noprint;
select name
,starttime
into :name1 - :name1000000
,:time1 - :time1000000
from data_have_nodup
;quit;
/*Getting total number of variables*/
proc sql noprint;
select count(*) into :nvars
from data_have_nodup
;quit;
/* Creating dataset with possible start values */
/*I'm not sure this step could be done with a single datastep, but I don't have SAS
on my PC to make tests, so I used the method below*/
data &data_out.;
do i = 1 to &max_starttime. + 1;
time = i; output;
end;
drop i;
run;
data &data_out.;
set &data_out.;
%do i = 1 %to &nvars.;
if time >= &&time&i then &&name&i = 1;
else &&name&i = 0;
%end;
run;
%mend answer_macro;
Unfortunately I don't have SAS on my machine right now, so I can't confirm that the code works. But even if it doesn't, you can use the logic in it.
I have a process flow in SAS Enterprise Guide which is comprised mainly of Data views rather than tables, for the sake of storage in the work library.
The problem is that I need to calculate percentiles (using proc univariate) from one of the data views and left join this to the final table (shown in the screenshot of my process flow).
Is there any way that I can specify the outfile in the univariate procedure as being a data view, so that the procedure doesn't calculate everything prior to it in the flow? When the percentiles are left joined to the final table, the flow is calculated again so I'm effectively doubling my processing time.
Please find the code for the univariate procedure below
proc univariate data=WORK.QUERY_FOR_SGFIX noprint;
var CSA_Price;
by product_id;
output out= work.CSA_Percentiles_Prod
pctlpre= P
pctlpts= 40 to 60 by 10;
run;
In SAS, my understanding is that procs such as proc univariate cannot generally produce views as output. The only workaround I can think of would be for you to replicate the proc logic within a data step and produce a view from the data step. You could do this e.g. by transposing your variables into temporary arrays and using the pctl function.
Here's a simple example:
data example /view = example;
array _height[19]; /*Number of rows in sashelp.class dataset*/
/*Populate array*/
do _n_ = 1 by 1 until(eof);
set sashelp.class end = eof;
_height[_n_] = height;
end;
/*Calculate quantiles*/
array quantiles[3] q40 q50 q60;
array points[3] (40 50 60);
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/*Keep only the quantiles we calculated*/
keep q40--q60;
run;
With a bit more work, you could also make this approach return percentiles for individual by groups rather than for the whole dataset at once. You would need to write a double-DOW loop to do this, e.g.:
data example;
array _height[19];
array quantiles[3] q40 q50 q60;
array points[3] _temporary_ (40 50 60);
/*Clear heights array between by groups*/
call missing(of _height[*]);
/*Populate heights array*/
do _n_ = 1 by 1 until(last.sex);
set class end = eof;
by sex;
_height[_n_] = height;
end;
/*Calculate quantiles*/
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/* Output all rows from input dataset, with by-group quantiles attached*/
do _n_ = 1 to _n_;
set class;
output;
end;
keep name sex q40--q60;
run;
I have written this code to do this :
read records in the table "not_identified" one by one
for one record pass the "name_firstname" variable to a macro named "mCalcul_lev_D33",
then, the macro calculates the Levenstein between the variable passed as parameter and all the values of the variable "name_firstname_in_D33" in "data_all" table,
if the Levenstein returns a value less or equal to "3", then the record of "data_all" is copied to "lev_D33" table.
rsubmit;
%macro mCalcul_lev_D33(theName);
data result.lev_D33;
set result.data_all;
name_LEV=complev(&theName, name_firstname_in_D33);
if name_LEV<=3 then output;
run;
%mend mCalcul_lev_D33;
endrsubmit;
rsubmit;
data _null_;
set result.not_identified;
call execute ('%mCalcul_lev_D33('||name_firstname||')');
;
run;
endrsubmit;
There is 53700000 records in "data_all". The code is running since yesterday. Because I cannot see the result, I am asking :
Is the code doing what I want?
How coding if I want to write "name_firstname" (the variable passed like parameter) in the beginning of each record of "lev_D33"?
Thank you!
D.O.:
I posit your macros are making the task more difficult than need be. There appears to be an coding problem in that each row in not_identified record will cause the result.lev_D33 to be rebuilt. If your long running program ever does finish, the lev_D33 output data set will correspond to only the last not_identified.
You are doing full outer join comparing ALL_COUNT * NOT_IDENT_COUNT rows in the process.
How many rows are in not_identified ?Hopefully far less than data_all.
Is the result libname pointing to a network drive or remote server ?Networking i/o can make things run a very long time and even win you a phone call from the network team.
A full outer join in DATA Step can be done with nested loops and a point= on the inner loop SET. In DATA Step the outer loop is the implicit loop.
Consider this sample code:
data all_data;
do row = 1 to 100;
length name_firstname $20;
name_firstname
= repeat (byte(65 + mod(row,26)), 4*ranuni(123))
|| repeat(byte(65 + 26*ranuni(123)), 4*ranuni(123))
;
output;
end;
run;
data not_identified;
do row = 1 to 10;
length name_firstname $20;
name_firstname = repeat (byte(65 + mod(row,26)), 10*ranuni(123));
output;
end;
run;
data lev33;
set all_data;
do check_row = 1 to check_count;
set not_identified (keep=name_firstname rename=name_firstname=check_name)
nobs=check_count
point=check_row
;
name_lev = complev (check_name, name_firstname);
if name_lev <= 3 then output;
end;
run;
This approach tests each not_identified before moving to the next row. This is a useful method when the all_data is very large and you might want to process chunks of it at a time. Chunk processing is an appropriate place to start macro coding:
%macro do_chunk (FROM_OBS=, TO_OBS=);
data lev33_&FROM_OBS._&TO_OBS;
set all_data (firstobs=&FROM_OBS obs=&TO_OBS);
do check_row = 1 to check_count;
set not_identified (keep=name_firstname rename=name_firstname=check_name)
nobs=check_count
point=check_row
;
name_lev = complev (check_name, name_firstname);
if name_lev <= 3 then output;
end;
run;
%mend;
%macro do_chunks;
%local index;
%do index = 1 %to 100 %by 10;
%do_chunk ( FROM_OBS=&index, TO_OBS=%eval(&index+9) )
%end;
%mend;
%do_chunks
You might shepherd the whole the process, bypassing do_chunks and manually invoking do_chunk for various ranges of your choosing.
Thanks to #Richard. I have used your second example to write this code :
rsubmit;
data result.lev_D33;
set result.not_identified (firstobs=1 obs=10);
do check_row = 1 to 1000000;
set &lib..data_all (firstobs=1 obs=1000000) point=check_row;
name_lev = complev (name_firstname, name_firstname_D3);
if name_lev <= 3 then output;
end;
run;
endrsubmit ;
And it worked like I wanted.
In this example, I compare name_firstname in not_identified table to all name_firstname_D3 in data_all. If the COMPLEV is less or equal to 3, then the merge of the 2 records are in the result table "lev_D33" (one record from not_identified is merged to one record from data_all).
To do a test, I taked 10 records from not_identified and tried to find a concordance of the names and the firstnames in 1000000 data_all only.
I have a panel data with a quarterly time series for each ID that has different start and end date. I'd like to run a rolling regression on each ID with 10 obs. I'm wondering how I can write a loop to do it. For instance for each ID, I'd like to run a regression from the first obs from date 19880831- 19910228 and then from the second obs 19890228-19910531 and so on and so forth until the last date for this ID. Each ID, however, has a different start and end date. Thanks.
Consider the following macro with nested do loops. Outer loop iterates through panel IDs and inner loop skips every 10 to filter data by the row number _N_ value. Regression results are then saved into individual one-line results data sets suffixed by id and current multiple of 10. Be sure to change the loops limits, (50, 1000) to accommodate data.
%macro rollingregression;
%do i = 1 %to 50; ** ID LOOP;
%do j = 1 %to 1000; ** OBS LOOP;
data new_data;
set current_data;
if ID = &i;
if _N_ >= &j & _N_ <= (&j + 9);
run;
proc reg data = new_data noprint outest=results&i&j;
model y = x1 x2 x3;
quit;
%end;
%end;
%mend rollingregression;
%rollingregression;
Is it possible to repeat a data step a number of times (like you might in a %do-%while loop) where the number of repetitions depends on the result of the data step?
I have a data set with numeric variables A. I calculate a new variable result = min(1, A). I would like the average value of result to equal a target and I can get there by scaling variable A by a constant k. That is solve for k where target = average(min(1,A*k)) - where k and target are constants and A is a list.
Here is what I have so far:
filename f0 'C:\Data\numbers.csv';
filename f1 'C:\Data\target.csv';
data myDataSet;
infile f0 dsd dlm=',' missover firstobs=2;
input A;
init_A = A; /* store the initial value of A */
run;
/* read in the target value (1 observation) */
data targets;
infile f1 dsd dlm=',' missover firstobs=2;
input target;
K = 1; * initialise the constant K;
run;
%macro iteration; /* I need to repeat this macro a number of times */
data myDataSet;
retain key 1;
set myDataSet;
set targets point=key;
A = INIT_A * K; /* update the value of A /*
result = min(1, A);
run;
/* calculate average result */
proc sql;
create table estimate as
select avg(result) as estimate0
from myDataSet;
quit;
/* compare estimate0 to target and update K */
data targets;
set targets;
set estimate;
K = K * (target / estimate0);
run;
%mend iteration;
I can get the desired answer by running %iteration a few times, but Ideally I would like to run the iteration until (target - estimate0 < 0.01). Is such a thing possible?
Thanks!
I had a similar problem to this just the other day. The below approach is what I used, you will need to change the loop structure from a for loop to a do while loop (or whatever suits your purposes):
First perform an initial scan of the table to figure out your loop termination conditions and get the number of rows in the table:
data read_once;
set sashelp.class end=eof;
if eof then do;
call symput('number_of_obs', cats(_n_) );
call symput('number_of_times_to_loop', cats(3) );
end;
run;
Make sure results are as expected:
%put &=number_of_obs;
%put &=number_of_times_to_loop;
Loop over the source table again multiple times:
data final;
do loop=1 to &number_of_times_to_loop;
do row=1 to &number_of_obs;
set sashelp.class point=row;
output;
end;
end;
stop; * REQUIRED BECAUSE WE ARE USING POINT=;
run;
Two part answer.
First, it's certainly possible to do what you say. There are some examples of code that works like this available online, if you want a working, useful-code example of iterative macros; for example, David Izrael's seminal Rakinge macro, which performs a rimweighting procedure by iterating over a relatively simple process (proc freqs, basically). This is pretty similar to what you're doing. In the process it looks in the datastep at the various termination criteria, and outputs a macro variable that is the total number of criteria met (as each stratification variable separately needs to meet the termination criterion). It then checks %if that criterion is met, and terminates if so.
The core of this is two things. First, you should have a fixed maximum number of iterations, unless you like infinite loops. That number should be larger than the largest reasonable number you should ever need, often by around a factor of two. Second, you need convergence criteria such that you can terminate the loop if they're met.
For example:
data have;
x=5;
run;
%macro reduce(data=, var=, amount=, target=, iter=20);
data want;
set have;
run;
%let calc=.;
%let _i=0;
%do %until (&calc.=&target. or &_i.=&iter.);
%let _i = %eval(&_i.+1);
data want;
set want;
&var. = &var. - &amount.;
call symputx('calc',&var.);
run;
%end;
%if &calc.=&target. %then %do;
%put &var. reduced to &target. in &_i. iterations.;
%end;
%else %do;
%put &var. not reduced to &target. in &iter. iterations. Try a larger number.;
%end;
%mend reduce;
%reduce(data=have,var=x,amount=1,target=0);
That is a very simple example, but it has all of the same elements. I prefer to use do-until and increment on my own but you can do the opposite also (as %rakinge does). Sadly the macro language doesn't allow for do-by-until like the data step language does. Oh well.
Secondly, you can often do things like this inside a single data step. Even in older versions (9.2 etc.), you can do all of what you ask above in a single data step, though it might look a little clunky. In 9.3+, and particularly 9.4, there are ways to run that proc sql inside the data step and get the result back without waiting for another data step, using RUN_MACRO or DOSUBL and/or the FCMP language. Even something simple, like this:
data have;
initial_a=0.3;
a=0.3;
target=0.5;
output;
initial_a=0.6;
a=0.6;
output;
initial_a=0.8;
a=0.8;
output;
run;
data want;
k=1;
do iter=1 to 20 until (abs(target-estimate0) < 0.001);
do _n_ = 1 to nobs;
if _n_=1 then result_tot=0;
set have nobs=nobs point=_n_;
a=initial_a*k;
result=min(1,a);
result_tot+result;
end;
estimate0 = result_tot/nobs;
k = k * (target/estimate0);
end;
output;
stop;
run;
That does it all in one data step. I'm cheating a bit because I'm writing my own data step iterator, but that's fairly common in this sort of thing, and it is very fast. Macros iterating multiple data steps and proc sql steps will be much slower typically as there is some overhead from each one.