I want to store an instance of a data step variable in a macro-variable using call symput, then use that macro-variable in the same data step to populate a new field, assigning it a new value every 36 records.
I tried the following code:
data a;
set a;
if MOB = 1 then do;
MOB1_accounts = accounts;
call symput('MOB1_acct', MOB1_accounts);
end;
else if MOB > 1 then MOB1_accounts = &MOB1_acct.;
run;
I have a series of repeating MOB's (1-36). I want to create a field called MOB1_Accts, set it equal to the # of accounts for that cohort where MOB = 1, and keep that value when MOB = 2, 3, 4 etc. I basically want to "drag down" the MOB 1 value every 36 records.
For some reason this macro-variable is returning "1" instead of the correct # accounts. I think it might be a char/numeric issue but unsure. I've tried every possible permutation of single quotes, double quotes, symget, etc... no luck.
Thanks for the help!
You are misusing the macro system.
The ampersand (&) introducer in source code tells SAS to resolve the following symbol and place it into the code submission stream. Thus, the resolved &MOB1_acct. can not be changed in the running DATA Step. In other words, a running step can not change it's source code -- The resolved macro variable will be the same for all implicit iterations of the step because its value became part of the source code of the step.
You can use SYMPUT() and SYMGET() functions to move strings out of and into a DATA Step. But that is still the wrong approach for your problem.
The most straight forward technique could be
use of a retained variable
mod (_n_, 36) computation to determine every 36th row. (_n_ is a proxy for row number in a simple step with a single SET.)
Example:
data a;
set a;
retain mob1_accounts;
* every 36 rows change the value, otherwise the value is retained;
if mod(_n_,36) = 1 then mob1_accounts = accounts;
run;
You didn't show any data, so the actual program statements you need might be slightly different.
Contrasting SYMPUT/SYMGET with RETAIN
As stated, SYMPUT/SYMGET is a possible way to retain values by off storing them in the macro symbol table. There is a penalty though. The SYM* requires a function call and whatever machinations/blackbox goings on are happening to store/retrieve a symbol value, and possibly additional conversions between character and numeric.
Example:
1,000,000 rows read. DATA _null_ steps to avoid writing overhead as part of contrast.
data have;
do rownum = 1 to 1e6;
mob + 1;
accounts = sum(accounts, rand('integer', 1,50) - 10);
if mob > 36 then mob = 1;
output;
end;
run;
data _null_;
set have;
if mob = 1 then call symput ('mob1_accounts', cats(accounts));
mob1_accounts = symgetn('mob1_accounts');
run;
data _null_;
set have;
retain mob1_accounts;
if mob = 1 then mob1_accounts = accounts;
run;
On my system logs
142 data _null_;
143 set have;
144
145 if mob = 1 then call symput ('mob1_accounts', cats(accounts));
146
147 mob1_accounts = symgetn('mob1_accounts');
148 run;
NOTE: There were 1000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 0.34 seconds
cpu time 0.34 seconds
149
150 data _null_;
151 set have;
152 retain mob1_accounts;
153
154 if mob = 1 then mob1_accounts = accounts;
155 run;
NOTE: There were 1000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds
Or
way real cpu
------------- ------ ----
SYMPUT/SYMGET 0.34 0.34
RETAIN 0.04 0.03
Related
I am extracting data from a database that has all values posted in strings, in the format +000000xx.xxx or -00000xx.xxx . I need to convert these to numeric to operate on.
data want;
set have;
numeric_var = string_var*1;
run;
works fine, but, to save compute time and resources on the final running, which will be over a much larger dataset, and in the interest of doing things properly I'd rather do that with a format or informat statement.
data want;
set have;
numeric_var = input(string_var, best8.);
run;
seems to output wrong values and to round everything to 0.
Any ideas?
Using best8. is telling SAS to only consider the first 8 characters of the string, so that's never going to work. You should use just best. or possibly best32. if you feel you have to pre-specify the length.
However, make sure you run some benchmarks before changing your current simple solution. SAS is already doing a character-to-numeric conversion as part of the numeric_var = string_var*1; statement, and is apparently doing it correctly; changing the code to use an informat will not automatically be any faster.
It would be cool if you benchmarked both methods and reported the results back here.
EDIT:
I did some benchmarking on this, out of curiosity. The code and log are below but TL;DR - the informat seems to be very slightly but consistently faster - 7.58 seconds vs 7.83 seconds in the run below on a 50 million observation data set. So the informat method is the way to go, but the 3% performance gain wouldn't be worth refactoring a large program, particularly if you don't have good test coverage to be sure of avoiding regressions.
483 * Set small for testing, big for benchmarking;
484 %let obs = 50000000;
485
486 * Generate test data;
487 data testdata;
488 do i = 1 to &obs;
489 numeric = round(ranuni(0)*100, 0.001);
490 char = '+' || put(numeric, z12.3-L);
491 output;
492 end;
493 run;
NOTE: The data set WORK.TESTDATA has 50000000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 12.55 seconds
user cpu time 11.41 seconds
system cpu time 0.84 seconds
memory 4375.18k
OS Memory 20784.00k
Timestamp 12/10/2019 10:36:11 AM
Step Count 51 Switch Count 0
494
495 %macro charToNum(in=, method=, obs=);
496
497 * Convert back to numeric;
498 data converted;
499 set ∈
500 %if "&method" = "MULT-BY-ONE" %then %do;
501 converted = char * 1;
502 %end; %else %if "&method" = "INFORMAT" %then %do;
503 converted = input(char, 32.);
504 %end;
505 if converted ne numeric then do;
506 put "ERROR: Conversion failed: " numeric= char= converted=;
507 end;
508 run;
509
510 %mend;
511
512 %charToNum(in = testdata, method = MULT-BY-ONE, obs = &obs);
NOTE: Character values have been converted to numeric values at the places given by:
(Line):(Column).
3:20
NOTE: There were 50000000 observations read from the data set WORK.TESTDATA.
NOTE: The data set WORK.CONVERTED has 50000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 7.83 seconds
user cpu time 5.92 seconds
system cpu time 1.88 seconds
memory 14642.84k
OS Memory 31036.00k
Timestamp 12/10/2019 10:36:18 AM
Step Count 52 Switch Count 0
513 %charToNum(in = testdata, method = INFORMAT, obs = &obs);
NOTE: There were 50000000 observations read from the data set WORK.TESTDATA.
NOTE: The data set WORK.CONVERTED has 50000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 7.58 seconds
user cpu time 5.36 seconds
system cpu time 2.15 seconds
memory 14646.18k
OS Memory 31036.00k
Timestamp 12/10/2019 10:36:26 AM
Step Count 53 Switch Count 0
If you want to keep only the numbers, use the code below.
Using compress this way the numbers in the string will keeped.
The first parameter is the name of variable. The second is optional, this case the caracters to be keeped. Third is "k" that means keep.
data want;
set have;
numeric_var = input(compress(string_var,"0123456789","k"), best8.);
run;
I have come across a scenario where NULL may be preventing datastep to execute.
Can someone please have a look and confirm why is this happening.
Ran this in SAS EG :
/*create a TEMP1 table*/
data TEMP1;
input Name $ age score;
cards;
A 10 100
B . 20
C 20 .
D . .
;
run;
/* step to overwrite WORK.TEMP1 dots with 0 */
DATa _NULL_;
SET TEMP1;
file print;
array a1 _numeric_;
do over a1;
if a1=. then a1=0;
end;
run;
Expectation is that all numeric fields with dot to be overwritten with 0.
It does only when DATA NULL is replaced with DATA TEMP1
A bit of a conundrum.
Here's some comments that may help. Basically, as others have indicated, _NULL_ does not create an output data set so your assumption there is incorrect.
You're also using FILE incorrectly I suspect but don't know what you're trying to do with that statement.
You also are using a DO OVER loop which is deprecated as of SAS V7 so you shouldn't use it in production code.
DATa _NULL_;*_Null_ means no output data set is created;
SET TEMP1; *input data set means temp1;
file print; *writes to a file named print, no filename statement so no idea what this means to you;
array a1 _numeric_; *creates an array of all numeric values;
do over a1; *Do over is deprecated as of 20 years ago, it works but I don't recommend using it in production code;
if a1=. then a1=0; *replaces missing with 0;
end;*ends loop;
*no put statements so nothing is written the file print;
run;
You could fix it by doing this, but I don't recommend using the same data set name. It makes it hard to debug your code later on.
/* step to overwrite WORK.TEMP1 dots with 0 */
DATa TEMP1;
SET TEMP1;
array a1 _numeric_;
do over a1;
if a1=. then a1=0;
end;
run;
Here are two different ways to replace values in an existing table
overwriting the entire table with a new copy of itself
data name; set name; …
modifying values in-place within existing table
data name; modify name; …
Example
1 data class;
2 set sashelp.class;
3 run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
cpu time 0.00 seconds
4
5 data class; /* output data set named is same as input data set */
6 set class;
7 age = age * 2;
8 run;
NOTE: There were 19 observations read from the data set WORK.CLASS.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.03 seconds
9
10 data class; /* output data set name */
11 modify class; /* is same as modify name, values updated in place */
12 age = age / 2;
13 run; /* observations are rewritten (see log) */
NOTE: There were 19 observations read from the data set WORK.CLASS.
NOTE: The data set WORK.CLASS has been updated. There were 19 observations rewritten, 0
observations added and 0 observations deleted.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.00 seconds
A third way would be to use SQL UPDATE statement with sets based on coalesce, however, that is not amenable to array processing.
Proc SQL;
update mydata set
a1 = coalesce (a1,0)
, s2 = coalesce (a2,0)
…
;
When you use data _NULL_ instead of data temp1 you only read from temp1 but you changes will be written nowhere. That is not conundrum that is basic SAS functionality. Only use _NULL_ when you don't need data to be written.
I have a dataset that looks like:
Month Cost_Center Account Actual Annual_Budget
June 53410 Postage 13 234
June 53420 Postage 0 432
June 53430 Postage 48 643
June 53440 Postage 0 917
June 53710 Postage 92 662
June 53410 Phone 73 267
June 53420 Phone 103 669
June 53430 Phone 90 763
...
I would like to first sum the Actual and Annual columns, respectively and then create a variable where it flags if the Actual extrapolated for the entire year is greater than than Annual column.
I have the following code:
Data Test;
set Combined;
%All_CC; /*MACRO TO INCLUDE ALL COST CENTERS*/
%Total_Other_Expenses;/*MACRO TO INCLUDE SPECIFIC Account Descriptions*/
Sum_Actual = sum(Actual);
Sum_Annual = sum(Annual_Budget);
Run_Rate = Sum_Actual*12;
if Run_Rate > Sum_Annual then Over_Budget_Alarm = 1;
run;
However, when I run this code, it does not sum by group, for example, this is the output I get:
Account_Description Sum_Actual Sum_Annual Run_Rate Over_Budget_Alarm
Postage 13 234 146
Postage 0 432 0
Postage 48 643 963 1
Postage 0 917 0
Postage 92 662 634 1
I'm looking for output where all the 'postage' are summed for Actual and Annual, leaving just one row of data.
Use PROC MEANS to summarize the data
Use a data step and IF/THEN statement to create your flags.
proc means data=have N SUM NWAY STACKODS;
class account;
var amount annual_budget;
ods output summary = summary_stats1;
output out = summary_stats2 N = SUM= / AUTONAME;
run;
data want;
set summary_stats;
if sum_actual > sum_annual_budget then flag=1;
else flag=0;
run;
SAS DATA step behavior is quite complex ("About DATA Step Execution" in SAS Language Reference: Concepts). The default behavior, that you're seeing, is: at the end of each iteration (i.e. for each input row) the row is written to the output data set, and the PDV - all data step variables - is reset.
You can't expect to write Base SAS "intuitively" without spending a few days learning it first, so I recommend using PROC SQL, unless you have a reason not to.
If you really want to aggregate in data step, you have to use something called BY groups processing: after ensuring the input data set is sorted by the BY vars, you can use something like the following:
data Test (keep = Month Account Sum_Actual Sum_Annual /*...your Run_Rate and Over_Budget_Alarm...*/);
set Combined; /* the input table */
by Month Account; /* must be sorted by these */
retain Sum_Actual Sum_Annual; /* don't clobber for each input row */
if first.account then do; /* instead do it manually for each group */
Sum_Actual = 0;
Sum_Annual = 0;
end;
/* accumulate the values from each row */
Sum_Actual = sum(Sum_Actual, Actual);
Sum_Annual = sum(Sum_Annual, Annual_Budget);
/* Note that Sum_Actual = Sum_Actual+Actual; will not work if any of the input values is 'missing'. */
if last.account then do;
/* The group has been processed.
Do any additional processing for the group as a whole, e.g.
calculate Over_Budget_Alarm. */
output; /* write one output row per group */
end;
run;
Proc SQL can be very effective for understanding aggregate data examination. With out seeing what the macros do, I would say perform the run rate checks after outputting data set test.
You don't show rows for other months, but I must presume the annual_budget values are constant across all months -- if so, I don't see a reason to ever sum annual_budget; comparing anything to sum(annual_budget) is probably at the incorrect time scale and not useful.
From the show data its hard to tell if you want to know any of these
which (or if some) months had a run_rate that exceeded the annual_budget
which (or if some) months run_rate exceeded the balance of annual_budget (i.e. the annual_budget less the prior months expenditure)
Presume each row in test is for a single year/month/costCenter/account -- if not the underlying data would have to be aggregated to that level.
Proc SQL;
* retrieve presumed constant annual_budget values from data;
* this information might (should) already exist in another table;
* presume constant annual budget value at each cost center | account combination;
* distinct because there are multiple months with the same info;
create table annual_budgets as
select distinct Cost_Center, Account, Annual_Budget
from test;
create table account_budgets as
select account, sum(annual_budget) as annual_budget
from annual_budgets
group by account;
* flag for some run rate condition;
create table annual_budget_mon_runrate_check as
select
2019 as year,
account,
sum(actual) as yr_actual, /* across all month/cost center */
min (
select annual_budget from account_budgets as inner
where inner.account = outer.account
) as account_budget,
max (
case when actual * 12 > annual_budget then 1 else 0 end
) as
excessive_runrate_flag label="At least one month had a cost center run rate that would exceed its annual_budget")
from
test as outer
group by
year, account;
You can add a where clause to restrict the accounts processed.
Changing the max to sum in the flag computation would return the number of cost center months with excessive run rates.
I would like to create one macro called 'currency_rate' which calls the correct value depending on the conditions stipulated
(either 'new_rate' or static value of 1.5):
%macro MONEY;
%Do i=1 %to 5;
data get_currency_&i (keep=month code new_rate currency_rate);
set Table1;
if month = &i and code = 'USD' then currency_rate=new_rate;
else currency_rate=1.5;
run;
data _null_;
set get_currency_&i;
if month = &i and code = 'USD' then currency_rate=new_rate;
else currency_rate=1.5;
call symput ('currency_rate', ???);
run;
%End;
%mend MONEY;
%MONEY
I am happy with the do loop and first data step. It is the call symput I am stuck on. Is call symput the correct function to use, to assign two possible values to one macro?
A snippet example of the way I will be using 'currency_rate' in a proc sql:
t1.income/¤cy_rate.
I am a beginner level SAS user, any guidance would be great!
Thanks
Let's simulate your case. Suppose we have 3 datasets, as shown below -
data get_currency_1;
input month code $ new_rate currency_rate;
cards;
1 USD 2 2
2 CHF 2 1.5
3 GBP 1 1.5
;
data get_currency_2;
input month code $ new_rate currency_rate;
cards;
1 USD 3 1.5
2 USD 4 4
3 JPY 0.5 1.5
;
data get_currency_3;
input month code $ new_rate currency_rate;
cards;
1 USD 1 1.5
2 USD 3 1.5
3 USD 2.5 2.5
;
Now, let's run your code where we assign a value to currency_rate.
Let i=1 So, the dataset get_currency_1 will be accessed. As we run the step, each and every row will be accessed and the value of currency_rate will be assigned to the macro variable currency_rate and this iteration will continue till the end of the data step. At this time, the last value will be of currency_rate will be the final value of macro variable currency_rate because beyond that the step ends.
%let i=1; /*Let's assign 1 to i*/
data _null_;
set get_currency_&i;
if month = &i and code = 'USD' then currency_rate=new_rate;
else currency_rate=1.5;
call symput ('currency_rate', currency_rate);
run;
%put Currency rate is: ¤cy_rate;
Currency rate is: 1.5
Let i=3:
%let i=3; /*Let's assign 3 to i*/
data _null_;
set get_currency_&i;
if month = &i and code = 'USD' then currency_rate=new_rate;
else currency_rate=1.5;
call symput ('currency_rate', currency_rate);
run;
%put Currency rate is: ¤cy_rate;
Currency rate is: 2.5
You cannot have multiple values on one macro variable.
You say you are a beginner, so the best course of action is to avoid macro programming at this point. You would be better served learning about where, merge (or join) and by statements.
You state you will need to use a currency_rate in a statement such as
t1.income / ¤cy_rate.
The t1. to me suggests t1 is an alias in a SQL join and thus the far more likely scenario is that you need to left join table t1 that contains incomes with table1 (call it monthly_datum) that contains the monthly currency rates.
select
t1.income / coaslesce(monthly_datum.currency_rates,1.5)
, …
from
income_data as t1
left join
monthly_datum
on t1.month = monthly_datum.month
The rate of 1.5 would be used when the income is associated with a month that is not present in monthly_datum.
A macro variable can only hold a single value.
Since you're only ever assigning a single value, you can easily use CALL SYMPUTX().
call symputx('currency_rate', currency_rate);
But if your data has more than one row, then the value will be the last value set in the data set.
To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75