use format name as function input in proc fcmp - sas

I have two different formats defined.
proc format;
value fmtA
1 = 3
2 = 5
;
value fmtB
1 = 2
2 = 4
;
run;
function myfun returns formatted value
proc fcmp outlib=WORK.pac.funcs;
function myfun(n);
val = put(n,fmta.);
return (val);
endsub;
run;
I want to make this a bit more dynamic - val will be based on function input.
EDIT
proc fcmp outlib=WORK.pac.funcs;
function myfun(n,myfmt $);
if myfmt = 'fmtA' then val = put(n,fmtA.);
else if myfmt = 'fmtB' then val = put(n,fmtB.);
else val = n;
return (val);
endsub;
run;
data test;
n = 2;
myfmt = 'fmtA';
output;
myfmt = 'fmtB';
output;
myfmt = 'fmtC';
output;
run;
data test2;
set test;
/* try to do sth like this */
value = myfun(n,myfmt);
run;
This solution works. However it requires a long list of checking when I have so many different formats. And it's not possible before I take a look at format name in the input test dataset.

Just use the PUTN() function.
proc format;
value fmtA 1 = '3' 2 = '5' ;
value fmtB 1 = '2' 2 = '4' ;
value fmtc 1 = '1' 2 = '3' ;
run;
data test;
do myfmt = 'fmtA.','fmtB.','fmtC.';
do n= 1,2;
str = putn(n,myfmt);
value = input(str,32.);
output;
end;
end;
run;

Related

How do I select the first 5 observations with regard to duplicates? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a large dataset containting over 80 000 000 rows sorted by "name" and "income" (with duplicates both for name and income). For the first name I would like to have the 5 lowest incomes. For the second name I would like to have the 5 lowest incomes (but incomes drawn to the first name are then disqualified to be selected). And so on, until the last name (if there are any incomes left at that time).
You first want to rank income within names. So:
proc rank data=yourdata out=temp ties=low;
by name;
var income;
ranks incomerank;
run;
Then you want to filter the 5 lowest incomes by name, so:
proc sql;
create table want as
select distinct *
from temp
where incomerank < 6;
quit;
You will need to sort and track incomes
Use an array to sort and track the lowest five income of a name.
Use a hash to track and check the observance of an income being output and thus ineligible for output by later names.
Example:
An insert sort of eligible low valued incomes is used and will be fast due to only 5 items.
data have;
call streaminit(1234);
do name = 1 to 1e6;
do seq = 1 to rand('integer', 20);
income = rand('integer', 20000, 1000000);
output;
end;
end;
run;
data
want (label='Lowest 5 incomes (first occurring over all names) of each name')
want_barren(keep=name label='Names whose all incomes were previously output for earlier names')
;
array X(5) _temporary_;
if _n_ = 1 then do;
if 0 then set have;
declare hash incomes();
incomes.defineKey('income');
incomes.defineDone();
end;
_maxmin5 = 1e15;
x(1) = 1e15;
x(2) = 1e15;
x(3) = 1e15;
x(4) = 1e15;
x(5) = 1e15;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if incomes.check() = 0 then continue;
* insert sort - lowest five not observed previously;
if income > _maxmin5 then continue;
do _i_ = 1 to 5;
if income < x(_i_) then do;
do _j_ = 5 to _i_+1 by -1;
x(_j_) = x(_j_-1);
end;
x(_i_) = income;
_maxmin5 = x(5);
incomes.add();
leave;
end;
end;
end;
_outflag = 0;
do _n_ = 1 to _n_;
set have;
if income in x then do;
_outflag = 1;
OUTPUT want;
end;
end;
if not _outflag then
OUTPUT want_barren;
drop _:;
run;
data have;
do n = 1 to 8e5;
do _N_ = 1 to 100;
income = ceil(rand('uniform') * 1e4);
address = cats('Address_', _N_);
output;
end;
end;
run;
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(dataset : 'have(obs=0)', ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedata(all : 'y');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 5 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;
Here is my interpretation of your problem and a solution.
Suppose a simplified version of your data looks like this and you want the 2 lowest income for each name. For simplicity, I use a numeric variable n as name, but a character var will work as well.
data have;
input n income;
datalines;
1 100
1 200
1 300
2 400
2 100
2 500
3 600
3 200
3 500
;
From this data, my guess is that your logic goes like this:
Start with n = 1.
Output the 2 observations with the lowest income (100 and 200)
Go to the next name (n=2).
Output the 2 observations with the lowest income, that has not already been output (300 and 400). 200 Has been output in the n=1 group.
...And so on...
This gives the desired result below:
data want;
input n income;
datalines;
1 100
1 200
2 300
2 400
3 500
;
Try out the solution below and verify that you get the result as posted above.
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 2 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;
Finally, let us create representable example data with 80Mio obs. I change the if c = 2 then leave; statement to if c = 5 then leave; to go back to your actual problem.
The code below runs in about 45 sec on my system and processes the data in a single pass. Let me know is it works for you :-)
data have;
do n = 1 to 8e5;
do _N_ = 1 to 100;
income = ceil(rand('uniform') * 1e4);
output;
end;
end;
run;
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 5 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;

how to create variables that names are concat with two array variable names

I have a HCC dataset DATA_HCC that with member ID and 79 binary variables:
Member_ID HCC1 HCC2 HCC6 HCC8 ... HCC189
XXXXXXX1 1 0 1 0 ... 0
XXXXXXX2 0 0 1 0 ... 0
XXXXXXX3 0 1 0 0 ... 1
I am trying to create a output dataset that could create new binary variables for all the combination of those 79 variables. Each new variable represents if a member had both of the variables as 1.
%LET hccList = HCC1 HCC2 HCC6 HCC8 HCC9 HCC10 HCC11 HCC12 HCC17 HCC18 HCC19 HCC21 HCC22 HCC23 HCC27
HCC28 HCC29 HCC33 HCC34 HCC35 HCC39 HCC40 HCC46 HCC47 HCC48 HCC54 HCC55 HCC57 HCC58
HCC70 HCC71 HCC72 HCC73 HCC74 HCC75 HCC76 HCC77 HCC78 HCC79 HCC80 HCC82 HCC83 HCC84
HCC85 HCC86 HCC87 HCC88 HCC96 HCC99 HCC100 HCC103 HCC104 HCC106 HCC107 HCC108 HCC110
HCC111 HCC112 HCC114 HCC115 HCC122 HCC124 HCC134 HCC135 HCC136 HCC137 HCC157 HCC158
HCC161 HCC162 HCC166 HCC167 HCC169 HCC170 HCC173 HCC176 HCC186 HCC188 HCC189;
DATA COUNT_HCC; SET DATA_HCC;
ARRAY HCC [*] &hccList.;
DO i = 1 TO DIM(HCC);
DO j = i+1 TO DIM(HCC);
%LET HCC_COMBO = CATX('_', VARNAME(HCC[i]), VARNAME(HCC[j]));
&HCC_COMBO. = MIN(HCC[i], HCC[j]);
END;
END;
RUN;
I tried to use CATX function to just concat the two variable names but it didn't work.
Here is the log error that I got:
ERROR: Undeclared array referenced: CATX.
ERROR: Variable CATX has not been declared as an array.
ERROR 71-185: The VARNAME function call does not have enough arguments.
And the results output sample would like this:
Member_ID HCC1_HCC2 HCC1_HCC6 HCC1_HCC8 ... HCC188_HCC189
XXXXXXX1 0 1 0 ... 0
XXXXXXX2 0 0 0 ... 0
XXXXXXX3 0 0 0 ... 1
To achieve dynamic variable name generation, use a macro to create the variables that you need. The below code generates dynamic variable names and generates data step code to create the variables.
%macro get_hcc_combo_mins;
%do i = 1 %to %sysfunc(countw(&hccList.));
%do j = %eval(&i.+1) %to %sysfunc(countw(&hccList.));
%let hcc1 = %scan(&hccList., &i.);
%let hcc2 = %scan(&hccList., &j.);
&hcc1._&hcc2. = min(&hcc1., &hcc2.);
%end;
%end;
%mend;
DATA COUNT_HCC; SET DATA_HCC;
ARRAY HCC [*] &hccList.;
%get_hcc_combo_mins;
RUN;
The macro %get_hcc_combo_mins generates this code in the data step:
HCC1_HCC2 = min(HCC1, HCC2);
HCC1_HCC6 = min(HCC1, HCC6);
HCC1_HCC8 = min(HCC1, HCC8);
...
There may be other ways to do this all within one data step that I'm not aware of, but macros can get the job done.
A DATA Step with LEXCOMB can generate variable name pairs. CALL EXECUTE submit a statement using those names.
Example:
Presume HCC: variable names, which specific ones not known apriori.
data have;
call streaminit(1234);
do id = 1 to 100;
array hcc hcc1 hcc3 hcc5 hcc7 hcc10-hcc79 hcc150 hcc155 hcc180 hcc190-hcc191;
do over hcc;
hcc = rand('uniform', dim(hcc)) < _i_;
end;
output;
end;
run;
data _null_;
set have;
array hcc hcc:;
do _n_ = 1 to dim(hcc);
hcc(_n_) = _n_;
end;
call execute("data pairwise; set have;");
do _n_ = 1 to comb(dim(hcc),2);
call lexcomb(_n_, 2, of hcc(*));
index1 = hcc(1);
index2 = hcc(2);
name1 = vname(hcc(index1));
name2 = vname(hcc(index2));
put name1=;
call execute (cats(
catx( '_',name1,name2),
'=',
catx(' and ',name1,name2),
';'
));
end;
call execute('run;');
stop;
run;
See if you can use this as a template.
/* Example data */
data have (drop = i j);
array h {*} HCC1 HCC2 HCC6 HCC8 HCC9 HCC10 HCC11 HCC12 HCC17 HCC18 HCC19 HCC21 HCC22 HCC23 HCC27
HCC28 HCC29 HCC33 HCC34 HCC35 HCC39 HCC40 HCC46 HCC47 HCC48 HCC54 HCC55 HCC57 HCC58
HCC70 HCC71 HCC72 HCC73 HCC74 HCC75 HCC76 HCC77 HCC78 HCC79 HCC80 HCC82 HCC83 HCC84
HCC85 HCC86 HCC87 HCC88 HCC96 HCC99 HCC100 HCC103 HCC104 HCC106 HCC107 HCC108 HCC110
HCC111 HCC112 HCC114 HCC115 HCC122 HCC124 HCC134 HCC135 HCC136 HCC137 HCC157 HCC158
HCC161 HCC162 HCC166 HCC167 HCC169 HCC170 HCC173 HCC176 HCC186 HCC188 HCC189;
do i = 1 to 10;
do j = 1 to dim (h);
h [j] = rand('uniform') > .5;
end;
output;
end;
run;
/* Create long version of output data */
data temp (drop = i j);
set have;
array a {*} HC:;
do i = 1 to dim (a)-1;
do j = i+1 to dim (a);
v = catx('_', vname (a[i]), vname (a[j]));
d = a [i] * a [j];
n = _N_;
output;
end;
end;
run;
/* Transpose to wide format */
proc transpose data=temp out=temp2 (drop=_: n);
by n;
id v;
var d;
run;
/* Merge back with original data */
data want;
merge have temp2;
run;

Automate check for number of distinct values SAS

Looking to automate some checks and print some warnings to a log file. I think I've gotten the general idea but I'm having problems generalising the checks.
For example, I have two datasets my_data1 and my_data2. I wish to print a warning if nobs_my_data2 < nobs_my_data1. Additionally, I wish to print a warning if the number of distinct values of the variable n in my_data2 is less than 11.
Some dummy data and an attempt of the first check:
%LET N = 1000;
DATA my_data1(keep = i u x n);
a = -1;
b = 1;
max = 10;
do i = 1 to &N - 100;
u = rand("Uniform"); /* decimal values in (0,1) */
x = a + (b-a) * u; /* decimal values in (a,b) */
n = floor((1 + max) * u); /* integer values in 0..max */
OUTPUT;
END;
RUN;
DATA my_data2(keep = i u x n);
a = -1;
b = 1;
max = 10;
do i = 1 to &N;
u = rand("Uniform"); /* decimal values in (0,1) */
x = a + (b-a) * u; /* decimal values in (a,b) */
n = floor((1 + max) * u); /* integer values in 0..max */
OUTPUT;
END;
RUN;
DATA _NULL_;
FILE "\\filepath\log.txt" MOD;
SET my_data1 NOBS = NOBS1 my_data2 NOBS = NOBS2 END = END;
IF END = 1 THEN DO;
PUT "HERE'S A HEADER LINE";
END;
IF NOBS1 > NOBS2 AND END = 1 THEN DO;
PUT "WARNING!";
END;
IF END = 1 THEN DO;
PUT "HERE'S A FOOTER LINE";
END;
RUN;
How can I set up the check for the number of distinct values of n in my_data2?
A proc sql way to do it -
%macro nobsprint(tab1,tab2);
options nonotes; *suppresses all notes;
proc sql;
select count(*) into:nobs&tab1. from &tab1.;
select count(*) into:nobs&tab2. from &tab2.;
select count(distinct n) into:distn&tab2. from &tab2.;
quit;
%if &&nobs&tab2. < &&nobs&tab1. %then %put |WARNING! &tab2. has less recs than &tab1.|;
%if &&distn&tab2. < 11 %then %put |WARNING! distinct VAR n count in &tab2. less than 11|;
options notes; *overrides the previous option;
%mend nobsprint;
%nobsprint(my_data1,my_data2);
This would break if you have to specify libnames with the datasets due to the .. And, you can use proc printto log to print it to a file.
For your other part as to just print the %put use the above as a call -
filename mylog temp;
proc printto log=mylog; run;
options nomprint nomlogic;
%nobsprint(my_data1,my_data2);
proc printto; run;
This won't print any erroneous text to SAS log other than your custom warnings.
#samkart provided perhaps the most direct, easily understood way to compare the obs counts. Another consideration is performance. You can get them without reading the entire data set if your data set has millions of obs.
One method is to use nobs= option in the set statement like you did in your code, but you unnecessarily read the data sets. The following will get the counts and compare them without reading all of the observations.
62 data _null_;
63 if nobs1 ne nobs2 then putlog 'WARNING: Obs counts do not match.';
64 stop;
65 set sashelp.cars nobs=nobs1;
66 set sashelp.class nobs=nobs2;
67 run;
WARNING: Obs counts do not match.
Another option is to get the counts from sashelp.vtable or dictionary.tables. Note that you can only query dictionary.tables with proc sql.

Find the next non blank value in SAS

I am trying to linearly interpolate values within a panel set data. So I am find the next non zero value within a variable if the current value of the variable is "."
For example if X = { 1, 2, . , . , . ,7), I want to store 7 as a variable "Y" and subject the lag value of X from it as the numerator of the slope. Can anyone help with this step?
If you cannot transpose your data, here is a way that will work for your given example:
data test;
input id $3. x best12.;
datalines;
AAA 1
BBB 2
CCC .
DDD .
EEE .
FFF 7
;
run;
data test2;
set test;
n = _n_;
if x ne .;
run;
data test3;
set test2;
lagx = lag(x);
lagn = lag(n);
if _n_ > 1 and n ne lagn + 1 then do;
postiondiff = n - lagn;
valuediff = x - lagx;
do i = (lagx + ((x-lagx)/(n-lagn))) to x by ((x-lagx)/(n-lagn));
x = i;
output;
end;
end;
else output;
keep x;
run;
data test4;
merge test test3 (rename = (x=newx));
run;
So we are basically rebuilding the variable with the interpolated values, then remerging it into the original dataset without a by variable which will line up all the new interpolated data with the missing points.
Is there a way you could transpose all your data? Interpolating like that is much easier when all the data you need is in a single observation. Like this:
data test;
input x best12.;
datalines;
1
2
.
.
.
7
;
run;
proc transpose data = test
out = test2;
run;
data test3;
set test2;
array xvalues {*} COL1-COL6;
array interpol {4,10} begin1-begin10 end1-end10 begposition1-begposition10 endposition1-endposition10;
rangenum = 1;
* Find the endpoints of the missing ranges;
do i = 1 to dim(xvalues);
if xvalues{i} ne . then lastknownx = xvalues{i};
else do;
interpol{1,rangenum} = lastknownx;
if interpol{3,rangenum} = . then interpol{3,rangenum} = i - 1;
end;
if i > 1 and xvalues{i} ne . then do;
if xvalues{i-1} = . then do;
interpol{2,rangenum} = xvalues{i};
interpol{4,rangenum} = i;
rangenum = rangenum + 1;
end;
end;
end;
* Interpolate;
rangenum = 1;
do j = 1 to dim(xvalues);
if xvalues{j} = . then do;
xvalues{j} = interpol{1,rangenum} + (j-interpol{3,rangenum})*((interpol{2,rangenum}-interpol{1,rangenum})/(interpol{4,rangenum}-interpol{3,rangenum}));
end;
else if j > 1 and xvalues{j} ne . then do;
if xvalues{j-1} = . then rangenum = rangenum + 1;
end;
end;
keep col1-col6;
run;
That can handle up to 10 different missing ranges per observation, though you could tweak the code to handle much more than that by creating bigger arrays.
The SAS data step reads a dataset one record at a time from top to bottom. So at record i, it can't access i+1 because it hasn't read it yet; it can only access i-1. Assume you have a dataset with a variable x.
data intrpl;
retain _x;
set yourdata;
by x notsorted;
if not missing(x) then do;
_x = x;
if last.x then do;
slope = _x - lag(_x);
output;
end;
end;
run;
Transposing can get kind of messy if x takes on a lot of values, so I recommend this method. I hope it helps!

symget - a list of macro variables

I'm trying to reproduce the code found here, specifically on page 7:
http://www.nesug.org/proceedings/nesug04/pm/pm13.pdf
/* set up example*/
%let var_1 = 'abc';
%let var_2 = 'def';
%let var_3 = 'ghi';
%let val_1 = 1.5;
%let val_2 = 3;
%let val_3 = 4.5;
/* use symget to create a list of var names and values */
data scores;
length var_name $32 value 8.;
do _N_ = 1 to 3;
var_name = symget('var_' || left(_N_));
value = symget('val_' || left(_N_));
end;
run;
However, the end result that I'm getting is only the last variable, not all 3:
var_name value
ghi 4.5
I want:
var_name value
abc 1.5
def 3
ghi 4.5
Why isn't this working?
You are missing an output statement to write each row. Insert it here:
do _N_ = 1 to 3;
var_name = symget('var_' || left(_N_));
value = symget('val_' || left(_N_));
output;
end;