Comparing observations - compare

Suppose my dataset includes the following variables:
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
input double(id var5 var6)
1 10000 0.4
2 22000 0.55
3 25000 0.5
4 40000 1
end
I need to delete rows of ids that have an increased value of var5 and reduced value of var6 compared with at least one other id. In the first example, number 4 with 2028 and 17.396 should be deleted. In the second example, number 3 with 25000 and 0.5 should be deleted. After the elimination, the observations of the three variables should look like this:
1 1052 17.348
2 1288 17.378
3 1536 17.387
5 1810 17.402
6 2034 17.407
1 10000 0.4
2 22000 0.55
4 40000 1
while var1 and var2 should remain intact.
How can I do this?

This is very odd because you appear to say that you have a dataset with completely unrelated variables. You have an initial dataset of 100 observations with variables var1 and var2 and then a secondary dataset with 6 observations with variables var5 and var6. Your objective appears to be to remove observations, but only for values contained in variables var5 and var6. This looks like spreadsheet thinking as Stata only has a single dataset in memory at any given time.
The task of identifying observations to drop requires that you compare each observations with values for var5 and var6 with all other observations with values for those variables. This can be done in Stata by forming all pairwise combinations using the cross command.
Here's a solution that starts with data organized exactly as you presented it and separates the two datasets in order to perform the task of dropping the observations based on var5 and var6 values. Since the datasets appear completely unrelated, an unmatched merge is used to recombine the data.
clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
tempfile main
save "`main'"
* extract secondary dataset
keep id var5 var6
keep if !mi(id)
tempfile data2
save "`data2'"
* form all pairwise combinations
rename * =_0
cross using "`data2'"
* identify cases where there's an increase in var5 and decrease in var6
gen todrop = var5_0 > var5 & var6_0 < var6
* drop id if there's at least one case, reduce to original obs and vars
bysort id_0 (todrop): keep if !todrop[_N]
keep if id == id_0
keep id var5 var6
list
* now merge back with original data, use unmatched merge since
* secondary data is unrelated
sort id
tempfile newdata2
save "`newdata2'"
use "`main'", clear
drop id var5 var6
merge 1:1 _n using "`newdata2'", nogen

Here's one way to do this without separating the datasets. The task of identifying the observations to drop require a double-loop to make all pairwise comparisons. There is however no command in Stata to drop observations for just a few variables. In the following example, I switch to Mata to load the observations to preserve and then clear out values and save the observations back into the Stata variables:
clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
* an observation index
gen obsid = _n if !mi(id)
* identify observations to drop
gen todrop = 0 if !mi(id)
sum obsid, meanonly
local n = r(N)
quietly forvalues i = 1/`n' {
forvalues j = 1/`n' {
replace id = . if var5[`i'] > var5[`j'] & var6[`i'] < var6[`j'] & _n == `i'
}
}
* take a trip to Mata to load the data to keep and store it back from there
mata:
// load data, ignore observations with missing values
X = st_data(., ("id","var5","var6"), 0)
// set all obs to missing
st_store(., ("id","var5","var6") ,J(st_nobs(),3,.))
// store non-missing values back into the variables
st_store((1,rows(X)), ("id","var5","var6") ,X)
end
drop obsid todrop
Alternatively, you can manually move up values by doing some observation index gymnastics:
clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
* an observation index
gen obsid = _n if !mi(id)
* identify observations to drop
gen todrop = 0 if !mi(id)
sum obsid, meanonly
local n = r(N)
quietly forvalues i = 1/`n' {
forvalues j = 1/`n' {
replace id = . if var5[`i'] > var5[`j'] & var6[`i'] < var6[`j'] & _n == `i'
}
}
* move observations up
local j 0
quietly forvalues i = 1/`n' {
if !mi(id[`i']) {
local ++j
replace id = id[`i'] in `j'
replace var5 = var5[`i'] in `j'
replace var6 = var6[`i'] in `j'
}
}
local ++j
replace id = . in `j'/l
replace var5 = . in `j'/l
replace var6 = . in `j'/l
drop obsid todrop

Related

Stata: Using if with value labels

I faced an issue using if with value labels.
set obs 5
gen var1 = _n
label define l_var1 1 "cat1" 2 "cat1" 3 "cat2" 4 "cat3" 5 "cat3"
label val var1 l_var1
keep if var1=="cat3":l_var1
(4 observations deleted)
I expected 3 records to be deleted. How can I achieve this?
I am using Stata 16.1.
"cat3":l_var1 does not look up all values in l_var1 that corresponds to "cat3". It returns the first value that corresponds to the string "cat3".
So "cat3":l_var1 evaluates to 4 so keep if var1=="cat3":l_var1 evaluates to keep if var1==4 and therefore only one observation is kept.
See code below that shows this behavior. This is not the way you seem to want "cat3":l_var1 to behave, but this is how it behaves.
set obs 5
gen var1 = _n
label define l_var1 1 "cat1" 2 "cat1" 3 "cat2" 5 "cat3" 4 "cat3"
label val var1 l_var1
gen var2 = "cat3":l_var1
gen var3 = 1 if var1=="cat3":l_var1
This answers what is going on in your code. The code below is a better way to solve what you are trying to do.
set obs 5
gen var1 = _n
label define l_var1 1 "cat1" 2 "cat1" 3 "cat2" 5 "cat3" 4 "cat3"
label val var1 l_var1
decode var1, generate(var_str)
keep if var_str == "cat3"

In SAS: How to flag unique combinations of a set of variable values

In SAS, how can I create an identifier for each unique combination of a set of variables?
I have, for example, a several thousand observations with a dichotomous value for six variables. There are 2^6 unique combinations for the values of these variables for each observation. I would like to create an identifier for each unique combination, and eventually group my observations according to this value.
Have:
SubjectID Var1 Var2 Var3 Var4 Var5 Var6
---------------------------------------------------------------
ID1 1 1 1 1 1 1
ID2 1 0 1 1 1 1
ID3 0 1 1 1 1 1
ID4 0 0 1 1 1 0
... ... ... ... ... ... ...
ID3000 1 1 0 1 0 0
Want:
SubjectID Var1 Var2 Var3 Var4 Var5 Var6 Identifier
------------------------------------------------------------------------------
ID1 1 1 1 1 1 1 A
ID2 1 1 1 1 1 1 A
ID3 0 1 1 1 1 1 B
ID4 0 0 1 1 1 0 C
... ... ... ... ... ... ...
ID3000 1 1 0 1 0 0 Z
A would represent 1, 1, 1, 1, 1, 1 as a unique combination and B would represent 0, 1, 1, 1, 1, 1 etc.
I have thought about creating a dummy variable based on 64 Var1-Var6 conditional statements. I've also thought about concatenating the values from Var1-Var6 into a new row to create a unique identifier.
Is there a more straightforward way of going about this?
I prefer an approach that assigns a specific identifier to a specific combination of the values, rather than one that just generates some arbitrary unique string whenever a new combination comes up.
Proc summary works well with the LEVELS option. This technique works for any values of the group variables numeric or character.
data have;
input (v1-v6)(1.);
cards;
111111
111110
111101
111011
110111
;;;;
proc print;
proc summary data=have nway;
class v1-v6;
output out=unique(drop=_type_) / levels;
run;
Why not just concatenate the values?
So your combinations are:
111111
111110
111101
111011
110111
....
You can use PROC FREQ to check the number of each type.
proc freq data=have;
table var1*var2*var3*var4*var5*var6 / out=want list;
run;
By using the unique values of the given variables' combinations and then creating an alphabetical List of Ids, you can create the result
data inp;
length combined $6.;
input subjectid $4. v1 1. v2 1. v3 1. v4 1. v5 1. v6 1.;
combined=compress(v1||v2||v3||v4||v5||v6);
datalines;
ID1 111111
ID2 011111
ID3 001111
ID4 111110
ID5 000111
ID6 111111
ID7 000111
;
run;
proc sql;
create table uniq
as
select distinct combined from inp order by combined desc;
quit;
data uniq1;
set uniq;
retain alphabet 65;
Id=byte(alphabet) ;
alphabet+1;
drop alphabet;
run;
proc sql;
create table final_ds
as
select subjectid, v1, v2, v3, v4, v5, v6, Id
from inp a
left join uniq1 b
on a.combined=b.combined;
quit;
Assuming the data is sorted by your grouping variables then just use BY group processing.
data want;
set have;
by var1-var6 ;
groupid + first.var6 ;
run;
Or you could just convert the 6 binary variables into a single unique value.
group2 = input(cats(of var1-var6),binary6.);
This has the added value of not requiring that you sort the data, but it does need for none of the grouping variables to be missing.
Result
SubjectID Var1 Var2 Var3 Var4 Var5 Var6 Identifier Want groupno group2
ID4 0 0 1 1 1 0 C 1 14
ID3 0 1 1 1 1 1 B 2 31
ID1 1 1 1 1 1 1 A 3 63
ID2 1 1 1 1 1 1 A 3 63

Automatically replace outlying values with missing values

Suppose the data set have contains various outliers which have been identified in an outliers data set. These outliers need to be replaced with missing values, as demonstrated below.
Have
Obs group replicate height weight bp cholesterol
1 1 A 0.406 0.887 0.262 0.683
2 1 B 0.656 0.700 0.083 0.836
3 1 C 0.645 0.711 0.349 0.383
4 1 D 0.115 0.266 666.000 0.015
5 2 A 0.607 0.247 0.644 0.915
6 2 B 0.172 333.000 555.000 0.924
7 2 C 0.680 0.417 0.269 0.499
8 2 D 0.787 0.260 0.610 0.142
9 3 A 0.406 0.099 0.263 111.000
10 3 B 0.981 444.000 0.971 0.894
11 3 C 0.436 0.502 0.563 0.580
12 3 D 0.814 0.959 0.829 0.245
13 4 A 0.488 0.273 0.463 0.784
14 4 B 0.141 0.117 0.674 0.103
15 4 C 0.152 0.935 0.250 0.800
16 4 D 222.000 0.247 0.778 0.941
Want
Obs group replicate height weight bp cholesterol
1 1 A 0.4056 0.8870 0.2615 0.6827
2 1 B 0.6556 0.6995 0.0829 0.8356
3 1 C 0.6445 0.7110 0.3492 0.3826
4 1 D 0.1146 0.2655 . 0.0152
5 2 A 0.6072 0.2474 0.6444 0.9154
6 2 B 0.1720 . . 0.9241
7 2 C 0.6800 0.4166 0.2686 0.4992
8 2 D 0.7874 0.2595 0.6099 0.1418
9 3 A 0.4057 0.0988 0.2632 .
10 3 B 0.9805 . 0.9712 0.8937
11 3 C 0.4358 0.5023 0.5626 0.5799
12 3 D 0.8138 0.9588 0.8293 0.2448
13 4 A 0.4881 0.2731 0.4633 0.7839
14 4 B 0.1413 0.1166 0.6743 0.1032
15 4 C 0.1522 0.9351 0.2504 0.8003
16 4 D . 0.2465 0.7782 0.9412
The "get it done" approach is to manually enter each variable/value combination in a conditional which replaces with missing when true.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
;
run;
data outliers;
input parameter $ 11. group replicate $ measurement;
datalines;
cholesterol 3 A 111
height 4 D 222
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
EDIT: Updated outliers so that parameter avoids truncation and changed measurement to be numeric type so as to match the corresponding height, weight, bp, cholesterol. This shouldn't change the responses.
data want;
set have;
if group = 3 and replicate = 'A' and cholesterol = 111 then cholesterol = .;
if group = 4 and replicate = 'D' and height = 222 then height = .;
if group = 2 and replicate = 'B' and weight = 333 then weight = .;
if group = 3 and replicate = 'B' and weight = 444 then weight = .;
if group = 2 and replicate = 'B' and bp = 555 then bp = .;
if group = 1 and replicate = 'D' and bp = 666 then bp = .;
run;
This, however, doesn't utilize the outliers data set. How can the replacement process be made automatic?
I immediately think of the IN= operator, but that won't work. It's not the entire row which needs to be matched. Perhaps an SQL key matching approach would work? But to match the key, don't I need to use a where statement? I'd then effectively be writing everything out manually again. I could probably create macro variables which contain the various if or where statements, but that seems excessive.
I don't think generating statements is excessive in this case. The complexity arises here because your outlier dataset cannot be merged easily since the parameter values represent variable names in the have dataset. If it is possible to reorient the outliers dataset so you have a 1 to 1 merge, this logic would be simpler.
Let's assume you cannot. There are a few ways to use a variable in a dataset that corresponds to a variable in another.
You could use an array like array params{*} height -- cholesterol; and then use the vname function as you loop through the array to compare to the value in the parameter variable, but this gets complicated in your case because you have a one to many merge, so you would have to retain the replacements and only output the last record for each by group... so it gets complicated.
You could transpose the outliers data using proc transpose, but that will get lengthy because you will need a transpose for each parameter, and then you'd need to merge all the transposed datasets back to the have dataset. My main issue with this method is that code with a bunch of transposes like that gets unwieldy.
You create the macro variable logic you are thinking might be excessive. But compared to the other ways of getting the values of the parameter variable to match up with the variable names in the have dataset, I don't think something like this is excessive:
data _null_;
set outliers;
call symput("outlierstatement"||_n_,"if group = "||group||" and replicate = '"||replicate||"' and "||parameter||" = "||measurement||" then "|| parameter ||" = .;");
call symput("outliercount",_n_);
run;
%macro makewant();
data want;
set have;
%do i = 1 %to &outliercount;
&&outlierstatement&i;
%end;
run;
%mend;
Lorem:
Transposition is the key to a fully automatic programmatic approach. The transposition that will occur is of the filter data, not the original data. The transposed filter data will have fewer rows than the original. As John indicated, transposition of the want data can create a very tall table and has to be transposed back after applying the filters.
As to the the filter data, the presence of a filter row for a specific group, replicate and parameter should be enough to mark a cell for filtering. This is on the presumption that you have a system for automatic outlier detection and the filter values will always be in concordance with the original values.
So, what has to be done to automate the filter application process without code generating a wall of test and assign statements ?
Transpose filter data into same form as want data, call it Filter^
Merge Want and Filter^ by record key (which is the by group of Group and Replicate)
Array process the data elements, looking for filtering conditions.
For your consideration, try the following SAS code. There is an erroneous filter record added to the mix.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
5 E 222 0.2465 0.7782 0.9412 /* test record for filter value misalignment test */
;
run;
data outliers;
length parameter $32; %* <--- widened parameter so it can transposed into column via id;
input parameter $ group replicate $ measurement ; %* <--- changed measurement to numeric variable;
datalines;
cholesterol 3 A 111
height 4 D 222
height 5 E 223 /* test record for filter value misalignment test */
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
data want;
set have;
if group = 3 and replicate = 'A' and cholesterol = 111 then cholesterol = .;
if group = 4 and replicate = 'D' and height = 222 then height = .;
if group = 2 and replicate = 'B' and weight = 333 then weight = .;
if group = 3 and replicate = 'B' and weight = 444 then weight = .;
if group = 2 and replicate = 'B' and bp = 555 then bp = .;
if group = 1 and replicate = 'D' and bp = 666 then bp = .;
run;
/* Create a view with 1st row having all the filtered parameters
* This is necessary so that the first transposed filter row
* will have the parameters as columns in alphabetic order;
*/
proc sql noprint;
create view outliers_transpose_ready as
select distinct parameter from outliers
union
select * from outliers
order by group, replicate, parameter
;
/* Generate a alphabetic ordered list of parameters for use
* as a variable (aka column) list in the filter application step */
select distinct parameter
into :parameters separated by ' '
from outliers
order by parameter
;
quit;
%put NOTE: &=parameters;
/* tranpose the filter data
* The ID statement pivots row data into column names.
* The prefix=_filter_ ensure the new column names
* will not collide with the original data, and can be
* the shortcut listed with _filter_: in an array statement.
*/
proc transpose data=outliers_transpose_ready out=outliers_apply_ready prefix=_filter_;
by group replicate notsorted;
id parameter;
var measurement;
run;
/* Robust production code should contain a bin for
* data that does not conform to the filter application conditions
*/
data
want2(label="Outlier filtering applied" drop=_i_ _filter_:)
want2_warnings(label="Outlier filtering: misaligned values")
;
merge have outliers_apply_ready(keep=group replicate _filter_:);
by group replicate;
/* The arrays are for like named columns
* due to the alphabetic ordering enforced in data and codegen preparation
*/
array value_filter_check _filter_:;
array value &parameters;
if group ne .;
do _i_ = 1 to dim(value);
if value(_i_) EQ value_filter_check(_i_) then
value(_i_) = .;
else
if not missing(value_filter_check(_i_)) AND
value(_i_) NE value_filter_check(_i_)
then do;
put 'WARNING: Filtering expected but values do not match. ' group= replicate= value(_i_)= value_filter_check(_i_)=;
output want2_warnings;
end;
end;
output want2;
run;
Confirm your want and automated want2 agree.
proc compare noprint data=want compare=want2 outnoequal out=diffs;
by group replicate;
run;
Enjoy your SAS
You could use a hash table. Load a hash table with the outlier dataset, with parameter-group-replicate defined as the key. Then read in the data, and as you read each record, check each of the variables to see if that combination of parameter-group-replicate can be found in the hash table. I think below works (I'm no hash expert):
data want;
if 0 then set outliers (keep=parameter group replicate);
if _N_ = 1 then
do;
declare hash h(dataset:'outliers') ;
h.defineKey('parameter', 'group', 'replicate') ;
h.defineDone() ;
end;
set have ;
array vars {*} height weight bp cholesterol ;
do i=1 to dim(vars);
parameter=vname(vars{i});
if h.check()=0 then call missing(vars{i});
end;
drop i parameter;
run;
I like #John's suggestion:
You could use an array like array params{*} height -- cholesterol; and
then use the vname function as you loop through the array to compare
to the value in the parameter variable, but this gets complicated in
your case because you have a one to many merge, so you would have to
retain the replacements and only output the last record for each by
group... so it gets complicated.
Generally in a one to many merge I would avoid recoding variables from the dataset that is unique, because variables are retained within BY groups. But in this case, it works out well.
proc sort data=outliers;
by group replicate;
run;
data want (keep=group replicate height weight bp cholesterol);
merge have (in=a)
outliers (keep=group replicate parameter in=b)
;
by group replicate;
array vars {*} height weight bp cholesterol ;
do i=1 to dim(vars);
if vname(vars{i})=parameter then call missing(vars{i});
end;
if last.replicate;
run;
Thank you #John for providing a proof of concept. My implementation is a little different and I think worth making a separate entry for posterity. I went with a macro variable approach because I feel it is the most intuitive, being a simple text replacement. However, since a macro variable can contain only 65534 characters, it is conceivable that there could be sufficient outliers to exceed this limit. In such a case, any of the other solutions would make fine alternatives. Note that it is important that the put statement use something like best32. Too short a width will truncate the value.
If you desire to have a dataset containing the if statements (perhaps for verification), simply remove the into : statement and place a create table statements as line at the beginning of the PROC SQL step.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
;
run;
data outliers;
input parameter $ 11. group replicate $ measurement;
datalines;
cholesterol 3 A 111
height 4 D 222
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
proc sql noprint;
select
cat('if group = '
, strip(put(group, best32.))
, " and replicate = '"
, strip(replicate)
, "' and "
, strip(parameter)
, ' = '
, strip(put(measurement, best32.))
, ' then '
, strip(parameter)
, ' = . ;')
into : listIfs separated by ' '
from outliers
;
quit;
%put %quote(&listIfs);
data want;
set have;
&listIfs;
run;

Why does proc arima with NoEst throw 'There is not enough data to fit the model' error?

I am using proc arima in SAS 9.4 to produce a forecast using a previously calibrated model, but it is throwing an error as if it is trying to calibrate the model itself :
ERROR: There is not enough data to fit the model
sample data:
data inputs;
input x var1 var2 var3 var4 var5;
datalines;
20 5 2 4 5 4
25 12 56 13 44 4
20 5 2 4 5 4
25 12 56 13 44 4
20 5 2 4 5 4
25 12 56 13 44 4
. 2 5 6 5 4
;
failing version:
proc arima;
identify
data = inputs
var = x
crossCorr = ( var1 var2 var3 var4 var5 )
noPrint;
estimate
p = 1 input = ( var1 var2 var3 var4 var5 )
ar = 0.9
initVal = ( 0.1$var1 0.2$var2 0.3$var3 0.4$var4 0.4$var5 )
noint
noEst /* Using noEst so should not need to do any estimation and short data-set should not be a problem */
method=ml
noprint
;
forecast lead=1 out=outputs noOutAll noprint;
quit;
If I remove the final variable from the model, it works fine:
proc arima;
identify
data = inputs
var = x
crossCorr = ( var1 var2 var3 var4 )
noPrint;
estimate
p = 1 input = ( var1 var2 var3 var4 )
ar = 0.9
initVal = ( 0.1$var1 0.2$var2 0.3$var3 0.4$var4 )
noint
noEst /* Using noEst so should not need to do any estimation and short data-set should not be a problem */
method=ml
noprint
;
forecast lead=1 out=outputs noOutAll noprint;
quit;
I can also get it to 'work' by adding one more value to the data. However, this shouldn't be necessary when the model is already calibrated (using much more data).
I've checked the SAS documentation to see if there are any flags to prevent the unnecessary check that causes this error but none of them helped.
The answer has been provided on the SAS communities forum. It is known behaviour and so my uncommon use case is not supported. The only workaround would be to add some dummy data, but in my case with MA terms that would change the results.
Response on SAS Communities

Ranking values based on another data set in SAS

Say I have two data sets A and B that have identical variables and want to rank values in B based on values in A, not B itself (as "PROC RANK data=B" does.)
Here's a simplified example of data sets A, B and want (the desired output):
A:
obs_A VAR1 VAR2 VAR3
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
B:
obs_B VAR1 VAR2 VAR3
1 15 150 2234
2 14 352 1555
3 36 251 1000
4 41 350 2011
5 60 553 5012
want:
obs VAR1 VAR2 VAR3
1 2 2 3
2 2 4 2
3 4 3 1
4 5 4 3
5 6 6 6
I come up with a macro loop that involves PROC RANK and PROC APPEND like below:
%macro MyRank(A,B);
data AB; set &A &B; run;
%do i=1 %to 5;
proc rank data=AB(where=(obs_A ne . OR obs_B=&i) out=tmp;
var VAR1-3;
run;
proc append base=want data=tmp(where=(obs_B=&i) rename=(obs_B=obs)); run;
%end;
%mend;
This is ok when the number of observations in B is small. But when it comes to very large number, it takes so long and thus wouldn't be a good solution.
Thanks in advance for suggestions.
I would create formats to do this. What you're really doing is defining ranges via A that you want to apply to B. Formats are very fast - here assuming "A" is relatively small, "B" can be as big as you like and it's always going to take just as long as it takes to read and write out the B dataset once, plus a couple read/writes of A.
First, reading in the A dataset:
data ranking_vals;
input obs_A VAR1 VAR2 VAR3;
datalines;
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
;;;;
run;
Then transposing it to vertical, as this will be the easiest way to rank them (just plain old sorting, no need for proc rank).
data for_ranking;
set ranking_vals;
array var[3];
do _i = 1 to dim(var);
var_name = vname(var[_i]);
var_value = var[_i];
output;
end;
run;
proc sort data=for_ranking;
by var_name var_value;
run;
Then we create a format input dataset, and use the rank as the label. The range is (previous value -> current value), and label is the rank. I leave it to you how you want to handle ties.
data for_fmt;
set for_ranking;
by var_name var_value;
retain prev_value;
if first.var_name then do; *initialize things for a new varname;
rank=0;
prev_value=.;
hlo='l'; *first record has 'minimum' as starting point;
end;
rank+1;
fmtname=cats(var_name,'F');
start=prev_value;
end=var_value;
label=rank;
output;
if last.var_name then do; *For last record, some special stuff;
start=var_value;
end=.;
hlo='h';
label=rank+1;
output; * Output that 'high' record;
start=.;
end=.;
label=.;
hlo='o';
output; * And a "invalid" record, though this should never happen;
end;
prev_value=var_value; * Store the value for next row.;
run;
proc format cntlin=for_fmt;
quit;
And then we test it out.
data test_b;
input obs_B VAR1 VAR2 VAR3;
var1r=put(var1,var1f.);
var2r=put(var2,var2f.);
var3r=put(var3,var3f.);
datalines;
1 15 150 2234
2 14 352 1555
3 36 251 1000
4 41 350 2011
5 60 553 5012
;;;;
run;
One way that you can rank by a variable from a separate dataset is by using proc sql's correlated subqueries. Essentially you counts the number of lower values in the lookup dataset for each value in the data to be ranked.
proc sql;
create table want as
select
B.obs_B,
(
select count(distinct A.Var1) + 1
from A
where A.var1 <= B.var1.
) as var1
from B;
quit;
Which can be wrapped in a macro. Below, a macro loop is used to write each of the subqueries. It looks through the list of variable and parametrises the subquery as required.
%macro rankBy(
inScore /*Dataset containing data to be ranked*/,
inLookup /*Dataset containing data against which to rank*/,
varID /*Variable by which to identify an observation*/,
varsRank /*Space separated list of variable names to be ranked*/,
outData /*Output dataset name*/);
/* Rank variables in one dataset by identically named variables in another */
proc sql;
create table &outData. as
select
scr.&varID.
/* Loop through each variable to be ranked */
%do i = 1 %to %sysfunc(countw(&varsRank., %str( )));
/* Store the variable name in a macro variable */
%let var = %scan(&varsRank., &i., %str( ));
/* Rank: count all the rows with lower value in lookup */
, (
select count(distinct lkp&i..&var.) + 1
from &inLookup. as lkp&i.
where lkp&i..&var. <= scr.&var.
) as &var.
%end;
from &inScore. as scr;
quit;
%mend rankBy;
%rankBy(
inScore = B,
inLookup = A,
varID = obs_B,
varsRank = VAR1 VAR2 VAR3,
outData = want);
Regarding speed, this will be slow if your A is large, but should be okay for large B and small A.
In rough testing on a slow PC I saw:
A: 1e1 B: 1e6 time: ~1s
A: 1e2 B: 1e6 time: ~2s
A: 1e3 B: 1e6 time: ~5s
A: 1e1 B: 1e7 time: ~10s
A: 1e2 B: 1e7 time: ~12s
A: 1e4 B: 1e6 time: ~30s
Edit:
As Joe points out below the length of time the query takes depends not just on the number of observations in the dataset, but how many unique values exist within the data. Apparently SAS performs optimisations to reduce the comparisons to only the distinct values in B, thereby reducing the number of times the elements in A need to be counted. This means that if the dataset B contains a large number of unique values (in the ranking variables) the process will take significantly longer then the times shown. This is more likely to happen if your data is not integers as Joe demonstrates.
Edit:
Runtime test rig:
data A;
input obs_A VAR1 VAR2 VAR3;
datalines;
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
;
run;
data B;
do obs_B = 1 to 1e7;
VAR1 = ceil(rand("uniform")* 60);
VAR2 = ceil(rand("uniform")* 500);
VAR3 = ceil(rand("uniform")* 6000);
output;
end;
run;
%let start = %sysfunc(time());
%rankBy(
inScore = B,
inLookup = A,
varID = obs_B,
varsRank = VAR1 VAR2 VAR3,
outData = want);
%let time = %sysfunc(putn(%sysevalf(%sysfunc(time()) - &start.), time12.2));
%put &time.;
Output:
0:00:12.41