SAS Studio: Finding Values of Column 2 in Column 1 Until Column 2 is Specific Value - sas

I have a simple question that I can't seem to answer. I HAVE a large data set where I am searching for values of column 2 that are found in column 1, until column 2 is a specific value. Sounds like a DO loop but I don't have much experience using them. Please see image as this likely will explain better.
Essentially, I have a "starting" point (with the first_match flag=1). Then, I want to grab the value of column 2 in this row (B in this example). Next, I want to search for this value (B) in column 1. Once I find that row (with column 1 = B & column 2 = C), I again grab the value in column 2 (C). Again, I find where in column 1 this new value occurs and obtain the corresponding value of column 2. I repeat this process until column 2 has a value of Z. That's my stopping point. The WANT table shows my desired output.
My apologies if the above is confusing, but it seems like a simple exercise that I can't seem to solve. Any help would be greatly appreciated. Glad to supply further clarification as well.
Have & Want
I have tried PROC SQL to create flags and grab the appropriate rows, but the code is extremely bulky and doesn't seem efficient. Also, the example I laid out has a desired output table with 3 rows. This may not be the case as the desired output could contain between 1 and 10 rows.

This question has been asked and answered previously.
Path traversal can be done using a DATA Step hash object.
Example:
data have;
length vertex1 vertex2 $8;
input vertex1 vertex2;
datalines;
A B
X B
D B
E B
B C
Q C
C Z
Z X
;
data want(keep=vertex1 vertex2 crumb);
length vertex1 vertex2 $8 crumb $1;
declare hash edges ();
edges.defineKey('vertex1');
edges.defineData('vertex2', 'crumb');
edges.defineDone();
crumb = ' ';
do while (not last_edge);
set have end=last_edge;
edges.add();
end;
trailhead = 'A';
vertex1 = trailhead;
do while (0 = edges.find());
if not missing(crumb) then leave;
output;
edges.replace(key:vertex1, data:vertex2, data:'*');
vertex1 = vertex2;
end;
if not missing(crumb) then output;
stop;
run;
All paths in the data can be discovered with an additional outer loop iterating (HITER) over a hash of the vertex1 values.

Related

Data step manipulation based on two fields conditioning

For the dataset,
data testing;
input key $ output $;
datalines;
1 A
1 B
1 C
2 A
2 B
2 C
3 A
3 B
3 C
;
run;
Desired Output,
1 A
2 B
3 C
The logic is if either key or output appear within the column before then delete the observation.
1 A (as 1 and A never appear then keep the obs)
1 B (as 1 appear already then delete)
1 C (as 1 appear then delete)
2 A (as A appear then delete)
2 B (as 2 and B never appear then keep the obs)
2 C (as 2 appear then delete)
3 A (as A appear then delete)
3 B (as B appear then delete)
3 C (as 3 and C never appear then keep the obs)
My effort:
The basic idea here is you keep a dictionary of what's already been used, and search that. Here's a simple array based method; a hash table might be better, certainly less memory intensive, anyway, and likely faster - I would leave that to your imagination.
data want;
set testing;
array _keys[30000] _temporary_; *temporary arrays to store 'used' values;
array _outputs[30000] $ _temporary_;
retain _keysCounter 1 _outputsCounter 1; *counters to help us store the values;
if whichn(key, of _keys[*]) = 0 and whichc(output,of _outputs[*]) = 0 /* whichn and whichc search lists (or arrays) for a value. */
then do;
_keys[_keysCounter] = key; *store the key in the next spot in the dictionary;
_keysCounter+1; *increment its counter;
_outputs[_outputsCounter] = output; *store the output in the next spot in the dictionary;
_outputsCounter+1; *increment its counter;
output; *output the actual datarow;
end;
keep key output;
run;

SAS/IML: how to use individual variance components in RANDNORMAL

This is a programming question, but I'll give you a little of the stats background first. This question refers to part of a data sim for a mixed-effects location scale model (i.e., heterogeneous variances). I'm trying to simulate two MVN variance components using the RANDNORMAL function in IML. Because both variance components are heterogeneous, the variances used by RANDNORMAL will differ across people. Thus, I need IML to select the specific row (e.g., row 1 = person 1) and use the RANDNORMAL function before moving onto the next row, and so on.
My example code below is for 2 people. I use DO to loop through each person's specific variance components (VC1 and VC2). I get the error: "Module RANDNORMAL called again before exit from prior call." I am assuming I need some kind of BREAK or EXIT function in the DO loop, but none I have tried work.
PROC IML;
ColNames = {"ID" "VC1" "VC2"};
A = {1 2 3,
2 8 9};
PRINT A[COLNAME=ColNames];
/*Set men of each variance component to 0*/
MeanVector = {0, 0};
/*Loop through each person's data using THEIR OWN variances*/
DO i = 1 TO 2;
VC1 = A[i,2];
VC2 = A[i,3];
CovMatrix = {VC1 0,
0 VC2};
CALL RANDSEED(1);
U = RANDNORMAL(2, MeanVector, CovMatrix);
END;
QUIT;
Any help is appreciated. Oh, and I'm using SAS 9.4.
You want to move some things around, but mostly you don't want to rewrite U twice: you need to write U's 1st row, then U's 2nd row, if I understand what you're trying to do. The below is a bit more efficient also, since I j() the U and _cv matrices rather than constructing then de novo every time through the loop (which is slow).
proc iml;
a = {1 2 3,2 8 9};
print(a);
_mv = {0,0};
U = J(2,2);
_cv = J(2,2,0);
CALL RANDSEED(1);
do i = 1 to 2;
_cv[1,1] = a[i,2];
_cv[2,2] = a[i,3];
U[i,] = randnormal(1,_mv, _cv);
end;
print(u);
quit;
Your mistake is the line
CovMatrix = {VC1 0, 0 VC2}; /* wrong */
which is not valid SAS/IML syntax. Instead, use #Joe's approach or use
CovMatrix = (VC1 || 0) // (0 || VC2);
For details, see the article "How to build matrices from expressions."
You might also be interested in this article that describes how to carry out this simulation with a block-diagonal matrix: "Constructing block matrices with applications to mixed models."

Simulating ARMA/ARIMA time series processes in SAS

I've been trying to find the simplest way to generate simulated time series datasets in SAS. I initially was experimenting with the LAG operator, but this requires input data, so is proabably not the best way to go. (See this question: SAS: Using the lag function without a set statement (to simulate time series data.))
Has anyone developed a macro or dataset that enables time series to be genereated with an arbitrary number of AR and MA terms? What is the best way to do this?
To be specific, I'm looking to generate what SAS calls an ARMA(p,q) process, where p denotes the autoregressive component (lagged values of the dependent variable), and q is the moving average component (lagged values of the error term).
Thanks very much.
I have developed a macro to attempt to answer this question, but I'm not sure whether this is the most efficient way of doing this. Anyway, I thought it might be useful to someone:
%macro TimeSeriesSimulation(numDataPoints=100, model=y=e,outputDataSetName=ts, maxLags=10);
data &outputDataSetName (drop=j);
array lagy(&maxlags) _temporary_;
array lage(&maxlags) _temporary_;
/*Initialise values*/
e = 0;
y=0;
t=1;
do j = 1 to 10;
lagy(j) = 0;
lage(j) = 0;
end;
output;
do t = 2 to &numDataPoints; /*Change this for number of observations*/
/*SPECIFY MODEL HERE*/
e = rannorm(-1); /*Draw from a N(0,1)*/
&model;
/*Update values of lags on the moving average and autoregressive terms*/
do j = &maxlags-1 to 1 by -1; /*Note you have to do this backwards because otherwise you cascade the current value to all past values!*/
lagy(j+1) = lagy(j);
lage(j+1) = lage(j);
end;
lagy(1) = y;
lage(1) = e;
output;
end;
run;
%mend;
/*Example 1: Unit root*/
%TimeSeriesSimulation(numDataPoints=1000, model=y=lagy(1)+e)
/*Example 2: Simple process with AR and MA components*/
%TimeSeriesSimulation(numDataPoints=1000, model=y=0.5*lagy(1)+0.5*lage(1)+e)

Differentiating between missing and total in output of proc means?

I've got something like the following:
proc means data = ... missing;
class 1 2 3 4 5;
var a b;
output sum=;
run;
This does what I want it to do, except for the fact that it is very difficult to differentiate between a missing value that represents a total, and a missing value that represents a missing value. For example, the following would appear in my output:
1 2 3 4 5 type sumA sumB
. . . . . 0 num num
. . . . . 1 num num
Ways I can think of handling this:
1) Change missings to a cardinal value prior to proc means. This is definitely doable...but not exactly clean.
2) Format the missings to something else prior, and then use preloadfmt? This is a bit of a pain...I'd really rather not.
3) Somehow use the proc means-generated variable type to determine whether the missing is a missing or a total
4) Other??
I feel like this is clearly a common enough problem that there must be a clean, easy way, and I just don't know what it is.
Option 3, for sure . Type is simply a binary number with 1 for each class variable, in order, that is included in the current row and 0 for each one that is missing. You can use the CHARTYPE option to ask for it to be given explicitly as a string ('01101110' etc.), or work with math if that's more your style.
How exactly you use this depends on what you're trying to accomplish. Rows that have a missing value on them will have a type that suggests a class variable should exist, but doesn't. So for example:
data want;
set have; *post-proc means assuming used CHARTYPE option;
array classvars a b c d e; *whatever they are;
hasmissing=0;
do _t = 1 to dim(classvars);
if char(_type_,_t) = 1 and classvars[_t] = . then hasmissing=1;
end;
*or;
if cmiss(of classvars[*]) = countc(_type_,'0') then hasmissing=0;
else hasmissing=1; *number of 0s = number of missings = total row, otherwise not;
run;
That's a brute force application, of course. You may also be able to identify it based on the number of missings, if you have a small number of types requested. For example, let's say you have 3 class variables (so 0 to 7 values for type), and you only asked for the 3 way combination (7, '111') and the 3 two way combination 'totals' (6,5,3, ie, '110','101','011'). Then:
data want;
set have;
if (_type_=7 and cmiss(of a b c) = 0) or (cmiss(of a b c) = 1) then ... ; *either base row or total row, no missings;
else ... ; *has at least one missing;
run;
Depending on your data, NMISS may also work. That checks to see if the number of missings is appropriate for the type of data.
Joe's strategy, modified slightly for my exact problem, because it may be useful to somebody at some point in the future.
data want;
set have;
array classvars a b c d e;
do _t = 1 to dim(classvars);
if char(_type_,_t) = 1 and (strip(classvars[_t] = "") or strip(classvars[_t]) = ".") then classvars[_t] = "TOTAL";
end;
run;
The rationale for the changes is as follows:
1) I'm working with (mostly) character variables, not numeric.
2) I'm not interested in whether a row has any missing or not, as those are very frequent, and I want to keep them. Instead, I just want the output to differentiate between the missings and the totals, which I have accomplished by renaming the instances of non-missing to something that indicates total.

SAS computation using double loops

I am trying to compute using two loops. But I am not very familiar with loop elements.
Here is my data:
data try;
input rs t a b c;
datalines;
0 600
1 600 0.02514 667.53437 0.1638
2 600 0.2766 724.60233 0.30162
3 610 0.01592 792.34628 0.21354
4 615.2869 0.03027 718.30377 0.22097
5 636.0273 0.01967 705.45965 0.16847
;
run;
What I am trying to compute is that for each 'T' value, all elements of a, b, and c need to be used for the equation. Then I create varaibles v1-v6 to put results of the equation for each T1-T6. After that, I create CS to sum all the elements of v.
So my result dataset will look like this:
rs T a b c v1 v2 v3 v4 v5 v6 CS
0 600 sum of v1
1 600 0.02514 667.53437 0.1638 sum of v2
2 600 0.2766 724.60233 0.30162 sum of v3
3 610 0.01592 792.34628 0.21354 sum of v4
4 615.2869 0.03027 718.30377 0.22097 sum of v5
5 636.0273 0.01967 705.45965 0.16847 sum of v6
I wrote a code below to do this but got errors. Mainly I am not sure how to use i and j properly to link all elements of variables. Can someone point out what i did not think correct? I am aware that myabe I should not use sum function to cum up elements of a variable but not sure which function to use.
data try3;
set try;
retain v1-v6;
retain t a b c;
array v(*) v1-v6;
array var(*) t a b c;
cs=0;
do i=1 to 6;
do j=1 to 6;
v[i,j]=(2.89*(a[j]**2*(1-c[j]))/
((c[j]+exp(1.7*a[j]*(t[i]-b[j])))*
((1+exp(-1.7*a[j]*(t[i]-b[j])))**2));
cs[i]=sum(of v[i,j]-v[i,j]);
end;
end;
run;
Forexample, v1 will be computed like v[1,1] =0 because there is no values for a b c.
For v[1,2]=(2.89*0.02514**2(1-0.1638))/((0.1638+exp(1.7*0.02514*600-667.53437)))*((1+exp(-1.7*0.02514*(600-667.5347)))**2)).
v[1,3]]=(2.89*0.2766**2(1-0.30162))/((0.30162+exp(1.7*0.2766*600-724.60233)))*((1+exp(-1.7*0.2766*(600-724.60233)))**2)).
v[1,4] will be using the next line values of a b c but the t will be same as the t[1]. and do this until the last row. And that will be v1. And then I need to sum all the elements of v1 like v1{1,1] +v1[1,2]+ v1{1,3] ....v1[1,6] to make cs[1,1].
The SAS language isn't that good at doing these kinds of things, which are essentially matrix calculations. The DATA step normally processes one observation at a time, though you can carry calculations over using the RETAIN statement. It is possible that you could get a cleaner result than this if you had access to PROC IML (which does matrix calculations natively), but assuming that you don't have access to IML, you need to do something like the following. I'm not 100% sure that it is what you need, but I think it is along the right lines:
data try;
infile cards missover;
input rs t a b c;
datalines;
0 600
1 600 0.02514 667.53437 0.1638
2 600 0.2766 724.60233 0.30162
3 610 0.01592 792.34628 0.21354
4 615.2869 0.03027 718.30377 0.22097
5 636.0273 0.01967 705.45965 0.16847
;
run;
data try4(rename=(aa=a bb=b cc=c css=cs tt=t vv1=v1 vv2=v2 vv3=v3 vv4=v4 vv5=v5 vv6=v6));
* Construct arrays into which we will read all of the records;
array t(6);
array a(6);
array b(6);
array c(6);
array v(6,6);
array cs(6);
* Read all six records;
do i=1 to 6;
set try(rename=(t=tt a=aa b=bb c=cc));
t[i] = tt;
a[i] = aa;
b[i] = bb;
c[i] = cc;
end;
* Now do the calculation, which involves values from each
row at each iteration;
do i=1 to 6;
cs[i]=0;
do j=1 to 6;
v[i,j]=(2.89*(a[j]**2*(1-c[j]))/
((c[j]+exp(1.7*a[j]*(t[i]-b[j])))*
((1+exp(-1.7*a[j]*(t[i]-b[j])))**2)));
cs[i]+v[i,j];
end;
* Then output the values for this iteration;
tt=t[i];
aa=a[i];
bb=b[i];
cc=c[i];
css=cs[i];
vv1=v[i,1];
vv2=v[i,2];
vv3=v[i,3];
vv4=v[i,4];
vv5=v[i,5];
vv6=v[i,6];
keep tt aa bb cc vv1-vv6 css;
output try4;
end;
Note that I have to construct arrays of known size, that is you have to know how many input records there are.
The first half of the DATA step constructs arrays into which the values from the input data set are read. We read all of the records, and then we do all of the calculations, since we have all of the values in memory in the matricies.
There is some fiddling with RENAMES so that you can keep the array names t, a, b, c etc but still have variables named a, b, c etc in the output data set.
So hopefully that might help you along a bit. Either that or confuse you because I've misunderstood what you're trying to do!