SAS multiple datasets logical order - sas

I'm sorry to ask this question but my English is poor and I don't know what to type on google to get results.
I want to do :
data test;
set mytable1 to mytable999;
run;
how can I tell SAS to set all the tables from 1 to 999 without writing them (cause it's long to do so). something like mytable1-999
thank you very much, I know it's a basic function but I don't remember what is the name in English

Just use the wild-card function of ´:´ in SAS. In
data myTable1;
do i = 1 to 3;
j = 2*i;
output;
end;
run;
data myTable2;
do i = 1 to 3;
j = -i;
output;
end;
run;
data myAll;
set myTable:;
run;
myTable: is equivalent with the list of all tables of which the name starts with myTable.
The result is
i j
== ==
1 2
2 4
3 6
1 -1
2 -2
3 -3

Related

SAS 9.4 Replacing all values after current line based on current values

I am matching files base on IDs numbers. I need to format a data set with the IDs to be matched, so that the same ID number is not repeated in column a (because column b's ID is the surviving ID after the match is completed). My list of IDs has over 1 million observations, and the same ID may be repeated multiple times in either/both columns.
Here is an example of what I've got/need:
Sample Data
ID1 ID2
1 2
3 4
2 5
6 1
1 7
5 8
The surviving IDs would be:
2
4
5
error - 1 no longer exists
error - 1 no longer exists
8
WHAT I NEED
ID1 ID2
1 2
3 4
2 5
6 5
5 7
7 8
I am, probably very obviously, a SAS novice, but here is what I have tried, re-running over and over again because I have some IDs that are repeated upward of 50 times or more.
Proc sort data=Have;
by ID1;
run;
This sort makes the repeated ID1 values consecutive, so the I could use LAG to replace the destroyed ID1s with the surviving ID2 from the line above.
Data Want;
set Have;
by ID1;
lagID1=LAG(ID1);
lagID2=LAG(ID2);
If NOT first. ID1 THEN DO;
If ID1=lagID1 THEN ID1=lagID2;
KEEP ID1 ID2;
IF ID1=ID2 then delete;
end;
run;
That sort of works, but I still end up with some that end up with duplicates that won't resolve no matter how many times I run (I would have looped it, but I don't know how), because they are just switching back and forth between IDs that have other duplicates (I can get down to about 2,000 of these).
I have figured out that instead of using LAG, I need replace all values after the current line with ID2 for each ID1 value, but I cannot figure out how to do that.
I want to read observation 1, find all later instances of the value of ID1, in both ID1 or ID2 columns, and replace that value with the current observation's ID2 value. Then I want to repeat that process with line 2 and so on.
For the example, I would want to look for any instances after line one of the value 1, and replace it with 2, since that is the surviving ID of that pair - 1 may appear further down multiple times in either of the columns, and I need all them to replaced. Line two would look for later values of 3 and replace them with 4, and so one. The end result should be that an ID number only appears once ever in the ID1 column (though it may appear multiple times in the ID2 column).
ID1 ID2
1 2
3 4
2 5
6 1
1 7
5 8
After first line has been read, data set would look as follows:
ID1 ID2
1 2
3 4
2 5
6 2
2 7
5 8
Reading observation two would make no changes since 3 does not appear again; after observation 3, the set would be:
ID1 ID2
1 2
3 4
2 5
6 5
5 7
5 8
Again, there would be not changes from observation four. but observation 5 would cause the final change:
ID1 ID2
1 2
3 4
2 5
6 5
5 7
7 8
I have tried using the following statement but I can't even tell if I am on the complete wrong track or if I just can't get the syntax figured out.
Data want;
Set have;
Do i=_n_;
ID=ID2;
Replace next var{EUID} where (EUID1=EUID1 AND EUID2=EUID1);
End;
Run;
Thanks for your help!
There is no need to work back and forth thru the data file. You just need to retain the replacement information so that you can process the file in a single pass.
One way to do that is to make a temporary array using the values of the ID variables as the index. That is easy to do for your simple example with small ID values.
So for example if all of the ID values are integers between 1 and 1000 then this step will do the job.
data want ;
set have ;
array xx (1000) _temporary_;
do while (not missing(xx(id1))); id1=xx(id1); end;
do while (not missing(xx(id2))); id2=xx(id2); end;
output;
xx(id1)=id2;
run;
You probably need to add a test to prevent cycles (1 -> 2 -> 1).
For a more general solution you should replace the array with a hash object instead. So something like this:
data want ;
if _n_=1 then do;
declare hash h();
h.definekey('old');
h.definedata('new');
h.definedone();
call missing(new,old);
end;
set have ;
do while (not h.find(key:id1)); id1=new; end;
do while (not h.find(key:id2)); id2=new; end;
output;
h.add(key: id1,data: id2);
drop old new;
run;
Here's an implementation of the algorithm you've suggested, using a modify statement to load and rewrite each row one at a time. It works with your trivial example but with messier data you might get duplicate values in ID1.
data have;
input ID1 ID2 ;
datalines;
1 2
3 4
2 5
6 1
1 7
5 8
;
run;
title "Before making replacements";
proc print data = have;
run;
/*Optional - should improve performance at cost of increased memory usage*/
sasfile have load;
data have;
do i = 1 to nobs;
do j = i to nobs;
modify have point = j nobs = nobs;
/* Make copies of target and replacement value for this pass */
if j = i then do;
id1_ = id1;
id2_ = id2;
end;
else do;
flag = 0; /* Keep track of whether we made a change */
if id1 = id1_ then do;
id1 = id2_;
flag = 1;
end;
if id2 = id1_ then do;
id2 = id2_;
flag = 1;
end;
if flag then replace; /* Only rewrite the row if we made a change */
end;
end;
end;
stop;
run;
sasfile have close;
title "After making replacements";
proc print data = have;
run;
Please bear in mind that as this modifies the dataset in place, interrupting the data step while it is running could result in data loss. Make sure you have a backup first in case you need to roll your changes back.
Seems like this should do the trick and is fairly straight forward. Let me know if it is what you are looking for:
data have;
input id1 id2;
datalines;
1 2
3 4
2 5
6 1
1 7
5 8
;
run;
%macro test();
proc sql noprint;
select count(*) into: cnt
from have;
quit;
%do i = 1 %to &cnt;
proc sql noprint;
select id1,id2 into: id1, :id2
from have
where monotonic() = &i;quit;
data have;
set have;
if (_n_ > input("&i",8.))then do;
if (id1 = input("&id1",8.))then id1 = input("&id2",8.);
if (id2 = input("&id1",8.))then id2 = input("&id2",8.);
end;
run;
%end;
%mend test;
%test();
this might be a little faster:
data have2;
input id1 id2;
datalines;
1 2
3 4
2 5
6 1
1 7
5 8
;
run;
%macro test2();
proc sql noprint;
select count(*) into: cnt
from have2;
quit;
%do i = 1 %to &cnt;
proc sql noprint;
select id1,id2 into: id1, :id2
from have2
where monotonic() = &i;
update have2 set id1 = &id2
where monotonic() > &i
and id1 = &id1;
quit;
proc sql noprint;
update have2 set id2 = &id2
where monotonic() > &i
and id2 = &id1;
quit;
%end;
%mend test2;
%test2();

Does SAS have a equivalent function to all() or any() in R

In R you can perform a condition across all rows in a column variable by using the all() or any() function. Is there an equivalent method in SAS?
I want condition if ANY rows in column x are negative, this should return TRUE.
Or, if ALL rows in column y are negative, this should return TRUE.
For example
x y
-1 -2
2 -4
3 -4
4 -3
In R:
all(x<0) would give the output FALSE
all(y<0) would give the output TRUE
I wish to replicate the same column-wise operation in SAS.
For completeness' sake, here is the SAS-IML solution. Of course, it's trivial as the any and all functions exist by the same name... I also include an example of using loc to identify the positive elements.
data have ;
input x y ##;
cards;
1 2
2 4
3 -4
4 -3
;
run;
proc iml;
use have;
read all var {"x" "y"};
print x y;
x_all = all(x>0);
x_any = any(x>0);
y_all = all(y>0);
y_any = any(y>0);
y_pos = y[loc(y>0)];
print x_all x_any y_all y_any;
print y_pos;
quit;
If you want to operate on all observations that might be easiest to do using SQL summary functions.
SAS will evaluate boolean expressions as 1 for true and 0 for false. So to find out if any observation has a condition you want to test if the MAX( condition ) is true (ie equal to 1). To find out if all observations have the condition you want to test if the MIN( condition ) is true.
data have ;
input x y ##;
cards;
-1 -2 2 -4 3 -4 4 -3
;
proc sql ;
create table want as
select
min(x<0) as ALL_X
, max(x<0) as ANY_X
, min(y<0) as ALL_Y
, max(y<0) as ANY_Y
from have
;
quit;
Result
Obs ALL_X ANY_X ALL_Y ANY_Y
1 0 1 1 1
SQL is probably the most feels-like similar way to do this, but the data step is just as efficient, and lends itself a bit better to any sort of modification - and frankly, if you're trying to learn SAS, is probably the way to go simply from the point of view of learning how to do things the SAS way.
data want;
set have end=eof;
retain any_x all_x; *persist the value across rows;
any_x = max(any_x, (x>0)); *(x>0) 1=true 0=false, keep the MAX (so keep any true);
all_x = min(all_x, (x>0)); *(x>0) keep the MIN (so keep any false);
if eof then output; *only output the final row to a dataset;
*and/or;
if eof then do; *here we output the any/all values to macro variables;
call symputx('any_x',any_x); *these macro variables can further drive logic;
call symputx('all_x',all_x); *and exist in a global scope (unless we define otherwise);
end;
run;

SAS: How to output the last observation in a sequence of a SAS data set

I want to output the last observation in variable which is an integer sequence in a sas data set.
I have this data set:
data have;
input seq var;
datalines;
1 7
2 6
3 3
1 1
2 4
1 8
2 9
3 1
4 8
;
run;
I would like to achieve the following:
seq var
3 3
2 4
4 8
I have thoroughly searched for my answer online but couldn't find anything.
You can use a look-ahead technique. This is one of many ways to write it.
data last;
set have end=eof;
if not eof then set have(firstobs=2 keep=seq rename=(seq=nseq));
if nseq eq 1 or eof then output;
drop nseq;
run;
Just to give an indication of the slickness of the look-ahead approach - you can do the same thing with lag, but it takes nearly twice as many lines of code:
data want(drop=prev_:);
set have end = eof;
prev_seq = lag(seq);
prev_var = lag(var);
if seq < prev_seq then do;
seq = prev_seq;
var = prev_var;
end;
if eof or seq = prev_seq;
run;

How do i perform calculation about the last n observations

how can i perform calculation for the last n observation in a data set
For example if I have 10 observations I would like to create a variable that would sum the last 5 values of another variable. Please do not suggest that I lag 5 times or use module ( N ). I need a bit more elegant solution than that.
with the code below alpha is the data set that i have and bravo is the one i need.
data alpha;
input lima ## ;
cards ;
3 1 4 21 3 3 2 4 2 5
;
run ;
data bravo;
input lima juliet;
cards;
3 .
1 .
4 .
21 .
3 32
3 32
2 33
4 33
2 14
5 16
;
run;
thank you in advance!
You can do this in the data step or using PROC EXPAND from SAS/ETS if available.
For the data step the idea is that you start with a cumulative sum (summ), but keep track of the number of values that were added so far (ninsum). Once that reaches 5, you start outputting the cumulative sum to the target variable (juliet), and from the next step you start subtracting the lagged-5 value to only store the sum of the last five values.
data beta;
set alpha;
retain summ ninsum 0;
summ + lima;
ninsum + 1;
l5 = lag5(lima);
if ninsum = 6 then do;
summ = summ - l5;
ninsum = ninsum - 1;
end;
if ninsum = 5 then do;
juliet = summ;
end;
run;
proc print data=beta;
run;
However there is a procedure that can do all kind of cumulative, moving window, etc calculations: PROC EXPAND, in which this is really just one line. We just tell it to calculate the backward moving sum in a window of width 5 and set the first 4 observations to missing (by default it will expand your series by 0's on the left).
proc expand data=alpha out=gamma;
convert lima = juliet / transformout=(movsum 5 trimleft 4);
run;
proc print data=gamma;
run;
Edit
If you want to do more complicated calculations, you need to carry the previous values in retained variables. I thought you wanted to avoid that, but here it is:
data epsilon;
set alpha;
array lags {5};
retain lags1 - lags5;
/* do whatever calculation is needed */
juliet = 0;
do i=1 to 5;
juliet = juliet + lags{i};
end;
output;
/* shift over lagged values, and add self at the beginning */
do i=5 to 2 by -1;
lags{i} = lags{i-1};
end;
lags{1} = lima;
drop i;
run;
proc print data=epsilon;
run;
I can offer rather ugly solution:
run data step and add increasing number to each group.
run sql step and add column of max(group).
run another data step and check if value from (2)-(1) is less than 5. If so, assign to _num_to_sum_ variable (for example) the value that you want to sum, otherwise leave it blank or assign 0.
and last do a sql step with sum(_num_to_sum_) and group results by grouping variable from (1).
EDIT: I have added a live example of the concept in a bit more compacted way.
input var1 $ var2;
cards;
aaa 3
aaa 5
aaa 7
aaa 1
aaa 11
aaa 8
aaa 6
bbb 3
bbb 2
bbb 4
bbb 6
;
run;
data step1;
set sourcetable;
by var1;
retain obs 0;
if first.var1 then obs = 0;
else obs = obs+1;
if obs >=5 then to_sum = var2;
run;
proc sql;
create table rezults as
select distinct var1, sum(to_sum) as needed_summs
from step1
group by var1;
quit;
In case anyone reads this :)
I solved it the way I needed it to be solved. Although now I am more curious which of the two(the retain and my solution) is more optimal in terms of computing/processing time.
Here is my solution:
data bravo(keep = var1 summ);
set alpha;
do i=_n_ to _n_-4 by -1;
set alpha(rename=var1=var2) point=i;
summ=sum(summ,var2);
end;
run;

SAS Transpose without using PROC Transpose

I have a question about transposing data without using PROC Transpose.
0 a b c
1 dog cat camel
2 9 7 2534
Without using PROC TRANSPOSE, how can I get a resulting dataset of:
Animals Weight
1 dog 9
2 cat 7
3 camel 2534
This is a bit of a curious request. This example code is hard coded for your 3 variables. You will have to generalize this if needed.
data temp;
input a $ b $ c $;
datalines;
dog cat camel
9 7 2534
;
run;
data animal_weight;
set temp end=last;
format animal animals1-animals3 $8.;
format weight weights1-weights3 best. ;
retain animals: weights:;
array animals[3];
array weights[3];
if _n_ = 1 then do;
animals[1] = a;
animals[2] = b;
animals[3] = c;
end;
else if _n_ = 2 then do;
weights[1] = input(a,best.);
weights[2] = input(b,best.);
weights[3] = input(c,best.);
end;
if last then do;
do i=1 to 3;
animal = animals[i];
weight = weights[i];
output;
end;
end;
drop i animals: weights: a b c;
run;
Read the values into 2 arrays, converting the weights from strings into numbers. Use the _N_ variable to figure out which array to populate. At the end of the data set, output the values in the arrays.
I wouldn't give this as an answer to a homework problem that I actually wanted to get a good grade on (because it's far too advanced, so it's obvious you asked for help); but the hash solution is almost certainly the most flexible and what I'd hope someone doing this in the real world would do (assuming there is a 'don't use proc transpose' real world reason, such as available resources). The problem is somewhat undefined, so this is only moderately fault-tolerant.
data have;
input a $ b $ c $;
datalines;
dog cat camel
9 7 2534
;;;;
run;
data _null_;
set have end=eof;
array charvars _character_;
if _n_ = 1 then do;
length animal $15 weight 8;
declare hash h();
h.defineKey('row');
h.defineData('animal','weight');
h.defineDone();
end;
animal=' ';
weight=.;
do row = 1 to dim(charvars);
rc_f = h.find();
if rc_f ne 0 then do;
animal=charvars[row];
rc_a = h.add();
animal=' ';
end;
else if rc_f eq 0 then do;
weight=input(charvars[row],best12.);
rc_r = h.replace();
end;
end;
if eof then rc_o = h.output(dataset:'want');
run;
Do you always have just two rows or is that the no of columns and the rows are dynamic?
If you have a dynamic no of rows and columns, then the ideal way will be to use open function, get the no of columns to a macro variable. This will be the no of rows in your new dataset. Then take the no of rows in your original dataset which will be the no of columns in your new dataset. This must happen before the actual Transpose method. Post this you can read it in to an array and using the macro variables as the dimensions output the values in to the new dataset.
Having said all this, why would you want to re-invent the wheel when you already have the SAS provided ready made transpose function?