Read wide file with repeated variables in SAS - sas

I have input data shaped like this:
var1 var2 var3 var2 var3 ...
where each row has one value of var1 followed by a varying number of var2-var3 pairs. After reading this input, I want the data set to have multiple records for each var1: one record for each pair of var2/var3.
So if the first two lines of the input file are
A 1 2 7 3 4 5
B 2 3
this would generate 4 records:
A 1 2
A 7 3
A 4 5
B 2 3
Is there an simple/elegant way to do this? I've tried reading each row as one long variable and splitting with scan but it's getting messy and I'm betting there's a really easy way to do this.

I'm sure there are many ways to do this, but here is the first that comes to my mind:
data want(keep=var1 var2 var3);
infile 'path-to-your-file';
input;
var1 = input(scan(_infile_,1),$8.);
i = 1;
do while(i ne 0);
i + 1;
var2 = input(scan(_infile_,i),8.);
i + 1;
var3 = input(scan(_infile_,i),8.);
if var3 = . then i = 0;
else output;
end;
run;
_infile_ is an automatic SAS variable that contains the currently read record. Use an appropriate informat for each variable you read.

Like this (conditional input with jumping back):
data test;
infile datalines missover;
input var1 $ var2 $ var3 $ temp $ #;
output;
do while(not missing(temp));
input +(-2) var2 $ var3 $ temp $ #;
output;
end;
drop temp;
datalines;
A 1 2 7 3 4 5
B 2 3
;
run;

Related

SAS Concatenation based on values

Below is the sample data.
NAME VAR2 VAR3 VAR4 VAR5
ABC X Y 2
DEF P Q R 3
GHI L 1
The count of variables (from VAR2-VAR4) is present under VAR5 for each record, I want the following output with NewVar as the concatenation of the variables which contain a value.
NAME VAR2 VAR3 VAR4 VAR5 NewVar
ABC X Y 2 X,Y
DEF P Q R 3 P,Q,R
GHI L 1 L
I have no clue how to do it in SAS. Any help is appreciated.
Use the CATX() function to concatenate the variables; with this function you have the option to specify the delimiter character to use between the values. Ex. CATX(',',VAR2,VAR3,VAR4)
Input Data:
data have;
input NAME $ VAR2 $ VAR3 $ VAR4 $ VAR5;
datalines;
ABC X Y . 2
DEF P Q R 3
GHI L . . 1
;
run;
Solution:
data want;
set have;
NewVar= catx(',',VAR2,VAR3,VAR4);
run;
or
%let list=VAR2,VAR3,VAR4;
data want2;
set have;
NewVar= catx(',',&list.);
run;
or (Tom's Recommendation)
data want3;
set have;
NewVar= catx(',',of var2-var4);
run;
Output:
NAME=ABC VAR2=X VAR3=Y VAR4= VAR5=2 NewVar=X,Y
NAME=DEF VAR2=P VAR3=Q VAR4=R VAR5=3 NewVar=P,Q,R
NAME=GHI VAR2=L VAR3= VAR4= VAR5=1 NewVar=L

Compare value to max of variable

I have a dataset with a lot of lines and I'm studying a group of variables.
For each line and each variable, I want to know if the value is equal to the max for this variable or more than or equal to 10.
Expected output (with input as all variables without _B) :
(you can replace T/F by TRUE/FALSE or 1/0 as you wish)
+----+------+--------+------+--------+------+--------+
| ID | Var1 | Var1_B | Var2 | Var2_B | Var3 | Var3_B |
+----+------+--------+------+--------+------+--------+
| A | 1 | F | 5 | F | 15 | T |
| B | 1 | F | 5 | F | 7 | F |
| C | 2 | T | 5 | F | 15 | T |
| D | 2 | T | 6 | T | 10 | T |
+----+------+--------+------+--------+------+--------+
Note that for Var3, the max is 15 but since 15>=10, any value >=10 will be counted as TRUE.
Here is what I've maid up so far (doubt it will be any help but still) :
%macro pleaseWorkLittleMacro(table, var, suffix);
proc means NOPRINT data=&table;
var &var;
output out=Varmax(drop=_TYPE_ _FREQ_) max=;
run;
proc transpose data=Varmax out=Varmax(rename=(COL1=varmax));
run;
data Varmax;
set Varmax;
varmax = ifn(varmax<10, varmax, 10);
run; /* this outputs the max for every column, but how to use it afterward ? */
%mend;
%pleaseWorkLittleMacro(MY_TABLE, VAR1 VAR2 VAR3 VAR4, _B);
I have the code in R, works like a charm but I really have to translate it to SAS :
#in a for loop over variable names, db is my data.frame, x is the
#current variable name and x2 is the new variable name
x.max = max(db[[x]], na.rm=T)
x.max = ifelse(x.max<10, x.max, 10)
db[[x2]] = (db[[x]] >= x.max) %>% mean(na.rm=T) %>% percent(2)
An old school sollution would be to read the data twice in one data step;
data expect ;
input ID $ Var1 Var1_B $ Var2 Var2_B $ Var3 Var3_B $ ;
cards;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
;
run;
data my_input;
set expect;
keep ID Var1 Var2 Var3 ;
proc print;
run;
It is a good habit to declare the most volatile things in your code as macro variables.;
%let varList = Var1 Var2 Var3;
%let markList = Var1_B Var2_B Var3_B;
%let varCount = 3;
Read the data twice;
data my_result;
set my_input (in=maximizing)
my_input (in=marking);
Decklare and Initialize arrays;
format &markList $1.;
array _vars [&&varCount] &varList;
array _maxs [&&varCount] _temporary_;
array _B [&&varCount] &markList;
if _N_ eq 1 then do _varNr = 1 to &varCount;
_maxs(_varNr) = -1E15;
end;
While reading the first time, Calculate the maxima;
if maximizing then do _varNr = 1 to &varCount;
if _vars(_varNr) gt _maxs(_varNr) then _maxs(_varNr) = _vars(_varNr);
end;
While reading the second time, mark upt to &maxMarks maxima;
if marking then do _varNr = 1 to &varCount;
if _vars(_varNr) eq _maxs(_varNr) or _vars(_varNr) ge 10
then _B(_varNr) = 'T';
else _B(_varNr) = 'F';
end;
Drop all variables starting with an underscore, i.e. all my working variables;
drop _:;
Only keep results when reading for the second time;
if marking;
run;
Check results;
proc print;
var ID Var1 Var1_B Var2 Var2_B Var3 Var3_B;
proc compare base=expect compare=my_result;
run;
This is quite simple to solve in sql
proc sql;
create table my_result as
select *
, Var1_B = (Var1 eq max_Var1)
, Var1_B = (Var2 eq max_Var2)
, Var1_B = (Var3 eq max_Var3)
from my_input
, (select max(Var1) as max_Var1
, max(Var2) as max_Var2
, max(Var3) as max_Var3)
;
quit;
(Not tested, as our SAS server is currently down, which is the reason I pass my time on Stack Overflow)
If you need that for a lot of variables, consult the system view VCOLUMN of SAS:
proc sql;
select ''|| name ||'_B = ('|| name ||' eq max_'|| name ||')'
, 'max('|| name ||') as max_'|| name
from sasHelp.vcolumn
where libName eq 'WORK'
and memName eq 'MY_RESULT'
and type eq 'num'
and upcase(name) like 'VAR%'
;
into : create_B separated by ', '
, : select_max separated by ', '
create table my_result as
select *, &create_B
, Var1_B = (Var1 eq max_Var1)
, Var1_B = (Var2 eq max_Var2)
, Var1_B = (Var3 eq max_Var3)
from my_input
, (select max(Var1) as max_Var1
, max(Var2) as max_Var2
, max(Var3) as max_Var3)
;
quit;
(Again not tested)
After Proc MEANS computes the maximum value for each column you can run a data step that combines the original data with the maximums.
data want;
length
ID $1 Var1 8 Var1_B $1. Var2 8 Var2_B $1. Var3 8 Var3_B $1. var4 8 var4_B $1;
input
ID Var1 Var1_B Var2 Var2_B Var3 Var3_B ; datalines;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
run;
data have;
set want;
drop var1_b var2_b var3_b var4_b;
run;
proc means NOPRINT data=have;
var var1-var4;
output out=Varmax(drop=_TYPE_ _FREQ_) max= / autoname;
run;
The neat thing the VAR statement is that you can easily list numerically suffixed variable names. The autoname option automatically appends _ to the names of the variables in the output.
Now combine the maxes with the original (have). The set varmax automatically retains the *_max variables, and they will not get overwritten by values from the original data because the varmax variable names are different.
Arrays are used to iterate over the values and apply the business logic of flagging a row as at max or above 10.
data want;
if _n_ = 1 then set varmax; * read maxes once from MEANS output;
set have;
array values var1-var4;
array maxes var1_max var2_max var3_max var4_max;
array flags $1 var1_b var2_b var3_b var4_b;
do i = 1 to dim(values); drop i;
flags(i) = ifc(min(10,maxes(i)) <= values(i),'T','F');
end;
run;
The difficult part above is that the MEANS output creates variables that can not be listed using the var1 - varN syntax.
When you adjust the naming convention to have all your conceptually grouped variable names end in numeric suffixes the code is simpler.
* number suffixed variable names;
* no autoname, group rename on output;
proc means NOPRINT data=have;
var var1-var4;
output out=Varmax(drop=_TYPE_ _FREQ_ rename=var1-var4=max_var1-max_var4) max= ;
run;
* all arrays simpler and use var1-varN;
data want;
if _n_ = 1 then set varmax;
set have;
array values var1-var4;
array maxes max_var1-max_var4;
array flags $1 flag_var1-flag_var4;
do i = 1 to dim(values); drop i;
flags(i) = ifc(min(10,maxes(i)) <= values(i),'T','F');
end;
run;
You can use macro code or arrays, but it might just be easier to transform your data into a tall variable/value structure.
So let's input your test data as an actual SAS dataset.
data expect ;
input ID $ Var1 Var1_B $ Var2 Var2_B $ Var3 Var3_B $ ;
cards;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
;
First you can use PROC TRANSPOSE to make the tall structure.
proc transpose data=expect out=tall ;
by id ;
var var1-var3 ;
run;
Now your rules are easy to apply in PROC SQL step. You can derive a new name for the flag variable by appending a suffix to the original variable's name.
proc sql ;
create table want_tall as
select id
, cats(_name_,'_Flag') as new_name
, case when col1 >= min(max(col1),10) then 'T' else 'F' end as value
from tall
group by 2
order by 1,2
;
quit;
Then just flip it back to horizontal and merge with the original data.
proc transpose data=want_tall out=flags (drop=_name_);
by id;
id new_name ;
var value ;
run;
data want ;
merge expect flags;
by id;
run;

Append multiple variable from same dataset into single variable

Given the following dataset:.
obs var1 var2 var3
1 123 456 .
2 123 . 789
3 . 456 789
How does one go about to append all the variables into a single variable whilst ignoring the empty observations (denoted by ".")?
Desired output:.
obs var4
1 123
2 123
3 456
4 456
5 789
6 789
Data step:.
data have;
input
var1 var2 var3; cards;
123 456 .
123 . 789
. 456 789
;run;
Not sure why you read the numbers in as char, but if I change to num, it could be done like this:
data have;
input var1 var2 var3;
cards;
123 456 .
123 . 789
. 456 789
;run;
data want (keep=var4);
set have;
var4=var1;if var4 ne . then output;
var4=var2;if var4 ne . then output;
var4=var3;if var4 ne . then output;
run;
OK, let's assume you have a file vith the values in it, and you do not know how many variables are in each row. First I need to create a sample textfile:
filename x temp;
data _nulL_;
file x;
put "123 456 . ";
put "123 . 789 ";
put ". 456 789 ";
run;
Then I need to read the first line and count the number of variables:
data _null_;
infile x;
input;
call symputx("number_of_variables",put(countw(_infile_," ","c"),best.));
stop;
run;
%put &number_of_variables;
Now I can dynamically read the variables:
%macro doit();
data have;
infile x;
input
%do i=1 %to &number_of_variables;
var&i
%end;
;
run;
data want (keep=var%eval(&number_of_variables + 1));
set have;
%do i=1 %to &number_of_variables;
var%eval(&number_of_variables + 1)=var&i;
if var%eval(&number_of_variables + 1) ne . then output;
%end;
run;
%mend;
%doit;
You can use proc transpose to do this but there is a trick to doing so. You will need to append a unique identifier to each row, prior to doing the transpose.
I've taken #Stig's sample data and added the observation number to use as a unique identifier:
data have;
input var1 var2 var3;
x = _n_; * ADDING A UNIQUE IDENTIFIER TO EVERY ROW;
cards;
123 456 .
123 . 789
. 456 789
;run;
Then it's simply a case of running proc transpose:
proc transpose data=have out=xx;
by x;
run;
And finally, remove any results where col1 is missing, and add in the observation number:
data want;
obs = _n_;
set xx (keep=col1);
where col1 ne .;
run;
As the order is not important then you can do this in one step, using arrays. As the data step moves through each row, the array enables the variable values to be stored in memory, so you can loop through them. I've set it up so that each time a non-missing value is found, then output it to the new variable.
In creating the array, I've set it to var1--var3, the double dash means all variables between var1 and var3 inclusive. If your real variables are numbered the same way then you can use var1-var3, which means all sequential numbers between the two variables.
data have;
input var1 var2 var3;
datalines;
123 456 .
123 . 789
. 456 789
;
run;
data want;
set have;
array allnums var1--var3;
do i = 1 to dim(allnums);
if not missing(allnums{i}) then do;
var4 = allnums{i};
output;
end;
end;
drop var1--var3 i;
run;

How to easly reformat dataset in SAS

Suppose a data are as follows:
A B C
1 3 2
1 4 9
2 6 0
2 7 3
where A B and C are the variable names.
Is there a way to transform the table to
A 1
A 1
A 2
A 2
B 3
B 4
B 6
B 7
C 2
C 9
C 0
C 3
Expanding on the advice from #donPablo, here's how you would code it. Create an array to read across the data, then output each iteration of that array so you end up with the number of rows being the rows * columns from the original dataset. The VNAME function enables you to store the variable name (A, B, C) as a value in a separate variable.
data have;
input A B C;
datalines;
1 3 2
1 4 9
2 6 0
2 7 3
;
run;
data want;
set have;
length var1 $10;
array vars{*} _numeric_;
do i=1 to dim(vars);
var1=vname(vars{i});
var2=vars{i};
keep var1 var2;
output;
end;
run;
proc sort data=want;
by var1;
run;
The least amount of (expensive) development time might be --
Read and store the first row
For each subsequent row
Read the row
Create three records
Until end
Sort
How many times will this be run? Per day/ per year?
What number of rows are there?
Might we save 1 hr / month? 1 min / year? Something will need to read the entire file. Optomize last. Make it work first.
tkx
It should work correctly:
DATA A(keep A);
new_var = 'A';
SET your_data;
RUN;
DATA B(keep B);
new_var = 'B';
SET your_data;
RUN;
DATA C(keep C);
new_var = 'C';
SET your_data;
RUN;
PROC APPEND base=A data=B FORCE;
RUN;
PROC APPEND base=A data=C FORCE;
RUN;
Data A is a result data set.

Test if a variable exists

I want to test if a variable exists and if it doesn't, create it.
The open()&varnum() functions can be used. Non-zero output from varnum() indicates the variable exists.
data try;
input var1 var2 var3;
datalines;
7 2 2
5 5 3
7 2 7
;
data try2;
set try;
if _n_ = 1 then do;
dsid=open('try');
if varnum(dsid,'var4') = 0 then var4 = .;
rc=close(dsid);
end;
drop rc dsid;
run;
data try2;
set try;
var4 = coalesce(var4,.);
run;
(assuming var4 is numeric)
Assign var4 to itself. The assignment will create the variable if it doesn't exist and leave the contents in place if it does.
data try;
input var1 var2 var3;
datalines;
7 2 2
5 5 3
7 2 7
;
data try2;
set try;
var4 = var4;
run;
Just remember that creating var4 this way when it doesn't exist will use the default variable attributes, so you may need to use an explicit attrib statement if you require specific formatting/length etc.
This is a very late answer/comment, but this method works for me and is pretty simple (SAS 9.4). In the below example, I used missing numeric and character variables and assigned a value to the missing character variable is missing.
data try;
input var1 var2 var3;
datalines;
7 2 2
5 5 3
7 2 7
;
data try2;
length var4 $20;
length var5 8;
set try;
var4 = var4;
if var4 = ' ' then var4 = 'Not on Source File';
run;