I have a SAV file with 5 variables. in two of them I have NA values (written as string. I need to convert NA to "empty value"
thought of using "if then rename" options but with no success as it prints out only one observation or keeps the NA values as they are.
please assist.
data QUS;
if happy=NA then rename happy="";
if educ=NA then rename educ="";
run;
proc print data=QUS;
run;
You are close.
In SAS you need to put strings in quotes. (Your code compares if variables Happy and NA are the same but it's not working because you don't have variable named NA).
Rename is for renaming variable names.
Also, you need to specify your dataset in the set statmement.
This will create data QUS which is identical to SAV but fo happy and educ NA is replaced with missing value.
data QUS;
set SAV;
if happy="NA" then happy="";
if educ="NA" then educ="";
run;
proc print data=QUS;
run;
Here is an example to change all character or numeric values using ARRAYS.
Something similar to this answer is in this SO answer and a SAS Communities answer HERE.
data HAVE;
Length id $5 var1 var2 var3 var4 var5 $15;
Input id $ var1 $ var2 $ var3 $ var4 $ var5 $ num1 num2 num3;
datalines;
00001 NA VALUE NA VALUE NA . 1 .
00002 VALUE . VALUE NA VALUE 1 . 1
00003 NA VALUE NA VALUE NA . 1 .
00004 VALUE . VALUE NA VALUE 1 . 1
00005 NA VALUE NA VALUE NA . 1 .
;
Run;
data WANT;
set HAVE;
array CHAR _character_ ;
array NUM _numeric_ ;
do over CHAR;
if CHAR="NA" then call missing(CHAR);
else if missing(CHAR) then CHAR="WAS MISSING";
end;
do over NUM;
if NUM=1 then call missing(NUM);
else if missing(NUM) then NUM=0;
end;
run ;
Related
Below is the sample data.
NAME VAR2 VAR3 VAR4 VAR5
ABC X Y 2
DEF P Q R 3
GHI L 1
The count of variables (from VAR2-VAR4) is present under VAR5 for each record, I want the following output with NewVar as the concatenation of the variables which contain a value.
NAME VAR2 VAR3 VAR4 VAR5 NewVar
ABC X Y 2 X,Y
DEF P Q R 3 P,Q,R
GHI L 1 L
I have no clue how to do it in SAS. Any help is appreciated.
Use the CATX() function to concatenate the variables; with this function you have the option to specify the delimiter character to use between the values. Ex. CATX(',',VAR2,VAR3,VAR4)
Input Data:
data have;
input NAME $ VAR2 $ VAR3 $ VAR4 $ VAR5;
datalines;
ABC X Y . 2
DEF P Q R 3
GHI L . . 1
;
run;
Solution:
data want;
set have;
NewVar= catx(',',VAR2,VAR3,VAR4);
run;
or
%let list=VAR2,VAR3,VAR4;
data want2;
set have;
NewVar= catx(',',&list.);
run;
or (Tom's Recommendation)
data want3;
set have;
NewVar= catx(',',of var2-var4);
run;
Output:
NAME=ABC VAR2=X VAR3=Y VAR4= VAR5=2 NewVar=X,Y
NAME=DEF VAR2=P VAR3=Q VAR4=R VAR5=3 NewVar=P,Q,R
NAME=GHI VAR2=L VAR3= VAR4= VAR5=1 NewVar=L
I have a dataset with a lot of lines and I'm studying a group of variables.
For each line and each variable, I want to know if the value is equal to the max for this variable or more than or equal to 10.
Expected output (with input as all variables without _B) :
(you can replace T/F by TRUE/FALSE or 1/0 as you wish)
+----+------+--------+------+--------+------+--------+
| ID | Var1 | Var1_B | Var2 | Var2_B | Var3 | Var3_B |
+----+------+--------+------+--------+------+--------+
| A | 1 | F | 5 | F | 15 | T |
| B | 1 | F | 5 | F | 7 | F |
| C | 2 | T | 5 | F | 15 | T |
| D | 2 | T | 6 | T | 10 | T |
+----+------+--------+------+--------+------+--------+
Note that for Var3, the max is 15 but since 15>=10, any value >=10 will be counted as TRUE.
Here is what I've maid up so far (doubt it will be any help but still) :
%macro pleaseWorkLittleMacro(table, var, suffix);
proc means NOPRINT data=&table;
var &var;
output out=Varmax(drop=_TYPE_ _FREQ_) max=;
run;
proc transpose data=Varmax out=Varmax(rename=(COL1=varmax));
run;
data Varmax;
set Varmax;
varmax = ifn(varmax<10, varmax, 10);
run; /* this outputs the max for every column, but how to use it afterward ? */
%mend;
%pleaseWorkLittleMacro(MY_TABLE, VAR1 VAR2 VAR3 VAR4, _B);
I have the code in R, works like a charm but I really have to translate it to SAS :
#in a for loop over variable names, db is my data.frame, x is the
#current variable name and x2 is the new variable name
x.max = max(db[[x]], na.rm=T)
x.max = ifelse(x.max<10, x.max, 10)
db[[x2]] = (db[[x]] >= x.max) %>% mean(na.rm=T) %>% percent(2)
An old school sollution would be to read the data twice in one data step;
data expect ;
input ID $ Var1 Var1_B $ Var2 Var2_B $ Var3 Var3_B $ ;
cards;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
;
run;
data my_input;
set expect;
keep ID Var1 Var2 Var3 ;
proc print;
run;
It is a good habit to declare the most volatile things in your code as macro variables.;
%let varList = Var1 Var2 Var3;
%let markList = Var1_B Var2_B Var3_B;
%let varCount = 3;
Read the data twice;
data my_result;
set my_input (in=maximizing)
my_input (in=marking);
Decklare and Initialize arrays;
format &markList $1.;
array _vars [&&varCount] &varList;
array _maxs [&&varCount] _temporary_;
array _B [&&varCount] &markList;
if _N_ eq 1 then do _varNr = 1 to &varCount;
_maxs(_varNr) = -1E15;
end;
While reading the first time, Calculate the maxima;
if maximizing then do _varNr = 1 to &varCount;
if _vars(_varNr) gt _maxs(_varNr) then _maxs(_varNr) = _vars(_varNr);
end;
While reading the second time, mark upt to &maxMarks maxima;
if marking then do _varNr = 1 to &varCount;
if _vars(_varNr) eq _maxs(_varNr) or _vars(_varNr) ge 10
then _B(_varNr) = 'T';
else _B(_varNr) = 'F';
end;
Drop all variables starting with an underscore, i.e. all my working variables;
drop _:;
Only keep results when reading for the second time;
if marking;
run;
Check results;
proc print;
var ID Var1 Var1_B Var2 Var2_B Var3 Var3_B;
proc compare base=expect compare=my_result;
run;
This is quite simple to solve in sql
proc sql;
create table my_result as
select *
, Var1_B = (Var1 eq max_Var1)
, Var1_B = (Var2 eq max_Var2)
, Var1_B = (Var3 eq max_Var3)
from my_input
, (select max(Var1) as max_Var1
, max(Var2) as max_Var2
, max(Var3) as max_Var3)
;
quit;
(Not tested, as our SAS server is currently down, which is the reason I pass my time on Stack Overflow)
If you need that for a lot of variables, consult the system view VCOLUMN of SAS:
proc sql;
select ''|| name ||'_B = ('|| name ||' eq max_'|| name ||')'
, 'max('|| name ||') as max_'|| name
from sasHelp.vcolumn
where libName eq 'WORK'
and memName eq 'MY_RESULT'
and type eq 'num'
and upcase(name) like 'VAR%'
;
into : create_B separated by ', '
, : select_max separated by ', '
create table my_result as
select *, &create_B
, Var1_B = (Var1 eq max_Var1)
, Var1_B = (Var2 eq max_Var2)
, Var1_B = (Var3 eq max_Var3)
from my_input
, (select max(Var1) as max_Var1
, max(Var2) as max_Var2
, max(Var3) as max_Var3)
;
quit;
(Again not tested)
After Proc MEANS computes the maximum value for each column you can run a data step that combines the original data with the maximums.
data want;
length
ID $1 Var1 8 Var1_B $1. Var2 8 Var2_B $1. Var3 8 Var3_B $1. var4 8 var4_B $1;
input
ID Var1 Var1_B Var2 Var2_B Var3 Var3_B ; datalines;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
run;
data have;
set want;
drop var1_b var2_b var3_b var4_b;
run;
proc means NOPRINT data=have;
var var1-var4;
output out=Varmax(drop=_TYPE_ _FREQ_) max= / autoname;
run;
The neat thing the VAR statement is that you can easily list numerically suffixed variable names. The autoname option automatically appends _ to the names of the variables in the output.
Now combine the maxes with the original (have). The set varmax automatically retains the *_max variables, and they will not get overwritten by values from the original data because the varmax variable names are different.
Arrays are used to iterate over the values and apply the business logic of flagging a row as at max or above 10.
data want;
if _n_ = 1 then set varmax; * read maxes once from MEANS output;
set have;
array values var1-var4;
array maxes var1_max var2_max var3_max var4_max;
array flags $1 var1_b var2_b var3_b var4_b;
do i = 1 to dim(values); drop i;
flags(i) = ifc(min(10,maxes(i)) <= values(i),'T','F');
end;
run;
The difficult part above is that the MEANS output creates variables that can not be listed using the var1 - varN syntax.
When you adjust the naming convention to have all your conceptually grouped variable names end in numeric suffixes the code is simpler.
* number suffixed variable names;
* no autoname, group rename on output;
proc means NOPRINT data=have;
var var1-var4;
output out=Varmax(drop=_TYPE_ _FREQ_ rename=var1-var4=max_var1-max_var4) max= ;
run;
* all arrays simpler and use var1-varN;
data want;
if _n_ = 1 then set varmax;
set have;
array values var1-var4;
array maxes max_var1-max_var4;
array flags $1 flag_var1-flag_var4;
do i = 1 to dim(values); drop i;
flags(i) = ifc(min(10,maxes(i)) <= values(i),'T','F');
end;
run;
You can use macro code or arrays, but it might just be easier to transform your data into a tall variable/value structure.
So let's input your test data as an actual SAS dataset.
data expect ;
input ID $ Var1 Var1_B $ Var2 Var2_B $ Var3 Var3_B $ ;
cards;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
;
First you can use PROC TRANSPOSE to make the tall structure.
proc transpose data=expect out=tall ;
by id ;
var var1-var3 ;
run;
Now your rules are easy to apply in PROC SQL step. You can derive a new name for the flag variable by appending a suffix to the original variable's name.
proc sql ;
create table want_tall as
select id
, cats(_name_,'_Flag') as new_name
, case when col1 >= min(max(col1),10) then 'T' else 'F' end as value
from tall
group by 2
order by 1,2
;
quit;
Then just flip it back to horizontal and merge with the original data.
proc transpose data=want_tall out=flags (drop=_name_);
by id;
id new_name ;
var value ;
run;
data want ;
merge expect flags;
by id;
run;
I have a series of string values with missing observations. I would like to use flat substitution. For instance variable x has 3 available values. There should be a 33.333% chance that a missing value will be assigned to the available values for x under this substitution method. How would I do this?
DATA have;
INPUT id a $ b $ c $ x;
CARDS;
1 Y Male . 5
2 Y Female . 4
3 . Female Tall 4
4 Y . Short 2
5 N Male Tall 1
;
Run;
You could use temporary arrays to store the possible values. Then generate a random index into the array.
DATA have;
INPUT id a $ b $ c $ x;
CARDS;
1 Y Male . 5
2 Y Female . 4
3 . Female Tall 4
4 Y . Short 2
5 N Male Tall 1
;
data want ;
set have ;
array possible_b (2) $8 ('Male','Female') ;
if missing(b) then b=possible_b(1+int(rand('uniform')*dim(possible_b)));
run;
I did this with generating random numbers and hard coding the limits. There should be an easier way to do this, but for the purposes of the question this should work.
option missing='';
data begin;
input a $;
cards;
a
.
b
c
.
e
.
f
g
h
.
.
j
.
;
run;
data intermediate;
set begin;
if a EQ '' then help= rand("uniform");
else help=.;
run;
data wanted;
set intermediate;
format help populated.;
if a EQ '' then do;
if 0<=help<0.33 then a='V1';
else if 0.33<=help<0.66 then a='V2';
else if 0.66<=help then a='V3';
end;
drop help;
run;
Given the following dataset:.
obs var1 var2 var3
1 123 456 .
2 123 . 789
3 . 456 789
How does one go about to append all the variables into a single variable whilst ignoring the empty observations (denoted by ".")?
Desired output:.
obs var4
1 123
2 123
3 456
4 456
5 789
6 789
Data step:.
data have;
input
var1 var2 var3; cards;
123 456 .
123 . 789
. 456 789
;run;
Not sure why you read the numbers in as char, but if I change to num, it could be done like this:
data have;
input var1 var2 var3;
cards;
123 456 .
123 . 789
. 456 789
;run;
data want (keep=var4);
set have;
var4=var1;if var4 ne . then output;
var4=var2;if var4 ne . then output;
var4=var3;if var4 ne . then output;
run;
OK, let's assume you have a file vith the values in it, and you do not know how many variables are in each row. First I need to create a sample textfile:
filename x temp;
data _nulL_;
file x;
put "123 456 . ";
put "123 . 789 ";
put ". 456 789 ";
run;
Then I need to read the first line and count the number of variables:
data _null_;
infile x;
input;
call symputx("number_of_variables",put(countw(_infile_," ","c"),best.));
stop;
run;
%put &number_of_variables;
Now I can dynamically read the variables:
%macro doit();
data have;
infile x;
input
%do i=1 %to &number_of_variables;
var&i
%end;
;
run;
data want (keep=var%eval(&number_of_variables + 1));
set have;
%do i=1 %to &number_of_variables;
var%eval(&number_of_variables + 1)=var&i;
if var%eval(&number_of_variables + 1) ne . then output;
%end;
run;
%mend;
%doit;
You can use proc transpose to do this but there is a trick to doing so. You will need to append a unique identifier to each row, prior to doing the transpose.
I've taken #Stig's sample data and added the observation number to use as a unique identifier:
data have;
input var1 var2 var3;
x = _n_; * ADDING A UNIQUE IDENTIFIER TO EVERY ROW;
cards;
123 456 .
123 . 789
. 456 789
;run;
Then it's simply a case of running proc transpose:
proc transpose data=have out=xx;
by x;
run;
And finally, remove any results where col1 is missing, and add in the observation number:
data want;
obs = _n_;
set xx (keep=col1);
where col1 ne .;
run;
As the order is not important then you can do this in one step, using arrays. As the data step moves through each row, the array enables the variable values to be stored in memory, so you can loop through them. I've set it up so that each time a non-missing value is found, then output it to the new variable.
In creating the array, I've set it to var1--var3, the double dash means all variables between var1 and var3 inclusive. If your real variables are numbered the same way then you can use var1-var3, which means all sequential numbers between the two variables.
data have;
input var1 var2 var3;
datalines;
123 456 .
123 . 789
. 456 789
;
run;
data want;
set have;
array allnums var1--var3;
do i = 1 to dim(allnums);
if not missing(allnums{i}) then do;
var4 = allnums{i};
output;
end;
end;
drop var1--var3 i;
run;
I would like to create a variable called DATFL that would have the following values for the last obseration :
DATFL
gender/scan
Here is the code :
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M F
2 jill F L
3 james F M
4 jonas M M
;
run;
data mix_3; set mix_;
length datfl datfl_ $ 50;
array m4(*) id name gender scan;
retain datfl;
do i=1 to dim(m4);
if index(m4(i) ,'M') then do;
datfl_=vname(m4(i)) ;
if missing(datfl) then datfl=datfl_;
else datfl=strip(datfl)||"/"||datfl_;
end;
end;
run;
Unfortunately, the value I get for 'DATFL' at the last observation is 'gender/scan/gender/scan'.Obviously because of the retain statement that I used for 'DATFL' I ended up with duplicates. At the end of this data step, I was planning to use a CALL SYMPUT statement to load the last value into macro variable but I won't do it until I fix my issue...Can anyone provide me with a guidance on how to prevent 'DATFL' to have duplicates value at the end of the dataset ? Cheers
sas_kappel
Don't retain DATFL, Instead, retain DATFL_.
data mix_3; set mix_;
length datfl datfl_ $ 50;
array m4(*) id name gender scan;
retain datfl_;
do i=1 to dim(m4);
if index(m4(i) ,'M') then do;
datfl_=vname(m4(i)) ;
if missing(datfl) then datfl=datfl_;
else datfl=strip(datfl)||"/"||datfl_;
end;
end;
if missing(datfl) then datfl = datfl_;
run;
It doesn't work...Let me change the dataset (mix_) and you can see that RETAIN DATFLl_, is not working in this scenario.
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M M
2 Marc F L
3 james F M
4 jonas H M
;
run;
To resume, what I want is to have the DISTINCT value of DATFL, into a macro variable. The code that I proposed does,for each records,a search for variables having the letter M, if it true then DATFL receives the variable name of the array variable. If there are multiple variable names then they will be separated by '/'. For the next records, do the same, BUT add only variable names satisfying the condition AND the variables that were not already kept in DATFL. Currently, if you run my program I have for DATFL at observation 4, DATFL=gender/scan/name/scan/scan but I would like to have DATFL=gender/scan/name , because those one are the distinct values. Ultimatlly, I will then write the following code;
if eof then CALL SYMPUT('DATFL',datfl);
sas_kappel
Your revised data makes it much clearer what you're looking for. Here is some code that should give the correct result.
I've used the CALL CATX function to add new values to DATFL, separated by a /. It first checks that the relevant variable name doesn't already exist in the string.
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M M
2 Marc F L
3 james F M
4 jonas H M
;
run;
data _null_;
set mix_ end=eof;
length datfl $100; /*or whatever*/
retain datfl;
array m4{*} $ id name gender scan;
do i = 1 to dim(m4);
if index(m4{i},'M') and not index(datfl,vname(m4{i})) then call catx('/',datfl,vname(m4{i}));
end;
if eof then call symput('DATFL', datfl);
run;
%put datfl = &DATFL.;