Rename nonsequential variable names to sequential names in sas - sas

I am working with survey data where the variable names in our database are descriptive, and not sequentially numbered. They are sequential in the database (moving from left to right). I would like to work in my programs with numbered variables, and I have been unsuccessful in trying to rename them programmatically without having to write out every change by hand (there are 87 total variables).
I have tried to use array, but that has not worked since they are not named sequentially nor do they have a common structure (no common prefix or suffix).
Example data is below:
data svy;
input id relationship outburst checkwork goodideas ;
cards;
101 3 4 5 6
102 4 5 6 6
103 1 1 8 1
104 2 3 2 4
;
run;
***** does not work ;
data svy_1; set svy;
rename relationship--goodideas = var01--var04;
run;
quit;
The above code returns the following error in the log:
ERROR: Missing numeric suffix on a numbered variable list (relationship-goodideas).
I would like to rename the variables to something like: var01, var02, etc...
Any help is greatly appreciated.

A few things:
Your data step #2 isn't right - it doesn't have a set statement. Also, it doesn't require 'quit' - quit is only for certain PROCs that generally are 'programming environments', such as PROC SQL, PROC FORMAT, PROC DATASETS. It doesn't do any harm but it looks odd :)
Sequential-in-the-dataset variable lists are double dash. So, you could trivially create an array with these:
array myvars relationship--goodideas;
So if that's good enough for you (no rename), then, go for it. If you really want to rename them (a bit of a bad idea IMO since it takes away some meaning of the variable name, making code harder to read, though I understand the reasoning why you'd want to), you can't use this unfortunately - while it's correct, the RENAME statement does not support it.
82 ***** does not work ;
83 data svy_1;
84 rename relationship--goodideas = var01-var04;
------------
47
ERROR 47-185: Given form of variable list is not supported by RENAME. Statement is ignored.
85 run;
You cannot use an array to perform rename statements, unfortunately; so you'll have to do something else. Here's one answer.
proc contents data=svy out=svy_vars(keep=name varnum) noprint;
run;
proc sort data=svy_vars;
by varnum;
run;
data for_rename;
set svy_vars;
if name in ('relationship' 'outburst' 'checkwork' 'goodideas') then do;
namectr+1;
new_name=cats(name,'=','var',put(namectr,z2.));
output;
end;
run;
proc sql;
select new_name into :renlist separated by ' ' from for_rename;
quit;
proc datasets nolist;
modify svy;
rename &renlist;
quit;
You can do something similar in a shorter fashion using PROC SQL and the DICTIONARY.COLUMNS table, or a data step and SASHELP.VCOLUMN, but the proc contents method is somewhat more transparent as to what's happening. If you have more than four variables, you may want to change that IN statement into a negative statement (if name not in (list of things to not change)) if that's easier, or even use the VARNUM variable itself to determine which variables you want to change (if varnum in (2:5) would work there).

A colleague came up with the best approach:
***** does work ;
data svy_1;
set svy;
array old { 4 } relationship--goodideas;
array var { 4 } ;
do i = 1 to 4;
var[i] = old[i];
end;
drop i;
run;

Related

SAS Missing Values Findings

I am working on a SAS Dataset which has missing values.
I can identify whether a particular variable has missing values using IS NULL/IS MISSING operator.
Is there any alternative way, through which I can identify which variables have missing values in one shot.
Thanks in Advance
The syntax IS NULL or IS MISSING is limited to use in SQL code (also in WHERE statements or WHERE= dataset options since those essentially use the same parser.)
To test if a value is missing you can also use the MISSING() function. Or compare it to a missing value. So for character variables test if it is equal to all blanks: c=' '. For numeric you can test x=., but you also need to look out for special missing values. So you might test if x <= .z.
To get a quick summary of number of distinct missing values for each variable you could use the NLEVEL option on PROC FREQ. Note it might not work for a large dataset with too many distinct values as the procedure will run out of memory.
use array and vname to find variable with missing values. If you want rows with missing values use cmiss function.
data have;
infile datalines missover;
input id num char $ var $;
datalines;
1 . A C
2 3 D
5 6 B D
;
/* gives variables with missing values*/
data want1(keep=miss);
set have;
array chars(*) _character_;
array nums(*) _numeric_;
do i=1 to dim(chars);
if chars(i)=' ' then
miss=vname(chars(i));
if nums(i)=. then
miss=vname(nums(i));
end;
if miss=' ' then
delete;
run;
/* gives rows with missing value*/
data want(drop=rows);
set have;
rows=cmiss(of id -- var);
if rows=1;
run;
You can use proc freq table statement with missing option. It includes missing category if missing values exist. Useful for categorical data.
data example;
input A Freq;
datalines;
1 2
2 2
. 2
;
*list variables in tables statement;
proc freq data=example;
tables A / missing;
run;
You can also use Proc Univariate it creates MissingValues table in ODS by default if any missing values exist. Useful for numeric data.
Two options (in addition to Peter Slezák's) I can suggest are :
- Use proc means with nmiss
proc means data = ___ n nmiss;
var _numeric_;
run;
In SAS Enterprise Guide, there is a characterize data task - this helps profile character variables too. (Under the hood, it is a combination of various procs, but is an easy to use option).
Hope this helps,
regards,
Sundaresh

Adding columns to a dataset in SAS using a for loop

I'm coming at SAS from a Python/R/Stata background, and learning that things are rather different in SAS. I'm approaching the following problem from the standpoint of one of these languages, perhaps SAS isn't up to what I want to do.
I have a panel dataset with an age column in it. I want to add new columns to the dataset using this age column. I'm going to simplify the functions of age to keep it simple in my example.
The goal is to loop over a sequence, and use the value of that sequence at each loop step to 1. assign the name of the new column and 2. assign the values of that column. I'm hoping to get my starting dataset, with new columns added to it taking values spline1 spline2... spline7
data somePath.FinalDataset;
do i = 1 to 7;
if i = 1 then
spline&i. = age;
if i ^= 1 then spline&i. = age + i;
end;
set somePath.StartingDataset;
run;
This code won't even run, though in an earlier version I was able to get it to run, but the new columns had their values shifted down one row from what they should have been. I include this code block as pseudocode of what I'm trying to do. Any help is much appreciated
One way to do this in SAS is with arrays. A SAS array can be used to reference a group of variables, and it can also create variables.
data have;
input age;
cards;
5
10
;
run;
data want;
set have;
array spline{7}; *create spline1 spline2 ... spline7;
do i=1 to 7;
if i = 1 then spline{i} = age;
else spline{i} = age + i;
end;
drop i;
run;
Spline{i} referes to the ith variable of the array named spline.
i is a regular variable, the DROP statement prevents it from being written to the output dataset.
When you say new columns were "shifted by one," note that spline1=age and spline2=age+2. You can change your code accordingly, e.g. if you want spline2=age+1, you could change your else statement to else spline{i} = age + i - 1 ; It is also possible to change the array statement to define it with 0 as the lower bound, rather than 1.
Arrays are likely the best way to solve this, but I will demonstrate a macro approach, which is necessary in some cases.
SAS separates its doing-things-with-data language from its writing-code language into the 'data step language' and the 'macro language'. They don't really talk to each other during a data step, because the macro language runs during the compilation stage (before any data is processed) while the data step language runs during the execution stage (while rows of data are being processed).
In any event, for something like this it's quite possible to write a macro to do what you want. Borrowing Quentin's general structure and initial dataset:
data have;
input age;
cards;
5
10
;
run;
%macro make_spline(var=, count=);
%local i;
%do i = 1 %to &count;
%if &i=1 %then &var.&i. = &var.;
%else &var.&i. = &var. + &i.;
; *this semicolon ends the assignment statement;
%end;
/* You end up with the IF statement generating:
age1 = age
and the extra semicolon after the if/else generates the ; for that line, making it
age1 = age;
etc. for the other lines.
*/
%mend make_spline;
data want;
set have;
%make_spline(var=age,count=7);
run;
This would then perform what you're looking to perform. The looping is in the macro language, not in the data step. You can assign parameters however you see fit; I prefer to have parameters like above, or even more (start loop could also be a parameter, and in fact the assignment code could be a parameter!).

Get values of Macro Variables in a SAS Table

I have a set of input macro variables in SAS. They are dynamic and generated based on the user selection in a sas stored process.
For example:There are 10 input values 1 to 10.
The name of the macro variable is VAR_. If a user selects 2,5,7 then 4 macro variables are created.
&VAR_0=3;
&VAR_=2;
&VAR_1=5;
&VAR_2=7;
The first one with suffix 0 provides the count. The next 3 provides the values.
Note:If a user select only one value then only one macro variable is created. For example If a user selects 9 then &var_=9; will be created. There will not be any count macro variable.
I am trying to create a sas table using these variables.
It should be like this
OBS VAR
-----------
1 2
2 5
3 7
-----------
This is what I tried. Not sure if this is the right way to do approach it.
It doesn't give me a final solution but I can atleast get the name of the macro variables in a table. How can I get their values ?
data tbl1;
do I=1 to &var_0;
VAR=CAT('&VAR_',I-1);
OUTPUT;
END;
RUN;
PROC SQL;
CREATE TABLE TBL2 AS
SELECT I,
CASE WHEN VAR= '&VAR_0' THEN '&VAR_' ELSE VAR END AS VAR
from TBL1;
QUIT;
Thank You for your help.
Jay
SAS helpfully stores them in a table for you already, you just need to parse out the ones you want. The table is called SASHELP.VMACRO or DICTIONARY.MACROS
Here's an example:
%let var=1;
%let var2=3;
%let var4=5;
proc sql;
create table want as
select * from sashelp.vmacro
where name like 'VAR%';
quit;
proc print data=want;
run;
I think the real issue is the inconsistent behavior of the stored process. It only creates the 0 and 1 variable when there are multiple selections. I think that your example is a little off. If the value of VAR_0 is three then their should be a VAR_3 macro variable. Also the value of VAR_ and VAR_1 should be set to the same thing.
To fix this in the past I have done something like this. First let's assign the parameter name a macro variable so that the code is reusable for other programs.
%let name=VAR_;
Then first make sure the minimal macro variables exist.
%global &name &name.0 &name.1 ;
Then make sure that you have a count by setting the 0 variable to 1 when it is empty.
%let &name.0 = %scan(&&&name.0 1,1);
Then make sure that you have a 1 variable. Since it should have the same value as the macro variable without a suffix just re-assign it.
%let &name.1 = &&&name ;
Now your data step is easier.
data want ;
length var $32 value $200 ;
do i=1 to &&&name.0 ;
var=cats(symget('name'),i);
value=symget(var);
output;
end;
run;
I don't understand your numbering scheme and recommend changing it, if you can; the &var_ variable is very confusing.
Anyway, the easiest way to do this is SYMGET. That returns a value from the macro symbol table which you can specify at runtime.
%let VAR_0=3;
%let VAR_=2;
%let VAR_1=5;
%let VAR_2=7;
data want;
do obs = 1 to &var_0.;
var = input(symget(cats('VAR_',ifc(obs=1,'',put(obs-1,2.)))),2.);
output;
end;
run;

extracting a list of observations from a single sas cell

I have a sas dataset that has a list of variables embedded within a single character variable, delimited by pipes. It looks something like this:
Obs. List_of_forms
1,"|FormA(04-15-2003)||FormB(04-15-2004)|",
2,"|FormA(04-15-2002)||FormA(04-15-2003)||FormB(04-15-2003)|"
I would like to extract each of the items delimited by pipes as individual variables, so the data would look something like this:
Obs., form1, form2, form3
1,"FormA(04-15-2003)","FormB(04-15-2004)",.,
2,"FormA(04-15-2002)","FormA(04-15-2003)","FormB(04-15-2003)"
But I'm at a loss for how to do this. I've thought about coding a do-loop to iterate through each pipe, but this seems needlessly complex. Any advice for a more elegant solution?
Use the SCAN() function. First we can setup your example data.
data have ;
obs+1;
input list_of_forms $60. ;
cards;
|FormA(04-15-2003)||FormB(04-15-2004)|
|FormA(04-15-2002)||FormA(04-15-2003)||FormB(04-15-2003)|
;;;;
Now we can convert it to multiple columns.
data want;
set have ;
array form (3) $60 ;
do i=1 to dim(form);
form(i) = scan(list_of_forms,i,'|');
end;
drop i;
run;
To make it more dynamic you could find the maximum number of values over the whole dataset and replace the hard coded upper bound of 3 on the new variables.
proc sql noprint ;
select max(countw(list_of_forms,'|'))
into :nforms
from have
;
run;
...
array form (&nforms) $60 ;

Create dynamic SAS variable name from string

I have something similar to the code below, I want to create every 2 character combination within my strings and then count the occurrence of each and store in a table. I will be changing the substr statement to a do loop to iterate through the whole string. But for now I just want to get the first character pair to work;
data temp;
input cat $50.;
call symput ('regex', substr(cat,1,2));
&regex = count(cat,substr(cat,1,2));
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
Expected results;
cat bv dv cd ud kd
#### 6
#### 4
#### 8
#### 1
#### 3
#### 9
#### 1
I'd prefer not to use a proc transpose as I can't loop through the string to create all the character pairs. I'll have to manually create them and I have upto 500 characters per string, plus I would like to search for 3 and 4 string patterns.
You can't do what you're asking to directly. You will either have to use the macro language, or use PROC TRANSPOSE. SAS doesn't let you reference data in the way you're trying to, because it has to have already constructed the variable names and such before it reads anything in.
I'll post a different solution that uses the macro language, but I suspect TRANSPOSE is the ultimate solution here; there's no practical reason this shouldn't work with your actual problem, and if you're having trouble with that it should be possible to help - post the do loop and what you're wanting, and we can of course help. Likely you just need to put the OUTPUT in the do loop.
data temp;
input cat $50.;
cat_val = substr(cat,1,2);
_var_ = count(cat,substr(cat,1,2));
output;
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;
proc transpose data=temp out=temp_T(drop=_name_);
by cat notsorted; *or by some ID variable more likely;
id cat_val;
var _var_;
run;
Here's a solution that uses CALL EXECUTE rather than the macro language, as I decided that was actually a better solution. I wouldn't use this in production, but it hopefully shows the concept (in particular, I would not run a PROC DATASETS for each variable separately - I would concat all the renames into one string then run that at the end. I thought this better for showing how the process might work.)
This takes advantage of timing - namely, CALL EXECUTE happens after the data step terminates, so by that point you do know what variable maps to what data point. It does have to pass the data twice in order to drop the spurious variables, though if you either know the actual number of variables you want to have, or if you're okay with the excess variables hanging around, it would be okay to skip that, and PROC DATASETS doesn't actually open the whole dataset, so it would be quite fast (even the above with five calls is quite fast).
data temp;
input cat $50.;
array _catvars[50]; *arbitrary 50 chosen here - pick one big enough for your data;
array _catvarnames[50] $ _temporary_;
cat_val = substr(cat,1,2);
_iternum = whichc(cat_val, of _catvarnames[*]);
if _iternum=0 then do;
_iternum = whichc(' ',of _catvarnames[*]);
_catvarnames[_iternum]=cat_val;
call execute('proc datasets lib=work; modify temp; rename '||vname(_catvars[_iternum])||' = '||cat_val||'; quit;');
end;
_catvars[_iternum]= count(cat,substr(cat,1,2));
if _n_=7 then do; *this needs to actually be a test for end-of-file (so add `end=eof` to the set statement or infile), but you cannot do that in DATALINES so I hardcode the example.;
call execute('data temp; set temp; drop _catvars'||put(whichc(' ',of _catvarnames[*]),2. -l)||'-_catvars50;run;');
end;
datalines;
bvbvbsbvbvbvbvblb
dvdvdvlxvdvdgd
cdcdcdcdvdcdcdvcdcded
udvdvdvdevdvdvdvdvdvdvevdedvdv
dvdkdkdvdkdkdkudvkdkd
kdkvdkdkvdkdkvudkdkdukdvdkdkdkdv
dvkvwduvwdedkd
;
run;