Im a new SAS User and I have a small problem
I have one large empty table A with lets say 100 columns that I have created with a simple proc sql; create table
I have another table B with lets say 40 columns and table C with 55 columns.
I want to add these two tables into table A, basically I want a table with 100 columns containing the data from table B & C and I'm doing this with a Union command.
Since I dont have values for all 100 variables I have to set default values.
Lets say I have a column named nourishment in table A, food in table B and has no equivalent in table C. I have rules like "If the data comes from table B then value =xxx if its from table C then Value="DefaultValue"
I'd do this easily with R or python but Im struggling with sas.
I'm using SAS sql commands (a Union command)
How do you set default values ? (for all data types : character, numeric or dateI'm using SAS sql commands )
Dates in SAS are actually just numeric values. Often they have a date format applied to make them readable.
So you could just assign a missing value by default like so:
. as ColumnName
or any default date like so
'17NOV2017'd as ColumnName
. as MyColumnName
SAS can deal with missing values.
Using a specially coded value, such as 'NA', to represent a missing value condition can work but may lead to headaches and extra coding. Recommended read in SAS help: "Working with Missing Values"
The default SAS missing value for numerics (which also includes dates) is period.
. as MyColumnName
SAS also has 27 special missing values for numerics that are expressed as . < character >
.A as MyColumnName
.Z as MyColumnName
._ as MyColumnName
The missing value for character variables is a single space
' '
'' empty quote string also works
' ' as does a longer empty string
Rule of thumb: be consistent when coding your missing values.
You can use OPTIONS MISSING to specify what character is shown when a missing value is printed.
OPTIONS MISSING = '*'; * My special representation of missing for this report;
Proc PRINT data=myData;
OPTIONS MISSING = '.'; * Restore to the default;
SAS custom formats can also be used to customize what is printed for missing values.
value MissingN
. = 'N/A'
.N = 'Special N/A different than regular N/A' /* for .N */
value $MissingC
' ' = 'N/A'
value SillyChristmasStocking
.C = 'Bad'
.O = 'children'
.A = 'get'
.L = 'No toys'
The token after the value keyword can be any new valid SAS name that you want to used for your format name.
Proc PRINT data=myData;
format myColumnName MissingN.;
format name $MissingC.;
format behaviour SillyChristmasStocking.;
As for your character missing value conditions, I would continue to use " " or ' '
You mention UNION which is a SQL feature. In SQL, JOIN also occur, perhaps more often then UNION. When JOINing and values from two source columns collide, you will want to use either COALESCE() function or CASE
statements to select the non-missing value.
I would not recommend using UNION in PROC SQL at any point in your SAS usage. UNION is almost always inferior to a simple data step, or a data step view.
That's because the data step seamlessly handles issues like differing variables on different tables. SAS is quite comfortable with vertically combining datasets; SQL is always a bit trickier when they're not identical.
data c;
set a b;
That runs whether or not a and b are identical, so long as a and b don't have conflicting variable names (that aren't intended to be in the same column); and if they do, just use the rename dataset option to resolve it.
If you do as the above, and don't use union, you'll get a missing value automatically for those dates.
A DATA Step approach for stacking data is the simplest. Use SET to stack the data and array processing to apply your defaults. For example:
data stacked_data;
array allchar _character_;
array allnum _numeric_;
array dates d1-d5;
do over allchar; if missing(allchar) then allchar = '*UNKNOWN*'; end;
do over allnum; if missing(allnum) then allnum = -995; end;
do over dates; if missing(dates) then dates='01NOV1971'd; end;
A subtle issue is that any missing values in ONE or TWO will be replaced with the default value.
Proc SQL
In Proc SQL you will want to create a single row table containing the default values for A. That table can be joined to the union of B and C. The join select will involve coalesce() in order to choose the predefined default value when a column is not from B or C.
For example, suppose you have an empty (zero rows), richly columned, target table (your A) acting as a template:
length _n_ 8;
length a1-a5 $25 d1-d5 4 x1-x20 y1-y20 z1-z20 p1-p20 q1-q20 r1-r20 8;
call missing (of _all_);
format d1-d5 yymmdd10.;
Because Proc SQL does not provide syntax for a default constraint you need to create a table of your own defaults. This is probably easiest with DATA Step:
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv to match TARGET;
array allchar _character_ (1000 * '*UNKNOWN*');
array allnum _numeric_ (1000 * -995);
array d d1-d5 (5 * '01NOV1971'd); * override the allnum array initialization;
Here is some generated demo data, ONE and TWO, that correspond to your B and C:
data ONE;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv of demo data to match TARGET;
do _n_ = 1 to 100;
array a a1 a3 a5;
array num x: y: z:;
array d d1 d2;
do over a; a = catx (' ', 'ONE', _n_, _i_); end;
do over num; num = 1000 + _n_ + _i_; end;
retain foodate '01jan1975'd;
do over d; d=foodate; foodate+1; end;
keep a1 a3 a5 x: y: z: d1 d2; * keep the disparate columns that were populated;
data TWO;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv of demo data to match TARGET;
do _n_ = 1 to 200;
array a a1 a2 a3;
array num x5 y5 z5 p: q: r:;
array d d1 d2;
do over a; a = catx (' ', 'TWO', _n_, _i_); end;
do over num; num = 20000 + _n_*10 + _i_; end;
retain foodate '01jan1985'd;
do over d; d=foodate; foodate+1; end;
keep a1 a2 a3 x5 y5 z5 p: q: r:; * keep the disparate columns that were populated;
A stacking of A, B and C is simple SQL but does not introduce target specific default values:
proc sql noprint;
* generic UNION stack with SAS missing values (space and dot) for cells
* where ONE and TWO did not contribute any data;
create table stacked_data as
select * from have_data_TEMPLATE %*** empty template first ensures template column order and formats are honored in output data;
outer union corresponding %*** align by column name, do not remove duplicates;
select * from ONE
outer union corresponding
select * from TWO
When the stacking is put in a sub-query, it can be joined with the defaults. The choosing of the target default value for each column involves examining DICTIONARY.COLUMNS and generating the SQL source for selecting the coalescence of stack and default.
proc sql noprint;
* codegen select items ;
select cat('coalesce(STACK.',trim(name),',DEFAULT.',trim(name),') as ',trim(name))
into :coalesces separated by ','
where libname = 'WORK' and memname = 'HAVE_DATA_TEMPLATE' %* dictionary lib and mem name values are always uppercase;
order by npos
create table stacked_data_with_defaults as
select * from TARGET_TEMPLATE %*** output honors template;
outer union corresponding
, &coalesces %*** apply codegen;
select * from WORK.have_data_TEMPLATE %*** ensure fully columned sub-select that will align with coalesces;
outer union corresponding
select 'one' as source, * from ONE
outer union corresponding
select 'two' as source, * from TWO
) as STACK
on 1=1
Why would you create an empty dataset? What is it going to be used for? Perhaps you want to use it as a default structure definition? If so and you want to stack B and C and get them in the structure defined by A you could code this way.
data want ;
set a(obs=0) b c ;
Not sure what the purpose would be to have default values. Couldn't you use formats if you want missing values to display in special ways?
Or you could create code to default values and perhaps just %include it or wrap the logic into a macro. So it you had a code file name '' with lines like this.
Then your little program to make a new dataset that looks like A and uses the data from B and C would look like this.
data want ;
set a(obs=0) b c ;
%include '';
If you really did want to aggregate the records into some large dataset then perhaps you want to use PROC APPEND to add the records once they are created in the right structure.
proc append data=want base=a ;
I stumbled upon the following code snippet in which the variable top3 has to be filled from a table have rather than from an array of numbers.
%let top3 = 14 15 42; /* This should be made obsolete.. */
%let no = 3;
proc sql;
create table want as
select *
from (select x, y from foo) a
%do i = 1 %to &no.;
%let current = %scan(&top3.,&i.); /* What do I need to put here? */
left join (select x, y from bar where z=¤t.) row_¤t.
on a.x = row_¤t..x
The table have contains the xs from the string and looks as follows:
i x
1 14
2 15
3 42
I am now wondering how I should modify the %let current = ... line such that current is populated from the table have. I know how to populate a macro variable using proc sql with select .. into, but I am afraid that the way I am going right now is fully against SAS philosophy.
It looks like you're more or less transposing something. If that's the case, this is doable in macro/sql pretty easily.
First, here's the simple version - no macro.
proc sql;
create table class_t as
select * from (
select name from sashelp.class ) class
left join (
select name, age as age_Alfred
from sashelp.class
where name='Alfred') Alfred
on =
We grab the value of age from the Alfred row and put it on the main join. This isn't exactly what you're doing, but it seems similar. (I'm just using one table, but you can of course use two here.)
Now, how do we extend this to be table-driven and not handwritten? Macros!
First, here's the macro - just taking the Alfred bit and making it generic.
%macro joiner(name=);
left join (
select name, age as age_&name.
from sashelp.class
where name="&name.") &name.
on = &
%mend joiner;
Second, we look at this and see two things we need to put into macro lists: the SELECT variable list (we'll get one new variable for each call), and the JOIN list.
proc sql;
select cats('%joiner(name=',name,')')
into :joinlist separated by ' '
from sashelp.class;
select cats(name,'.age_',name)
into :selectlist separated by ','
from sashelp.class;
And then, we just call it!
proc sql;
create table class_t as
select,&selectlist. from (
select name from sashelp.class) class
Now, your dataset you call the macro lists from is perhaps the dataset with the 3 rows in it you have above ("have"). The dataset you actually get the appending data from is some other dataset ("bar"), right? And then the ones you join to is perhaps a third dataset ("foo"). Here I just use the one, for simplicity, but the concept is the same, just different sources.
When the lookup data is in a table you can perform a three way join without any need for SAS Macro. You don't provide any data so the example will mock some.
Suppose a master record has several associated detail records, and the detail records contain a z value used for selection into a result set per a wanted z lookup table.
data masters;
call streaminit(2020);
do id = 1 to 100;
do x = 1 to 100;
m_rownum + 1;
code = rand('integer', 10,45);
data details;
call streaminit(2020);
do date = 1 to 20;
do x = 1 to 100;
do rep = 1 to 5;
d_rownum + 1;
amount = rand('integer', 100,200);
z = rand('integer', 10,45);
data zs;
input z ##; datalines;
14 15 42
proc sql;
create table want as
, d_rownum
, masters.x
, masters.code
, details.z
, details.amount
left join
details.x = masters.x
inner join
zs.z = details.z
order by, masters.x, details.z,
I have a 2 column dataset - accounts and attributes, where there are 6 types of attributes.
I am trying to use PROC TRANSPOSE in order to set the 6 different attributes as 6 new columns and set 1 where the column has that attribute and 0 where it doesn't
This answer shows two approaches:
array based transposition using index lookup via hash.
For the case that all of the accounts missing the same attribute, there would be no way for the data itself to exhibit all the attributes -- ideally the allowed or expected attributes should be listed in a separate table as part of your data reshaping.
When working with a table of only account and attribute you will need to construct a view adding a numeric variable that can be transposed. After TRANSPOSE the result data will have to be further massaged, replacing missing values (.) with 0.
data have;
call streaminit(123);
do account = 1 to 10;
do attribute = 'a','b','c','d','e','f';
if rand('uniform') < 0.75 then output;
data stage / view=stage;
set have;
num = 1;
proc transpose data=stage out=want;
by account;
id attribute;
var num;
data want;
set want;
array attrs _numeric_;
do index = 1 to dim(attrs);
if missing(attrs(index)) then attrs(index) = 0;
drop index;
proc sql;
drop view stage;
Advanced technique - Array and Hash mapping
In some cases the Proc TRANSPOSE is deemed unusable by the coder or operator, perhaps very many by groups and very many attributes. An alternate way to transpose attribute values into like named flag variables is to code:
Two scans
Scan 1 determine attribute values that will be encountered and used as column names
Store list of values in a macro variable
Scan 2
Arrayify the attribute values as variable names
Map values to array index using hash (or custom informat per #Joe)
Process each group. Set arrayed variable corresponding to each encountered attribute value to 1. Array index obtained via lookup through hash map.
* pass #1, determine attribute values present in data, the values will become column names;
proc sql noprint;
select distinct attribute into :attrs separated by ' ' from have;
* or make list of attributes from table of attributes (if such a table exists outside of 'have');
* select distinct attribute into :attrs separated by ' ' from attributes;
%put NOTE: &=attrs;
* pass #2, perform array based tranposformation;
data want2(drop=attribute);
* prep pdv, promulgate by group variable attributes;
if 0 then set have(keep=account);
array attrs &attrs.;
format &attrs. 4.;
if _n_=1 then do;
declare hash attrmap();
do _n_ = 1 to dim(attrs);
attrmap.add(key:vname(attrs(_n_)), data: _n_);
* preset all flags to zero;
do _n_ = 1 to dim(attrs);
attrs(_n_) = 0;
* DOW loop over by group;
do until (last.account);
set have;
by account;
attrmap.find(); * lookup array index for attribute as column;
attrs(_n_) = 1; * set flag for attribute (as column);
* implicit output one row per by group;
One other option for doing this not using PROC TRANSPOSE is the data step array technique.
Here, I have a dataset that hopefully matches yours approximately. ID is probably your account, Product is your attribute.
data have;
call streaminit(2007);
do id = 1 to 4;
do prodnum = 1 to 6;
if rand('Uniform') > 0.5 then do;
product = byte(96+prodnum);
Now, here we transpose it. We make an array with the six variables that could occur in HAVE. Then we iterate through the array to see if that variable is there. You can add a few additional lines to the if block to set all of the variables to 0 instead of missing initially (I think missing is better, but YMMV).
data want;
set have;
by id;
array vars[6] a b c d e f;
retain a b c d e f;
if then call missing(of vars[*]);
do _i = 1 to dim(vars);
if lowcase(vname(vars[_i])) = product then
vars[_i] = 1;
if then output;
We could do it a lot faster if we knew how the dataset was constructed, of course.
data want;
set have;
by id;
array vars[6] a b c d e f;
if then call missing(of vars[*]);
retain a b c d e f;
if then output;
While your data doesn't really work that way, you could make an informat though that did this.
*First we build an informat relating the product to its number in the array order;
proc format;
invalue arrayi
*Now we can use that!;
data want;
set have;
by id;
array vars[6] a b c d e f;
if then call missing(of vars[*]);
retain a b c d e f;
if then output;
This last one is probably the absolute fastest option - most likely much faster than PROC TRANSPOSE, which tends to be one of the slower procs in my book, but at the cost of having to know ahead of time what variables you're going to have in that array.
I have two input datasets which I need to interweave. The input files have defined lengths for numeric fields depending on the size of the integer. When I interweave the datasets -- either a DATA or PROC SQL statement -- the lengths of numeric fields are all reset to the default of 8. Outside of explicitly defining the length for each field in a LENGTH statement, is there an option for SAS to keep the original attributes of the input columns?
More details ...
data A ;
length numeric_variable 3 ;
{input data}
data B ;
length numeric_variable 3 ;
{input data}
data AB ;
set A B ;
by some_id_variable ;
In the data set AB, the variable NUMERIC_VARIABLE is length 8 instead of 3. I can explicitly put another length statement in the "data AB" statement, but I have tons of columns.
Your description is wrong. A data step will set the length based on how it is first defined. If you just select the variable in SQL it keeps its length. However in SQL if you are doing something like UNION that combines variables from different sources then the length will be set to 8.
data one; length x 3; x=1; run;
data two; length x 5; x=2; run;
data one_two; set one two; run;
data two_one; set two one; run;
proc sql ;
create table sql_one as select * from one;
create table sql_two as select * from two;
create table sql_one_two as select * from one union select * from two;
create table sql_two_one as select * from two union select * from one;
proc sql;
select memname,name,length
from dictionary.columns
where libname='WORK'
and memname like '%ONE%'
or memname like '%TWO%'
Member Name Column Name Length
ONE x 3
TWO x 5
So if you want define your variables then either add the length statement as you mentioned or create a template datasets and reference that in your data steps before referencing the other dataset(s). For complex SQL code you will need to include the LENGTH= option in your SELECT clause to force the lengths for the variables you are creating.
Can you post code that demonstrates the problem?
This code does NOT exhibit a final data set in which the numeric lengths get changed from 3 to 8.
data A; id = 'A'; length x 3; x=1;
data B; id = 'A'; length x 3; x=2;
data AB;
set A B;
by id;
proc contents data=AB; run;
# Variable Type Len
1 id Char 1
2 x Num 3
Ok so here it goes,
I am working with a dataset containing Minimum Inhibitory Concentration (MIC) values for different antibiotics (About 30 different antibiotics). Each antibiotic has MIC values from different test-types and interpretations for each of those MICs.
The MIC variables for antibiotic Amikacin have a common mnemonic suffix AMK
ALL the antibiotics have variables similar to above (I.e. the micmr, micms, interpmr, etc is all the same for each variable. The only thing that changes is the last few letters that correspond to the antibiotic name)
I am attempting to validate these data, I have a list of valid MIC values for each type of test. Is there a way to write a program that will check all the variables that start with “mic” so that I don’t have to specify each individual variable name?
I'm not a microbiologist, but guessing your variable name construct has lots of information in it.
Variable name construct:
mic - minimum inhibitory concentration
inter - susceptibility interpretation
<susceptibility> - <mr|ms|vk|px|et>
mr - resistant
ms - sensitive
vk -
px -
et -
There are two approaches to validating the presence of MIC related variable names in the data set:
Way #1 - List a comparison of the data set variables to a constructed list of variables. The list is based on pre-specified list of antibodies, OR
Way #2 - Deconstruct the data set variable names into MIC variable name parts. Report on the data set variables and any possibly missing MIC variables.
Simlate MIC data set - Make a data set with some MIC variable names
* simluate some data;
data have;
do sampleid = 1 to 1000;
length instrumentid $20.;
format rundate yymmdd10.;
length operator $10.;
array construct_names
interpetamk micetamk interpmramk micmramk interpmsamk micmsamk
interppxamk micpxamk interpvkamk micvkamk
interpetimi micetimi interpmrimi micmrimi interpmsimi micmsimi
interppximi micpximi interpvkimi micvkimi
interpmsfubar micmsfubar
interppxfubar micpxfoobar
do over construct_names;
construct_names = round(rand("normal", 50,9), 0.25);
Get metadata
* get data set variable names as data;
proc contents noprint data=have out=have_names(keep=varnum name);
Way #1
* compute variable names for expected MIC naming constructs;
* match only expected antibody variables;
data expect_names(keep=sequence name);
* load arrays with construct parts;
array part1(2) $6 ('mic', 'interp');
array part2(5) $2 ('mr', 'ms', 'vk', 'px', 'et');
array part3(4) $10 ('AMK', 'IMIP', 'TOBI', 'TYPO'); /* 4 expected antibodies */
* construct expected names;
do part3_index = 1 to dim(part3);
do part2_index = 1 to dim(part2);
do part1_index = 1 to dim(part1);
sequence + 1;
name = cats(part1[part1_index], part2[part2_index], part3[part3_index]);
* Way 1 data validation: compare data variable names to expectations;
proc sql;
create table name_comparison as
coalesce(, as name,
when is null and is not null then 'Expected MIC variable was not in the data set'
when is not null and is null then 'NOT a MIC variable construct'
else 'OK'
end as status
from have_names as have
full join expect_names as expect
on upper( eq upper(
order by have.varnum, expect.sequence
ods html file='compare.html' style=plateau;
proc print data=name_comparison;
var varnum;
var name / style=[fontfamily=monospace];
var status;
ods html close;
The report would be a simple listing showing how the variable names were evaluated
Way #2
Deconstruct data set variable names and color coded grid report.
* Compute construct parts and check for completeness;
proc sql;
create table part1 (
order num, mnemonic char(6), meaning char(200)
insert into part1
values (1, 'mic', 'minimum inhibitory concentration')
values (2, 'interp', 'susceptibility interpretation')
create table part2 (
order num, mnemonic char(6), meaning char(200)
insert into part2
values (1, 'mr', '??')
values (2, 'ms', '??')
values (3, 'vk', '??')
values (4, 'px', '??')
values (5, 'et', '??')
create table mic_name_prefixes as
part1.order as part1z format=2.
, part1.mnemonic as part1
, part2.order as part2z format=2.
, part2.mnemonic as part2
, cats(part1.mnemonic,part2.mnemonic) as prefix
from part1 cross join part2
create table antibodies(label="Extract antibody from variable names with proper prefix") as
substr(upper(name),length(prefix)+1) as antibody
, min(varnum) as abz format=6.
from have_names
join mic_name_prefixes
on upper(name) like upper(cats(prefix,'%'))
group by antibody
order by abz
* sub select CROSS JOIN for complete grid;
* FULL JOIN for complete comparison;
create table name_grid_data as
abz, part1z, part2z
, grid.part1, grid.part2, grid.antibody
, coalesce(, as varname length=32
, not missing( as expected_found format=1.
( select PREFIX.*, AB.*, cats(part1,part2,antibody) as name
from mic_name_prefixes PREFIX
cross join antibodies AB
) as grid
full join have_names as have
on upper( = upper(
order by
coalesce(abz,have.varnum+1e6), part2z, part1z
reset noprint;
select count(distinct antibody) into :abcount trimmed from name_grid_data;
select count(distinct 0) into :abmissing trimmed from name_grid_data where missing(antibody);
%let abcount = %eval(&abcount + &abmissing);
%put NOTE: &=abcount;
%macro cols (from,to);
/* needed for array statement in compute block */
%local index;
%do index = &from %to &to;
ods html file = 'mic_names.html';
proc report data=name_grid_data spanrows missing;
part1 part2
antibody,varname /* 'display var under across var' trick, display will be shown */
antibody=ab,expected_found /* same trick with ab alias, to get _c#_ column for compute block logic */
define part1 / group order=data ' ' style=header;
define part2 / group order=data ' ' style=header;
define antibody / across order=data ' ';
define ab / across order=data ' ' noprint; /* NOPRINT, _c#_ available, but not rendered */
define varname / ' ' style=[fontfamily=monospace];
define placeholder / noprint; /* required for 'display under across' trick */
/* right most column has access to all leftward columns */
compute placeholder;
array name_col %cols(3, %eval(2+&abcount)); /* array for _c#_ columns */
array have_col %cols(%eval(3+&abcount), %eval(2+2*&abcount)); /* array for _c#_ columns */
/* conditionally highlight the missing variables */
do index = 1 to &abcount - &abmissing;
if not missing ( name_col(index) ) then do;
if not have_col(index) then
call define (vname(name_col(index)), 'style', 'style=[background=lightred]');
call define (vname(name_col(index)), 'style', 'style=[background=lightgreen]');
ods html close;
Color coded grid report
I have a SAS dataset as follow :
Key A B C D E
001 1 . 1 . 1
002 . 1 . 1 .
Other than keeping the existing varaibales, I want to replace variable value with the variable name if variable A has value 1 then new variable should have value A else blank.
Currently I am hardcoding the values, does anyone has a better solution?
The following should do the trick (the first dstep sets up the example):-
data test_data;
length key A B C D E 3;
format key z3.; ** Force leading zeroes for KEY;
key=001; A=1; B=.; C=1; D=.; E=1; output;
key=002; A=.; B=1; C=.; D=1; E=.; output;
proc sort;
by key;
data results(drop = _: i);
set test_data(rename=(A=_A B=_B C=_C D=_D E=_E));
array from_vars[*] _:;
array to_vars[*] $1 A B C D E;
do i=1 to dim(from_vars);
to_vars[i] = ifc( from_vars[i], substr(vname(from_vars[i]),2), '');
It all looks a little awkward as we have to rename the original (assumed numeric) variables to then create same-named character variables that can hold values 'A', 'B', etc.
If your 'real' data has many more variables, the renaming can be laborious so you might find a double proc transpose more useful:-
proc transpose data = test_data out = test_data_tran;
by key;
proc transpose data = test_data_tran out = results2(drop = _:);
by key;
var _name_;
id _name_;
where col1;
However, your variables will be in the wrong order on the output dataset and will be of length $8 rather than $1 which can be a waste of space. If either points are important (they rsldom are) and both can be remedied by following up with a length statement in a subsequent datastep:-
option varlenchk = nowarn;
data results2;
length A B C D E $1;
set results2;
option varlenchk = warn;
This organises the variables in the right order and minimises their length. Still, you're now hard-coding your variable names which means you might as well have just stuck with the original array approach.