SAS PROC SQL UNION ALL - minimizes column length - sas

I have 8 tables, all containing the same order and number of columns, while one specific column named ATTRIBUTE contains different data which is of length 4 to 25. When I use PROC SQL and UNION ALL tables, the ATTRIBUTE column data length in minimizes to the lowest (4 digits).
How do I solve that i.e keeping full length of the data ?

Example, per #Lee
data have1;
attrib name length=$10 format=$10.;
name = "Anton Short";
run;
data have2;
attrib name length=$50 format=$50.;
name = "Pippy Longstocking of Stoyville";
run;
* column attributes such as format, informat and label of the selected columns
* in the result set are 'inherited' on a first found first kept order, dependent on
* the SQL join plan (i.e. the order of the tables as coded for the query);
proc sql;
create table want as
select name from have1 union
select name from have2
;
proc contents data=want varnum;
run;
Format is shorted than Length, any output display of longer values will appear to have been truncated at the data level.
* attributes of columns can be reset,
* (cleared so as to be dealt with in default manners),
* without rewriting the entire data set;
proc datasets nolist lib=work;
modify want;
attrib name format=; * format= removes the format of a variable;
run;
proc contents data=want varnum;
run;

Related

How to select columns only containing the certain string in SAS [duplicate]

I would like to know is it possible to perform an action that keeps only columns that contain a certain character.
For example, lets say that I have columns: name, surname, sex, age.
I want to keep only columns that start with letter 's' (surname and sex).
How do I do that?
There's several variations on how to filter out names.
For prefixes or lists of variables it's pretty easy. For suffixes or more complex patterns it keeps more complicated. In general you can short cut lists as follows:
_numeric_ : all numeric variables
_character_ : all character variables
_all_ : all variables
prefix1 - prefix# : all variables with the same prefix assuming they're numbered
prefix: : all variables that start with prefix
firstVar -- lastVar : variables based on location between first and last variable, including the first and last.
first-numeric-lastVar : variables that are numeric based on location between first and last variable
Anything more complex requires that you filter it via the metadata list. SAS basically keeps some metadata about each data set so you can query that information to build your lists. Data about columns and types are in the sashelp.vcolumn or dictionary.column data set.
To filter all columns that have the word mpg for example:
*generate variable list;
proc sql noprint;
select name into :var_list separated by " "
from sashelp.vcolumn
where libname = 'SASHELP' and memname = 'CARS'
and lowcase(name) like '%mpg%';
quit;
*check log for results;
%put &var_list;
*verification from original table;
proc contents data=sashelp.cars;
run;
*example of usage;
data want;
set sashelp.cars;
keep &var_list;
run;
Some more details are available in this blog post and here (documentation).
If you want do keep only variables that start with an s, then use name prefix list operator :.
data want;
set have(keep=s:);
run;
It's possible.
In the code below I created a macro variable that has the name of columns that have in a table. After run the code you will have the name of columns you want.
PROC SQL;
SELECT
NAME
INTO:
NMVAR /* SAVE IN MACRO VARIABLE */
FROM SASHELP.VCOLUMN
WHERE
LIBNAME EQ "YOUR LIBNAME" AND /* THE NAME OF LIB MUST BE WRITTEN IN UPPERCASE */
MEMNAME EQ "YOUR TABLE" AND /* THE NAME OF 'TABLE/DATA SET' MUST BE WRITTEN IN UPPERCASE */
SUBSTR(NAME,1,1) EQ "S";
RUN;
For complex variable name selection filtering, such as regular expressions, or lookup in external metadata control table, you will need to process the metadata of the table itself to construct source that can be applied.
This example demonstrates two, of many, ways that source code can be generated.
metadata table from target table, Proc CONTENTS
process metadata, Proc SQL
construct source code
Expectation of name lists < 64K
SQL INTO :<macro-variable> for source code expected to be < 64K characters
Very large name lists or robust
A macro that streams source code from metadata table
From a data set with 50,000 variables select the columns whose name contains 2912
data have;
retain id 'HOOPLA12345' x1-x50000 .;
stop;
run;
* obtain metadata of target table;
proc contents noprint data=have
out=varlist_table
( keep=name
where= (
prxmatch('/x.*2912.*/',name) /* name selection criteria */
)
);
run;
* Short lists;
* construct source code for name list;
proc sql noprint;
select name into :varlist separated by ' ' from varlist_table;
data want;
set have (keep=&varlist); /* apply generated source code */
run;
* Arbitrary or Long lists expected;
%macro stream_column (data=, column=);
%local dsid index &column;
%let dsid=%sysfunc(open(&data(keep=&column)));
%if &dsid %then %do;
%syscall SET(dsid);
%do %while (0=%sysfunc(fetch(&dsid)));
&&&column. /* emit column value from table */
%end;
%let dsid = %sysfunc(close(&dsid));
%end;
%mend;
options mprint;
data want2;
set have (keep=
/* stream source code as macro text emissions */
%stream_column(data=varlist_table,column=name)
);
run;

SAS, keeping only columns that contain a certain character

I would like to know is it possible to perform an action that keeps only columns that contain a certain character.
For example, lets say that I have columns: name, surname, sex, age.
I want to keep only columns that start with letter 's' (surname and sex).
How do I do that?
There's several variations on how to filter out names.
For prefixes or lists of variables it's pretty easy. For suffixes or more complex patterns it keeps more complicated. In general you can short cut lists as follows:
_numeric_ : all numeric variables
_character_ : all character variables
_all_ : all variables
prefix1 - prefix# : all variables with the same prefix assuming they're numbered
prefix: : all variables that start with prefix
firstVar -- lastVar : variables based on location between first and last variable, including the first and last.
first-numeric-lastVar : variables that are numeric based on location between first and last variable
Anything more complex requires that you filter it via the metadata list. SAS basically keeps some metadata about each data set so you can query that information to build your lists. Data about columns and types are in the sashelp.vcolumn or dictionary.column data set.
To filter all columns that have the word mpg for example:
*generate variable list;
proc sql noprint;
select name into :var_list separated by " "
from sashelp.vcolumn
where libname = 'SASHELP' and memname = 'CARS'
and lowcase(name) like '%mpg%';
quit;
*check log for results;
%put &var_list;
*verification from original table;
proc contents data=sashelp.cars;
run;
*example of usage;
data want;
set sashelp.cars;
keep &var_list;
run;
Some more details are available in this blog post and here (documentation).
If you want do keep only variables that start with an s, then use name prefix list operator :.
data want;
set have(keep=s:);
run;
It's possible.
In the code below I created a macro variable that has the name of columns that have in a table. After run the code you will have the name of columns you want.
PROC SQL;
SELECT
NAME
INTO:
NMVAR /* SAVE IN MACRO VARIABLE */
FROM SASHELP.VCOLUMN
WHERE
LIBNAME EQ "YOUR LIBNAME" AND /* THE NAME OF LIB MUST BE WRITTEN IN UPPERCASE */
MEMNAME EQ "YOUR TABLE" AND /* THE NAME OF 'TABLE/DATA SET' MUST BE WRITTEN IN UPPERCASE */
SUBSTR(NAME,1,1) EQ "S";
RUN;
For complex variable name selection filtering, such as regular expressions, or lookup in external metadata control table, you will need to process the metadata of the table itself to construct source that can be applied.
This example demonstrates two, of many, ways that source code can be generated.
metadata table from target table, Proc CONTENTS
process metadata, Proc SQL
construct source code
Expectation of name lists < 64K
SQL INTO :<macro-variable> for source code expected to be < 64K characters
Very large name lists or robust
A macro that streams source code from metadata table
From a data set with 50,000 variables select the columns whose name contains 2912
data have;
retain id 'HOOPLA12345' x1-x50000 .;
stop;
run;
* obtain metadata of target table;
proc contents noprint data=have
out=varlist_table
( keep=name
where= (
prxmatch('/x.*2912.*/',name) /* name selection criteria */
)
);
run;
* Short lists;
* construct source code for name list;
proc sql noprint;
select name into :varlist separated by ' ' from varlist_table;
data want;
set have (keep=&varlist); /* apply generated source code */
run;
* Arbitrary or Long lists expected;
%macro stream_column (data=, column=);
%local dsid index &column;
%let dsid=%sysfunc(open(&data(keep=&column)));
%if &dsid %then %do;
%syscall SET(dsid);
%do %while (0=%sysfunc(fetch(&dsid)));
&&&column. /* emit column value from table */
%end;
%let dsid = %sysfunc(close(&dsid));
%end;
%mend;
options mprint;
data want2;
set have (keep=
/* stream source code as macro text emissions */
%stream_column(data=varlist_table,column=name)
);
run;

Proc Freq Combine Results Rows based on Contents

I am doing a Proc Freq on a a large amount of User Entered Data, I would like to know if I can Combine the Results Rows based on the Contents of the first column.
You appear to want to perform a frequency of the first word (or 1st scanned part of a column). Such a case will require data manipulation to reduce the longer value to the desired shortened value, in a different variable, to be frequency binned.
data have;
input;
user_entered_data = _infile_;
datalines;
Nyfaria - January
Nyfaria - Febuary
Michelangelo - January
Michelangelo - Feburary
run;
data have_for_freq;
set have;
item = scan (user_entered_data,1,' ');
run;
options nocenter;
ods noproctitle;
proc freq data=have_for_freq;
title "Freq of raw data";
table user_entered_data;
run;
proc freq data=have_for_freq;
title "Freq of raw data formatted as $4.";
table user_entered_data;
format user_entered_data $4.;
run;
proc freq data=have_for_freq;
title "Freq of raw data - item scanned out";
table item;
run;
Note: In some cases you can use a format to control the mapping of a raw value to a reported value. There is no format that returns the first 'word' of a value (such as scan does)

Append of Tables with the Same Variables but Differing Attributes

My question is about the append of two different tables that are supposed to have the same name/format/type/length variables.
I am trying to create a step in my SAS program where I don't allow my program to be executed if the format/type/length of variables with the same name is not the same.
For example, when in one table I have a date in type string "dd-mm-yyyy" and in the other table I have the "yyyy-mm-dd" or "dd-mm-yyyy hh:mm:ss". After the append, our daily executions based on these input tables didn't work as expected. Sometimes the values come up as missing or out of order, since the formats are different.
I tried using the PROC COMPARE statement, which allowed me to check which variables have Differing Attributes (Type, Length, Format, InFormat and Labels).
proc compare base = SAS-data-set
compare = SAS-data-set;
run;
However, I only got the info on which variables have differing atributes (listing of common variables with differing attributes), not being able to do anything with/about it.
On the other hand, I would like to know if there's a chance to have a structured output table with this information, in order to use it as a control statement.
Creating an automatic task to do it would save me a lot of time.
Screenshot of an example:
You can use Proc CONTENTS to get information about a data sets variables. Do that for both data sets, and then you can use Proc COMPARE to create a data set informing you of the variable attributes differences.
data cars1;
set sashelp.cars (obs=10);
date = today ();
format date date9.;
cars1_only = 1;
x = 1.458; label x = "x-factor";
run;
data cars2;
length type $50;
set sashelp.cars (obs=10);
format date yymmdd10.;
cars2_only = 1;
X = 1.548; label x = "X factor to apply";
run;
proc contents noprint data=cars1 out=cars1_contents;
proc contents noprint data=cars2 out=cars2_contents;
run;
data cars1_contents;
set cars1_contents;
upName = upcase(Name);
run;
data cars2_contents;
set cars2_contents;
upName = upcase(Name);
run;
proc sort data=cars1_contents; by upName;
proc sort data=cars2_contents; by upName;
run;
proc compare noprint
base=cars1_contents
compare=cars2_contents
outall
out=cars_contents_compare (where=(_TYPE_ ne 'PERCENT'))
;
by upName;
run;
There is also an ODS table you can capture directly without having to run Proc CONTENTS, but the capture is not 'data-rific'
ods output CompareVariables=work.cars_vars;
proc compare base=cars1 compare=cars2;
run;

SAS - Add origin table name as a column in report

I have an output table that contains 300+ variables from 30 different tables that are joined by UNION, which is used for modelling. I have created a macro that creates a report with a number of statistics, such as mean, min/max values etc. using this output table. I am trying to add a column to the report that details which table(s) the variables come from. I say table(s) as some of the variables are shared across different tables. I want to avoid having the same variable in the report multiple times as the statistics are the same irrespective of what table the variable comes from. Is there an efficient way to do this?
Instead of UNION consider using a DATA STEP and then use the INDSNAME option instead.
data want;
set sashelp.class sashelp.cars indsname=source;
source_dataset = source;
run;
If it were me, I would loop over each of the union datasets and just put the table name and variable names into a compiled dataset. You probably have all the table names in either a macro list or typed out, so you can just add a few more lines of code to run proc contents on each of those to compile a full list of table and variable names. Note that like your example, there will be duplicate variable names that you can modify after the table is compiled:
** create different tables **;
data height; set sashelp.class(keep=name height); run;
data weight; set sashelp.class(keep=name weight); run;
data sex; set sashelp.class(keep=name sex); run;
** put your datasets into a list either manually or dynamically **;
/* manually */
%let ds_list=height weight sex;
/* dynamically -- be careful to include only tables in your union */
proc sql noprint;
select MEMNAME
into: ds_list separated by " "
from sashelp.vmember
where libname = "WORK" and memname not in ("SASMACR","FORMATS");
quit;
%put &ds_list.;
** loop over each table to put the table name and variables in a dataset **;
%MACRO get_names(ds_list);
%do i=1 %to %sysfunc(countw(&ds_list.));
%let ds = %scan(&ds_list.,&i.);
proc contents data = &ds. noprint
out=names_&ds.(keep=MEMNAME NAME rename=(MEMNAME=SOURCE_DATASET));
run;
proc append data = names_&ds. base=full force; run;
%end;
%MEND;
%get_names(&ds_list.);
I managed to do this using the following:
Create table with source tables.
PROC SQL;
CREATE TABLE SOURCES AS
SELECT NAME
,MEMNAME
FROM DICTIONARY.COLUMNS
WHERE LIBNAME='LIBNAME'
ORDER BY 1,2;
RUN;
Join to my stats table.
PROC SQL;
CREATE TABLE STATS_NEW AS
SELECT memname AS TABLE_NAME,a.*
FROM STATS a
LEFT JOIN SOURCES b
ON a.name = b.name
GROUP BY a.name
ORDER BY a.name;
QUIT;
Transpose data and add in comma separators.
DATA STATS_TRANSPOSE (drop=TABLE_NAME);
LENGTH INPUT_TABLES $1000;
SET STATS_NEW;
BY name;
RETAIN INPUT_TABLES;
IF FIRST.name THEN DO; INPUT_TABLES=TABLE_NAME; END;
IF NOT FIRST.name
THEN DO;
INPUT_TABLES=CATS(INPUT_TABLES,', ',TABLE_NAME);
END;
IF LAST.name THEN DO;
IF name IN ('FIELD1','FIELD2')
THEN DO; INPUT_TABLES='ALL'; END;
OUTPUT;
END;
RUN;