Sort all rows by length of string in variable X (longer strings first) - sas

I have a variable UserName that contains IDs of variable length. A shortened example:
How can I sort all rows by variable X where longer strings are listed first.
Context: This is for calculating HEI 2015 scores using the ASA24 macro. It writes:
/*Note: Some users have found that the SAS program will drop observations from the analysis if the ID field is not the same length for all observations. To prevent this error, the observations with the longest ID length should be listed first when the data is imported into SAS. */

Proc SQL with an ORDER BY clause specifying an ordering value computed in a CASE expression.
The computation when length(X) > 8 then -length(X) else 0 ensures longest values are first when sorted and all value lengths <= some-capping-length (8) are treated equally
ORDER BY length(X) desc, X would also select longest X values first and then by X itself, but length would predominate ordering even when value lengths < 8.
data have;
length X $50;
input X; datalines;
GFHsp036
GFHsp038
GFHsp039
GFHsp040
GFHsp0400
GFHsp0401
GFHsp0402
GFHsp04021
;
proc sql;
create table want as
select * from have
order by
case when length(x) > 8 then -length(X) else 0 end,
X
;
quit;
proc print;
var X / style=[fontfamily='Courier'];
run;

Here is probably the simplest way to do this
data have;
input string $;
datalines;
abcde
ab
a
abcd
abc
;
proc sql;
create table want as
select * from have
order by length(string) desc;
quit;

Re-ordering IDs did not help in my case as PROC IMPORT needed GUESSINGROWS = MAX.
Please see SAS Macro Truncating IDs
For how to fix the truncating IDs that this question attempted to fix.

Related

SAS changes all numerics to length 8 even when input data sets define numerics otherwise

I have two input datasets which I need to interweave. The input files have defined lengths for numeric fields depending on the size of the integer. When I interweave the datasets -- either a DATA or PROC SQL statement -- the lengths of numeric fields are all reset to the default of 8. Outside of explicitly defining the length for each field in a LENGTH statement, is there an option for SAS to keep the original attributes of the input columns?
More details ...
data A ;
length numeric_variable 3 ;
{input data}
;
data B ;
length numeric_variable 3 ;
{input data}
;
data AB ;
set A B ;
by some_id_variable ;
{stuff};
;
In the data set AB, the variable NUMERIC_VARIABLE is length 8 instead of 3. I can explicitly put another length statement in the "data AB" statement, but I have tons of columns.
Your description is wrong. A data step will set the length based on how it is first defined. If you just select the variable in SQL it keeps its length. However in SQL if you are doing something like UNION that combines variables from different sources then the length will be set to 8.
Example:
data one; length x 3; x=1; run;
data two; length x 5; x=2; run;
data one_two; set one two; run;
data two_one; set two one; run;
proc sql ;
create table sql_one as select * from one;
create table sql_two as select * from two;
create table sql_one_two as select * from one union select * from two;
create table sql_two_one as select * from two union select * from one;
quit;
proc sql;
select memname,name,length
from dictionary.columns
where libname='WORK'
and memname like '%ONE%'
or memname like '%TWO%'
;
quit;
Results:
Column
Member Name Column Name Length
----------------------------------------------------------------------------
ONE x 3
ONE_TWO x 3
SQL_ONE x 3
SQL_ONE_TWO x 8
SQL_TWO x 5
SQL_TWO_ONE x 8
TWO x 5
TWO_ONE x 5
So if you want define your variables then either add the length statement as you mentioned or create a template datasets and reference that in your data steps before referencing the other dataset(s). For complex SQL code you will need to include the LENGTH= option in your SELECT clause to force the lengths for the variables you are creating.
Can you post code that demonstrates the problem?
This code does NOT exhibit a final data set in which the numeric lengths get changed from 3 to 8.
data A; id = 'A'; length x 3; x=1;
data B; id = 'A'; length x 3; x=2;
data AB;
set A B;
by id;
run;
proc contents data=AB; run;
Contents
# Variable Type Len
1 id Char 1
2 x Num 3

Replacing values of a column with its minimum value in sas

i am very new to sas and I have the following work table
I want to create a new table in which column Date and Z remain the same, but all values in column X are replaced with the minimum value in column X and all values in column Y are replaced with the minimum value in column y.
Sample output is as follows
You can use the fact that PROC SQL will automatically remerge aggregate statistics back onto detail observations.
proc sql;
create table want as
select date, x, min(y) as y, min(z) as z
from have
;
quit;
If you don't want to use proc sql statement you can modify this code found from https://blogs.sas.com/content/iml/2014/12/01/max-and-min-rows-and-cols.html
data MinMaxRows;
set sashelp.Iris;
array x {*} _numeric_; /* x[1] is 1st var,...,x[4] is 4th var */
min = min(of x[*]); /* min value for this observation */
max = max(of x[*]); /* max value for this observation */
run;
proc print data=MinMaxRows(obs=7);
var _numeric_;
run;

defining default values

Im a new SAS User and I have a small problem
I have one large empty table A with lets say 100 columns that I have created with a simple proc sql; create table
I have another table B with lets say 40 columns and table C with 55 columns.
I want to add these two tables into table A, basically I want a table with 100 columns containing the data from table B & C and I'm doing this with a Union command.
Since I dont have values for all 100 variables I have to set default values.
Lets say I have a column named nourishment in table A, food in table B and has no equivalent in table C. I have rules like "If the data comes from table B then value =xxx if its from table C then Value="DefaultValue"
I'd do this easily with R or python but Im struggling with sas.
I'm using SAS sql commands (a Union command)
How do you set default values ? (for all data types : character, numeric or dateI'm using SAS sql commands )
Dates in SAS are actually just numeric values. Often they have a date format applied to make them readable.
So you could just assign a missing value by default like so:
. as ColumnName
or any default date like so
'17NOV2017'd as ColumnName
. as MyColumnName
SAS can deal with missing values.
Using a specially coded value, such as 'NA', to represent a missing value condition can work but may lead to headaches and extra coding. Recommended read in SAS help: "Working with Missing Values"
The default SAS missing value for numerics (which also includes dates) is period.
. as MyColumnName
SAS also has 27 special missing values for numerics that are expressed as . < character >
.A as MyColumnName
...
.Z as MyColumnName
._ as MyColumnName
The missing value for character variables is a single space
' '
'' empty quote string also works
' ' as does a longer empty string
Rule of thumb: be consistent when coding your missing values.
You can use OPTIONS MISSING to specify what character is shown when a missing value is printed.
OPTIONS MISSING = '*'; * My special representation of missing for this report;
Proc PRINT data=myData;
run;
OPTIONS MISSING = '.'; * Restore to the default;
SAS custom formats can also be used to customize what is printed for missing values.
Proc FORMAT;
value MissingN
. = 'N/A'
.N = 'Special N/A different than regular N/A' /* for .N */
;
value $MissingC
' ' = 'N/A'
;
value SillyChristmasStocking
.C = 'Bad'
.O = 'children'
.A = 'get'
.L = 'No toys'
;
The token after the value keyword can be any new valid SAS name that you want to used for your format name.
Proc PRINT data=myData;
format myColumnName MissingN.;
format name $MissingC.;
format behaviour SillyChristmasStocking.;
run;
As for your character missing value conditions, I would continue to use " " or ' '
You mention UNION which is a SQL feature. In SQL, JOIN also occur, perhaps more often then UNION. When JOINing and values from two source columns collide, you will want to use either COALESCE() function or CASE
statements to select the non-missing value.
I would not recommend using UNION in PROC SQL at any point in your SAS usage. UNION is almost always inferior to a simple data step, or a data step view.
That's because the data step seamlessly handles issues like differing variables on different tables. SAS is quite comfortable with vertically combining datasets; SQL is always a bit trickier when they're not identical.
data c;
set a b;
run;
That runs whether or not a and b are identical, so long as a and b don't have conflicting variable names (that aren't intended to be in the same column); and if they do, just use the rename dataset option to resolve it.
If you do as the above, and don't use union, you'll get a missing value automatically for those dates.
NFN:
DATA Step
A DATA Step approach for stacking data is the simplest. Use SET to stack the data and array processing to apply your defaults. For example:
data stacked_data;
set
TARGET_TEMPLATE (obs = 0)
ONE
TWO
;
array allchar _character_;
array allnum _numeric_;
array dates d1-d5;
do over allchar; if missing(allchar) then allchar = '*UNKNOWN*'; end;
do over allnum; if missing(allnum) then allnum = -995; end;
do over dates; if missing(dates) then dates='01NOV1971'd; end;
run;
A subtle issue is that any missing values in ONE or TWO will be replaced with the default value.
Proc SQL
In Proc SQL you will want to create a single row table containing the default values for A. That table can be joined to the union of B and C. The join select will involve coalesce() in order to choose the predefined default value when a column is not from B or C.
For example, suppose you have an empty (zero rows), richly columned, target table (your A) acting as a template:
data TARGET_TEMPLATE;
length _n_ 8;
length a1-a5 $25 d1-d5 4 x1-x20 y1-y20 z1-z20 p1-p20 q1-q20 r1-r20 8;
call missing (of _all_);
format d1-d5 yymmdd10.;
stop;
run;
Because Proc SQL does not provide syntax for a default constraint you need to create a table of your own defaults. This is probably easiest with DATA Step:
data TARGET_DEFAULTS;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv to match TARGET;
array allchar _character_ (1000 * '*UNKNOWN*');
array allnum _numeric_ (1000 * -995);
array d d1-d5 (5 * '01NOV1971'd); * override the allnum array initialization;
output;
stop;
run;
Here is some generated demo data, ONE and TWO, that correspond to your B and C:
data ONE;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv of demo data to match TARGET;
do _n_ = 1 to 100;
array a a1 a3 a5;
array num x: y: z:;
array d d1 d2;
do over a; a = catx (' ', 'ONE', _n_, _i_); end;
do over num; num = 1000 + _n_ + _i_; end;
retain foodate '01jan1975'd;
do over d; d=foodate; foodate+1; end;
output;
end;
keep a1 a3 a5 x: y: z: d1 d2; * keep the disparate columns that were populated;
run;
data TWO;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv of demo data to match TARGET;
do _n_ = 1 to 200;
array a a1 a2 a3;
array num x5 y5 z5 p: q: r:;
array d d1 d2;
do over a; a = catx (' ', 'TWO', _n_, _i_); end;
do over num; num = 20000 + _n_*10 + _i_; end;
retain foodate '01jan1985'd;
do over d; d=foodate; foodate+1; end;
output;
end;
keep a1 a2 a3 x5 y5 z5 p: q: r:; * keep the disparate columns that were populated;
run;
A stacking of A, B and C is simple SQL but does not introduce target specific default values:
proc sql noprint;
* generic UNION stack with SAS missing values (space and dot) for cells
* where ONE and TWO did not contribute any data;
create table stacked_data as
select * from have_data_TEMPLATE %*** empty template first ensures template column order and formats are honored in output data;
outer union corresponding %*** align by column name, do not remove duplicates;
select * from ONE
outer union corresponding
select * from TWO
;
When the stacking is put in a sub-query, it can be joined with the defaults. The choosing of the target default value for each column involves examining DICTIONARY.COLUMNS and generating the SQL source for selecting the coalescence of stack and default.
proc sql noprint;
* codegen select items ;
select cat('coalesce(STACK.',trim(name),',DEFAULT.',trim(name),') as ',trim(name))
into :coalesces separated by ','
from DICTIONARY.COLUMNS
where libname = 'WORK' and memname = 'HAVE_DATA_TEMPLATE' %* dictionary lib and mem name values are always uppercase;
order by npos
;
create table stacked_data_with_defaults as
select * from TARGET_TEMPLATE %*** output honors template;
outer union corresponding
select
source
, &coalesces %*** apply codegen;
from
(
select * from WORK.have_data_TEMPLATE %*** ensure fully columned sub-select that will align with coalesces;
outer union corresponding
select 'one' as source, * from ONE
outer union corresponding
select 'two' as source, * from TWO
) as STACK
join
TARGET_DEFAULTS as DEFAULT
on 1=1
;
quit;
Why would you create an empty dataset? What is it going to be used for? Perhaps you want to use it as a default structure definition? If so and you want to stack B and C and get them in the structure defined by A you could code this way.
data want ;
set a(obs=0) b c ;
run;
Not sure what the purpose would be to have default values. Couldn't you use formats if you want missing values to display in special ways?
Or you could create code to default values and perhaps just %include it or wrap the logic into a macro. So it you had a code file name 'defaults.sas' with lines like this.
startdate=coalesce(startdate,'01JAN2013'd);
gender=coalescec(gender,'UNKNOWN');
Then your little program to make a new dataset that looks like A and uses the data from B and C would look like this.
data want ;
set a(obs=0) b c ;
%include 'defaults.sas';
run;
If you really did want to aggregate the records into some large dataset then perhaps you want to use PROC APPEND to add the records once they are created in the right structure.
proc append data=want base=a ;
run;

SAS SCAN Function and Missing Values

I am trying to develop a recursive program to in missing string values using flat probabilities (for instance if a variable had three possible values and one observation was missing, the missing observation would have a 33% of being replace with any value).
Note: The purpose of this post is not to discuss the merit of imputation techniques.
DATA have;
INPUT id gender $ b $ c $ x;
CARDS;
1 M Y . 5
2 F N . 4
3 N Tall 4
4 M Short 2
5 F Y Tall 1
;
/* Counts number of categories i.e. 2 */
proc sql;
SELECT COUNT(Unique(gender)) into :rescats
FROM have
WHERE Gender ~= " " ;
Quit;
%let rescats = &rescats;
%put &rescats; /*internal check */
/* Collects response categories separated by commas i.e. F,M */
proc sql;
SELECT UNIQUE gender into :genders separated by ","
FROM have
WHERE Gender ~= " "
GROUP BY Gender;
QUIT;
%let genders = &genders;
%put &genders; /*internal check */
/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql;
SELECT COUNT (UNIQUE(customerKey)) into :ID
FROM have
WHERE customerkey < 6;
QUIT;
%let ID = &ID;
%put &ID; /*internal check */
data want;
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1;
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ',');
END;
RUN;
I the SCAN function does not create a F or M observation within gender. It also appears to create a new M and F variable. Additionally the DO Loop creates addition entry under within CustomerKey. Is there any way to get rid of these?
I would prefer to use loops and macros to solve this. I'm not yet proficient with arrays.
Here is my attempt at tidying this up a little:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint;
/* Counts number of categories i.e. 2 */
SELECT COUNT(unique(gender)) into :rescats
FROM have
WHERE not(missing(Gender));
/* Collects response categories separated by commas i.e. F,M */
SELECT unique gender into :genders separated by ","
FROM have
WHERE not(missing(Gender))
;
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders; /*internal check */
/*Removed ID list code as it wasn't making any difference to the imputation in this example*/
data want;
SET have;
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1;
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ','); /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;
Halo8:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
Tip: You can use a dot (.) to mean a missing value for a character variable during INPUT.
Tip: DATALINES is the modern alternative to CARDS.
Tip: Data values don't have to line up, but it helps humans.
Thus this works as well:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES;
1 M Y . 5
2 F N . 4
3 . N Tall 4
4 M . Short 2
5 F Y Tall 1
;
run;
Tip: Your technique requires two passes over the data.
One to determine the distinct values.
A second to apply your imputation.
Most approaches require two passes per variable processed. A hash approach can do only two passes but requires more memory.
There are many ways to deteremine distinct values: SORTING+FIRST., Proc FREQ, DATA Step HASH, SQL, and more.
Tip: Solutions that move data to code back to data are sometimes needed, but can be troublesome. Often the cleanest way is to let data remain data.
For example: INTO will be the wrong approach if the concatenated distinct values would require more than 64K
Tip: Data to Code is especially troublesome for continuous values and other values that are not represented exactly the same when they become code.
For example: high precision numeric values, strings with control-characters, strings with embedded quotes, etc...
This is one approach using SQL. As mentioned before, Proc SURVEYSELECT is far better for real applications.
Proc SQL;
Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
%let REPLACEMENT_COUNT = &SQLOBS; %* Tip: Take advantage of automatic macro variable SQLOBS;
data REPLACEMENTS;
set REPLACEMENTS;
rownum+1; * rownum needed for RANUNI matching;
run;
Proc SQL;
* Perform replacement of missing values;
Update have
set gender =
(
select gender
from REPLACEMENTS
where rownum = ceil(&REPLACEMENT_COUNT * ranuni(1234))
)
where gender is NULL
;
%let SYSLAST = have;
DM 'viewtable have' viewtable;
You don't have to be concerned about columns not having a missing value because no replacement would occur in those. For columns having a missing the list of candidate REPLACEMENTS excludes the missing and the REPLACEMENT_COUNT is correct for computing the uniform probability of replacement, 1/COUNT, coded as rownum = ceil (random)

Do loop and If statement in Proc IML

I have table1 that contains one column (city), I have a second table (table2) that has two columns (city, distance),
I am trying to create a third table, table 3, this table contains two columns (city, distance), the city in table 3 will come from the city column in table1 and the distance will be the corresponding distance in table2.
I tried doing this using Proc IML based on Joe's suggestion and this is what I have.
proc iml;
use Table1;
read all var _CHAR_ into Var2 ;
use Table2;
read all var _NUM_ into Var4;
read all var _CHAR_ into Var5;
do i=1 to nrow(Var2);
do j=1 to nrow(Var5);
if Var2[i,1] = Var5[j,1] then
x[i] = Var4[i];
end;
create Table3 from x;
append from x;
close Table3 ;
quit;
I am getting an error, matrix x has not been set to a value. Can somebody please help me here. Thanks in advance.
The technique you want to use is called the "unique-loc technique". It enables you to loop over unique values of a categorical variable (in this case, unique cities) and do something for each value (in this case, copy the distance into another array).
So that others can reprodce the idea, I've imbedded the data directly into the program:
T1_City = {"Gould","Boise City","Felt","Gould","Gould"};
T2_City = {"Gould","Boise City","Felt"};
T2_Dist = {10, 15, 12};
T1_Dist = j(nrow(T1_City),1,.); /* allocate vector for results */
do i = 1 to nrow(T2_City);
idx = loc(T1_City = T2_City[i]);
if ncol(idx)>0 then
T1_Dist[idx] = T2_Dist[i];
end;
print T1_City T1_Dist;
The IF-THEN statement is to prevent in case there are cities in Table2 that are not in Table1. You can read about why it is important to use that IF-THEN statement. The IF-THEN statement is not needed if Table2 contains all unique elements of Table1 cities.
This technique is discussed and used extensively in my book Statistical Programming with SAS/IML Software.
You need a nested loop, or to use a function that finds a value in another matrix.
IE:
do i = 1 to nrow(table1);
do j = 1 to nrow(table2);
...
end;
end;