I Have a column with many flags that were parsed from a XML parser. Data looks like this:
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=N;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=N;GROSSGIVEN=Y;UMAPPED=N;
I have to create a table with all these column names to capture the flags. Like:
USERKEYED VALMATCH DEVICEVERIFIED EXCEPTION USERREGISTRD ASSOCIATE EXTERNAL GROSSGIVEN UMAPPED
Y N N N N Y N Y N
Y N N N N Y Y Y N
Y N N Y N Y N Y N
How can I capture values dynamically in SAS? Either in a DATA step or a PROC step?
Thanks in advance.
Let's start with your example output data.
data expect ;
id+1;
length USERKEYED VALMATCH DEVICEVERIFIED EXCEPTION
USERREGISTRD ASSOCIATE EXTERNAL GROSSGIVEN UMAPPED $1 ;
input USERKEYED -- UMAPPED;
cards4;
Y N N N N Y N Y N
Y N N N N Y Y Y N
Y N N Y N Y N Y N
;;;;
Now we can recreate your example input data:
data have ;
do until (last.id);
set expect ;
by id ;
array flag _character_;
length string $200 ;
do _n_=1 to dim(flag);
string=catx(';',string,catx('=',vname(flag(_n_)),flag(_n_)));
end;
end;
keep id string;
run;
Which will look like this:
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=N;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=N;GROSSGIVEN=Y;UMAPPED=N
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=N;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=Y;GROSSGIVEN=Y;UMAPPED=N
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=Y;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=N;GROSSGIVEN=Y;UMAPPED=N
So to process this we need to parse out the pairs from the variable STRING into multiple observations with the individual pairs' values split into NAME and VALUE variables.
data middle ;
set have ;
do _n_=1 by 1 while(_n_=1 or scan(string,_n_,';')^=' ');
length name $32 ;
name = scan(scan(string,_n_,';'),1,'=');
value = scan(scan(string,_n_,';'),2,'=');
output;
end;
keep id name value ;
run;
Then we can use PROC TRANSPOSE to convert those observations into variables.
proc transpose data=middle out=want (drop=_name_) ;
by id;
id name ;
var value ;
run;
The data that you have is a series of name/value pairs, using a ; as a delimiter. We can extract each name/value pair one at a time, and then parse those into values:
data tmp;
length my_string next_pair name value $200;
my_string = "USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=N;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=N;GROSSGIVEN=Y;UMAPPED=N;";
cnt = 1;
next_pair = scan(my_string,cnt,";");
do while (next_pair ne "");
name = scan(next_pair,1,"=");
value = scan(next_pair,2,"=");
output;
cnt = cnt + 1;
next_pair = scan(my_string,cnt,";");
end;
keep name value;
run;
Gives us:
name value
=================== =====
USERKEYED Y
VALMATCH N
DEVICEVERIFIED N
EXCEPTION N
USERREGISTRD N
ASSOCIATE Y
EXTERNAL N
GROSSGIVEN Y
UMAPPED N
We can then transpose the data so that the name is used for the column names:
proc transpose data=tmp out=want(drop=_name_);
id name;
var value;
run;
Which gives you the desired table.
DATA <MY_DATASET>;
SET INPUT_DATASET;
USERKEYED = substr(input_column, find(input_column, 'USERKEYED=')+10,1);
VALMATCH = substr(input_column, find(input_column, 'VALMATCH=')+9,1);
DEVICEVERIFIED = substr(input_column, find(input_column, 'DEVICEVERIFIED=')+15,1);
EXCEPTION = substr(input_column, find(input_column, 'EXCEPTION=')+10,1);
USERREGISTRD = substr(input_column, find(input_column, 'USERREGISTRD=')+13,1);
ASSOCIATE = substr(input_column, find(input_column, 'ASSOCIATE=')+10,1); EXTERNAL = substr(input_column, find(input_column, 'EXTERNAL=')+9,1);
GROSSGIVEN = substr(input_column, find(input_column, 'GROSSGIVEN=')+11,1);
UMAPPED = substr(input_column, find(input_column, UMAPPED=')+8,1);
run;
My answer is essentially in the first block of code, the rest is just explanation, one alternative and a nice tip.
Based on the answer you gave, the input data is already in a SAS data set, so that can be read to create a file of SAS code which can then be run using %include and so proc transpose is not required:
filename tempcode '<path><file-name.txt>'; /* set this up yourself */
/* write out SAS code to the fileref tempcode */
data _null_;
file tempcode;
set have;
if _n_=1 then
put 'Y="Y"; N="N"; drop Y N;';
put input_column;
put 'output;';
run;
/* %include the code to create the desired output */
data want;
%include tempcode;
run;
As the input data already almost looks like SAS assignment statements, we have taken advantage of that and so the SAS code that has been run from fileref tempcode using %include should look like:
Y="Y"; N="N"; drop Y N;
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=N;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=N;GROSSGIVEN=Y;UMAPPED=N;
output;
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=N;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=Y;GROSSGIVEN=Y;UMAPPED=N;
output;
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=Y;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=N;GROSSGIVEN=Y;UMAPPED=N;
output;
As an alternative, fileref tempcode could contain all of the code for data step "data want;":
/* write out entire SAS data step code to the fileref tempcode */
data _null_;
file tempcode;
set have end=lastrec;
if _n_=1 then
put 'data want;'
/'Y="Y"; N="N"; drop Y N;';
put input_column;
put 'output;';
if lastrec then
put 'run;';
run;
%include tempcode; /* no need for surrounding SAS code */
As a tip, to see the code being processed by %include in the log you can use the following variation:
%include tempcode / source2;
Related
I need to get a dataset of a uniform grid 20x20 using info from SASHELP.CARS so that x and y variables are obtained as follows:
do y = min(weight) to max(weight) by (min(weight)+max(weight))/20;
do x = min(horsepower) to max(horsepower) by (min(horsepower)+max(horsepower))/20;
output;
end;
end;
Weight and HorsePower are variables of SASHELP.CARS. Furthermore the grid dataset has to have two more columns EnginSizeMean LengthMean with the same value in each row that equals mean(EnginSize) and mean(Length) from SASHELP.CARS (need all this to build dependency graph for regression model).
First calculate the statistics you need to use.
proc summary data=sashelp.cars ;
var weight horsepower enginesize length ;
output out=stats
min(weight horsepower)=
max(weight horsepower)=
mean(enginesize length)=
/ autoname
;
run;
Then use those values to generate your "grid".
data want;
set stats;
do y = 1 to 20 ;
weight= weight_min + (y-1)*(weight_min+weight_max)/20;
do x = 1 to 20 ;
horsepower = horsepower_min + (x-1)*(horsepower_min+horsepower_max)/20;
output;
end;
end;
run;
I've got pretty big table where I want to replace rare values (for this example that have less than 10 occurancies but real case is more complicated- it might have 1000 levels while I want to have only 15). This list of possible levels might change so I don't want to hardcode anything.
My code is like:
%let var = Make;
proc sql;
create table stage1_ as
select &var.,
count(*) as count
from sashelp.cars
group by &var.
having count >= 10
order by count desc
;
quit;
/* Join table with table including only top obs to replace rare
values with "other" category */
proc sql;
create table stage2_ as
select t1.*,
case when t2.&var. is missing then "Other_&var." else t1.&var. end as &var._new
from sashelp.cars t1 left join
stage1_ t2 on t1.&var. = t2.&var.
;
quit;
/* Drop old variable and rename the new as old */
data result;
set stage2_(drop= &var.);
rename &var._new=&var.;
run;
It works, but unfortunately it is not very officient as it needs to make a join for each variable (in real case I am doing it in loop).
Is there a better way to do it? Maybe some smart replace function?
Thanks!!
You probably don't want to change the actual data values. Instead consider creating a custom format for each variable that will map the rare values to an 'Other' category.
The FREQ procedure ODS can be used to capture the counts and percentages of every variable listed into a single table. NOTE: Freq table/out= captures only the last listed variable. Those counts can be used to construct the format according to the 'othering' rules you want to implement.
data have;
do row = 1 to 1000;
array x x1-x10;
do over x;
if row < 600
then x = ceil(100*ranuni(123));
else x = ceil(150*ranuni(123));
end;
output;
end;
run;
ods output onewayfreqs=counts;
proc freq data=have ;
table x1-x10;
run;
data count_stack;
length name $32;
set counts;
array x x1-x10;
do over x;
name = vname(x);
value = x;
if value then output;
end;
keep name value frequency;
run;
proc sort data=count_stack;
by name descending frequency ;
run;
data cntlin;
do _n_ = 1 by 1 until (last.name);
set count_stack;
by name;
length fmtname $32;
fmtname = trim(name)||'top';
start = value;
label = cats(value);
if _n_ < 11 then output;
end;
hlo = 'O';
label = 'Other';
output;
run;
proc format cntlin=cntlin;
run;
ods html;
proc freq data=have;
table x1-x10;
format
x1 x1top.
x2 x2top.
x3 x3top.
x4 x4top.
x5 x5top.
x6 x6top.
x7 x7top.
x8 x8top.
x9 x9top.
x10 x10top.
;
run;
Looking to automate some checks and print some warnings to a log file. I think I've gotten the general idea but I'm having problems generalising the checks.
For example, I have two datasets my_data1 and my_data2. I wish to print a warning if nobs_my_data2 < nobs_my_data1. Additionally, I wish to print a warning if the number of distinct values of the variable n in my_data2 is less than 11.
Some dummy data and an attempt of the first check:
%LET N = 1000;
DATA my_data1(keep = i u x n);
a = -1;
b = 1;
max = 10;
do i = 1 to &N - 100;
u = rand("Uniform"); /* decimal values in (0,1) */
x = a + (b-a) * u; /* decimal values in (a,b) */
n = floor((1 + max) * u); /* integer values in 0..max */
OUTPUT;
END;
RUN;
DATA my_data2(keep = i u x n);
a = -1;
b = 1;
max = 10;
do i = 1 to &N;
u = rand("Uniform"); /* decimal values in (0,1) */
x = a + (b-a) * u; /* decimal values in (a,b) */
n = floor((1 + max) * u); /* integer values in 0..max */
OUTPUT;
END;
RUN;
DATA _NULL_;
FILE "\\filepath\log.txt" MOD;
SET my_data1 NOBS = NOBS1 my_data2 NOBS = NOBS2 END = END;
IF END = 1 THEN DO;
PUT "HERE'S A HEADER LINE";
END;
IF NOBS1 > NOBS2 AND END = 1 THEN DO;
PUT "WARNING!";
END;
IF END = 1 THEN DO;
PUT "HERE'S A FOOTER LINE";
END;
RUN;
How can I set up the check for the number of distinct values of n in my_data2?
A proc sql way to do it -
%macro nobsprint(tab1,tab2);
options nonotes; *suppresses all notes;
proc sql;
select count(*) into:nobs&tab1. from &tab1.;
select count(*) into:nobs&tab2. from &tab2.;
select count(distinct n) into:distn&tab2. from &tab2.;
quit;
%if &&nobs&tab2. < &&nobs&tab1. %then %put |WARNING! &tab2. has less recs than &tab1.|;
%if &&distn&tab2. < 11 %then %put |WARNING! distinct VAR n count in &tab2. less than 11|;
options notes; *overrides the previous option;
%mend nobsprint;
%nobsprint(my_data1,my_data2);
This would break if you have to specify libnames with the datasets due to the .. And, you can use proc printto log to print it to a file.
For your other part as to just print the %put use the above as a call -
filename mylog temp;
proc printto log=mylog; run;
options nomprint nomlogic;
%nobsprint(my_data1,my_data2);
proc printto; run;
This won't print any erroneous text to SAS log other than your custom warnings.
#samkart provided perhaps the most direct, easily understood way to compare the obs counts. Another consideration is performance. You can get them without reading the entire data set if your data set has millions of obs.
One method is to use nobs= option in the set statement like you did in your code, but you unnecessarily read the data sets. The following will get the counts and compare them without reading all of the observations.
62 data _null_;
63 if nobs1 ne nobs2 then putlog 'WARNING: Obs counts do not match.';
64 stop;
65 set sashelp.cars nobs=nobs1;
66 set sashelp.class nobs=nobs2;
67 run;
WARNING: Obs counts do not match.
Another option is to get the counts from sashelp.vtable or dictionary.tables. Note that you can only query dictionary.tables with proc sql.
I am trying to develop a recursive program to in missing string values using flat probabilities (for instance if a variable had three possible values and one observation was missing, the missing observation would have a 33% of being replace with any value).
Note: The purpose of this post is not to discuss the merit of imputation techniques.
DATA have;
INPUT id gender $ b $ c $ x;
CARDS;
1 M Y . 5
2 F N . 4
3 N Tall 4
4 M Short 2
5 F Y Tall 1
;
/* Counts number of categories i.e. 2 */
proc sql;
SELECT COUNT(Unique(gender)) into :rescats
FROM have
WHERE Gender ~= " " ;
Quit;
%let rescats = &rescats;
%put &rescats; /*internal check */
/* Collects response categories separated by commas i.e. F,M */
proc sql;
SELECT UNIQUE gender into :genders separated by ","
FROM have
WHERE Gender ~= " "
GROUP BY Gender;
QUIT;
%let genders = &genders;
%put &genders; /*internal check */
/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql;
SELECT COUNT (UNIQUE(customerKey)) into :ID
FROM have
WHERE customerkey < 6;
QUIT;
%let ID = &ID;
%put &ID; /*internal check */
data want;
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1;
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ',');
END;
RUN;
I the SCAN function does not create a F or M observation within gender. It also appears to create a new M and F variable. Additionally the DO Loop creates addition entry under within CustomerKey. Is there any way to get rid of these?
I would prefer to use loops and macros to solve this. I'm not yet proficient with arrays.
Here is my attempt at tidying this up a little:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint;
/* Counts number of categories i.e. 2 */
SELECT COUNT(unique(gender)) into :rescats
FROM have
WHERE not(missing(Gender));
/* Collects response categories separated by commas i.e. F,M */
SELECT unique gender into :genders separated by ","
FROM have
WHERE not(missing(Gender))
;
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders; /*internal check */
/*Removed ID list code as it wasn't making any difference to the imputation in this example*/
data want;
SET have;
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1;
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ','); /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;
Halo8:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
Tip: You can use a dot (.) to mean a missing value for a character variable during INPUT.
Tip: DATALINES is the modern alternative to CARDS.
Tip: Data values don't have to line up, but it helps humans.
Thus this works as well:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES;
1 M Y . 5
2 F N . 4
3 . N Tall 4
4 M . Short 2
5 F Y Tall 1
;
run;
Tip: Your technique requires two passes over the data.
One to determine the distinct values.
A second to apply your imputation.
Most approaches require two passes per variable processed. A hash approach can do only two passes but requires more memory.
There are many ways to deteremine distinct values: SORTING+FIRST., Proc FREQ, DATA Step HASH, SQL, and more.
Tip: Solutions that move data to code back to data are sometimes needed, but can be troublesome. Often the cleanest way is to let data remain data.
For example: INTO will be the wrong approach if the concatenated distinct values would require more than 64K
Tip: Data to Code is especially troublesome for continuous values and other values that are not represented exactly the same when they become code.
For example: high precision numeric values, strings with control-characters, strings with embedded quotes, etc...
This is one approach using SQL. As mentioned before, Proc SURVEYSELECT is far better for real applications.
Proc SQL;
Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
%let REPLACEMENT_COUNT = &SQLOBS; %* Tip: Take advantage of automatic macro variable SQLOBS;
data REPLACEMENTS;
set REPLACEMENTS;
rownum+1; * rownum needed for RANUNI matching;
run;
Proc SQL;
* Perform replacement of missing values;
Update have
set gender =
(
select gender
from REPLACEMENTS
where rownum = ceil(&REPLACEMENT_COUNT * ranuni(1234))
)
where gender is NULL
;
%let SYSLAST = have;
DM 'viewtable have' viewtable;
You don't have to be concerned about columns not having a missing value because no replacement would occur in those. For columns having a missing the list of candidate REPLACEMENTS excludes the missing and the REPLACEMENT_COUNT is correct for computing the uniform probability of replacement, 1/COUNT, coded as rownum = ceil (random)
Is there any more elegant way than that presented below for the following task:
to create Indicator Variables (below "MAX_X1" and "MAX_X2") whithin each group (below "key1") of multiple observation (below "key2") with value 1 if this observation corresponds to the maximum value of the variable in eache group and 0 otherwise
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc means data=have noprint;
by key1;
var x1 x2;
output out=max
max= / autoname;
run;
data want;
merge have max;
by key1;
drop _:;
run;
proc sql;
title "MAX";
select name into :MAXvars separated by ' '
from dictionary.columns
WHERE LIBNAME="WORK" AND MEMNAME="WANT" AND NAME like "%_Max"
order by name;
quit;
title;
data want; set want;
array MAX (*) &MAXvars;
array XVars (*) x1 x2;
array Indicators (*) MAX_X1 MAX_X2;
do i=1 to dim(MAX);
if XVars[i]=MAX[i] then Indicators[i]=1; else Indicators[i]=0;
end;
drop i;
run;
Thanks for any suggestion of optimization
Proc sql can be used with a group by statement to allow summary functions across values of a variable.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc sql;
create table want
as select
key1,
key2,
x1,
x2,
case
when x1 = max(x1) then 1
else 0 end as max_x1,
case
when x2 = max(x2) then 1
else 0 end as max_x2
from have
group by key1
order by key1, key2;
quit;
It is also possible to do this in a single data step, provided that you read the input dataset twice - this is an example of a double DOW-loop.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
/*Sort by key1 (or generate index) if not already sorted*/
proc sort data = have;
by key1;
run;
data want;
if 0 then set have;
array xvars[3,2] x1 x2 x1_max_flag x2_max_flag t_x1_max t_x2_max;
/*1st DOW-loop*/
do _n_ = 1 by 1 until(last.key1);
set have;
by key1;
do i = 1 to 2;
xvars[3,i] = max(xvars[1,i],xvars[3,i]);
end;
end;
/*2nd DOW-loop*/
do _n_ = 1 to _n_;
set have;
do i = 1 to 2;
xvars[2,i] = (xvars[1,i] = xvars[3,i]);
end;
output;
end;
drop i t_:;
run;
This may be a bit complicated to understand, so here's a rough explanation of how it flows:
Read one by group with the first DOW-loop, updating rolling max variables as each row is read in. Don't output anything yet.
Now read the same by-group again using the second DOW-loop, checking to see whether each row is equal to the rolling max and outputting each row.
Go back to first DOW-loop, read the next by-group and repeat.