I'm very new to SAS and i'm trying to figure out my way around using it. I'm trying to figure out how to use the Compare procedure. Basically what I want to do is to see if the values in one column match the values in another column multiplied by 2 and count the number of mistakes. So if I have this data set:
a b
2 4
1 2
3 5
It should check whether b = 2 * a and tell me how many errors they are. I've been reading through the documentation for the compare procedure but like i said i'm very new and i can't seem to figure out how to check for this.
You could do if with PROC COMPARE but you still need to compute 2*a and you can't do that with PROC COMPARE. I would create a FLAG and summarize the FLAG. IFN function returns 1 for values that are NOT equal. PROC MEANS counts the 1's where mean is percent and sum is count of non-matching.
data comp;
input a b;
flag = ifn(b NE 2*a,1,0);
cards;
2 4
1 2
3 5
;;;;
run;
proc means n mean sum;
var flag;
run;
Proc compare compares values in two different datasets, whereas your variables are both in one dataset. The following may be simplest:
data matches errors;
set temp;
if b = 2 * a then output matches;
else output errors;
run;
Related
I have the following data:
data df;
input id $ d1 d2 d3;
datalines;
a . 2 3
b . . .
c 1 . 3
d . . .
;
run;
I want to apply some transformation/operation across a subset of columns. In this case, that means dropping all rows where columns prefixed with d are all missing/null.
Here's one way I accomplished this, taking heavy influence from this SO post.
First, sum all numeric columns, row-wise.
data df_total;
set df;
total = sum(of _numeric_);
run;
Next, drop all rows where total is missing/null.
data df_final;
set df_total;
where total is not missing;
run;
Which gives me the output I wanted:
a . 2 3
c 1 . 3
My issue, however, is that this approach assumes that there's only one "primary-key" column (id, in this case) and everything else is numeric and should be considered as a part of this sum(of _numeric_) is not missing logic.
In reality, I have a diverse array of other columns in the original dataset, df, and it's not feasible to simply drop all of them, writing all of that out. I know the columns for which I want to run this "test" all are prefixed with d (and more specifically, match the pattern d<mm><dd>).
How can I extend this approach to a particular subset of columns?
Use a different short cut reference, since you know it all starts with D,
total = sum( of D:);
if n(of D:) = 0 then delete;
Which will add variables that are numeric and start with D. If you have variables you want to exclude that start with D, that's problematic.
Since it's numeric, you can also use the N() function instead, which counts the non missing values in the row. In general though, SAS will do this automatically for most PROCS such as REG/GLM(not in a data step obviously).
If that doesn't work for some reason you can query the list of variables from the sashelp table.
proc sql noprint;
select name into :var_list separated by ", " from sashelp.vcolumn
where libname='WORK' and memname='DF' and name like 'D%';
quit;
data df;
set have;
if n(&var_list.)=0 then delete;
run;
I am trying to find a quick way to replace missing values with the average of the two nearest non-missing values. Example:
Id Amount
1 10
2 .
3 20
4 30
5 .
6 .
7 40
Desired output
Id Amount
1 10
2 **15**
3 20
4 30
5 **35**
6 **35**
7 40
Any suggestions? I tried using the retain function, but I can only figure out how to retain last non-missing value.
I thinks what you are looking for might be more like interpolation. While this is not mean of two closest values, it might be useful.
There is a nifty little tool for interpolating in datasets called proc expand. (It should do extrapolation as well, but I haven't tried that yet.) It's very handy when making series of of dates and cumulative calculations.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc expand data=have out=Expanded;
convert amount=amount_expanded / method=join;
id id; /*second is column name */
run;
For more on the proc expand see documentation: https://support.sas.com/documentation/onlinedoc/ets/132/expand.pdf
This works:
data have;
input id amount;
cards;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc sort data=have out=reversed;
by descending id;
run;
data retain_non_missing;
set reversed;
retain next_non_missing;
if amount ne . then next_non_missing = amount;
run;
proc sort data=retain_non_missing out=ordered;
by id;
run;
data final;
set ordered;
retain last_non_missing;
if amount ne . then last_non_missing = amount;
if amount = . then amount = (last_non_missing + next_non_missing) / 2;
run;
but as ever, will need extra error checking etc for production use.
The key idea is to sort the data into reverse order, allowing it to use RETAIN to carry the next_non_missing value back up the data set. When sorted back into the correct order, you then have enough information to interpolate the missing values.
There may well be a PROC to do this in a more controlled way (I don't know anything about PROC STANDARDIZE, mentioned in Reeza's comment) but this works as a data step solution.
Here's an alternative requiring no sorting. It does require IDs to be sequential, though that can be worked around if they're not.
What it does is uses two set statements, one that gets the main (and previous) amounts, and one that sets until the next amount is found. Here I use the sequence of id variables to guarantee it will be the right record, but you could write this differently if needed (keeping track of what loop you're on) if the id variables aren't sequential or in an order of any sort.
I use the first.amount check to make sure we don't try to execute the second set statement more than we should (which would terminate early).
You need to do two things differently if you want first/last rows treated differently. Here I assume prev_amount is 0 if it's the first row, and I assume last_amount is missing, meaning the last row just gets the last prev_amount repeated, while the first row is averaged between 0 and the next_amount. You can treat either one differently if you choose, I don't know your data.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;;;;
run;
data want;
set have;
by amount notsorted; *so we can tell if we have consecutive missings;
retain prev_amount; *next_amount is auto-retained;
if not missing(amount ) then prev_amount=amount;
else if _n_=1 then prev_amount=0; *or whatever you want to treat the first row as;
else if first.amount then do;
do until ((next_id > id and not missing(next_amount)) or (eof));
set have(rename=(id=next_id amount=next_amount)) end=eof;
end;
amount = mean(prev_amount,next_amount);
end;
else amount = mean(prev_amount,next_amount);
run;
I'd like the sum of the rows of a data set. In particular, I would like to sum from the second element to the last element (skipping the first entry).
How can I achieve this?
It sounds like you want to add up everything except the first column. You also don't know how many variables you have and it many change over time.
There may be a smarter way to do this, but here are 3 options.
If your ID value is stored as text while everything else is a number, then it is trivial to say:
data sum;
set test;
sum = sum(of _numeric_);
run;
which will simply add up all numeric variables. However it sounds like you have integer IDs, so perhaps one of these options would work. First, some sample data:
data test;
input id var1 var2 var3;
cards;
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
;
run;
Option 1 - Simply add up all of the numeric variables, and then subtract your ID value, this leaves you with the sum of everything except the ID:
data test2;
set test;
sum=sum(of _numeric_)-id;
run;
Option 2 - You can tell SAS to operate over a range of variables in the order they are listed in the dataset. You could just do sum = sum(var1--var3);, however you might not know what the first and last variables are. There's also a possibility that your ID variable is in the middle somewhere.
A solution to this would be to make sure your ID variable is first, and then create dummy variables before and after the range of variables you want to sum:
data test3;
format id START_SUM;
set test;
END_SUM = .;
sum = sum(of START_SUM--END_SUM);
drop START_SUM END_SUM;
run;
This creates ID and START_SUM before setting your data, and then creates the empty END_SUM at the end of your data. It then sums everything from START_SUM to END_SUM, and because sum(of ...) skips over missing values, you only get the sum of the variables you actually care about. Then you drop your dummy variables as they are no longer necessary.
Option 1 is obviously simpler, but Option 2 has some potential benefits in that it works with both numeric and non-numeric IDs, and has no chance of being subject to any sorts of weird rounding issues when you add and subtract the ID (although that won't happen if everything is an integer).
Say that my data set has quite a lot of missing/invalid values and I would like to remove (or drop) the entire variable (or column) if it contains too many invalid values.
Take the following example, the variable 'gender' has quite a lot of "#N/A"s. I would like to remove that variable if a certain percentage of the data points in there are "#N/A"s, say more than 50%, more than 30%.
In addition, I would like to make the percentage a configurable value, i.e., I am willing to remove the entire variable if more than x% of the observations under that variable are "#N/A". And I also want to be able to define what an invalid value is, could be "#N/A", could be "Invalid Value", could be " ", could be anything else that I pre-define.
data dat;
input id score gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
Please make the solution as generalized as possible. For example, if the real data set contains thousands of variables, I need to be able to loop through all those variables instead of referencing their variable names one by one. Furthermore, the data set could contain more than just "#N/A" as bad values, other things like ".", "Invalid Obs", "N.A." could also exist at the same time.
PS: Actually I thought of a way to make this problem easier. We could probably read in all the data points as numerical values, so that all the "#N/A", "N.A.", " " stuff get turned into ".", which makes the drop criterion easier. Hope that helps you solve this problem for me ...
Update: below is the code I am working on. Got stuck at the last block.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
proc contents data=dat out=test0(keep=name type) noprint;
/*A DATA step is used to subset the test0 data set to keep only the character */
/*variables and exclude the one ID character variable. A new list of numeric*/
/*variable names is created from the character variable name with a "_n" */
/*appended to the end of each name. */
data test0;
set test0;
if type=2;
newname=trim(left(name))||"_n";
/*The macro system option SYMBOLGEN is set to be able to see what the macro*/
/*variables resolved to in the SAS log. */
options symbolgen;
/*PROC SQL is used to create three macro variables with the INTO clause. One */
/*macro variable named c_list will contain a list of each character variable */
/*separated by a blank space. The next macro variable named n_list will */
/*contain a list of each new numeric variable separated by a blank space. The */
/*last macro variable named renam_list will contain a list of each new numeric */
/*variable and each character variable separated by an equal sign to be used on*/
/*the RENAME statement. */
proc sql noprint;
select trim(left(name)), trim(left(newname)),
trim(left(newname))||'='||trim(left(name))
into :c_list separated by ' ', :n_list separated by ' ',
:renam_list separated by ' '
from test0;
quit;
/*The DATA step is used to convert the numeric values to character. An ARRAY */
/*statement is used for the list of character variables and another ARRAY for */
/*the list of numeric variables. A DO loop is used to process each variable */
/*to convert the value from character to numeric with the INPUT function. The */
/*DROP statement is used to prevent the character variables from being written */
/*to the output data set, and the RENAME statement is used to rename the new */
/*numeric variable names back to the original character variable names. */
data test2;
set dat;
array ch(*) $ &c_list;
array nu(*) &n_list;
do i = 1 to dim(ch);
nu(i)=input(ch(i),8.);
end;
drop i &c_list;
rename &renam_list;
run;
data test3;
set test2;
array myVars(*) &c_list;
countTotal=1;
do i = 1 to dim(myVars);
myCounter = count(.,myVars(i));
/* if sum(countMissing)/sum(countTotal) lt 0.5 then drop VNAME(myVars(i)); */
end;
run;
The problem is, and where I got stuck on, is that I am not able to drop the variables that I want to drop. And the reason is because I do not want to use the variable names in the drop function. Instead, I want it done in a loop where I can reference the variable names with the looper "i". I tried to use the array "myVars(i)" but it doesnt seem to work with the drop function.
My understanding is that SAS processes drop statements during data step compilation, i.e. before it looks at any of the data from any input datasets. Therefore, you cannot use the vname function like that to select variables to drop, as it doesn't evaluate the variable names until the data step has finished compiling and has moved on to execution.
You will need to output a temporary dataset or view containing all your variables, including the ones you don't want, build up a list of variables that you want to drop, in a macro variable, then drop them in a subsequent data step.
Refer to this paper and page 3 in particular for more details of which things run during compilation rather than execution:
http://www.lexjansen.com/nesug/nesug11/ds/ds04.pdf
In general, you'll find this sort of thing simplified using built in procs - this is SAS's bread and butter. You just need to restate the question.
What you want is to drop variables with a % of missing/bad data higher than 50%, so you need a frequency table of variables, right?
So - use PROC FREQ. This is the simplified version (only looks for "#N/A"), but it should be easy to modify the last step to make it look for other values (and to sum up the percents for them). Or, like you'll see in the linked question (from my comment on the question), you can use a special format that puts all invalid values to one formatted value, and all valid values to another formatted value. (You'll have to construct this format.)
Concept: use PROC FREQ to get frequency table, then look at that dataset to find the rows with > 50% of the rows and an invalid value in the F_ column.
This won't work with actual missing (" " or .); you'll need to add the /MISSING option to PROC FREQ if you have those also.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
*shut off ODS for the moment, and only use ODS OUTPUT, so we do not get a mess in our results window;
ods exclude all;
ods output onewayfreqs=freq_tables;
proc freq data=dat;
tables id score gender;
run;
ods output close;
ods exclude none;
*now we check for variables that match our criteria;
data has_missing;
set freq_tables;
if coalescec(of f_:) ='#N/A' and percent>50;
varname = substr(table,7);
run;
*now we put those into a macro variable to drop;
proc sql;
select varname
into :droplist separated by ' '
from has_missing;
quit;
*and we drop them;
data dat_fixed;
set dat;
drop &droplist.;
run;
I have a data set where each observation is a combination of binary indicator variables, but not necessarily all possible combinations. I'd like to eliminate observations that are subsets of other observations. As an example, suppose I had these three observations:
var1 var2 var3 var4
0 0 1 1
1 0 0 1
0 1 1 1
In this case, I would want to eliminate observation 1, because it's a subset of observation 3. Observation 2 isn't a subset of anything else, so my output data set should contain observations 2 and 3.
Is there an elegant and preferably fast way to do this in SAS? My best solution thus far is a brute force loop through the data set using a second set statement with the point option to see if the current observation is a subset of any others, but these data sets could become huge once I start working with a lot of variables, so I'm hoping to find a better way.
First off, one consideration: is it possible for one row to have 1 for all indicators? You should check for that first - if one row does have all 1s, then it will always be the unique solution.
_POINT_ is inefficient, but loading into a hash table isn't a terribly bad way to do it. Just load up a hash table with a string of the binary indicators CATted together, and then search that table.
First, use PROC SORT NODUPKEY to eliminate the exact matches. Unless you have a very large number of indicator variables, this will eliminate many rows.
Then, sort it in an order where the more "complicated" rows are at the top, and the less complicated at the bottom. This might be as simple as making a variable which is the sum of binary indicators and sort by that descending; or if your data suggests, might be sorting by a particular order of indicators (if some are more likely to be present). The purpose of this is to reduce the number of times we search; if the likely matches are on top, we will leave the loop faster.
Finally, use a hash iterator to search the list, in descending order by the indicators variable, for any matches.
See below for a partially-tested example. I didn't verify that it eliminated every valid elimination, but it eliminates around half of the rows, which sounds reasonable.
data have;
array vars var1-var20;
do _u = 1 to 1e4;
do _t = 1 to dim(Vars);
vars[_t] = round(ranuni(7),1);
end;
complexity = sum(of vars[*]);
indicators = cats(of vars[*]);
output;
end;
drop _:;
run;
proc sort nodupkey data=have;
by indicators;
run;
proc sort data=have;
by descending complexity;
run;
data want;
if _n_ = 1 then do;
format indicators $20.;
call missing(indicators, complexity);
declare hash indic(dataset:'have', ordered:'d');
indic.defineKey('indicators');
indic.defineData('complexity','indicators');
indic.defineDone();
declare hiter inditer('indic');
end;
set have(drop=indicators rename=complexity=thisrow_complex); *assuming have has a variable, "indicators", like "0011001";
array vars var1-var20;
rc=inditer.first();
rowcounter=1;
do while (rc=0 and complexity ge thisrow_complex);
do _t = 1 to dim(vars);
if vars[_t]=1 and char(indicators,_t) ne '1' then leave;
end;
if _t gt dim(Vars) then delete;
else rc=inditer.next();
rowcounter=rowcounter+1;
end;
run;
I'm prety sure their is probably a more math-oriented way of doing this but for now this what I can think about. Proceed with caution as I only checked on a small no. of test cases.
My pseudo-algorithm:
(Bit pattern= concatenation of all binary variables into a string.)
get a unique list of binary(bit) patterns which is sorted in
asceding order (this is what the PROC SQL step is doing
Set up two arrays. 1 array to track the variables (var1 to var4) on the current row and another array to track lag(of var1 to var4).
If a bit pattern is a subset of another bit pattern (lagged version) then "dot product vector multiplication" should give back the lagged bit pattern.
If bit pattern = lagged_bit_pattern then flag that pattern to be excluded.
In the last data step you will get the list of bit patterns you need to exclude. NOTE: this approach does not take care of duplicate patterns such as the following:
record1: 1 0 0 1
record2: 1 0 0 1
which can be easily excluded via PROC SORT & NODUPKEY.
/*sample data*/
data sample_data;
input id var1-var4;
bit_pattern = compress(catx('',var1,var2,var3,var4));
datalines;
1 0 0 1 1
2 1 0 0 1
3 0 1 1 1
4 0 0 0 1
5 1 1 1 0
;
run;
/*in the above example, 0001 0011 need to be eliminated. These will be highlighted in the last datastep*/
/*get unique combination of patterns in the dataset*/
proc sql ;
create table all_poss_patterns as
select
var1,var2, var3,var4,
count(*) as freq
from sample_data
group by var1,var2, var3,var4
order by var1,var2, var3,var4;
quit;
data patterns_to_exclude;
set all_poss_patterns;
by var1-var4;
length lagged1-lagged4 8;
array first_array{*} var1-var4;
array lagged{*}lagged1-lagged4;
length bit_pattern $32.;
length lagged_bit_pattern $32.;
bit_pattern = '';
lagged_bit_pattern='';
do i = 1 to dim(first_array);
lagged{i}=lag(first_array{i});
end;
do i = 1 to dim(first_array);
bit_pattern=cats("", bit_pattern,lagged{i}*first_array{i});
lagged_bit_pattern=cats("",lagged_bit_pattern,lagged{i});
end;
if bit_pattern=lagged_bit_pattern then exclude_pattern=1;
else exclude_pattern=0;
/*uncomment the following two lines to just keep the patterns that need to be excluded*/
/*if bit_pattern ne '....' and exclude_pattern=1;*/ /*note the bit_pattern ne '....' the no. of dots should equal no. of binary vars*/
/*keep bit_pattern;*/
run;