Consider the following example:
/* Create two not too interesting datasets: */
Data ones (keep = A);
Do i = 1 to 3;
A = 1;
output;
End;
run;
Data numbers;
Do B = 1 to 5;
output;
End;
Run;
/* The interesting step: */
Data together;
Set ones numbers;
if B = 2 then A = 2;
run;
So dataset ones contains one variable A with 3 observations, all ones and dataset numbers contains one variable (B) with 5 observations: the numbers 1 to 5.
I expect the resulting dataset together to have two columns (A and B) and the A column to read (vertically) 1, 1, 1, . , 2, . , . , .
However, when executing the code I find that column A reads 1, 1, 1, . , 2, 2, 2 , 2
Apparently the 2 created in the fifth observation is retained all the way down for no apparent reason. What is going on here?
(For the sake of completeness: when I split the last data step into two as below:
Data together;
set ones numbers;
run;
Data together;
set together;
if B = 2 then A = 2;
run;
it does do what I expect.)
Yes, any variable that is defined in a SET, MERGE, or UPDATE statement is automatically retained (not set to missing at the top of the data step loop). You can effectively ignore that with
output;
call missing(of <list of variables to clear out>);
run;
at the end of your data step.
This is how MERGE works for many-to-one merges, by the way, and the reason that many-to-many merges don't usually work the way you want them to.
The difference between the 'together' and the 'separate' cases is that in the separate case, you have two data sets with different variables. If you are running this in interactive mode, ie SAS Program Editor or Enhanced Editor (not EG or batch mode), you can use the data step debugger to see this a little more clearly. You would see the following:
At the end of the last row of the ones dataset:
i A B
3 1 .
Notice B exists, but is missing. Then it goes back to the top of the data step loop. All three variables are left alone since they're all from the data sets. Then it attempts to read from ones one more time, which generates:
i A B
. . .
Then it realizes it cannot read from ones, and starts to read from numbers. At the end of the first row of the numbers dataset:
i A B
. . 1
Then it goes to the top, again changes nothing; then it reads in a 2 for B.
i A B
. . 2
Then it sets A to 2, per your program:
i A B
. 2 2
Then it returns to the start of the data step loop again.
i A B
. 2 2
Then it reads in B=3:
i A B
. 2 3
Then it continues looping, for B=4, 5.
Now, compare that to the single dataset. It will be nearly the same (with a small difference at the switch between datasets that does not yield a different result). Now we go to the step where A=2 B=2:
i A B
. 2 2
Now when the data step reads in the next row, it has all three variables on it. So it yields:
i A B
. . 3
Since it read in A=. from the row, it is setting it to missing. In the one-data-step version, it didn't have a value for A to read in, so it didn't replace the 2 with missing.
Related
I have the following data:
data df;
input id $ d1 d2 d3;
datalines;
a . 2 3
b . . .
c 1 . 3
d . . .
;
run;
I want to apply some transformation/operation across a subset of columns. In this case, that means dropping all rows where columns prefixed with d are all missing/null.
Here's one way I accomplished this, taking heavy influence from this SO post.
First, sum all numeric columns, row-wise.
data df_total;
set df;
total = sum(of _numeric_);
run;
Next, drop all rows where total is missing/null.
data df_final;
set df_total;
where total is not missing;
run;
Which gives me the output I wanted:
a . 2 3
c 1 . 3
My issue, however, is that this approach assumes that there's only one "primary-key" column (id, in this case) and everything else is numeric and should be considered as a part of this sum(of _numeric_) is not missing logic.
In reality, I have a diverse array of other columns in the original dataset, df, and it's not feasible to simply drop all of them, writing all of that out. I know the columns for which I want to run this "test" all are prefixed with d (and more specifically, match the pattern d<mm><dd>).
How can I extend this approach to a particular subset of columns?
Use a different short cut reference, since you know it all starts with D,
total = sum( of D:);
if n(of D:) = 0 then delete;
Which will add variables that are numeric and start with D. If you have variables you want to exclude that start with D, that's problematic.
Since it's numeric, you can also use the N() function instead, which counts the non missing values in the row. In general though, SAS will do this automatically for most PROCS such as REG/GLM(not in a data step obviously).
If that doesn't work for some reason you can query the list of variables from the sashelp table.
proc sql noprint;
select name into :var_list separated by ", " from sashelp.vcolumn
where libname='WORK' and memname='DF' and name like 'D%';
quit;
data df;
set have;
if n(&var_list.)=0 then delete;
run;
Can someone explain this code to me in depth?? I have a list of comments in the code where I am confused. Is there anyway I can attach a csv of the data? Thanks in advance.
data have;
infile "&sasforum.\datasets\Returns.csv" firstobs=2 dsd truncover;
input DATE :mmddyy10. A B B_changed;
format date yymmdd10.;
run;
data spread;
do nb = 1 by 1 until(not missing(B));
set have;
end;
br = B;
do i = 1 to nb;
set have; *** I don't get how you can do i = 1 to nb with set have. There is not variable nb on set have. The variable nb is readinto the dataset spread;
if nb > 1 then B_spread = (1+br)**(1/nb) - 1;
else B_spread = B;
output;
end;
drop nb i br;
run;
***** If i comment out "drop nb i br" i get to see that nb takes a value of 2 for the null values of B.. I don't get how this is done or possible. Because if I run the code right after the line "br = B", and put an output statement in the first do loop, I am clearly seeing that nb takes a valueof one for B null values.Honestly, It is like the first do loop is reads in future observations for B as BR. Can you please explain this to me. The second dataset "bunch" seems to follow the same type of principles as the first... So i imagine if I get a grasp on the first on how the datasetspread is created, then I will understand how bunch is created.;
This is an advanced DATA step programming technique, commonly referred to as a DoW loop. If you search lexjansen.com for DoW, you will find helpful papers like http://support.sas.com/resources/papers/proceedings09/038-2009.pdf. The DoW loop codes and explicit loop around a SET statement. This is actually a "Double-DoW loop", because you have two explicit loops.
I made some sample data, and added some PUT statements to your code:
data have ;
input B ;
cards ;
.
.
1
2
.
.
.
3
;
data spread;
do nb = 1 by 1 until(not missing(B));
set have;
put _n_= "top do-loop " (nb B)(=) ;
end;
br = B;
do i = 1 to nb;
set have;
if nb > 1 then B_spread = (1+br)**(1/nb) - 1;
else B_spread = B;
output;
put _n_= "bottom do-loop " (nb B br B_spread)(=) ;
end;
drop nb i br;
run;
With that sample data, on the first iteration of the DATA step (N=1), the top do loop will iterate three times, reading the first three records of HAVE. At that point, (not missing(B)) will be true, and the loop will not iterate again. The variable NB will have a value of 3. The bottom loop will then iterate 3 times, because NB has a value of 3. It will also read the first three records have HAVE. It will compute B_Spread, and output each record.
On the second iteration of the DATA step, the top DO loop will iterate only once. It will read the 4th record, with B=2. The bottom loop will iterate once, reading the 4th record, computing B_spread, and output.
On the third iteration of the DATA step, the top DO loop will iterate four times, reading the 5th through 8th records. The bottom loop will also iterate four times, reading the 5th through 8th records, computing B_spread, and output.
On the fourth iteration of the DATA step, the step to complete, because the SET statement in the top loop will read the End Of File mark.
The core concept of a Double-DoW loop is that typically you are reading the data in groups. Often groups are identified by an ID. Here they are defined by sequential records read until not missing(B). The top DO-loop reads the first group of records, and computes some value (in this case, it computes NB, the number of records in the group). Then the bottom DO-loop reads the first group of records, and computes some new value, using the value computed in top DO-loop. In this case, the bottom DO-loop computes B_spread, using NB.
I'd like the sum of the rows of a data set. In particular, I would like to sum from the second element to the last element (skipping the first entry).
How can I achieve this?
It sounds like you want to add up everything except the first column. You also don't know how many variables you have and it many change over time.
There may be a smarter way to do this, but here are 3 options.
If your ID value is stored as text while everything else is a number, then it is trivial to say:
data sum;
set test;
sum = sum(of _numeric_);
run;
which will simply add up all numeric variables. However it sounds like you have integer IDs, so perhaps one of these options would work. First, some sample data:
data test;
input id var1 var2 var3;
cards;
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
;
run;
Option 1 - Simply add up all of the numeric variables, and then subtract your ID value, this leaves you with the sum of everything except the ID:
data test2;
set test;
sum=sum(of _numeric_)-id;
run;
Option 2 - You can tell SAS to operate over a range of variables in the order they are listed in the dataset. You could just do sum = sum(var1--var3);, however you might not know what the first and last variables are. There's also a possibility that your ID variable is in the middle somewhere.
A solution to this would be to make sure your ID variable is first, and then create dummy variables before and after the range of variables you want to sum:
data test3;
format id START_SUM;
set test;
END_SUM = .;
sum = sum(of START_SUM--END_SUM);
drop START_SUM END_SUM;
run;
This creates ID and START_SUM before setting your data, and then creates the empty END_SUM at the end of your data. It then sums everything from START_SUM to END_SUM, and because sum(of ...) skips over missing values, you only get the sum of the variables you actually care about. Then you drop your dummy variables as they are no longer necessary.
Option 1 is obviously simpler, but Option 2 has some potential benefits in that it works with both numeric and non-numeric IDs, and has no chance of being subject to any sorts of weird rounding issues when you add and subtract the ID (although that won't happen if everything is an integer).
I'm very new to SAS and i'm trying to figure out my way around using it. I'm trying to figure out how to use the Compare procedure. Basically what I want to do is to see if the values in one column match the values in another column multiplied by 2 and count the number of mistakes. So if I have this data set:
a b
2 4
1 2
3 5
It should check whether b = 2 * a and tell me how many errors they are. I've been reading through the documentation for the compare procedure but like i said i'm very new and i can't seem to figure out how to check for this.
You could do if with PROC COMPARE but you still need to compute 2*a and you can't do that with PROC COMPARE. I would create a FLAG and summarize the FLAG. IFN function returns 1 for values that are NOT equal. PROC MEANS counts the 1's where mean is percent and sum is count of non-matching.
data comp;
input a b;
flag = ifn(b NE 2*a,1,0);
cards;
2 4
1 2
3 5
;;;;
run;
proc means n mean sum;
var flag;
run;
Proc compare compares values in two different datasets, whereas your variables are both in one dataset. The following may be simplest:
data matches errors;
set temp;
if b = 2 * a then output matches;
else output errors;
run;
Say that my data set has quite a lot of missing/invalid values and I would like to remove (or drop) the entire variable (or column) if it contains too many invalid values.
Take the following example, the variable 'gender' has quite a lot of "#N/A"s. I would like to remove that variable if a certain percentage of the data points in there are "#N/A"s, say more than 50%, more than 30%.
In addition, I would like to make the percentage a configurable value, i.e., I am willing to remove the entire variable if more than x% of the observations under that variable are "#N/A". And I also want to be able to define what an invalid value is, could be "#N/A", could be "Invalid Value", could be " ", could be anything else that I pre-define.
data dat;
input id score gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
Please make the solution as generalized as possible. For example, if the real data set contains thousands of variables, I need to be able to loop through all those variables instead of referencing their variable names one by one. Furthermore, the data set could contain more than just "#N/A" as bad values, other things like ".", "Invalid Obs", "N.A." could also exist at the same time.
PS: Actually I thought of a way to make this problem easier. We could probably read in all the data points as numerical values, so that all the "#N/A", "N.A.", " " stuff get turned into ".", which makes the drop criterion easier. Hope that helps you solve this problem for me ...
Update: below is the code I am working on. Got stuck at the last block.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
proc contents data=dat out=test0(keep=name type) noprint;
/*A DATA step is used to subset the test0 data set to keep only the character */
/*variables and exclude the one ID character variable. A new list of numeric*/
/*variable names is created from the character variable name with a "_n" */
/*appended to the end of each name. */
data test0;
set test0;
if type=2;
newname=trim(left(name))||"_n";
/*The macro system option SYMBOLGEN is set to be able to see what the macro*/
/*variables resolved to in the SAS log. */
options symbolgen;
/*PROC SQL is used to create three macro variables with the INTO clause. One */
/*macro variable named c_list will contain a list of each character variable */
/*separated by a blank space. The next macro variable named n_list will */
/*contain a list of each new numeric variable separated by a blank space. The */
/*last macro variable named renam_list will contain a list of each new numeric */
/*variable and each character variable separated by an equal sign to be used on*/
/*the RENAME statement. */
proc sql noprint;
select trim(left(name)), trim(left(newname)),
trim(left(newname))||'='||trim(left(name))
into :c_list separated by ' ', :n_list separated by ' ',
:renam_list separated by ' '
from test0;
quit;
/*The DATA step is used to convert the numeric values to character. An ARRAY */
/*statement is used for the list of character variables and another ARRAY for */
/*the list of numeric variables. A DO loop is used to process each variable */
/*to convert the value from character to numeric with the INPUT function. The */
/*DROP statement is used to prevent the character variables from being written */
/*to the output data set, and the RENAME statement is used to rename the new */
/*numeric variable names back to the original character variable names. */
data test2;
set dat;
array ch(*) $ &c_list;
array nu(*) &n_list;
do i = 1 to dim(ch);
nu(i)=input(ch(i),8.);
end;
drop i &c_list;
rename &renam_list;
run;
data test3;
set test2;
array myVars(*) &c_list;
countTotal=1;
do i = 1 to dim(myVars);
myCounter = count(.,myVars(i));
/* if sum(countMissing)/sum(countTotal) lt 0.5 then drop VNAME(myVars(i)); */
end;
run;
The problem is, and where I got stuck on, is that I am not able to drop the variables that I want to drop. And the reason is because I do not want to use the variable names in the drop function. Instead, I want it done in a loop where I can reference the variable names with the looper "i". I tried to use the array "myVars(i)" but it doesnt seem to work with the drop function.
My understanding is that SAS processes drop statements during data step compilation, i.e. before it looks at any of the data from any input datasets. Therefore, you cannot use the vname function like that to select variables to drop, as it doesn't evaluate the variable names until the data step has finished compiling and has moved on to execution.
You will need to output a temporary dataset or view containing all your variables, including the ones you don't want, build up a list of variables that you want to drop, in a macro variable, then drop them in a subsequent data step.
Refer to this paper and page 3 in particular for more details of which things run during compilation rather than execution:
http://www.lexjansen.com/nesug/nesug11/ds/ds04.pdf
In general, you'll find this sort of thing simplified using built in procs - this is SAS's bread and butter. You just need to restate the question.
What you want is to drop variables with a % of missing/bad data higher than 50%, so you need a frequency table of variables, right?
So - use PROC FREQ. This is the simplified version (only looks for "#N/A"), but it should be easy to modify the last step to make it look for other values (and to sum up the percents for them). Or, like you'll see in the linked question (from my comment on the question), you can use a special format that puts all invalid values to one formatted value, and all valid values to another formatted value. (You'll have to construct this format.)
Concept: use PROC FREQ to get frequency table, then look at that dataset to find the rows with > 50% of the rows and an invalid value in the F_ column.
This won't work with actual missing (" " or .); you'll need to add the /MISSING option to PROC FREQ if you have those also.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
*shut off ODS for the moment, and only use ODS OUTPUT, so we do not get a mess in our results window;
ods exclude all;
ods output onewayfreqs=freq_tables;
proc freq data=dat;
tables id score gender;
run;
ods output close;
ods exclude none;
*now we check for variables that match our criteria;
data has_missing;
set freq_tables;
if coalescec(of f_:) ='#N/A' and percent>50;
varname = substr(table,7);
run;
*now we put those into a macro variable to drop;
proc sql;
select varname
into :droplist separated by ' '
from has_missing;
quit;
*and we drop them;
data dat_fixed;
set dat;
drop &droplist.;
run;