Here is the demonstrate data.
data faminc;
input famid faminc1-faminc12;
cards;
1 3281 3413 3114 2500 2700 . 3114 3319 3514 1282 2434 2818
2 4042 . . . . . 1531 2914 3819 4124 4274 4471
3 6015 . . . . . . . . . . .
;
run;
I would like to create an indicator variable called fam_indicator. If variables faminc2-faminc12 are all missing, then fam_indicator=1. Otherwise fam_indicator=0.
I tried the code below but it didn't work.
data fam;
set faminc;
if missing(faminc2-faminc12) then fam_indicator=1;
else fam_indicator=0;
run;
You can do this a bunch of different ways. If the variables are all numeric, then n will do it for you.
data fam;
set faminc;
if n(of faminc2-faminc12) eq 0 then fam_indicator=1;
else fam_indicator=0;
run;
cmiss and nmiss also could work; cmiss is generic regardless of type, while nmiss is only for numerics. They would count the number of missings, so you'd want if cmiss(of faminc2-faminc12) eq 11 or similar.
The other thing you needed was the of. n(faminc2-faminc12) would just subtract the one from the other. of says "the next thing here is a variable list" and it will then expand the list out.
nmiss function could be used directly, sum function is also another option, sum of all missing values is still missing value.
fam_indicator=ifn(sum(of faminc2-faminc12)=.,1,0);
Related
I am trying to find a quick way to replace missing values with the average of the two nearest non-missing values. Example:
Id Amount
1 10
2 .
3 20
4 30
5 .
6 .
7 40
Desired output
Id Amount
1 10
2 **15**
3 20
4 30
5 **35**
6 **35**
7 40
Any suggestions? I tried using the retain function, but I can only figure out how to retain last non-missing value.
I thinks what you are looking for might be more like interpolation. While this is not mean of two closest values, it might be useful.
There is a nifty little tool for interpolating in datasets called proc expand. (It should do extrapolation as well, but I haven't tried that yet.) It's very handy when making series of of dates and cumulative calculations.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc expand data=have out=Expanded;
convert amount=amount_expanded / method=join;
id id; /*second is column name */
run;
For more on the proc expand see documentation: https://support.sas.com/documentation/onlinedoc/ets/132/expand.pdf
This works:
data have;
input id amount;
cards;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc sort data=have out=reversed;
by descending id;
run;
data retain_non_missing;
set reversed;
retain next_non_missing;
if amount ne . then next_non_missing = amount;
run;
proc sort data=retain_non_missing out=ordered;
by id;
run;
data final;
set ordered;
retain last_non_missing;
if amount ne . then last_non_missing = amount;
if amount = . then amount = (last_non_missing + next_non_missing) / 2;
run;
but as ever, will need extra error checking etc for production use.
The key idea is to sort the data into reverse order, allowing it to use RETAIN to carry the next_non_missing value back up the data set. When sorted back into the correct order, you then have enough information to interpolate the missing values.
There may well be a PROC to do this in a more controlled way (I don't know anything about PROC STANDARDIZE, mentioned in Reeza's comment) but this works as a data step solution.
Here's an alternative requiring no sorting. It does require IDs to be sequential, though that can be worked around if they're not.
What it does is uses two set statements, one that gets the main (and previous) amounts, and one that sets until the next amount is found. Here I use the sequence of id variables to guarantee it will be the right record, but you could write this differently if needed (keeping track of what loop you're on) if the id variables aren't sequential or in an order of any sort.
I use the first.amount check to make sure we don't try to execute the second set statement more than we should (which would terminate early).
You need to do two things differently if you want first/last rows treated differently. Here I assume prev_amount is 0 if it's the first row, and I assume last_amount is missing, meaning the last row just gets the last prev_amount repeated, while the first row is averaged between 0 and the next_amount. You can treat either one differently if you choose, I don't know your data.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;;;;
run;
data want;
set have;
by amount notsorted; *so we can tell if we have consecutive missings;
retain prev_amount; *next_amount is auto-retained;
if not missing(amount ) then prev_amount=amount;
else if _n_=1 then prev_amount=0; *or whatever you want to treat the first row as;
else if first.amount then do;
do until ((next_id > id and not missing(next_amount)) or (eof));
set have(rename=(id=next_id amount=next_amount)) end=eof;
end;
amount = mean(prev_amount,next_amount);
end;
else amount = mean(prev_amount,next_amount);
run;
In my dataset there are several observations (IDs) with all or too many missing variables. I want to know which IDs have no data (all variables are missing). I used proc freq but it gives me only freqency of variables, which do not serve my purpose. Proc mean nmiss also give me just total missing. I want to know exactly which IDs have missing variables. I searched online but couldn't locate solution of my problem. Help would be appreciated. Below is the sample data;
ID a b c d e
1 . 3 1 2 2
2 . . . . .
3 . . . . .
4 3 . 5 . .
I want result in a way that show me data of ID with complete missing information like;
ID a b c d e
2 . . . . .
3 . . . . .
Thanks
Thanks in advance
Use the nmiss function instead, which counts the number of missing values im the row for a specified list of variables. If you're looking at 3 variables for example
If nmiss(var1, var2, var3) =3;
Keep ID;
This will keep only records with all three variables missing.
The n function returns the number of non-missing numeric values in a list. This means you could use a variable list and not worry about counting the variables:
if n(of _numeric_) = 0 then output;
or
if n(of a--e) = 0 then output;
If you're checking character variables, there is no corresponding c function, but you could use the coalescec function to do something similar. The coalesce functions return the first non-missing value from a list of values. To select rows with all character values missing, use something like:
if missing(coalescec(of _character_)) then output;
I have data set that contains empty cells. It looks like
Year Volume ID
2000 999 LSE
2001 . LSE
. 555 LSE
2008 . NYSE
2010 1099 NYSE
I need to delete the row that contains empty cells. The output should look like this
Year Volume ID
2000 999 LSE
2000 1099 NYSE
I tried following code
data test;
set data;
if volume = " . " then delete;
if year= " . " then delete;
run;
But output file has 0 observations and SAS gives me
NOTE: Character values have been converted to numeric values at the
places given by (Line):(Column).
Also I tried
options missing = ' ';
data test;
set data;
if missing(cats(of _all_)) then delete;
run;
But its not working as well.
I just want to delete the rows with empty cells.
Anyone can help me to solve this issue ? Thanks in advance !!!
Options Missing only affects how things are printed or converted when going numeric -> character. In this case you have numerics, so it accomplishes nothing.
Your first code sample is mostly correct- at least, when I try it, it works. " . " is not really right, but it will convert (as the note says) to missing since none of those characters are a number.
The proper way to do this is one of the two:
data have;
input Year Volume ID $;
datalines;
2000 999 LSE
2001 . LSE
. 555 LSE
2008 . NYSE
2010 1099 NYSE
;;;;
run;
data want;
set have;
if year = . then delete;
if volume = . then delete;
run;
or
data want;
set have;
if missing(year) then delete;
if missing(volume) then delete;
run;
missing returns true if the variable is missing (which includes 28 total values, but . is the most common).
A better way to do more than one is to use the nmiss or cmiss functions (nmiss for numbers, cmiss for character or mixed type).
data want;
set have;
if nmiss(year,volume) = 0;
run;
That will return the number of missing values, which you can then test for whatever value you are looking for (in this case, zero values). You could even do:
data want;
set have;
if nmiss(of _NUMERIC_) = 0;
run;
where _NUMERIC_ is all numeric variables. (of is needed for variable lists like this to tell SAS to expect a list.)
Your second doesn't work, by the way, because it's catting the ID variable together with the others. You could have seen this by looking at the value of that cats (ie, assign it to a variable). You could have said
if cats(of _all_) = ID then delete;
but as several of us have shown that's probably inferior to the simpler solutions using nmiss.
You can just use a subsetting if nmiss() by checking the variables that must be populated:
data test;
set data;
if nmiss(year,volume)=0 ;
run;
Edit: This works if year and volume is numeric, if it is string, you can use the cmiss() function.
Don't use quotes with numeric variables, e.g.:
if volume = . then delete;
Other option that works for either character or numeric:
if missing(volume) then delete;
You could use a where clause in the set statement here as well:
data new_dataset;
set old_dataset (where = (volume is not missing or year is not missing));
run;
I always enjoy using the is not missing syntax because it seems too much like writing normal English to work
I have a SAS dataset which looks like this:
Month Col1 Col2 Col3 Col4
200801 11 2 3 20
200802 5 9 4 10
. . . . .
. . . . .
. . . . .
201212 3 34 1 0
I want to create a dataset by shift each row's column Col1-Col4 values, to the right. It will look diagonally shifted.
Month Col1 Col2 Col3 Col4 Col5 Col6 Col7 . . . . . . . Coln
200801 11 2 3 20
200802 . 5 9 4 10
. . . . .
. . . . .
. . . . .
201212 . . . . . . . . . 3 34 1 0
Can someone suggest how I can do it?
Thanks!
First off, if you can avoid doing so, do. This is a pretty sparse way to store data, and will involve large datasets (definitely use OPTIONS COMPRESS at least), and usually can be worked around with good use of CLASS variables.
If you really must do this, PROC TRANSPOSE is your friend. While this is possible in the data step, it's less messy and more flexible in PROC TRANSPOSE.
First, make a totally vertical dataset (month+colname+colvalue):
data pre_t;
set have;
array cols col1-col4;
do _t = 1 to dim(cols);
colname = cats("col",((_N_-1) + _t)); *shifting here, edit this logic as needed;
value = cols[_t];
output;
keep colname value month;
run;
In that datastep, you are creating the eventual column name in colname and setting it up for transpose. If you have data not identical to the above (in particular, if you have data grouped by something else), N may not work and you may need to do some logic (such as figuring out difference from 200801) to calculate the col#.
Then, proc transpose:
proc transpose data=pre_t out=want;
by month;
id colname;
var value;
run;
And voilĂ , you should have what you were looking for. Make sure it's sorted properly in order to get the output in the expected order.
I have a data set with a number of records (the contents of which are irrelevant) and a series of flag variables at the end of each record. Something like this:
Record ID Flag 1 Flag 2 Flag 3
1 Y
2 Y Y
3
4
5
6 Y Y
I would like to create a printed report (or, ideally, a data set that I could then print) that would look something like the following:
Variable N % Not Missing
Flag 1 6 33.33333
Flag 2 6 33.33333
Flag 3 6 16.66666
I can accomplish something close to what I want for one variable at a time using Proc Freq with something like this:
proc freq data=Work.Records noprint;
tables Flag1 /out=Work.Temp;
run;
I suppose I could easily write a macro to loop through each variable and concatenate the results into one data set... but that seems WAY too complex for this. There has to be a built-in SAS procedure for this, but I'm not finding it.
Any thoughts out there?
Thanks!
Welcome to the fabulous world of Proc Tabulate.
PROC TABULATE is the procedure to use when PROC FREQ can't cut it. It works similarly to PROC FREQ in that it has table statements, but beyond that it's pretty much on steroids.
data test;
input RecordID (Flag1 Flag2 Flag3) ($);
datalines;
1 Y . .
2 . Y Y
3 . . .
4 . . .
5 . . .
6 . Y Y
;;;;
run;
proc tabulate data=test;
class flag1-flag3/missing;
tables (flag1-flag3),colpctn; *things that generate rows,things that generate columns;
run;
Unfortunately it's not easy to hide the blank rows without using some advanced style tricks, but if you don't mind seeing both rows this works pretty well.