Replacing Missing Value with non-missing in the same dataset - replace

I have a question regarding the following problem. I have data that looks like this:
State Total
AZ 1000
AZ 1000
AZ -
CA -
CA 4000
That is, I have missing data for the variable "total" for some observations. I would like to replace the missing values with total from non-missing observations.
Desired output
enter code here
State Total
AZ 1000
AZ 1000
AZ **1000**
CA **4000**
CA 4000
Any ideas?

If your values are constant use PROC STANDARDIZE to replace the missing values.
Proc stdize data=have out=want missing=mean reponly;
By state;
Var amount;
Run;

Here is a solution I came up with. Surely there are more elegant ways to do this, but this is tested and works.
Idea is sort the data so that missing values are after proper ones. Then loop though each state. Save the 'total' value from first observation and apply it to any missing cells in the state.
data begin;
length state $3 total 5;
input state Total;
cards;
AZ 1000 ##
AZ 1000 ##
AZ ##
CA ##
CA 4000 ##
OZ ##
OZ 3000 ##
OZ ##
;
run;
proc sort data=begin; by state descending total ; run;
data Filled;
set begin;
by state; /*Handle each state as own subset*/
retain memory; /*Keeps the 'memory' from prior observations and not from column */
if first.state then memory=total; /*Save the value to temporary column*/
if total=. then total=memory; /*Fill blanks*/
drop memory; /*Cleanup*/
run;

Merge to impute with mean.
proc sql;
select a.state,coalesce(a.total,b.total) from have a left join (select distinct state,mean(total) as total from have group by state) b on a.state=b.state;
quit;

Related

Missing values in VARMAX

I have a dataset with visitors and weather variables. I'm trying to forecast visitors based on the weather variables. Since the dataset only consists of visitors in season there is missing values and gaps for every year. When running proc reg in sas it's all okay but the issue comes when i'm using proc VARMAX. I cannot run the regression due to missing values. How can i tackle this?
proc varmax data=tivoli4 printall plots=forecast(all);
id obs interval=day;
model lvisitors = rain sunshine averagetemp
dfebruary dmarch dmay djune djuly daugust doctober dnovember ddecember
dwednesday dthursday dfriday dsaturday dsunday
d_24Dec2016 d_05Dec2013 d_24Dec2017 d_24Dec2014 d_24Dec2015 d_24Dec2019
d_24Dec2018 d_24Sep2012 d_06Jul2015
d_08feb2019 d_16oct2014 d_15oct2019 d_20oct2016 d_15oct2015 d_22sep2017 d_08jul2015
d_20Sep2019 d_08jul2016 d_16oct2013 d_01aug2012 d_18oct2012 d_23dec2012 d_30nov2013 d_20sep2014 d_17oct2012 d_17jun2014
dFrock2012 dFrock2013 dFrock2014 dFrock2015 dFrock2016 dFrock2017 dFrock2018 dFrock2019
dYear2015 dYear2016 dYear2017
/p=7 q=2 Method=ml dftest;
garch p=1 q=1 form=ccc OUTHT=CONDITIONAL;
restrict
ar(3,1,1)=0, ar(4,1,1)=0, ar(5,1,1)=0,
XL(0,1,13)=0, XL(0,1,14)=0, XL(0,1,13)=0, XL(0,1,27)=0, XL(0,1,38)=0, XL(0,1,42)=0;
output lead=10 out=forecast;
run;
As with any forecast, you will first need to prepare your time-series. You should first run through your data through PROC TIMESERIES to fill-in or impute missing values. The impute choice that is most appropriate is dependent on your variables. The below code will:
Sum lvisitors by day and set missing values to 0
Set missing values of averagetemp to average
Set missing values of rain, sunshine, and your variables starting with d to 0 (assuming these are indicators)
Code:
proc timeseries data=have out=want;
id obs interval = day
setmissing = 0
notsorted
;
var lvisitors / accumulate=total;
crossvar averagetemp / accumulate=none setmissing=average;
crossvar rain sunshine d: / accumulate=none;
run;
Important Time Interval Consideration
Depending on your data, this could bias your error rate and estimates since you always know no one will be around in the off-season. If you have many missing values for off-season data, you will want to remove those rows.
Since PROC VARMAX does not support custom time intervals, you can instead create a simple time identifier. You can alternatively turn this into a format for proc format and converttime_id at the end.
data want;
set have;
time_id+1;
run;
proc varmax data=want;
id time_id interval=day;
...
output lead=10 out=myforecast;
run;
data myforecast;
merge myforecast
want(keep=time_id date)
;
by time_id;
run;
Or, if you made a format:
data myforecast;
set myforecast;
date = put(time_id, timeid.);
drop time_id;
run;

SAS vlookup - view all data not only joined

I would like to see all the data from "one" dataset. If join between tables not exist overwrite the value 0. The current code gives me values only where there is a connection. This table I need:
data one;
input lastname: $15. typeofcar: $15. mileage;
datalines;
Jones Toyota 3000
Smith Toyota 13001
Jones2 Ford 3433
Smith2 Toyota 15032
Shepherd Nissan 4300
Shepherd2 Honda 5582
Williams Ford 10532
;
data two;
input startrange endrange typeofservice & $35.;
datalines;
3000 5000 oil change
5001 6000 overdue oil change
6001 8000 oil change and tire rotation
8001 9000 overdue oil change
9001 11000 oil change
11001 12000 overdue oil change
12001 14000 oil change and tire rotation
15032 14999 overdue oil change
13001 15999 15000 mile check
;
data combine;
do until (mileage<15000);
set one;
do i=1 to nobs;
set two point=i nobs=nobs;
if startrange = mileage then
output;
end;
end;
run;
proc print;
run;
Description of the code from the SAS support site:
Read the first observation from the SAS data set outside the DO loop. Assign the FOUND variable to 0. Start the DO loop reading observations from the SAS data set inside the DO loop. Process the IF condition; if the IF condition is true, OUTPUT the observation and set the FOUND variable to 1. Assigning the FOUND variable to 1 will cause the DO loop to stop processing because of the UNTIL (FOUND) that is coded on the DO loop. Go back to the top of the DATA step and read the next observation from the data set outside the DO loop and process through the DATA step again until all observations from the data set outside the DO loop have been read.
You could do that with a LEFT JOIN in a proc sql
taking all variables from one
then making 2 conditions to fill startrange and endrange with 0 when missing.
proc sql noprint;
create table want as
select t1.*
, case when t2.startrange=. then 0 else t2.startrange end as startrange
, case when t2.endrange=. then 0 else t2.endrange end as endrange
, t2.typeofservice
from one t1 left join two t2
on (t1.mileage = t2.startrange)
;run;quit;
Or do it in 2 steps (I personally find the if of the data step cleaner than the case when of the proc sql.)
proc sql noprint;
create table want as select *
from one t1 left join two t2 on (t1.mileage = t2.startrange)
;run;quit;
data want; set want;
if startrange=. then do; startrange=0; endrange=0; end;
run;
I can't use proc sql because I need Vlookup inside loop UNTIL. I need another solution.
Data step is not the best way to code this. It is much easier to code fuzzy matches using SQL code.
Not sure why you need to have zeros instead of missing values, but coalesce() should make it easy to provide them.
proc sql ;
create table combine as
select a.*
, coalesce(b.startrange,0) as startrange
, coalesce(b.endrange,0) as endrange
, b.typeofservice
from one a left join two b
on a.mileage between b.startrange and b.endrange
;
quit;

Summing values by character in SAS

I created this fakedata as an example:
data fakedata;
length name $5;
infile datalines;
input name count percent;
return;
datalines;
Ania 1 17
Basia 1 3
Ola 1 10
Basia 1 52
Basia 1 2
Basia 1 16
;
run;
The result I want to have is:
---> summed counts and percents for Basia
I would like to have summed count and percent for Basia as she was only once in the table with count 4 and percent 83. I tried exchanging name into a number to do GROUP BY in proc sql but it changes into order by (I had such an error). Suppose that it isn't so difficult, but I can't find the solution. I also tried some arrays without any success. Any help appreciated!
It sounds like proc sql does what you want:
proc sql;
select name, count(*) as cnt, sum(percent) as sum_percent
from fakedata
group by name;
You can add a where clause to get the results just for one name.
Hm, actually I got an answer.
proc summary data=fakedata;
by name;
var count percent;
output out=wynik (drop = _FREQ_ _TYPE_) sum(count)=count sum(percent)=percent;
run;
You can go back a step and use PROC FREQ most likely to generate this output in a single step. Based on counts the percents are not correct, but I'm not sure they're intended to be, right now they add up to over 100%. If you already have some summaries, then use the WEIGHT statement to account for the counts.
proc freq data=fakedata;
table name;
weight count;
run;

SAS: in the DATA step, how to calculate the grand mean of a subset of observations, skipping over missing values

I'm trying to calculate the grand mean of a subset of observations (e.g., observation 20 to observation 50) in the data step. In this calculation, I also want to skip over (ignore) any missing values.
I've tried to play around with the mean function using various if … then statements, but I can't seem to fit all of it together.
Any help would be much appreciated.
For reference, here's the basic outline of my data steps:
data sas1;
infile '[file path]';
input v1 $ 1-9 v2 $ 11 v3 13-17 [redacted] RPQ 50-53 [redacted] v23 101-106;
v1=translate(v1,"0"," ");
format [redacted];
label [redacted];
run;
data gmean;
set sas1;
id=_N_;
if id = 10-40 then do;
avg = mean(RPQ);
end;
/*Here, I am trying to calculate the grand mean of the RPQ variable*/
/*but only observations 10 to 40, and skipping over missing values*/
run;
Use the automatic variable /_N_/ to id the rows. Use a sum value that is retained row to row and then divide by the number of observations at the end. Use the missing() function to determine the number of observations present and whether or not to add to the running total.
data stocks;
set sashelp.stocks;
retain sum_total n_count 0;
if 10<=_n_<=40 and not missing(open) then do;
n_count=n_count+1;
sum_total=sum_total+open;
end;
if _n_ = 40 then average=sum_total/n_count;
run;
proc print data=stocks(obs=40 firstobs=40);
var average;
run;
*check with proc means that the value is correct;
proc means data=sashelp.stocks (firstobs=10 obs=40) mean;
var open;
run;

In SAS how do I group my ratios into high medium and low?

I calculate a ratio for 40 stocks. I need to sort those into three groups high, medium and low based on the value of the ratio. The ratios are fractions of one and there aren't many repetitions. What I need is to create three groups of about 13 stocks each, in group 1 to have the high ratios, in group 2 medium ratios and group 3 low ratios. I have the below code but it just assigns rank 1 to all my stocks.
How can I correct this?
data sourceh.combinedfreq2;
merge sourceh.nonnfreq2 sourceh.nofreq2 sourcet.caps;
by symbol;
ratio=(freqnn/freq);
run;
proc rank data=sourceh.combinedFreq2 out=sourceh.ranked groups=3;
by symbol notsorted;
var ratio;
ranks rank;
run;
If you want to automatically partition into three relatively even groups, you can use PROC RANK (See example using sashelp.stocks):
data have;
set sashelp.stocks;
ratio=high/low;
run;
proc rank data=have out=want groups=3;
by stock notsorted;
var ratio;
ranks rank;
run;
That partitions them into three groups. As long as you have 40 different values (ie, not a lot of repeats of one value), it will make 3 evenly split groups (with ~13 in each).
In your case, do not use by anything - by will create separate sets of ranks (here I'm ranking dates by stock, but you want to rank stocks.)
I think people are making this more complicated than it needs to be. Lets do this on easy mode.
First, we'll create the dataset and create out ratios.
Second, We'll sort the data by ratio.
Lastly, we'll assign a group based on observation number.
WARNING! UNTESTED CODE!
/*Make the dataset. I stole this from your code above*/
data sourceh.combinedfreq2;
merge sourceh.nonnfreq2 sourceh.nofreq2 sourcet.caps;
by symbol;
ratio=(freqnn/freq);
run;
/*sort the data so that its ordered by ratio*/
PROC SORT DATA=sourceh.combinedfreq2 OUT=sourceh.combinedfreq2 ;
BY DESCENDING ratio ;
RUN ;
/*Assign a value based on observation number*/
Data sourceh.combinedfreq2;
Set sourceh.combinedfreq2;
length Group $6.;
if _N_ <=13 Then Group = "High";
if _N_ > 13 and _N_ <= 26 Then Group = "Medium";
if _N_ > 26 Then Group = "Low";
run;