The variable variable_1 ranges from 0 to 10 plus a 99 for missing values. I want to do a stata calculation using only the values from 0 to 10 but the if command doesn't seem to work. The sum command result states 99 as the max value. How can I select all the values that are not 99?
if variable_1 !=99 {
sum variable_1
}
I think you might want to use an if qualifier, not an if statement. That is append your if statement rather than writing it before your calculation.
sum variable_1 if variable_1 != 99
http://www.stata.com/support/faqs/data-management/multiple-operations/
sum variable_1 if variable_1 !=99
Related
I have 18 numerical variables pm25_total2000 to pm25_total2018
Each person have a starting year between 2013 and 2018, we can call that variable "reqyear".
Now I want to calculate mean for each persons 10 years before the starting year.
For example if a person have starting year 2015 I want mean(of pm25_total2006-pm25_total2015)
Or if a person have starting year 2013 I want mean(of pm25_total2004-pm25_total2013)
How to do this?
data _null_;
set scapkon;
reqyear=substr(iCDate,1,4)*1;
call symput('reqy',reqyear);
run;
data scatm;
set scapkon;
/* Medelvärde av 10 år innan rekryteringsår */
pm25means=mean(of pm25_total%eval(&reqy.-9)-pm25_total%eval(&reqy.));
run;
%eval(&reqy.-9) will be constant value (the same value for all as for the first person) , in my case 2007
That doesn't work.
You can compute the mean with a traditional loop.
data want;
set have;
array x x2000-x2018;
call missing(sum, mean, n);
do _n_ = 1 to 10;
v = x ( start - 1999 -_n_ );
if not missing(v) then do;
sum + v;
n + 1;
end;
end;
if n then mean = sum / n;
run;
If you want to flex your SAS skill, you can use POKE and PEEK concepts to copy a fixed length slice (i.e. a fixed number of array elements) of an array to another array and compute the mean of the slice.
Example:
You will need to add sentinel elements and range checks on start to prevent errors when start-10 < 2000.
data have;
length id start x2000-x2018 8;
do id = 1 to 15;
start = 2013 + mod(id,6);
array x x2000-x2018;
do over x;
x = _n_;
_n_+1;
end;
output;
end;
format x: 5.;
run;
data want;
length id start mean10yrPriorStart 8;
set have;
array x x2000-x2018;
array slice(10) _temporary_;
call pokelong (
peekclong ( addrlong ( x(start-1999-10) ) , 10*8 ) ,
addrlong ( slice (1))
);
mean10yrPriorStart = mean(of slice(*));
run;
use an array and loop
index the array with years
accumulate the sum of the values
accumulate the count to account for any missing values
divide to obtain the mean value
data want;
set have;
array _pm(2000:2018) pm25_total2000 - pm25_total2018;
do year=reqyear to (reqyear-9) by -1;
*add totals;
total = sum(total, _pm(year));
*add counts;
nyears = sum(nyears,not missing(_pm(year)));
end;
*accounts for possible missing years;
mean = total/nyears;
run;
Note this loop goes in reverse (start year to 9 years previous) because it's slightly easier to understand this way IMO.
If you have no missing values you can remove the nyears step, but not a bad thing to include anyways.
NOTE: My first answer did not address the OP's question, so this a redux.
For this solution, I used Richard's code for generating test data. However, I added a line to randomly add missing values.
x = _n_;
if ranuni(1) < .1 then x = .;
_n_+1;
This alternative does not perform any checks for missing values. The sum() and n() functions inherently handle missing values appropriately. The loop over the dynamic slice of the data array only transfers the value to a temporary array. The final sum and count is performed on the temp array outside of the loop.
data want;
set have;
array x(2000:2018) x:;
array t(10) _temporary_;
j = 1;
do i = start-9 to start;
t(j) = x(i);
j + 1;
end;
sum = sum(of t(*));
cnt = n(of t(*));
mean = sum / cnt;
drop x: i j;
run;
Result:
id start sum cnt mean
1 2014 72 7 10.285714286
2 2015 305 10 30.5
3 2016 458 9 50.888888889
4 2017 631 9 70.111111111
I have the following SAS PROC MEANS statement that works great as it is.
proc means data=MBA_NODUP_APPLICANT_&TERM. missing nmiss n mean median p10 p90 fw = 8;
where ENR = 1;
by SRC_TYPE;
var gmattotal greverb2 grequant2 greanwrt;
run;
However, I am trying to add new variable calculating nmiss/(nmiss+n). I don't see any examples of this online, but also nothing that says that it cannot be done.
To calculate the percent missing, which is what your formula means, just use the OUTPUT statement to generate a dataset with the NMISS and N values. Then add a step to do the arithmetic yourself.
Or you could create a new binary variable using the MISSING() function and take the MEAN of that. The mean of a 1/0 variable is the same are the percent that were 1 (TRUE).
Example:
data test;
set sashelp.cars;
missing_cylinders=missing(cylinders);
run;
proc means data=test nmiss n mean;
var cylinders missing_cylinders ;
run;
So 2/428 is a little less than 0.5%.
The MEANS Procedure
N
Variable Miss N Mean
------------------------------------------------
Cylinders 2 426 5.8075117
missing_cylinders 0 428 0.0046729
I am getting confused of the syntax SAS uses when sums across the column.
I wrote the following code to sum across the columns:
DATA SUM_RESULTS_ADF;
SET VOLUME_DOLLAR;
by SYM_ROOT;
if %upcase(EX) = 'D';
if first.SYM_ROOT then
do;
SUMMED_DOLLARSIZE=0;
SUMMED_SIZE=0;
end;
SUMMED_DOLLARSIZE + DOLLAR_SIZE;
SUMMED_SIZE + SIZE;
if last.SYM_ROOT then output;
drop DOLLAR_SIZE SIZE;
RUN;
I just want to sum all the numbers in the column named DOLLAR_SIZE and size. But I am not sure if I am doing it correctly.
Because in OOC languages, we usually write: SUMMED_DOLLARSIZE = SUMMED_DOLLARSIZE + DOLLAR_SIZE;
But it seems that SAS doesn't need the equal sign here.
The use of the SUM statement or the SUM(,...) function will handle missing values differently than just using the + operator. With SUM the missing values are ignored, but with + they will generate a missing result.
You are using the SUM statement. That is is just a short cut to save some typing.
The SUM statement has the form:
variable + expression ;
It is equivalent to these two statements:
retain variable 0 ;
variable = sum(variable,expression);
If you used simple addition instead of the SUM(,...) function then any observations with missing values would result in the sum being missing.
Here is a worked example:
data want ;
input cost ;
sum1 + cost ;
retain sum2 0;
sum2 = sum(sum2,cost);
retain sum3 0;
sum3 = sum3 + cost;
cards;
10
20
.
30
;
I have a data set like this :
ID I201401 I201402 ... I201411 I201412 START END
1 1 0 1 1 I201402 I201410
2 0 0 0 1 I201401 I201408
3 1 1 0 0 I201408 I201412
To explain the dataset simply each ID have a 1 or 0 in column I201401 through I201412 depending on certain factor. Depending on other factor I establish column START and END too. Not all ID have the same START and END value.
What I want to do is to create a other column that is the summation of the column mention in the START column through the END column. For quick understanding here is what the dataset should appear :
ID SUM
1 (SUM of I201402 Throught I201410)
2 (SUM of I201401 Throught I201408)
3 (SUM of I201408 Throught I201412)
The thing is a don't really know how to specifies the sum function to use the value of column START and END to do is operation.
Thank you!
I don't know how to do this without looping, but with an array and the vname() function, you should be able to do what you need:
data want (keep=id sum);
set have;
array var_array I201401--I201412;
sum=0;
do over var_array;
if start le vname(var_array) le end then sum = sum + var_array;
end;
run;
I am trying to compute using two loops. But I am not very familiar with loop elements.
Here is my data:
data try;
input rs t a b c;
datalines;
0 600
1 600 0.02514 667.53437 0.1638
2 600 0.2766 724.60233 0.30162
3 610 0.01592 792.34628 0.21354
4 615.2869 0.03027 718.30377 0.22097
5 636.0273 0.01967 705.45965 0.16847
;
run;
What I am trying to compute is that for each 'T' value, all elements of a, b, and c need to be used for the equation. Then I create varaibles v1-v6 to put results of the equation for each T1-T6. After that, I create CS to sum all the elements of v.
So my result dataset will look like this:
rs T a b c v1 v2 v3 v4 v5 v6 CS
0 600 sum of v1
1 600 0.02514 667.53437 0.1638 sum of v2
2 600 0.2766 724.60233 0.30162 sum of v3
3 610 0.01592 792.34628 0.21354 sum of v4
4 615.2869 0.03027 718.30377 0.22097 sum of v5
5 636.0273 0.01967 705.45965 0.16847 sum of v6
I wrote a code below to do this but got errors. Mainly I am not sure how to use i and j properly to link all elements of variables. Can someone point out what i did not think correct? I am aware that myabe I should not use sum function to cum up elements of a variable but not sure which function to use.
data try3;
set try;
retain v1-v6;
retain t a b c;
array v(*) v1-v6;
array var(*) t a b c;
cs=0;
do i=1 to 6;
do j=1 to 6;
v[i,j]=(2.89*(a[j]**2*(1-c[j]))/
((c[j]+exp(1.7*a[j]*(t[i]-b[j])))*
((1+exp(-1.7*a[j]*(t[i]-b[j])))**2));
cs[i]=sum(of v[i,j]-v[i,j]);
end;
end;
run;
Forexample, v1 will be computed like v[1,1] =0 because there is no values for a b c.
For v[1,2]=(2.89*0.02514**2(1-0.1638))/((0.1638+exp(1.7*0.02514*600-667.53437)))*((1+exp(-1.7*0.02514*(600-667.5347)))**2)).
v[1,3]]=(2.89*0.2766**2(1-0.30162))/((0.30162+exp(1.7*0.2766*600-724.60233)))*((1+exp(-1.7*0.2766*(600-724.60233)))**2)).
v[1,4] will be using the next line values of a b c but the t will be same as the t[1]. and do this until the last row. And that will be v1. And then I need to sum all the elements of v1 like v1{1,1] +v1[1,2]+ v1{1,3] ....v1[1,6] to make cs[1,1].
The SAS language isn't that good at doing these kinds of things, which are essentially matrix calculations. The DATA step normally processes one observation at a time, though you can carry calculations over using the RETAIN statement. It is possible that you could get a cleaner result than this if you had access to PROC IML (which does matrix calculations natively), but assuming that you don't have access to IML, you need to do something like the following. I'm not 100% sure that it is what you need, but I think it is along the right lines:
data try;
infile cards missover;
input rs t a b c;
datalines;
0 600
1 600 0.02514 667.53437 0.1638
2 600 0.2766 724.60233 0.30162
3 610 0.01592 792.34628 0.21354
4 615.2869 0.03027 718.30377 0.22097
5 636.0273 0.01967 705.45965 0.16847
;
run;
data try4(rename=(aa=a bb=b cc=c css=cs tt=t vv1=v1 vv2=v2 vv3=v3 vv4=v4 vv5=v5 vv6=v6));
* Construct arrays into which we will read all of the records;
array t(6);
array a(6);
array b(6);
array c(6);
array v(6,6);
array cs(6);
* Read all six records;
do i=1 to 6;
set try(rename=(t=tt a=aa b=bb c=cc));
t[i] = tt;
a[i] = aa;
b[i] = bb;
c[i] = cc;
end;
* Now do the calculation, which involves values from each
row at each iteration;
do i=1 to 6;
cs[i]=0;
do j=1 to 6;
v[i,j]=(2.89*(a[j]**2*(1-c[j]))/
((c[j]+exp(1.7*a[j]*(t[i]-b[j])))*
((1+exp(-1.7*a[j]*(t[i]-b[j])))**2)));
cs[i]+v[i,j];
end;
* Then output the values for this iteration;
tt=t[i];
aa=a[i];
bb=b[i];
cc=c[i];
css=cs[i];
vv1=v[i,1];
vv2=v[i,2];
vv3=v[i,3];
vv4=v[i,4];
vv5=v[i,5];
vv6=v[i,6];
keep tt aa bb cc vv1-vv6 css;
output try4;
end;
Note that I have to construct arrays of known size, that is you have to know how many input records there are.
The first half of the DATA step constructs arrays into which the values from the input data set are read. We read all of the records, and then we do all of the calculations, since we have all of the values in memory in the matricies.
There is some fiddling with RENAMES so that you can keep the array names t, a, b, c etc but still have variables named a, b, c etc in the output data set.
So hopefully that might help you along a bit. Either that or confuse you because I've misunderstood what you're trying to do!