add value of another variable to variable in loop - stata

I'd like to do something like that
gen var1 = 0
gen var2 = 0
forval x = 1/5 {
replace var1 = `x'
replace var2 = var2 + var1
}
Namely I want to replace var2 by its old value plus var1. In a programming language like Python this works but in Stata it doesn't.
My goal is not to create a lot of variables! That's why I want to update the variable var2 in every cycle of the loop. I my loop would run from 1 to 100, I don't want to create 100 variables in order to get a nice solution.
In my example, in the first cycle of the loop, var1 becomes 1 and var2 also becomes 1. In the second cycle var1 should be 2 and var2 should become 3 since it adds the old value of var2 (which is 1) to the new value of var1 which is 2. In the third cylce var1 should become 3 and var2 should become 3 + 3 which is the old value of var2 plus the value of var1 in this cyle. So on and so forth. That's what I want to have!
Could someone please help me?

no need for a loop:
clear all
set obs 100
gen id = _n
tsset id
gen var1 = _n - 1
gen var2 = 0
replace var2 = l.var2 + l.var1 if _n > 1
If you just want to know the "end-result", i.e. the values for var1 and var2 at the end of the loop, then you can use Mata:
mata
a = 0
b = 0
for (i = 1 ; i <= 100; i++) {
a = i
b = b + a
}
a
b
end

Related

Stata: Using if with value labels

I faced an issue using if with value labels.
set obs 5
gen var1 = _n
label define l_var1 1 "cat1" 2 "cat1" 3 "cat2" 4 "cat3" 5 "cat3"
label val var1 l_var1
keep if var1=="cat3":l_var1
(4 observations deleted)
I expected 3 records to be deleted. How can I achieve this?
I am using Stata 16.1.
"cat3":l_var1 does not look up all values in l_var1 that corresponds to "cat3". It returns the first value that corresponds to the string "cat3".
So "cat3":l_var1 evaluates to 4 so keep if var1=="cat3":l_var1 evaluates to keep if var1==4 and therefore only one observation is kept.
See code below that shows this behavior. This is not the way you seem to want "cat3":l_var1 to behave, but this is how it behaves.
set obs 5
gen var1 = _n
label define l_var1 1 "cat1" 2 "cat1" 3 "cat2" 5 "cat3" 4 "cat3"
label val var1 l_var1
gen var2 = "cat3":l_var1
gen var3 = 1 if var1=="cat3":l_var1
This answers what is going on in your code. The code below is a better way to solve what you are trying to do.
set obs 5
gen var1 = _n
label define l_var1 1 "cat1" 2 "cat1" 3 "cat2" 5 "cat3" 4 "cat3"
label val var1 l_var1
decode var1, generate(var_str)
keep if var_str == "cat3"

Wide to Long Dataset in SAS

I have a dataset that has multiple measures taken as multiple time points.
The data look like this:
UserID Var1_2008 Var1_2009 Var1_2010 Var2_2008 Var2_2009 Var2_2010 Race
1 Y N Y 20 30 20 1
2 N N N 15 30 35 0
I want the data to look like this:
Year UserID Var1 Var2 Race
2008 1 Y 20 1
2009 1 N 30 1
....
How can I do this? I'm totally lost
You could use an array, assuming you have the same years for all of the var1_ and var2_ variables.
data want ;
set have ;
/* Need two arrays, as one is character, the other numeric */
array v1{*} var1_: ; /* wildcard all 'var1_'-prefixed variables */
array v2{*} var2_: ; /* same for var2_ */
/* loop along v1 array */
do i = 1 to dim(v1) ;
/* use vname function to get variable name associated to this array element */
year = input(scan(vname(v1{i}),-1,'_'),8.) ;
var1 = v1{i} ;
var2 = v2{i} ;
output ;
end ;
drop i ;
run ;
There's a macro for that! I think running the following will do exactly what you want to accomplish:
filename ut url 'https://raw.githubusercontent.com/FriedEgg/Papers/master/An_Easier_and_Faster_Way_to_Untranspose_a_Wide_File/src/untranspose.sas';
%include ut ;
%untranspose(data=have, out=want, by=UserID, id=year, delimiter=_,
var=Var1 Var2, copy=Race)

Comparing observations

Suppose my dataset includes the following variables:
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
input double(id var5 var6)
1 10000 0.4
2 22000 0.55
3 25000 0.5
4 40000 1
end
I need to delete rows of ids that have an increased value of var5 and reduced value of var6 compared with at least one other id. In the first example, number 4 with 2028 and 17.396 should be deleted. In the second example, number 3 with 25000 and 0.5 should be deleted. After the elimination, the observations of the three variables should look like this:
1 1052 17.348
2 1288 17.378
3 1536 17.387
5 1810 17.402
6 2034 17.407
1 10000 0.4
2 22000 0.55
4 40000 1
while var1 and var2 should remain intact.
How can I do this?
This is very odd because you appear to say that you have a dataset with completely unrelated variables. You have an initial dataset of 100 observations with variables var1 and var2 and then a secondary dataset with 6 observations with variables var5 and var6. Your objective appears to be to remove observations, but only for values contained in variables var5 and var6. This looks like spreadsheet thinking as Stata only has a single dataset in memory at any given time.
The task of identifying observations to drop requires that you compare each observations with values for var5 and var6 with all other observations with values for those variables. This can be done in Stata by forming all pairwise combinations using the cross command.
Here's a solution that starts with data organized exactly as you presented it and separates the two datasets in order to perform the task of dropping the observations based on var5 and var6 values. Since the datasets appear completely unrelated, an unmatched merge is used to recombine the data.
clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
tempfile main
save "`main'"
* extract secondary dataset
keep id var5 var6
keep if !mi(id)
tempfile data2
save "`data2'"
* form all pairwise combinations
rename * =_0
cross using "`data2'"
* identify cases where there's an increase in var5 and decrease in var6
gen todrop = var5_0 > var5 & var6_0 < var6
* drop id if there's at least one case, reduce to original obs and vars
bysort id_0 (todrop): keep if !todrop[_N]
keep if id == id_0
keep id var5 var6
list
* now merge back with original data, use unmatched merge since
* secondary data is unrelated
sort id
tempfile newdata2
save "`newdata2'"
use "`main'", clear
drop id var5 var6
merge 1:1 _n using "`newdata2'", nogen
Here's one way to do this without separating the datasets. The task of identifying the observations to drop require a double-loop to make all pairwise comparisons. There is however no command in Stata to drop observations for just a few variables. In the following example, I switch to Mata to load the observations to preserve and then clear out values and save the observations back into the Stata variables:
clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
* an observation index
gen obsid = _n if !mi(id)
* identify observations to drop
gen todrop = 0 if !mi(id)
sum obsid, meanonly
local n = r(N)
quietly forvalues i = 1/`n' {
forvalues j = 1/`n' {
replace id = . if var5[`i'] > var5[`j'] & var6[`i'] < var6[`j'] & _n == `i'
}
}
* take a trip to Mata to load the data to keep and store it back from there
mata:
// load data, ignore observations with missing values
X = st_data(., ("id","var5","var6"), 0)
// set all obs to missing
st_store(., ("id","var5","var6") ,J(st_nobs(),3,.))
// store non-missing values back into the variables
st_store((1,rows(X)), ("id","var5","var6") ,X)
end
drop obsid todrop
Alternatively, you can manually move up values by doing some observation index gymnastics:
clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
* an observation index
gen obsid = _n if !mi(id)
* identify observations to drop
gen todrop = 0 if !mi(id)
sum obsid, meanonly
local n = r(N)
quietly forvalues i = 1/`n' {
forvalues j = 1/`n' {
replace id = . if var5[`i'] > var5[`j'] & var6[`i'] < var6[`j'] & _n == `i'
}
}
* move observations up
local j 0
quietly forvalues i = 1/`n' {
if !mi(id[`i']) {
local ++j
replace id = id[`i'] in `j'
replace var5 = var5[`i'] in `j'
replace var6 = var6[`i'] in `j'
}
}
local ++j
replace id = . in `j'/l
replace var5 = . in `j'/l
replace var6 = . in `j'/l
drop obsid todrop

Checking for proper ordering of numeric, time, etc

My data looks something like this:
data tmp ;
input id var1 - var5 ;
datalines ;
1 1 2 3 4 5
2 1 2 . . .
3 1 . . . 4
4 . 3 . . .
5 . . . . 5
6 1 3 2 2 3
7 5 3 7 8 9
8 1 . . . 2
9 1 . 2 3 4
;
run ;
I'm trying to determine if n variables are properly 'ordered.' By ordered, I mean numerically or sequential in time (or even alphabetic). So in this example, my desired output would be:
dummy = 1 1 1 1 1 0 0 1 1 since the ones where dummy = 1 are in correct order.
It would be trivial if I had complete data:
if var1 <= var2 <= ... <= varn then dummy = 1; else dummy = 0;
I do not have complete data unfortunately. So the problem may be that sas treats . as a very small number(?) and also that I cannot perform operations on . since this also failed:
if 0 * (var1 = .) + var1 <=
var1 * (var2 = .) + var2 <=
var2 * (var3 = .) + var3 <= ... <=
var_n-1 * (varn = .) + varn
then dummy = 1;
else dummy = 0;
Basically this would check to see if a variable is . and if it is, then use the previous value in the inequality, but if it is not missing, proceed as normal. This works sometimes, but still requires most of the info to be nonmissing.
I have also tried something like:
if var2 = max(var1, var2) & var1 <= var2 &
var3 = max(var1 -- var3) & var2 <= var3 & ...
but this approach also needs complete data. And I have tried transposing the data into a long format so that I can just delete the missing columns (and only keep variables I am interested in knowing the order of) but a transposed data set of thousands of variables isn't useful to me (if you would convert back to wide, there would still be missing columns).
Clearly, I am not the best SASer, but I would ideally like to write a macro or something since this issue comes up for me a lot (basically just a data check to see if dates are in order and occur when they should be regarding their relative timeline).
Here is all the code:
data tmp ;
input id var1 - var5 ;
datalines ;
1 1 2 3 4 5
2 1 2 . . .
3 1 . . . 4
4 . 3 . . .
5 . . . . 5
6 1 3 2 2 3
7 5 3 7 8 9
8 1 . . . 2
9 1 . 2 3 4
;
run ;
data tmp1 ;
set tmp ;
if var1 <= var2 <= var3 <= var4 <= var5 then dummy1 = 1 ; else dummy1 = 0 ;
if 0 * (var1 = .) + var1 <=
var1 * (var2 = .) + var2 <=
var2 * (var3 = .) + var3 <=
var3 * (var4 = .) + var4 <=
var4 * (var5 = .) + var5
then dummy2 = 1 ;
else dummy2 = 0 ;
if var2 = max(var1,var2) & var1 ~= var2 &
var3 = max(var1, var2, var3) & var2 ~= var3 &
var4 = max(var1, var2, var3, var4) & var3 ~= var4 &
var5 = max(var1, var2, var3, var4, var5) & var4 ~= var5
then dummy3 = 1 ;
else dummy3 = 0 ;
* none of dummy1 - 3 pick up the observations that are in proper order ;
run ;
data tmp1_varsIwant ;
set tmp1 ;
keep id var1 -- var5 ;
run ;
proc transpose data = tmp1_varsIwant out = tmp1_long ;
by id ;
run ;
data tmp1_long ;
set tmp1_long ;
if col1 = . then delete ;
if _name_ in('var6', 'var999') then delete ;
run ;
proc sort data = tmp1_long ;
by id col1 ;
run ;
Maybe you could force all the logic into one conditional, but it's probably simpler to use a loop like this:
data tmp1 ;
set tmp ;
array vars (*) var1-var5;
last_highest = .;
dummy = 1;
do i = 1 to 5;
if vars(i) > . and vars(i) < last_highest then do;
dummy = 0;
leave;
end;
last_highest = coalesce(vars(i),last_highest);
end;
run ;

Read wide file with repeated variables in SAS

I have input data shaped like this:
var1 var2 var3 var2 var3 ...
where each row has one value of var1 followed by a varying number of var2-var3 pairs. After reading this input, I want the data set to have multiple records for each var1: one record for each pair of var2/var3.
So if the first two lines of the input file are
A 1 2 7 3 4 5
B 2 3
this would generate 4 records:
A 1 2
A 7 3
A 4 5
B 2 3
Is there an simple/elegant way to do this? I've tried reading each row as one long variable and splitting with scan but it's getting messy and I'm betting there's a really easy way to do this.
I'm sure there are many ways to do this, but here is the first that comes to my mind:
data want(keep=var1 var2 var3);
infile 'path-to-your-file';
input;
var1 = input(scan(_infile_,1),$8.);
i = 1;
do while(i ne 0);
i + 1;
var2 = input(scan(_infile_,i),8.);
i + 1;
var3 = input(scan(_infile_,i),8.);
if var3 = . then i = 0;
else output;
end;
run;
_infile_ is an automatic SAS variable that contains the currently read record. Use an appropriate informat for each variable you read.
Like this (conditional input with jumping back):
data test;
infile datalines missover;
input var1 $ var2 $ var3 $ temp $ #;
output;
do while(not missing(temp));
input +(-2) var2 $ var3 $ temp $ #;
output;
end;
drop temp;
datalines;
A 1 2 7 3 4 5
B 2 3
;
run;