Stata: Using if with value labels - if-statement

I faced an issue using if with value labels.
set obs 5
gen var1 = _n
label define l_var1 1 "cat1" 2 "cat1" 3 "cat2" 4 "cat3" 5 "cat3"
label val var1 l_var1
keep if var1=="cat3":l_var1
(4 observations deleted)
I expected 3 records to be deleted. How can I achieve this?
I am using Stata 16.1.

"cat3":l_var1 does not look up all values in l_var1 that corresponds to "cat3". It returns the first value that corresponds to the string "cat3".
So "cat3":l_var1 evaluates to 4 so keep if var1=="cat3":l_var1 evaluates to keep if var1==4 and therefore only one observation is kept.
See code below that shows this behavior. This is not the way you seem to want "cat3":l_var1 to behave, but this is how it behaves.
set obs 5
gen var1 = _n
label define l_var1 1 "cat1" 2 "cat1" 3 "cat2" 5 "cat3" 4 "cat3"
label val var1 l_var1
gen var2 = "cat3":l_var1
gen var3 = 1 if var1=="cat3":l_var1
This answers what is going on in your code. The code below is a better way to solve what you are trying to do.
set obs 5
gen var1 = _n
label define l_var1 1 "cat1" 2 "cat1" 3 "cat2" 5 "cat3" 4 "cat3"
label val var1 l_var1
decode var1, generate(var_str)
keep if var_str == "cat3"

Related

SAS concatenate in SAS Data Step

I don't know how to describe this question but here is an example. I have an initial dataset looks like this:
input first second $3.;
cards;
1 A
1 B
1 C
1 D
2 E
2 F
3 S
3 A
4 C
5 Y
6 II
6 UU
6 OO
6 N
7 G
7 H
...
;
I want an output dataset like this:
input first second $;
cards;
1 "A,B,C,D"
2 "E,F"
3 "S,A"
4 "C"
5 "Y"
6 "II,UU,OO,N"
7 "G,H"
...
;
Both tables will have two columns. Unique value of range of the column "first" could be 1 to any number.
Can someone help me ?
something like below
proc sort data=have;
by first second;
run;
data want(rename=(b=second));
length new_second $50.;
do until(last.first);
set have;
by first second ;
new_second =catx(',', new_second, second);
b=quote(strip(new_second));
end;
drop second new_second;
run;
output is
first second
1 "A,B,C,D"
2 "E,F"
3 "A,S"
4 "C"
5 "Y"
6 "II,N,OO,UU"
7 "G,H"
You can use by-group processing and the retain function to achieve this.
Create a sample dataset:
data have;
input id value $3.;
cards;
1 A
1 B
1 C
1 D
2 E
2 F
3 S
3 A
4 C
5 Y
6 II
6 UU
6 OO
6 N
7 G
7 H
;
run;
First ensure that your dataset is sorted by your id variable:
proc sort data=have;
by id;
run;
Then use the first. and last. notation to identify when the id variable is changing or about to change. The retain statement tells the datastep to keep the value within concatenated_value over observations rather than resetting it to a blank value. Use the quote() function to apply the " chars around the result before outputting the record. Use the cats() function to perform the actual concatenation and separate the records with a ,.
data want;
length contatenated_value $500.;
set have;
by id;
retain contatenated_value ;
if first.id then do;
contatenated_value = '';
end;
contatenated_value = catx(',', contatenated_value, value);
if last.id then do;
contatenated_value = quote(cats(contatenated_value));
output;
end;
drop value;
run;
Output:
contatenated_
value id
"A,B,C,D" 1
"E,F" 2
"S,A" 3
"C" 4
"Y" 5
"II,UU,OO,N" 6
"G,H" 7

Comparing observations

Suppose my dataset includes the following variables:
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
input double(id var5 var6)
1 10000 0.4
2 22000 0.55
3 25000 0.5
4 40000 1
end
I need to delete rows of ids that have an increased value of var5 and reduced value of var6 compared with at least one other id. In the first example, number 4 with 2028 and 17.396 should be deleted. In the second example, number 3 with 25000 and 0.5 should be deleted. After the elimination, the observations of the three variables should look like this:
1 1052 17.348
2 1288 17.378
3 1536 17.387
5 1810 17.402
6 2034 17.407
1 10000 0.4
2 22000 0.55
4 40000 1
while var1 and var2 should remain intact.
How can I do this?
This is very odd because you appear to say that you have a dataset with completely unrelated variables. You have an initial dataset of 100 observations with variables var1 and var2 and then a secondary dataset with 6 observations with variables var5 and var6. Your objective appears to be to remove observations, but only for values contained in variables var5 and var6. This looks like spreadsheet thinking as Stata only has a single dataset in memory at any given time.
The task of identifying observations to drop requires that you compare each observations with values for var5 and var6 with all other observations with values for those variables. This can be done in Stata by forming all pairwise combinations using the cross command.
Here's a solution that starts with data organized exactly as you presented it and separates the two datasets in order to perform the task of dropping the observations based on var5 and var6 values. Since the datasets appear completely unrelated, an unmatched merge is used to recombine the data.
clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
tempfile main
save "`main'"
* extract secondary dataset
keep id var5 var6
keep if !mi(id)
tempfile data2
save "`data2'"
* form all pairwise combinations
rename * =_0
cross using "`data2'"
* identify cases where there's an increase in var5 and decrease in var6
gen todrop = var5_0 > var5 & var6_0 < var6
* drop id if there's at least one case, reduce to original obs and vars
bysort id_0 (todrop): keep if !todrop[_N]
keep if id == id_0
keep id var5 var6
list
* now merge back with original data, use unmatched merge since
* secondary data is unrelated
sort id
tempfile newdata2
save "`newdata2'"
use "`main'", clear
drop id var5 var6
merge 1:1 _n using "`newdata2'", nogen
Here's one way to do this without separating the datasets. The task of identifying the observations to drop require a double-loop to make all pairwise comparisons. There is however no command in Stata to drop observations for just a few variables. In the following example, I switch to Mata to load the observations to preserve and then clear out values and save the observations back into the Stata variables:
clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
* an observation index
gen obsid = _n if !mi(id)
* identify observations to drop
gen todrop = 0 if !mi(id)
sum obsid, meanonly
local n = r(N)
quietly forvalues i = 1/`n' {
forvalues j = 1/`n' {
replace id = . if var5[`i'] > var5[`j'] & var6[`i'] < var6[`j'] & _n == `i'
}
}
* take a trip to Mata to load the data to keep and store it back from there
mata:
// load data, ignore observations with missing values
X = st_data(., ("id","var5","var6"), 0)
// set all obs to missing
st_store(., ("id","var5","var6") ,J(st_nobs(),3,.))
// store non-missing values back into the variables
st_store((1,rows(X)), ("id","var5","var6") ,X)
end
drop obsid todrop
Alternatively, you can manually move up values by doing some observation index gymnastics:
clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
* an observation index
gen obsid = _n if !mi(id)
* identify observations to drop
gen todrop = 0 if !mi(id)
sum obsid, meanonly
local n = r(N)
quietly forvalues i = 1/`n' {
forvalues j = 1/`n' {
replace id = . if var5[`i'] > var5[`j'] & var6[`i'] < var6[`j'] & _n == `i'
}
}
* move observations up
local j 0
quietly forvalues i = 1/`n' {
if !mi(id[`i']) {
local ++j
replace id = id[`i'] in `j'
replace var5 = var5[`i'] in `j'
replace var6 = var6[`i'] in `j'
}
}
local ++j
replace id = . in `j'/l
replace var5 = . in `j'/l
replace var6 = . in `j'/l
drop obsid todrop

Stata: identify consecutive rows with numbers that can cancel out

I have a dataset in long form that lists observations by month. I want to identify if consecutive rows for a variable can cancel out (in other words, have the same absolute value). And if so, I want to change both observations to zero. In addition, I want to have an additional dummy variable that tells me if I've changed anything for that row. How can I structure the code?
For example,
Date Var1 Var 2
Jan2010 5 6
Feb2010 6 0
Mar2010 -6 1
In the above example, I want to make the dataset into below
Date Var1 Var 2 Dummy
Jan2010 5 6 0
Feb2010 0 0 1
Mar2010 0 0 1
This (seemingly) meets the criteria described, but other considerations may come into play if there are other factors not explicitly mentioned (e.g., do you need to consider whether Var2 "cancels out"? What if Apr2010 is 6? etc.).
clear
input str7 Date Var1 Var2
"Jan2010" 5 6
"Feb2010" 6 0
"Mar2010" -6 1
end
gen Dummy = Var1 == Var1[_n+1] * -1 | Var1 == Var1[_n-1] * -1
replace Var1 = 0 if Dummy
replace Var2 = 0 if Dummy
li , noobs
yielding
+-------------------------------+
| Date Var1 Var2 Dummy |
|-------------------------------|
| Jan2010 5 6 0 |
| Feb2010 0 0 1 |
| Mar2010 0 0 1 |
+-------------------------------+
Or perhaps more correctly, Dummy should be generated with respect to actual months and not observations:
gen Month = monthly(Date, "MY")
format Month %tm
tsset Month , monthly
gen Dummy = Var1 == Var1[_n+1] * -1 | Var1 == Var1[_n-1] * -1
Edit: As Roberto rightly points out, the previous code (using abs()) was written based on the example posted, but multiplying by -1 is more robust and yields the same result (for the sample data posted). And the suggestion to preserve the original variables is of course a generally good idea.

How to take x amount of observations and find their mean in SAS?

Say I have a dataset with 3 variables. It looks like:
Var1 Var2 Var3
1 1 4
1 2 5
1 3 1
2 1 6
2 2 2
2 3 8
3 1 2
3 2 7
3 3 9
How can I find the mean of Var3 for each "group" it is in? (4, 5, 1 from Var3 have 1 from Var1 in common, 6,2,8 with 2 etc)? Would using a where expression work and would I be able to loop it over values like in Var1?
I think you can just use the CLASS option to proc means or similar. E.g.:
PROC MEANS DATA=DAT MEAN;
CLASS Var1;
VAR Var3;
run;

Easily splitting out multiple saved mean values into separate macro variables in SAS

I have a data set with a ton of variables. For example:
ID v1 v2 v3 v4 v5 v6 v7 v8
1 4 1 2 2 2 2 1 2
2 2 3 1 4 3 4 4 2
3 3 5 1 3 4 3 4 3
4 3 1 2 3 2 2 4 2
5 5 1 5 5 3 5 1 5
...
I want to take the average of each variable, store it, and then be able to use it for other data sets.
What I have tried so far is for each variable, over and over:
proc means data=data;
var v1;
output out=v1out mean=meanv1;
run;
proc means data=data;
var v2;
output out=v2out mean=meanv2;
run;
...
then, for each (again):
data v1temp;
set v1;
call symput("meanv1",meanv1);
run;
data v2temp;
set v2;
call symput("meanv2",meanv2);
run;
...
But this is very tedious with a lot of variables. Is there an easier way?
I want to take the average of each variable, store it, and then be
able to use it for other data sets.
There doesn't seem to be an advantage to using global macro variables for this. Another option is to calculate the means as #user102890 suggests above:
proc means data = myData noprint;
var v1-v8;
output out = myDataMeans(drop = _type_ _freq_
where = (_stat_='MEAN')
rename = (v1-v8 = meanV1-meanV8));
run;
And then just set that one observation into your data set:
DATA myData;
set myData;
if _N_ = 1 then set myDataMeans;
...;
RUN;
Then you have variables meanV1-meanV8 available as actual data set values on every observation of data set data. You could do the same thing for any other data set for which you want to use the means of those variables.
Behold the power of PROC SQL;)
data myData;
input id v1-v8;
datalines;
1 4 1 2 2 2 2 1 2
2 2 3 1 4 3 4 4 2
3 3 5 1 3 4 3 4 3
4 3 1 2 3 2 2 4 2
5 5 1 5 5 3 5 1 5
;
run;
proc transpose data= myData out= myXData;
by id;
var v1-v8;
run;
proc sql noprint;
select mean( col1 )
into :mean1 - :mean8
from myXData
group by _name_
;
quit;
%put &mean1 &mean2 &mean3 &mean4 &mean5 &mean6 &mean7 &mean8;
Log output:
171
172 %put &mean1 &mean2 &mean3 &mean4 &mean5 &mean6 &mean7 &mean8;
3.4 2.2 2.2 3.4 2.8 3.2 2.8 2.8
I still concur the macro variables are not the best way storing sequential data.
data myData;
input id v1-v8;
datalines;
1 4 1 2 2 2 2 1 2
2 2 3 1 4 3 4 4 2
3 3 5 1 3 4 3 4 3
4 3 1 2 3 2 2 4 2
5 5 1 5 5 3 5 1 5
;
run;
proc means data = myData noprint;
var v1-v8;
output out = myDataMeans(drop = _type_ _freq_
where = (_stat_='MEAN')
rename = (v1-v8 = meanV1-meanV8));
run;
The output datset, myDataMeans looks like the following:
_STAT_ meanV1 meanV2 meanV3 meanV4 meanV5 meanV6 meanV7 meanV8
MEAN 3.4 2.2 2.2 3.4 2.8 3.2 2.8 2.8
The following will read the myDataMeans dataset and put each column in it into its own macro variable.
%let dsid=%sysfunc(open(myDataMeans,i));/*open the dataset which has macro vars to read in cols*/
%syscall set(dsid); /*no leading ampersand with %SYSCALL */
%let rc=%sysfunc(fetchobs(&dsid,1));/*just reading 1 obs*/
%let rc=%sysfunc(close(&dsid));/*close dataset after reading*/
%put _user_;
The following global macro variables are created as shown in the log:
GLOBAL _STAT_ MEAN
GLOBAL MEANV1 3.4
GLOBAL MEANV2 2.2
GLOBAL MEANV3 2.2
GLOBAL MEANV4 3.4
GLOBAL MEANV5 2.8
GLOBAL MEANV6 3.2
GLOBAL MEANV7 2.8
GLOBAL MEANV8 2.8