Local macro on subsample data using if statement in Stata - if-statement

I want to use the local command in Stata to store several variables that I afterwards want to export as two subsamples. I separate the dataset by the grouping variable grouping_var, which is either 0 or 1. I tried:
if grouping_var==0 local vars_0 var1 var2 var3 var4
preserve
keep `vars_0'
saveold "data1", replace
restore
if grouping_var==1 local vars_1 var1 var2 var3 var4
preserve
keep `vars_1'
saveold "data2", replace
restore
However, the output is not as I expected and the data is not divided into two subsamples. The first list includes the whole dataset. Is there anything wrong in how I use the if statement here?

There is a bit of confusion between the "if qualifier" and the "if command" here. The syntax if (condition) (command) is the "if command", and generally does not provide the desired behavior when written using observation-level logical conditions.
In short, Stata evaluates if (condition) for the first observation, which is why your entire data set is being kept/saved in the first block (i.e., in your current sort order, grouping_var[1] == 0). See http://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/ for more information.
Assuming you want to keep different variables in each case, something like the code below should work:
local vars_0 var1 var2 var3 var4
local vars_1 var5 var6 var7 var8
forvalues g = 0/1 {
preserve
keep if grouping_var == `g'
keep `vars_`g''
save data`g' , replace
restore
}

Related

SAS: How to vertically join multiple datasets where a variable is numeric in one dataset and character in the other

This is complicated a little because we're using a pipeline with a filelist to compile the data, so there are 50+ datasets coming in. I need to combine many, many datasets vertically, but var2 is numeric in some and character in others. Var1 is not important, so we can drop it, but when I try to drop it in the data step, it is throwing an error because of the differing data types. More details below.
Here's what I want to do at it's most basic...
data in1;
input var1 $ var2
datalines;
a 1
b 2
;
data in2;
input var1 $ var2 $
datalines;
a 1a
b 2b
;
data newdd;
set in1 in2;
run;
Is it possible to combine these datasets in the "data newdd" step without changing the inputs? Is there a way to drop var2 in this data step in a way that will still let it merge var1 and not throw an error? Or better yet, can I make var2 read in as character in all cases?
To drop var2 use a data set option.
data newdd;
set in1(drop=var2) in2(drop=var2);
run;
To combine and ensure it's the same:
data newdd;
set in1 (in=t1 rename=var2=_var2) in2(in=t2 );
if t1 then var2 = put(_var2, 8. -l);
run;
Long answer - you need to fix how you read in the files - all 50 from source so they're consistent. You can have SAS generate the correct types/INPUT statement if you have a master list of variables and type/length/format/informat, type at minimum.

Stata Using Putexcel with bysort() command

I have a large set of categorical data that I need moved to Excel. I also need to sort the data by specific criteria - whether a state has adopted a certain policy - which I'll call var1. For each var2, var3, ... , var n, I want to write something similar to
bysort var1: tab var2 [aw = weight]
Because the data are categorical, I'm not interested in mean, sd, etc., only the number and proportion of responses for each category. But when I write
putexcel A1 = bysort var1: tab var2 [aw = weight]
the console tells me "weights not allowed." If I add parentheses and write
putexcel A1 = (bysort var1: tab var2 [aw = weight])
the console says "bysort not found."
Any idea what's going on here?

Stata: Replace values of one row based on another if data are missing

In Stata, I am trying to change the values--both string and numeric--of one row based on the one just above or just below it only if the values are missing. Here are some sample data:
input
str40 id var1 var2 var3 var4 str40 var5_string str40 var6_string
"correctly-spelled" 10 20 . . "random text 1" ""
"misspelled" . . 30 40 "" "random text 2"
end
Essentially, I want my final dataset to look as follows:
input
id var1 var2 var3 var4 var5_string var6_string
"correctly-spelled" 10 20 30 40 "random text 1" "random text 2"
end
I need a row-specific solution (i.e. avoiding collapse), because my (wide) dataset has thousands of labeled variables, and I don't want to lose the labels due to collapse. Also, not all of the variables are numeric, and the naming conventions of the variables are not consistent. Accordingly, fixing the spelling of id with a simple replace, executing a collapse (firstnm) id var5_string var6_string (mean) var1 var2 var3 var4, by(id), or using var* for anything won't help. Basically, what happened was one person merged using the "correctly-spelled" id, the other person merged using the "misspelled" id, and I don't have any of the source files. Thanks!
If you can assume that the misspelled ID comes right after (or right before) the correctly spelled, you can use _n±1 to get the previous or following value. For more information on system variables see help _variables
If you assume the correct one always comes first, then the second replace would be sufficient.
mi() is the abbreviated missing() function.
the second conditions & !mi(var'[_n±1])`, are just to make sure that non-missing don't get replaced by missing values, should two valid (but different) ID's come up sequentially. Depending on your data, this further condition might not be necessary.
local list_of_vars var1 var2 var3 var4 var5_string var6_string
foreach var of local list_of_vars {
replace `var' = `var'[_n-1] if mi(`var') & !mi(`var'[_n-1])
replace `var' = `var'[_n+1] if mi(`var') & !mi(`var'[_n+1])
}
. list
+-------------------------------------------------------------------------------+
| id var1 var2 var3 var4 var5_string var6_string |
|-------------------------------------------------------------------------------|
1. | correctly-spelled 10 20 30 40 random text 1 random text 2 |
2. | misspelled 10 20 30 40 random text 1 random text 2 |
+-------------------------------------------------------------------------------+
Then just keep the correct ones. Hopefully you can identify them somehow.
// the following is just to be able to identify the correct id's, of course you will have to adapt it so that it matches only the correctly-spelled IDs or you have other way of identifying them :)
gen _ck_corect_id = (id=="correctly-spelled")
keep if _ck_corect_id==1

Summary table of many variables when each needs to be restricted using if

I have three different variables in Stata, var1, var2, and var3.
I need to make a summary table of these three variables so that I have the observation number, mean, sd, min, max as the fields in the resulting summary table.
I am using the following code :
su var1 if restriction == 2
su var2 if restriction == 3
su var3 if restriction == 4
Since the summary table is created from variables that are applied with restrictions, I am unable to use :
su var1 var2 var3
I would be very grateful if anyone has any ideas on how to modify my code so that instead of three lines of code I can use one line of code to get a single table will all the stats I require, which I can then copy as a table into my Word document.
Nothing reproducible here without example data. Please study https://stackoverflow.com/help/mcve
But I would go
gen var1_2 = var1 if restriction == 2
gen var2_3 = var2 if restriction == 3
gen var3_4 = var3 if restriction == 4
summarize var1_2 var2_3 var3_4

how to convert a variable with both character and numeric variable into a numeric variable in sas

i looked at the previous links related to the topic and tried using the commands but it is showing error.
i have a variable var1 = census tract 244.1 which is in character format of length 25. i need a final variable which will contain only the number 244.1 and the format should be numeric
i used the following commands:
newvar = input (var1, 8.)
but it showed error mentioning it as an invalid argument to function INPUT.
i also used:
newvar = input (var1, best32.) but again the same error message as above.
i tried to remove the word 'census tract' word using:
TRACT =tranwrd(var1, "Census Tract", '');
the message said that var1 is defined both as character and numeric variable
i have run out of option. so need help. i'm using sas 9.3
You'll have to do this in two steps:
Extract the characters "244.1"
Since we're only interested in 244.1, we'll get rid of the rest. This could have been done in a number of ways, one of which is tranwrd as you pointed out.
var2 = substr(var1, 13, 6);
Convert the character value "244.1" to the number 244.1
We need to take the character value and convert to a number. The input function allows us to take a character value and convert it to a number using an informat. An informat is just a way of telling sas how to interpret the value. In this case, treat it as a number stored in 8 bytes.
var3 = input(var2, 8.);
Full example program:
data work.one;
var1 = "census tract 244.1";
var2 = substr(var1, 13, 6);
var3 = input(var2, 8.);
run;
/* Show that var3 is a numeric variable */
proc contents data=work.one;
run;
Bonus Info
Note that you cannot save the converted value back to the original "var1" variable, since once it has been declared as a character variable it cannot store a number. If you did want to keep the same variable you would have to drop var1, then rename var3 to var1:
drop var1;
rename var3=var1;