Stata: svy:mean affecting variable format - stata

A code I am running uses svy:mean and there is NO subpop command used.
My issue is that is that for certain variables, it renames some of the values of the variable to _subpop_1, etc. but others are still in their original format. For example, I have a county variable. After using the svy:mean command, some counties show up as Alameda, Alpine, etc) while some show up as _subpop_7, _subpop_8, etc.
Does anyone know why this is?
When using a tab command on the same variable, none of the formats are affected and every county shows up.
An example of my code and output (I hid the numbers) would be:
foreach var of varlist county {
svy: mean deport, over(`var')
}
Survey: Mean estimation
Number of strata = . Number of obs = .
Number of PSUs = . Population size = .
Design df = .
ALAMEDA: county = ALAMEDA
ALPINE: county = ALPINE
AMADOR: county = AMADOR
BUTTE: county = BUTTE
CALAVERAS: county = CALAVERAS
COLUSA: county = COLUSA
_subpop_7: county = CONTRA COSTA
_subpop_8: county = DEL NORTE
_subpop_9: county = EL DORADO
FRESNO: county = FRESNO
GLENN: county = GLENN
HUMBOLDT: county = HUMBOLDT
IMPERIAL: county = IMPERIAL

More than a programming problem, this is simply a case of Stata doing what it states it'll do. From help mean:
Noninteger values,
negative values, and labels that are not valid Stata names are substituted with a default identifier.
An example reproducing the "problem" is:
webuse hbp
// some value labels with spaces
label define lblcity 1 "contra costa" 2 "el dorado" 3 "alameda" 5 "alpine"
label values city lblcity
mean hbp, over(city)
More on valid Stata names in [U] 11 Language syntax.
(Note the svy : prefix plays no role here.)

Related

Looping with distance matching

I want to match treated firms to control firms by industry and year considering firms that are the closest in terms of profitability (roa). I want a 1:1 match. I am using a distance measure (mahalanobis).
I have 530,000 firm-year observations in my sample, namely 267,000 treated observations and 263,000 control observations approximatively. Here is my code:
gen neighbor1 = .
gen idobs = .
levelsof industry
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if industry == `i' & year == `j', mahalanobis(roa)
capture noisily replace neighbor1 = _n1 if industry == `i' & year == `j'
capture noisily replace idobs = _id if industry == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
Treat is my treatment variable. It takes the value of 1 for treated observations and 0 for non-treated observations.
The command psmatch2 creates the variable _n1 and _id among others. _n1 is the id number of the matched observation (closest neighbor) and _id is an id number (1 - 530,000) that is unique to each observation.
The code 'works', i.e. I get no error message. My variable neighbor1 has 290,724 non-missing observations.
However, these 290,724 observations vary between 1 and 933 which is odd. The variable neighbor1 should provide me the observation id number of the matched observation, which can vary between 1 and 530,000.
It seems that the code erases or ignores the result of the matching process in different subgroups. What am I doing wrong?
Edit:
I found a public dataset and adapted my previous code so that you can run my code with this dataset and see more clearly what the problem could be.
I am using Vella and Verbeek (1998) panel data on 545 men worked every year from 1980-1987 from this website: https://www.stata.com/texts/eacsap/
Let's say that I want to match treated observations, i.e. people, to control observations by marriage status (married) and year considering people that worked a similar number of hours (hours), i.e. the shortest distance.
I create a random treatment variable (treat) for the sake of this example.
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta
gen treat = round(runiform())
gen neighbor1 = .
gen idobs = .
levelsof married
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if married == `i' & year == `j', mahalanobis(hours)
capture noisily replace neighbor1 = _n1 if married == `i' & year == `j'
capture noisily replace idobs = _id if married == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
What this code should do is to look at each subgroup of observations: 444 observations in 1980 that are not married, 101 observations in 1980 that are married, ..., and 335 observations in 1987 that are married. In each of these subgroups, I would like to match a treated observation to a control observation considering the shortest distance in the number of hours worked.
There are two problems that I see after running the code.
First, the variable idobs should take a unique number between 1 and 4360 because there are 4360 observations in this dataset. It is just an ID number. It is not the case. A few observations can have an ID number 1, 2 and so on.
Second, neighbor1 varies between 1 and 204 meaning that the matched observations have only ID numbers varying from 1 to 204.
What is the problem with my code?
Here is a solution using the command iematch, installed through the package ietoolkit -> ssc install ietoolkit. For disclosure, I wrote this command. psmatch2 is great if you want the ATT. But if all you want is to match observations across two groups using nearest neighbor, then iematch is cleaner.
In both commands you need to make each industry-year match in a subset, then combine that information. In both commands the matched group ID will restart from 1 in each subset.
Using your example data, this creates one matchID var for each subset, then you will have to find a way to combine these to a single matchID without conflicts across the data set.
* use data set and keep only vars required for simplicity
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta, clear
keep year married hour
* Set seed for replicability. NEVER use the 123456 seed in production, randomize a new seed
set seed 123456
*Generate mock treatment
gen treat = round(runiform())
*generate vars to store results
gen matchResult = .
gen matchDiff = .
gen matchCount = .
*Create locals for loops
levelsof married
local married_statuses = r(levels)
levelsof year
local years = r(levels)
*Loop over each subgroup
foreach year of local years {
foreach married_status of local married_statuses {
*This command is similar to psmatch2, but a simplified version for
* when you are not looking for the ATT.
* This command is only about matching.
iematch if married == `married_status' & year == `year', grp(treat) match(hour) seedok m1 maxmatch(1)
*These variables list meta info about the match. See helpfile for docs,
*but this copy info from each subset in this loop to single vars for
*the full data set. Then the loop specfic vars are dropped
replace matchResult = _matchResult if married == `married_status' & year == `year'
replace matchDiff = _matchDiff if married == `married_status' & year == `year'
replace matchCount = _matchCount if married == `married_status' & year == `year'
drop _matchResult _matchDiff _matchCount
*For each loop you will get a match ID restarting at 1 for each loop.
*Therefore we need to save them in one var for each loop and combine afterwards.
rename _matchID matchID_`married_status'_`year'
}
}

How many observations in the output dataset?

A raw data file is listed below:
RANCH,1250,2,1,Sheppard Avenue, "$64,000"
SPLIT,1190,1,1,Rand Street, "$65,850"
CONDON, 1400,2,1,Market Street, "80,050"
TWOSTORY, 1810,4,3,Garris Street, "$107,250"
RANCH, 1500,3,3,Kemble Avenue, "$86,650"
SPLIT, 1615, 4,3, West Drive, "94,450"
SPLIT, 1305, 3,1.5,Graham Avenue, "$73,650"
The following is the code:
data work.condo_ranch;
infield "file_specificaton" did;
input style $ #;
if style = 'CONDO' or style = 'RANCH' then
input sqfeet bedrooms baths street $ price: dollar10.;
run;
So, I think the output dataset contains 3 observations, while the correct answer is that the output contains 7 observations. Does anyone tell me why? Many thanks for your time and attention.
Why would you expect the output dataset to have only 3 observations. There is an implied OUTPUT statement at the bottom of the DATA step. If you want to output only those records where STYLE IN ("CONDO","RANCH") you could add a conditional OUTPUT, e.g.:
if style = 'CONDO' or style = 'RANCH' then do;
input sqfeet bedrooms baths street $ price: dollar10.;
output;
end;
If you only want to output the records where style is CONDO or RANCH you could just change your THEN to a semi-colon. That would make your IF statement a subsetting IF. So the data step would return at that point and never run the second INPUT or the implied OUTPUT at the end of the step.

Check all values in a column

Is there a way to select all values in a column, and then check whether the entire column includes only certain parameters? Here is the code that I have tried, but I can see why it is not doing what I want:
IF City = "8" or
City = "12" or
City = "15" or
City = "24" or
City = "35"
THEN put "All cities are within New York";
I am trying to select the entire column on 'City', and check to see if that column includes ONLY those 5 values. If it includes ONLY those 5 values, then I want it to print to the log saying that. But, I can see that my method checks each row if it includes one of those, and if even only just 1 row contains one of them, it will print to the log. So I am getting a print to the log for every instance of this.
What I am trying to do is:
want-- IF (all of)City includes 8 & 12 & 15 & 24 & 35
THEN put "All cities are within New York."
You just need test if any observation is NOT in your list of values.
data _null_;
if eof then put 'All cities are within New York';
set have end=eof;
where not (city in ('8','12','15','24','35') );
put 'Found city NOT within New York. ' CITY= ;
stop;
run;
A SQL alternative could be:
proc sql;
select case when sum(not city in ('8','12','15','24','35'))>0 then 'Found city NOT within New York' else 'All cities are within New York' end
from have;
quit;
Not the same as being shown in the log, and not as efficient as the one #Tom offered.

TRANWRD to fix Merge error?

I recently combined two datasets with a pretty straightforward Merge statement. I was using an ACS dataset and a Census population dataset. I needed a flag from the latter to be in the former. When I merged, the place variable (town/county, state) was not de-duplicated because one dataset used state abbreviations while the other used the full spelling:
Obs GeoID GeoName
1 . Abbeville County, SC
2 45001 Abbeville County, South Carolina
I need to change the GeoName for Obs1 so that it equals Obs2
Would an index function work? Or do I need the TRANWRD function? Thanks.
Solved:
data _null_;
length geoName $100;
GeoName_C = scan(GeoName,1,',');
GeoName_S = scan(GeoName,-1,','); *-1 scans from the right in case you could have commas in the city - check for this and adjust GeoName_C to include them if it is possible;
GeoName_S_F = stnamel(strip(GeoName_S));
GeoName = catx(',',GeoName_C,GeoName_S_F);
put _all_;
run;
What I would do is separate the city from the state and use SAS's inbuilt function stnamel to convert the abbreviation to the full name.
data _null_;
length geoName $100;
GeoName='Abbeville Road, SC';
GeoName_C = scan(GeoName,1,',');
GeoName_S = scan(GeoName,-1,','); *-1 scans from the right in case you could have commas in the city - check for this and adjust GeoName_C to include them if it is possible;
GeoName_S_F = stnamel(strip(GeoName_S));
GeoName = catx(',',GeoName_C,GeoName_S_F);
put _all_;
run;

"too many variables specified" error with predict following logit

I have a panel of data (firm-years) that span several countries. For each country I estimate a logit model using the first five years then I use this model to predict probabilities in subsequent years. I foreach loop over the countries and forvalues loop over the subsequent years.
The first few countries work well (both estimations and predictions), but the fifth country's first out-of-sample prediction fails with:
Country: United Kingdom
Year: 1994
too many variables specified
r(103);
The model fits and 1994 has enough data to predict a probability. My predict call is:
predict temp_`c'`y' ///
if (country == "`c'") ///
& (fyear == `y'), ///
pr
Do you have any ideas what could cause this error? I am confused because logit and predict work elsewhere in the same loop. Thanks!
FWIW, here's the .do file.
* generate table 5 from Denis and Osobov (2008 JFE)
preserve
* loop to estimate model by country
levelsof country, local(countries)
foreach c of local countries {
display "Country: `c'"
summarize fyear if (country == "`c'"), meanonly
local est_low = `r(min)'
local est_high = `=`r(min)' + 4'
local pred_low = `=`r(min)' + 5'
local pred_high = `r(max)'
logit payer size v_a_tr e_a_tr re_be_tr ///
if (country == "`c'") ///
& inrange(fyear, `est_low', `est_high')
forvalues y = `pred_low'/`pred_high' {
display "Country: `c'"
display "Year: `y'"
predict temp_`c'`y' ///
if (country == "`c'") ///
& (fyear == `y'), ///
pr
}
}
* combine fitted values and generate delta
egen payer_expected = rowfirst(temp_*)
drop temp_*
generate delta = payer - payer_expected
* table
table country fyear, ///
contents(count payer mean payer mean payer_expected)
*
restore
Update: If I drop (country == "United Kingdom"), then the same problem shifts to the United States (next and last country in panel). If I drop inlist(country, "United Kingdom", "United States") then the problem disappears and the .do file runs through.
You are using country names as part of the new variable name that predict is creating. However, when you get to "United Kingdom" your line
predict temp_`c'`y'
implies something like
predict temp_United Kingdom1812
But Stata sees that as two variable names where only one is allowed.
Otherwise put, you are being bitten by a simple rule: Stata does not allow spaces within variable names.
Clearly the same problem would bite with "United States".
The simplest fudge is to change the values so that spaces become underscores "_". Stata's OK with variable names including underscores. That could be
gen country2 = subinstr(country, " ", "_", .)
followed by a loop over country2.
Note for everyone not up in historical details. 1812 is the year that British troops burnt down the White House. Feel free to substitute "1776" or some other date of choice.
(By the way, credit for a crystal-clear question!)
Here's an another approach to your problem. Initialise your variable to hold predicted values. Then as you loop over the possibilities, replace it chunk by chunk with each set of predictions. That avoids the whole business of generating a bunch of variables with different names which you don't want to hold on to long-term.
* generate table 5 from Denis and Osobov (2008 JFE)
preserve
gen payer_expected = .
* loop to estimate model by country
levelsof country, local(countries)
foreach c of local countries {
display "Country: `c'"
summarize fyear if (country == "`c'"), meanonly
local est_low = `r(min)'
local est_high = `=`r(min)' + 4'
local pred_low = `=`r(min)' + 5'
local pred_high = `r(max)'
logit payer size v_a_tr e_a_tr re_be_tr ///
if (country == "`c'") ///
& inrange(fyear, `est_low', `est_high')
forvalues y = `pred_low'/`pred_high' {
display "Country: `c'"
display "Year: `y'"
predict temp ///
if (country == "`c'") ///
& (fyear == `y'), pr
quietly replace payer_expected = temp if temp < .
drop temp
}
}
generate delta = payer - payer_expected
* table
table country fyear, ///
contents(count payer mean payer mean payer_expected)
*
restore