Move values of different variables in the same observation - stata

I have a dataset imported from Excel where countries are stored as one variable, while their corresponding population are placed in a separate variable but in the next observation.
For example:
clear
input str32 country population
"United States of America" .
"" 3447
"Afghanistan" .
"" 727
"Belgium" .
"" 992
"China" .
"" 12000
end
How can I get the population values in the same observation as those in country?

The following works for me:
replace population = population[_n+1] if population == .
drop if country == ""

Related

Moving variable to next column in SAS

I'm having the following table in SAS
SAS Table: Price
ID Description Price Discount
20 Hot blue warm 12.0
21 Durable A 15.0 0
22 Flexible 13.5 0
23 Bendable and A 12.3
I'm planning to move 'warm' and 'and A' from Price column to Description column while '12.0' and '12.3' to Price, what should I do?
You cannot change the type of an existing variable, but you can change the name.
Use the INPUT() function to convert strings to numbers. You can use the ?? modifier to suppress errors generated by strings that do not represent numbers.
data want;
set price(rename=(price=char_price));
price = input(char_price,??32.)
if missing(price) then description=catx(' ',description,char_price);
run;

Looping with distance matching

I want to match treated firms to control firms by industry and year considering firms that are the closest in terms of profitability (roa). I want a 1:1 match. I am using a distance measure (mahalanobis).
I have 530,000 firm-year observations in my sample, namely 267,000 treated observations and 263,000 control observations approximatively. Here is my code:
gen neighbor1 = .
gen idobs = .
levelsof industry
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if industry == `i' & year == `j', mahalanobis(roa)
capture noisily replace neighbor1 = _n1 if industry == `i' & year == `j'
capture noisily replace idobs = _id if industry == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
Treat is my treatment variable. It takes the value of 1 for treated observations and 0 for non-treated observations.
The command psmatch2 creates the variable _n1 and _id among others. _n1 is the id number of the matched observation (closest neighbor) and _id is an id number (1 - 530,000) that is unique to each observation.
The code 'works', i.e. I get no error message. My variable neighbor1 has 290,724 non-missing observations.
However, these 290,724 observations vary between 1 and 933 which is odd. The variable neighbor1 should provide me the observation id number of the matched observation, which can vary between 1 and 530,000.
It seems that the code erases or ignores the result of the matching process in different subgroups. What am I doing wrong?
Edit:
I found a public dataset and adapted my previous code so that you can run my code with this dataset and see more clearly what the problem could be.
I am using Vella and Verbeek (1998) panel data on 545 men worked every year from 1980-1987 from this website: https://www.stata.com/texts/eacsap/
Let's say that I want to match treated observations, i.e. people, to control observations by marriage status (married) and year considering people that worked a similar number of hours (hours), i.e. the shortest distance.
I create a random treatment variable (treat) for the sake of this example.
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta
gen treat = round(runiform())
gen neighbor1 = .
gen idobs = .
levelsof married
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if married == `i' & year == `j', mahalanobis(hours)
capture noisily replace neighbor1 = _n1 if married == `i' & year == `j'
capture noisily replace idobs = _id if married == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
What this code should do is to look at each subgroup of observations: 444 observations in 1980 that are not married, 101 observations in 1980 that are married, ..., and 335 observations in 1987 that are married. In each of these subgroups, I would like to match a treated observation to a control observation considering the shortest distance in the number of hours worked.
There are two problems that I see after running the code.
First, the variable idobs should take a unique number between 1 and 4360 because there are 4360 observations in this dataset. It is just an ID number. It is not the case. A few observations can have an ID number 1, 2 and so on.
Second, neighbor1 varies between 1 and 204 meaning that the matched observations have only ID numbers varying from 1 to 204.
What is the problem with my code?
Here is a solution using the command iematch, installed through the package ietoolkit -> ssc install ietoolkit. For disclosure, I wrote this command. psmatch2 is great if you want the ATT. But if all you want is to match observations across two groups using nearest neighbor, then iematch is cleaner.
In both commands you need to make each industry-year match in a subset, then combine that information. In both commands the matched group ID will restart from 1 in each subset.
Using your example data, this creates one matchID var for each subset, then you will have to find a way to combine these to a single matchID without conflicts across the data set.
* use data set and keep only vars required for simplicity
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta, clear
keep year married hour
* Set seed for replicability. NEVER use the 123456 seed in production, randomize a new seed
set seed 123456
*Generate mock treatment
gen treat = round(runiform())
*generate vars to store results
gen matchResult = .
gen matchDiff = .
gen matchCount = .
*Create locals for loops
levelsof married
local married_statuses = r(levels)
levelsof year
local years = r(levels)
*Loop over each subgroup
foreach year of local years {
foreach married_status of local married_statuses {
*This command is similar to psmatch2, but a simplified version for
* when you are not looking for the ATT.
* This command is only about matching.
iematch if married == `married_status' & year == `year', grp(treat) match(hour) seedok m1 maxmatch(1)
*These variables list meta info about the match. See helpfile for docs,
*but this copy info from each subset in this loop to single vars for
*the full data set. Then the loop specfic vars are dropped
replace matchResult = _matchResult if married == `married_status' & year == `year'
replace matchDiff = _matchDiff if married == `married_status' & year == `year'
replace matchCount = _matchCount if married == `married_status' & year == `year'
drop _matchResult _matchDiff _matchCount
*For each loop you will get a match ID restarting at 1 for each loop.
*Therefore we need to save them in one var for each loop and combine afterwards.
rename _matchID matchID_`married_status'_`year'
}
}

Use putexcel to insert a string in Excel

I am trying to make an easy to use do file where the user inserts the names of the towns s/he wants to summarize and then Stata:
summarizes the towns
saves the results in an Excel file
exports the names of the towns summarized
I am using a list saved in a local macro since it works well with the inlist() function:
clear
input Date AskingRent str10 Town
2019 12 Boston
2019 13 Cambridge
2018 14 Boston
2018 15 Cambridge
end
local towns `" "Billerica", "Boston" "'
keep if inlist(City, `towns')
***some analysis
putexcel set "results.xlsx", modify
putexcel A1 = `towns'
I want the Excel file to have "Billerica, Boston" in cell A1.
However, I get an error in the last line of code that says:
nothing found where expression expected
The following works for me:
clear
input foo1 str20 foo2
5 "Billerica"
6 "Boston"
7 "London"
8 "New York"
end
. local towns `" "Billerica", "Boston" "'
. keep if inlist(foo2, `towns')
. putexcel set "results.xlsx", modify
. putexcel A1 = `"`towns'"'
file results.xlsx saved

Check all values in a column

Is there a way to select all values in a column, and then check whether the entire column includes only certain parameters? Here is the code that I have tried, but I can see why it is not doing what I want:
IF City = "8" or
City = "12" or
City = "15" or
City = "24" or
City = "35"
THEN put "All cities are within New York";
I am trying to select the entire column on 'City', and check to see if that column includes ONLY those 5 values. If it includes ONLY those 5 values, then I want it to print to the log saying that. But, I can see that my method checks each row if it includes one of those, and if even only just 1 row contains one of them, it will print to the log. So I am getting a print to the log for every instance of this.
What I am trying to do is:
want-- IF (all of)City includes 8 & 12 & 15 & 24 & 35
THEN put "All cities are within New York."
You just need test if any observation is NOT in your list of values.
data _null_;
if eof then put 'All cities are within New York';
set have end=eof;
where not (city in ('8','12','15','24','35') );
put 'Found city NOT within New York. ' CITY= ;
stop;
run;
A SQL alternative could be:
proc sql;
select case when sum(not city in ('8','12','15','24','35'))>0 then 'Found city NOT within New York' else 'All cities are within New York' end
from have;
quit;
Not the same as being shown in the log, and not as efficient as the one #Tom offered.

Stata: svy:mean affecting variable format

A code I am running uses svy:mean and there is NO subpop command used.
My issue is that is that for certain variables, it renames some of the values of the variable to _subpop_1, etc. but others are still in their original format. For example, I have a county variable. After using the svy:mean command, some counties show up as Alameda, Alpine, etc) while some show up as _subpop_7, _subpop_8, etc.
Does anyone know why this is?
When using a tab command on the same variable, none of the formats are affected and every county shows up.
An example of my code and output (I hid the numbers) would be:
foreach var of varlist county {
svy: mean deport, over(`var')
}
Survey: Mean estimation
Number of strata = . Number of obs = .
Number of PSUs = . Population size = .
Design df = .
ALAMEDA: county = ALAMEDA
ALPINE: county = ALPINE
AMADOR: county = AMADOR
BUTTE: county = BUTTE
CALAVERAS: county = CALAVERAS
COLUSA: county = COLUSA
_subpop_7: county = CONTRA COSTA
_subpop_8: county = DEL NORTE
_subpop_9: county = EL DORADO
FRESNO: county = FRESNO
GLENN: county = GLENN
HUMBOLDT: county = HUMBOLDT
IMPERIAL: county = IMPERIAL
More than a programming problem, this is simply a case of Stata doing what it states it'll do. From help mean:
Noninteger values,
negative values, and labels that are not valid Stata names are substituted with a default identifier.
An example reproducing the "problem" is:
webuse hbp
// some value labels with spaces
label define lblcity 1 "contra costa" 2 "el dorado" 3 "alameda" 5 "alpine"
label values city lblcity
mean hbp, over(city)
More on valid Stata names in [U] 11 Language syntax.
(Note the svy : prefix plays no role here.)