I am trying to make an easy to use do file where the user inserts the names of the towns s/he wants to summarize and then Stata:
summarizes the towns
saves the results in an Excel file
exports the names of the towns summarized
I am using a list saved in a local macro since it works well with the inlist() function:
clear
input Date AskingRent str10 Town
2019 12 Boston
2019 13 Cambridge
2018 14 Boston
2018 15 Cambridge
end
local towns `" "Billerica", "Boston" "'
keep if inlist(City, `towns')
***some analysis
putexcel set "results.xlsx", modify
putexcel A1 = `towns'
I want the Excel file to have "Billerica, Boston" in cell A1.
However, I get an error in the last line of code that says:
nothing found where expression expected
The following works for me:
clear
input foo1 str20 foo2
5 "Billerica"
6 "Boston"
7 "London"
8 "New York"
end
. local towns `" "Billerica", "Boston" "'
. keep if inlist(foo2, `towns')
. putexcel set "results.xlsx", modify
. putexcel A1 = `"`towns'"'
file results.xlsx saved
Related
In Stata, I am trying to change the values--both string and numeric--of one row based on the one just above or just below it only if the values are missing. Here are some sample data:
input
str40 id var1 var2 var3 var4 str40 var5_string str40 var6_string
"correctly-spelled" 10 20 . . "random text 1" ""
"misspelled" . . 30 40 "" "random text 2"
end
Essentially, I want my final dataset to look as follows:
input
id var1 var2 var3 var4 var5_string var6_string
"correctly-spelled" 10 20 30 40 "random text 1" "random text 2"
end
I need a row-specific solution (i.e. avoiding collapse), because my (wide) dataset has thousands of labeled variables, and I don't want to lose the labels due to collapse. Also, not all of the variables are numeric, and the naming conventions of the variables are not consistent. Accordingly, fixing the spelling of id with a simple replace, executing a collapse (firstnm) id var5_string var6_string (mean) var1 var2 var3 var4, by(id), or using var* for anything won't help. Basically, what happened was one person merged using the "correctly-spelled" id, the other person merged using the "misspelled" id, and I don't have any of the source files. Thanks!
If you can assume that the misspelled ID comes right after (or right before) the correctly spelled, you can use _n±1 to get the previous or following value. For more information on system variables see help _variables
If you assume the correct one always comes first, then the second replace would be sufficient.
mi() is the abbreviated missing() function.
the second conditions & !mi(var'[_n±1])`, are just to make sure that non-missing don't get replaced by missing values, should two valid (but different) ID's come up sequentially. Depending on your data, this further condition might not be necessary.
local list_of_vars var1 var2 var3 var4 var5_string var6_string
foreach var of local list_of_vars {
replace `var' = `var'[_n-1] if mi(`var') & !mi(`var'[_n-1])
replace `var' = `var'[_n+1] if mi(`var') & !mi(`var'[_n+1])
}
. list
+-------------------------------------------------------------------------------+
| id var1 var2 var3 var4 var5_string var6_string |
|-------------------------------------------------------------------------------|
1. | correctly-spelled 10 20 30 40 random text 1 random text 2 |
2. | misspelled 10 20 30 40 random text 1 random text 2 |
+-------------------------------------------------------------------------------+
Then just keep the correct ones. Hopefully you can identify them somehow.
// the following is just to be able to identify the correct id's, of course you will have to adapt it so that it matches only the correctly-spelled IDs or you have other way of identifying them :)
gen _ck_corect_id = (id=="correctly-spelled")
keep if _ck_corect_id==1
I have a dataset imported from Excel where countries are stored as one variable, while their corresponding population are placed in a separate variable but in the next observation.
For example:
clear
input str32 country population
"United States of America" .
"" 3447
"Afghanistan" .
"" 727
"Belgium" .
"" 992
"China" .
"" 12000
end
How can I get the population values in the same observation as those in country?
The following works for me:
replace population = population[_n+1] if population == .
drop if country == ""
I have a dataset of the top management teams of US banks from 2005 - 2015.
Now I want to generate a change-variable if a TMT composition changed between 2006 and 2009.
So first I used:
drop if Year > 2009
drop if Year < 2006
by id (id), sort: gen changed = (DirectorID[1] != DirectorID[_N])
and afterwards I used
by id (id), sort: gen changed = (DirectorID[1] != DirectorID[_N]) if Year < 2010 & Year > 2005
However there is a difference in output between two variables:
247 cases of "No change" and 853 cases of "Change" in the first and 116 cases of "No change" and the rest as "Changed" in the second variable
Could anyone clarify what the differences between these two commands are in Stata?
There are a couple reasons you may be seeing a different count of changes to the dataset. The data is most likely sorted differently for these two calls. The (id) parts have no effect here because you are already sorting by id. What you likely want to do is residually sort by year. So, bysort id (Year) - this way the dataset will be in the same order for each command you type. In the second command, the if clause is going to set the variable changed to missing for observations outside of the year range, but those observations are still being included in the calculation. You could create a new variable to flag the years of interest, and then add that new variable to the bysort call.
Lastly, you need to decide whether you only want to look at changes year-over-year (the value of the changed could vary by year within id), or have the value of changed reflect whether there were any changes in DirectorID over the entire time frame of interest (the value of changed would be constant within id).
Here's a toy example illustrating the difference. Essentially, when you drop the data, the last and the first observation could be the same, but in general you will have less data to compare the first and last observation since much of the data will be gone. When you use if, then the data is still there, even though the calculation is restricted to the middle observation by the if:
. clear
. input id year director_id
id year directo~d
1. 1 2016 10
2. 1 2017 20
3. 1 2018 30
4. end
.
. bys id (year): gen changed = (director_id[1] != director_id[_N]) if year < 2018 & year > 2016
(2 missing values generated)
. list, clean noobs
id year direct~d changed
1 2016 10 .
1 2017 20 1
1 2018 30 .
.
. drop if inlist(year, 2016,2018)
(2 observations deleted)
. bys id (year): gen changed2 = (director_id[1] != director_id[_N]) if year < 2018 & year > 2016
. list, clean noobs
id year direct~d changed changed2
1 2017 20 1 0
I added a sort by year since that seems in the spirit of your exercise.
A raw data file is listed below:
RANCH,1250,2,1,Sheppard Avenue, "$64,000"
SPLIT,1190,1,1,Rand Street, "$65,850"
CONDON, 1400,2,1,Market Street, "80,050"
TWOSTORY, 1810,4,3,Garris Street, "$107,250"
RANCH, 1500,3,3,Kemble Avenue, "$86,650"
SPLIT, 1615, 4,3, West Drive, "94,450"
SPLIT, 1305, 3,1.5,Graham Avenue, "$73,650"
The following is the code:
data work.condo_ranch;
infield "file_specificaton" did;
input style $ #;
if style = 'CONDO' or style = 'RANCH' then
input sqfeet bedrooms baths street $ price: dollar10.;
run;
So, I think the output dataset contains 3 observations, while the correct answer is that the output contains 7 observations. Does anyone tell me why? Many thanks for your time and attention.
Why would you expect the output dataset to have only 3 observations. There is an implied OUTPUT statement at the bottom of the DATA step. If you want to output only those records where STYLE IN ("CONDO","RANCH") you could add a conditional OUTPUT, e.g.:
if style = 'CONDO' or style = 'RANCH' then do;
input sqfeet bedrooms baths street $ price: dollar10.;
output;
end;
If you only want to output the records where style is CONDO or RANCH you could just change your THEN to a semi-colon. That would make your IF statement a subsetting IF. So the data step would return at that point and never run the second INPUT or the implied OUTPUT at the end of the step.
Is there a way to select all values in a column, and then check whether the entire column includes only certain parameters? Here is the code that I have tried, but I can see why it is not doing what I want:
IF City = "8" or
City = "12" or
City = "15" or
City = "24" or
City = "35"
THEN put "All cities are within New York";
I am trying to select the entire column on 'City', and check to see if that column includes ONLY those 5 values. If it includes ONLY those 5 values, then I want it to print to the log saying that. But, I can see that my method checks each row if it includes one of those, and if even only just 1 row contains one of them, it will print to the log. So I am getting a print to the log for every instance of this.
What I am trying to do is:
want-- IF (all of)City includes 8 & 12 & 15 & 24 & 35
THEN put "All cities are within New York."
You just need test if any observation is NOT in your list of values.
data _null_;
if eof then put 'All cities are within New York';
set have end=eof;
where not (city in ('8','12','15','24','35') );
put 'Found city NOT within New York. ' CITY= ;
stop;
run;
A SQL alternative could be:
proc sql;
select case when sum(not city in ('8','12','15','24','35'))>0 then 'Found city NOT within New York' else 'All cities are within New York' end
from have;
quit;
Not the same as being shown in the log, and not as efficient as the one #Tom offered.