Run a regression of countries by quartiles for a specific year - stata

I am exploring an effect that I think will vary by GDP levels, from a data set that has, vertically, country and year (1960 to 2015), so each country label is on 55 rows. I ran
sort year
by year: egen yrank = xtile(rgdp), nquantiles(4)
which tags every year row with what quartile of GDP they were in that year. I want to run this:
xtreg fiveyearg taxratio if yrank == 1 & year==1960
which would regress my variable (tax ratio) against some averaged gdp data from countries that were in the bottom quartile of GDPs in 1960 alone. So even if later on they grew enough to change ranks, the later data would still be in the regression pool. Sadly, I cannot get this code, or any variation, to run.
My current approach is to try to generate some new variable that would give every row with country label X a value of 1 if they were in the bottom quartile in 1960, but I can't get that to work either. i have run out of ideas, so I thought I would ask!

Based on your latest comment, which describes the (un)expected behavior:
clear
set more off
*----- example data -----
input ///
country year rank
1 1960 2
1 1961 1
1 1962 2
2 1960 1
2 1961 1
2 1962 1
3 1960 3
3 1961 3
3 1962 3
end
list, sepby(country)
*----- what you want -----
// tag countries whose first observation for -rank- is 1
// (I assume the first observation for -year- is always 1960)
bysort country : gen toreg = rank[1] == 1
list, sepby(country)
// run regression conditional on -toreg-
xtreg ... if toreg
Check help subscripting if in doubt.

Related

Count by groups and collapse

I have a dataset in Stata and want to count by group (loc_ID) and year. I used the following two lines of code:
egen count_obsv = tag(loc_ID year)
This adds a counter to my dataset (count_obsv) which is 1 (and 0 for every element that has the same combination of loc_ID and year) for every new combination.
Then I use:
collapse (sum) count_obsv, by(loc_ID year)
according to various Stata forum posts this should result in eg.:
loc_ID year count_obsv
1 2000 342
1 2001 23
2 2008 23
...
But my output is:
loc_ID year count_obsv
1 2000 1
1 2001 1
2 2008 1
...
What am I summarizing wrong?
When you call up the tag() function of the egen command, you assign the value 1 to just one of any number of observations with the same distinct values for the specified variables, and 0 to all the others. Then when you ask for the sum of those values in the same groups of observations, you get the group sums of one 1 and any number of 0s, and each sum is thus necessarily 1.
Your question is probably abstracted from some other calculations that worked as you expected, but if all you wanted was a dataset with frequencies, then
contract loc_ID year
would do that for you. If you wanted a dataset with summaries of other variables too, you would need something more like
collapse (count) count=foo (mean) mean=foo (sd) sd=foo, by(loc_ID year)
I doubt that any Statalist posts state otherwise. (I wrote tag() in 1999, and I am not aware of this as a misunderstanding.) There is a related but so to speak distinct problem where tag() comes in useful, which is counting distinct values (often called unique values).
sysuse auto, clear
egen tag = tag(foreign rep78)
egen distinct = total(tag), by(foreign)
tabdisp foreign, c(distinct)
would be a way to get at the number of distinct values of rep78 within categories of foreign.

Difference between dropping years and "if Year > 2005"

I have a dataset of the top management teams of US banks from 2005 - 2015.
Now I want to generate a change-variable if a TMT composition changed between 2006 and 2009.
So first I used:
drop if Year > 2009
drop if Year < 2006
by id (id), sort: gen changed = (DirectorID[1] != DirectorID[_N])
and afterwards I used
by id (id), sort: gen changed = (DirectorID[1] != DirectorID[_N]) if Year < 2010 & Year > 2005
However there is a difference in output between two variables:
247 cases of "No change" and 853 cases of "Change" in the first and 116 cases of "No change" and the rest as "Changed" in the second variable
Could anyone clarify what the differences between these two commands are in Stata?
There are a couple reasons you may be seeing a different count of changes to the dataset. The data is most likely sorted differently for these two calls. The (id) parts have no effect here because you are already sorting by id. What you likely want to do is residually sort by year. So, bysort id (Year) - this way the dataset will be in the same order for each command you type. In the second command, the if clause is going to set the variable changed to missing for observations outside of the year range, but those observations are still being included in the calculation. You could create a new variable to flag the years of interest, and then add that new variable to the bysort call.
Lastly, you need to decide whether you only want to look at changes year-over-year (the value of the changed could vary by year within id), or have the value of changed reflect whether there were any changes in DirectorID over the entire time frame of interest (the value of changed would be constant within id).
Here's a toy example illustrating the difference. Essentially, when you drop the data, the last and the first observation could be the same, but in general you will have less data to compare the first and last observation since much of the data will be gone. When you use if, then the data is still there, even though the calculation is restricted to the middle observation by the if:
. clear
. input id year director_id
id year directo~d
1. 1 2016 10
2. 1 2017 20
3. 1 2018 30
4. end
.
. bys id (year): gen changed = (director_id[1] != director_id[_N]) if year < 2018 & year > 2016
(2 missing values generated)
. list, clean noobs
id year direct~d changed
1 2016 10 .
1 2017 20 1
1 2018 30 .
.
. drop if inlist(year, 2016,2018)
(2 observations deleted)
. bys id (year): gen changed2 = (director_id[1] != director_id[_N]) if year < 2018 & year > 2016
. list, clean noobs
id year direct~d changed changed2
1 2017 20 1 0
I added a sort by year since that seems in the spirit of your exercise.

SAS: PROC FREQ with multiple ID variables

I have data that's tracking a certain eye phenomena. Some patients have it in both eyes, and some patients have it in a single eye. This is what some of the data looks like:
EyeID PatientID STATUS Gender
1 1 1 M
2 1 0 M
3 2 1 M
4 3 0 M
5 3 1 M
6 4 1 M
7 4 0 M
8 5 1 F
9 6 1 F
10 6 0 F
11 7 1 F
12 8 1 F
13 8 0 F
14 9 1 F
As you can see from the data above, there are 9 patients total and all of them have the particular phenomena in one eye.
I need the count the number of patients with this eye phenomena.
To get the number of total patients in the dataset, I used:
PROC FREQ data=new nlevels;
tables PatientID;
run;
To count the number of patients with this eye phenomena, I used:
PROC SORT data=new out=new1 nodupkey;
by Patientid Status;
run;
proc freq data=new1 nlevels;
tables Status;
run;
However, it gave the correct number of patients with the phenomena (9), but not the correct number without (0).
I now need to calculate the gender distribution of this phenomena. I used:
proc freq data=new1;
tables gender*Status/chisq;
run;
However, in the cross table, it has the correct number of patients who have the phenomena (9), but not the correct number without (0). Does anyone have any thoughts on how to do this chi-square, where if the has this phenomena in at least 1 eye, then they are positive for this phenomena?
Thanks!
PROC FREQ is doing what you told it to: counting the status=0 cases.
In general here you are using sort of blunt tools to accomplish what you're trying to accomplish, when you probably should use a more precise tool. PROC SORT NODUPKEY is sort of overkill for example, and it doesn't really do what you want anyway.
To set up a dataset of has/doesn't have, for example, let's do a few things. First I add one more row - someone who actually doesn't have - so we see that working.
data have;
input eyeID patientID status gender $;
datalines;
1 1 1 M
2 1 0 M
3 2 1 M
4 3 0 M
5 3 1 M
6 4 1 M
7 4 0 M
8 5 1 F
9 6 1 F
10 6 0 F
11 7 1 F
12 8 1 F
13 8 0 F
14 9 1 F
15 10 0 M
;;;;
run;
Now we use the data step. We want a patient-level dataset at the end, where we have eye-level now. So we create a new patient-level status.
data patient_level;
set have;
by patientID;
retain patient_status;
if first.patientID then patient_status =0;
patient_status = (patient_Status or status);
if last.patientID then output;
keep patientID patient_Status gender;
run;
Now, we can run your second proc freq. Also note you have a nice dataset of patients.
title "Patients with/without condition in any eye";
proc freq data=patient_level;
tables patient_status;
run;
title;
You also may be able to do your chi-square analysis, though I'm not a statistician and won't dip my toe into whether this is an appropriate analysis. It's likely better than your first, anyway - as it correctly identifies has/doesn't have status in at least one eye. You may need a different indicator, if you need to know number of eyes.
title "Crosstab of gender by patient having/not having condition";
proc freq data=patient_level;
tables gender*patient_Status/chisq;
run;
title;
If your actual data has every single patient having the condition, of course, it's unlikely a chi-square analysis is appropriate.

Modifying data in SAS: copying part of the value of a cell, adding missing data and labeling it

I have three different questions about modifying a dataset in SAS. My data contains: the day and the specific number belonging to the tag which was registred by an antenna on a specific day.
I have three separate questions:
1) The tag numbers are continuous and range from 1 to 560. Can I easily add numbers within this range which have not been registred on a specific day. So, if 160-280 is not registered for 23-May and 40-190 for 24-May to add these non-registered numbers only for that specific day? (The non registered numbers are much more scattered and for a dataset encompassing a few weeks to much to do by hand).
2) Furthermore, I want to make a new variable saying a tag has been registered (1) or not (0). Would it work to make this variable and set it to 1, then add the missing variables and (assuming the new variable is not set for the new number) set the missing values to 0.
3) the last question would be in regard to the format of the registered numbers which is along the line of 528 000000000400 and 000 000000000054. I am only interested in the last three digits of the number and want to remove the others. If I could add the missing numbers I could make a new variable after the data has been sorted by date and the original transponder code but otherwise what would you suggest?
I would love some suggestions and thank you in advance.
I am inventing some data here, I hope I got your questions right.
data chickens;
do tag=1 to 560;
output;
end;
run;
data registered;
input date mmddyy8. antenna tag;
format date date7.;
datalines;
01012014 1 1
01012014 1 2
01012014 1 6
01012014 1 8
01022014 1 1
01022014 1 2
01022014 1 7
01022014 1 9
01012014 2 2
01012014 2 3
01012014 2 4
01012014 2 7
01022014 2 4
01022014 2 5
01022014 2 8
01022014 2 9
;
run;
proc sql;
create table dates as
select distinct date, antenna
from registered;
create table DatesChickens as
select date, antenna, tag
from dates, chickens
order by date, antenna, tag;
quit;
proc sort data=registered;
by date antenna tag;
run;
data registered;
merge registered(in=INR) DatesChickens;
by date antenna tag;
Registered=INR;
run;
data registeredNumbers;
input Numbers $16.;
datalines;
528 000000000400
000 000000000054
;
run;
data registeredNumbers;
set registeredNumbers;
NewNumbers=substr(Numbers,14);
run;
I do not know SAS, but here is how I would do it in SQL - may give you an idea of how to start.
1 - Birds that have not registered through pophole that day
SELECT b.BirdId
FROM Birds b
WHERE NOT EXISTS
(SELECT 1 FROM Pophole_Visits p WHERE b.BirdId = p.BirdId AND p.date = ????)
2 - Birds registered through pophole
If you have a dataset with pophole data you can query that to find if a bird has been through. What would you flag be doing - finding a bird that has never been through any popholes? Looking for dodgy sensor tags or dead birds?
3 - Data code
You might have more joy with the SUBSTRING function
Good luck

Creating "GDP in 1960" variable from GDP variables for different years

I'm pretty new to Stata.
I have a set of observations of the form "Country GDP Year". I want to create a new variable GDP1960, which gives the GDP in 1960 of each country for each year:
USA $100m 1960 USA $100m 1960 $100m
USA $200m 1965 --> USA $200m 1965 $100m
Canada $60m 1960 Canada $60m 1960 $60m
What's the right syntax to make this happen? (I assume egen is involved in some mysterious way)
You've found a solution with cond(), but here's a couple of suggestions that might make modeling your data easier and help you avoid problems with issues that might arise when sorting by creating your rank variable (and I've got the egen solution that you asked about below):
Paste the code below into your do-file editor and run it:
*---------------------------------BEGIN EXAMPLE
clear
inp str20 country str10 gdp year
"USA" "$100m" 1960
"USA" "$200m" 1965
"Canada" "$60m" 1960
"Canada" "$120m" 1965
"USA" "$250m" 1970
"Mexico" "$90m" 1970
"Canada" "$800m" 1970
"Mexico" "$160m" 1960
"Mexico" "$220m" 1965
"Mexico" "$350m" 1975
end
//1. destring gdp so that we can work with it
destring gdp, ignore("$", "m") replace
//2. Create GDP for 1960 var:
bys country: g x = gdp if year==1960
bys country: egen gdp60 = max(x)
drop x
**you could also create balanced panels to see gaps in your data**
preserve
ssc install panels
panels country year
fillin country year
li //take a look at the results win. to see how filled panel data would look
restore
//3. create a gdp variable for each year (reshape the dataset)
drop gdp60
reshape wide gdp, i(country) j(year)
**much easier to use this format for modeling
su gdp1970
**here's a fake "outcome" or response variable to work with**
g outcome = 500+int((1000-500+1)*runiform())
anova outcome gdp1960-gdp1970 //or whatever makes sense for your situation
*---------------------------------END EXAMPLE
A one-line solution is
egen gdp60 = mean(gdp / (year == 1960)), by(country)
The trick here is the division by the expression year == 1960. This is true for 1960, in which case we divide by 1, which leaves the gdp for that year unchanged. It is false for all other years, in which case we divide by 0. That sounds crazy, but the consequence whenever we divide by zero is just missing values, which will be ignored by egen's mean() function.
You could use other egen functions, as in this case there should be at most one value for 1960 for each country, so e.g. max(), min(), total() should all work too. (If a country has no value for 1960, or a missing value, we will end up with missing, which is precisely as it should be.)
Discussion at http://www.stata-journal.com/article.html?article=dm0055
Well, I found a solution in the end. It relies on the fact that generate and replace work on the data in its sorted order, and that you can refer to the current observation with _n.
gen rank = 100
replace rank = 50 if year == 1960
gen gdp60 = .
sort country rank
replace gdp60 = cond(iso == iso[_n-1], gdp60[_n-1], gdp[_n])
drop rank
sort country year
EDIT: A more direct solution with the same flavour:
gen wanted = year == 1960
bysort country (wanted) : gen gdp60 = gdp[_N]
drop wanted
sort country year
Here wanted will be 1 for 1960 and 0 otherwise.
I can't think of anything shorter than these two lines:
gen temp = gdp if year == 1960
by country : egen gdp60 = max(temp)
If you want a variable for each year (e.g., gdp60, gdp61, gdp62,...) then you probably should use reshape