Having an issue with stopping a loop in Stata - stata

So, I have a bunch of variables in my data set which are binary and contain information on whether an individual was married or not. So, for example, marr79, is whether a person was married in 1979 or not.
I'm trying to find how many years a person was married (the first time) from the child's birth. So, if the child was born in 1980, and the person was married in 1980, it would add to child_marr, and it would do the same for the following 18 years of their life. I want it to stop, though, if it encounters a 0. So if there are 1's for 1980, 1981, and 1982, and a 0 for 1983, I want it to stop at 1983, even if there is a 1 in 1984.
My code below (and it is one of many iterations I've tried) either has it run through all the years without stopping, or never run at all, leaving values of all 0.
Any help is appreciated.
gen child_marr=0;
forvalues y=79(1)99 {;
gen temp_yr=1900+`y';
if (ch_yob<=temp_yr & marr`y'==1 & temp_yr<(ch_yob+18))==1 {;
replace child_marr = child_marr + 1;
};
else if (marr`y'==0 & ch_yob<=temp_yr) {;
continue, break;
};
drop temp_yr;
};

A few comments:
Your condition if (test1 & test2 & test3) == 1 does not need the == 1 portion -- Stata infers that if (condition) means if condition == 1 (caveat: for cases where the logical test is {0,1}).
There is no need to generate a temporary variable, since you can compare the value of a variable to a local macro directly.
To the issue at hand, your loop is comparing observation-level criteria (e.g., the value of the variable temp_yr to the value of the variable ch_yob). This can seem correct, but is often problematic -- see Stata FAQ: if command versus if qualifier.
A first pass at a solution would be to recode your forvalues loop to use the if qualifier rather than the if command:
gen child_marr = 0
forvalues y = 79/99 {
local yr = 1900 + `y'
replace child_marr = child_marr + 1 if (ch_yob <= `yr') & (marr`y' == 1) & (`yr' < (ch_yob + 18))
}
But as mentioned, a concrete solution would be easier with a reproducible example.

Related

Generating categorical variable

In my Stata data set, the "alternative" variable consists of 4 modes including pier, private, beach and charter.
I want to generate new variable y as follows:
We collapse the model to three alternatives and order the alternatives, with y = 0 if fishing from a pier or beach, y = 1 if fishing from a private boat and y = 2 if fishing from a charter.
I tried to do this by looking at thetas in this website:
stata tips but I can't solve it.
Note: I don't understand from the dataset. And I get error related to type of the variable while generating variable I download the dataset from the website https://www.stata-press.com/data/musr/musr.zip The data name is mus15data
The variables in the dataset is as follows:
Here, "mode" variable is alternatives.
If I understand correctly, this is
gen y = 0 if inlist(1, dbeach, dpier)
* gen y = 0 if dbeach == 1 | dpier == 1
replace y = 1 if dprivate == 1
replace y = 2 if dcharter == 1
Many other solutions are possible. Here is one more.
gen y = cond(inlist(1, dbeach, pier), 0, 2 * (dcharter == 1) + (dprivate == 1))
If all those variables are only ever 0 or 1 (and never missing) some simplifications are possible.
Go only with code you find clear and can explain to others.
I am assuming that pier, beach, private, charter are mutually exclusive. I've not checked with the dataset.

How to perform rolling window calculations without SSC packages

Goal: perform rolling window calculations on panel data in Stata with variables PanelVar, TimeVar, and Var1, where the window can change within a loop over different window sizes.
Problem: no access to SSC for the packages that would take care of this (like rangestat)
I know that
by PanelVar: gen Var1_1 = Var1[_n]
produces a copy of Var1 in Var1_1. So I thought it would make sense to try
by PanelVar: gen Var1SumLag = sum(Var1[(_n-3)/_n])
to produce a rolling window calculation for _n-3 to _n for the whole variable. But it fails to produce the results I want, it just produces zeros.
You could use sum(Var1) - sum(Var1[_n-3]), but I also want to be able to make the rolling window left justified (summing future observations) as well as right justified (summing past observations).
Essentially I would like to replicate Python's ".rolling().agg()" functionality.
In Stata _n is the index of the current observation. The expression (_n - 3) / _n yields -2 when _n is 1 and increases slowly with _n but is always less than 1. As a subscript applied to extract values from observations of a variable it always yields missing values given an extra rule that Stata rounds down expressions so supplied. Hence it reduces to -2, -1 or 0: in each case it yields missing values when given as a subscript. Experiment will show you that given any numeric variable say numvar references to numvar[-2] or numvar[-1] or numvar[0] all yield missing values. Otherwise put, you seem to be hoping that the / yields a set of subscripts that return a sequence you can sum over, but that is a long way from what Stata will do in that context: the / is just interpreted as division. (The running sum of missings is always returned as 0, which is an expression of missings being ignored in that calculation: just as 2 + 3 + . + 4 is returned as 9 so also . + . + . + . is returned as 0.)
A fairly general way to do what you want is to use time series operators, and this is strongly preferable to subscripts as (1) doing the right thing with gaps (2) automatically working for panels too. Thus after a tsset or xtset
L0.numvar + L1.numvar + L2.numvar + L3.numvar
yields the sum of the current value and the three previous and
L0.numvar + F1.numvar + F2.numvar + F3.numvar
yields the sum of the current value and the three next. If any of these terms is missing, the sum will be too; a work-around for that is to return say
cond(missing(L3.numvar), 0, L3.numvar)
More general code will require some kind of loop.
Given a desire to loop over lags (negative) and leads (positive) some code might look like this, given a range of subscripts as local macros i <= j
* example i and j
local i = -3
local j = 0
gen double wanted = 0
forval k = `i'/`j' {
if `k' < 0 {
local k1 = -(`k')
replace wanted = wanted + L`k1'.numvar
}
else replace wanted = wanted + F`k'.numvar
}
Alternatively, use Mata.
EDIT There's a simpler method, to use tssmooth ma to get moving averages and then multiply up by the number of terms.
tssmooth ma wanted1=numvar, w(3 1)
tssmooth ma wanted2=numvar, w(0 1 3)
replace wanted1 = 4 * wanted1
replace wanted2 = 4 * wanted2
Note that in contrast to the method above tssmooth ma uses whatever is available at the beginning and end of each panel. So, the first moving average, the average of the first value and the three previous, is returned as just the first value at the beginning of each panel (when the three previous values are unknown).

How to create a variable taking value X+1 if an event doesn't occur in X periods?

How can I create a new variable that takes value X+1 if an event doesn't occur in X periods of time?
Specifically, I have data of many people in 12 years. For a question, they could answer yes (1) or no(0). I care the first time someone says Yes during 12 years and created a variable that takes value of the number of years with Yes replies.
But if someone replies No for 12 years, I set value of that variable equal 13. But I'm stuck at how to do that.
by hhidpn (wave), sort: gen byte EarlyHeart = sum(rhearte) == 1
gen EarlyHeart1=year if EarlyHeart==1
(what's next?)
If the last cumulative sum for an individual is 0, then they all are.
by hhidpn (wave), sort: gen byte EarlyHeart = sum(rhearte) == 1
by hhidpn : replace EarlyHeart = 13 if EarlyHeart[_N] == 0

Nearest Neighbor Matching in Stata

I need to program a nearest neighbor algorithm in stata from scratch because my dataset does not allow me to use any of the available solutions (as far as I am concerned).
To be pecise. I have a dataset that is of similar structure to that of the following (original has around 14k observations)
input id value treatment match
1 0.14 0 .
2 0.32 0 .
3 0.465 1 2
4 0.878 1 2
5 0.912 1 2
6 0.001 1 1
end
I want to generate a variable called match (already included in the example above). For each observation with treatment == 1 the variable match should store the id of another observation from within treatment == 0 whose value is closest to value of the considered observation (treatment == 1).
I am new to stata programming, so I am not yet familiar with the syntax. My first shot is the following however it does not produce any changes to the match variable. I am sure this is a novice question but I am hoping for some advice on how to make the code running.
EDIT: I have changed the code slightly and now it seems to work. Do you see any problems that may arise if I run it on a bigger dataset?
set more off
clear all
input id pscore treatment
1 0.14 0
2 0.32 0
3 0.465 1
4 0.878 1
5 0.912 1
6 0.001 1
end
gen match = .
forval i = 1/`= _N' {
if treatment[`i'] == 1 {
local dist 1
forvalues j = 1/`= _N' {
if (treatment[`j'] == 0) {
local current_dist (pscore[`i'] - pscore[`j'])^2
if `dist' > `current_dist' {
local dist `current_dist' // update smallest distance
replace match = id[`j'] in `i' // write match
}
}
}
}
}
Consider some simulated data: 1,000 observations, 200 of them untreated (treat == 0) and the rest treated (treat == 1). Then the code included below will be much more efficient than the originally posted. (Ties, like in your code, are not explicitly handled.)
clear
set more off
*----- example data -----
set obs 1000
set seed 32956
gen id = _n
gen pscore = runiform()
gen treat = cond(_n <= 200, 0, 1)
*----- new method -----
timer clear
timer on 1
// get id of last non-treated and first treated
// (data is sorted by treat and ids are consecutive)
bysort treat (id): gen firsttreat = id[1]
local firstt = first[_N]
local lastnt = `firstt' - 1
// start loop
gen match = .
gen dif = .
quietly forvalues i = `firstt'/`=_N' {
// compute distances
replace dif = (pscore[`i'] - pscore)^2
summarize dif in 1/`lastnt', meanonly
// identify id of minimum-distance observation
replace match = . in 1/`lastnt'
replace match = id in 1/`lastnt' if dif == r(min)
summarize match in 1/`lastnt', meanonly
// save the minimum-distance id
replace match = r(max) in `i'
}
// clean variable and drop
replace match = . in 1/`lastnt'
drop dif firsttreat
timer off 1
tempfile first
save `first'
*----- your method -----
drop match
timer on 2
gen match = .
quietly forval i = 1/`= _N' {
if treat[`i'] == 1 {
local dist 1
forvalues j = 1/`= _N' {
if (treat[`j'] == 0) {
local current_dist (pscore[`i'] - pscore[`j'])^2
if `dist' > `current_dist' {
local dist `current_dist' // update smallest distance
replace match = id[`j'] in `i' // write match
}
}
}
}
}
timer off 2
tempfile second
save `second'
// check for equality of results
cf _all using `first'
// check times
timer list
The results in seconds to finish execution:
. timer list
1: 0.19 / 1 = 0.1930
2: 10.79 / 1 = 10.7900
The difference is huge, specially considering this data set has only 1,000 observations.
An interesting thing to notice is that as the number of non-treated cases increases relative to the number of treated, then the original method improves, but never reaches the levels of efficiency of the new method. As an example, invert the number of cases, so there is now 800 untreated and 200 treated (change data setup to gen treat = cond(_n <= 800, 0, 1)). The result is
. timer list
1: 0.07 / 1 = 0.0720
2: 4.45 / 1 = 4.4470
You can see that the new method also improves and is still much faster. In fact, the relative difference is still the same.
Another way to do this is using joinby or cross. The problem is they temporarily expand (a lot) the size of your data base. In many cases, they are not feasible due to the hard limit Stata has on the number of possible observations (see help limits). You can find an example of joinby here: https://stackoverflow.com/a/19784222/2077064.
Edit
If there's a large number of treated relative to untreated, your code suffers
because you go through the whole first loop many more times (due to the first if).
Furthermore, going through
that whole loop once, implies going through another loop that
has itself two if conditions, _N more times.
The opposite case in which there are few treated observations means that you go through the whole
first loop only in a small number of occasions, speeding up your code substantially.
The reason my code can maintain its efficiency is due to the use of in. This always
offers speed gains over if. Stata will go directly to those observations with no
logical checking needed. Your problem provides an opportunity for that replacement
and it's wise to seize it.
If my code used if where in is in place, the results would be different.
Your code would be faster for the
case in which there's a large number of untreated relative to treated, and again, that
is because in your code there would not be the need to go through the complete loop,
requiring very little work;
the first loop is short-circuited with the first if. For the opposite case,
my code would still dominate.
The key is to "separate" treated from untreated and work on each group using in.

Is it possible to define a variable in expression in C++?

I have this insane homework where I have to create an expression to validate date with respect to Julian and Gregorian calendar and many other things ...
The problem is that it must be all in one expression, so I can't use any ;
Are there any options of defining variable in expression? Something like
d < 31 && (bool leapyear = y % 4 == 0) || (leapyear ? d % 2 : 3) ....
where I could define and initialize one or more variables and use them in that one expression without using any ;?
Edit: It is explicitly said, that it must be a one-line expression. No functions ..
The way I'm doing it right now is writing macros and expanding them, so I end up with stuff like this
#define isJulian(d, m, y) (y < 1751 || (y == 1752 && (m < 9) || (m == 9 && d <= 2)))
#define isJulianLoopYear(y) (y % 4 == 0)
#define isGregorian(d, m, y) (y > 1573 || (y == 1752 && (m > 9) || (m == 9 && d > 13)))
#define isGregorianLoopYear(y) ((y % 4 == 0) || (y % 400 = 0))
// etc etc ....
looks like this is the only suitable way to solve the problem
edit: Here is original question
Suppose we have variables d m and y containing day, month and year. Task is to write one single expression which decides, if date is valid or not. Value should be true (non-zero value) if date is valid and false (zero) if date is not valid.
This is an example of expression (correct expression would look something like this):
d + 4 == y ^ 85 ? ~m : d * (y-2)
These are examples of wrong answers (not expressions):
if ( log ( d ) == 1752 ) m = 1;
or:
for ( i = 0; i < 32; i ++ ) m = m / 2;
Submit only file containing only one single expression without ; at the end. Don't submit commands or whole program.
Until 2.9.1752 was Julian calendar, after that date is Gregorian calendar
In Julian calendar is every year dividable by 4 a leap year.
In Gregorian calendar is leap year ever year, that is dividible by 4, but is not dividible by 100. Years that are dividable by 400 are another exception and are leap years.
1800, 1801, 1802, 1803, 1805, 1806, ....,1899, 1900, 1901, ... ,2100, ..., 2200 are not loop years.
1896, 1904, 1908, ..., 1996, 2000, 2004, ..., 2396,..., 2396, 2400 are loop years
In september 1752 is another exception, when 2.9.1752 was followed by 14.9.1752, so dates 3.9.1752, 4.9.1752, ..., 13.9.1752 are not valid.
((m >0)&&(m<13)&&(d>0)&&(d<32)&&(y!=0)&&(((d==31)&&
((m==1)||(m==3)||(m==5)||(m==7)||(m==8)||(m==10)||(m==12)))
||((d<31)&&((m!=2)||(d<29)))||((d==29)&&(m==2)&&((y<=1752)?((y%4)==0):
((((y%4)==0)&&((y%100)!=0))
||((y%400)==0)))))&&(((y==1752)&&(m==9))?((d<3)||(d>13)):true))
<evil>
Why would you define a new one, if you can reuse an existing one? errno is a perfect temporary variable.
</evil>
I think the intent of the homework is to ask you to do this without using variables, and what you are trying to do might defeat its purpose.
In standard C++, this is not possible. G++ has an extension known as statement expressions that can do that.
I don't believe you can, but even if you could, it would only have scope inside of the parentheses they are defined in (in your example) and cannot be used outside of them.
Your solution, which I will not provide fully for you, will probably go along these lines:
isJulian ? isJulianLeapyear : isGregorianLeapyear
To make it more specific, it could be like this:
isJulian ? (year % 4) == 0 : ((year % 4) == 0 || (year % 400) == 0)
You'll have to just make sure your algorithm is correct. I'm not a calender expert, so I wouldn't know about that.
First: Don't. It may be cute, but even if there's an extension that allows it, code golf is a dangerous game that will almost always end up causing mmore grief than it solves.
Okay, back to the 'real' question as defined by the homework. Can you make additional functions? If so, instead of capturing whether or not it's a leap year in a variable, make a function isLeapYear(int year) that returns the correct value.
Yes, that means you'll calculate it more than once. If that ends up being a performance issue, I'd be incredibly surprised... and it's a premature optimization to worry about that in the first place.
I'd be very surprised if you weren't allowed to write functions as part of doing this. It seems like that'd be half the point of an exercise like this.
......
Okay, so here's a quick overview of what you'll need to do.
First, basic verification - that month, day, year are possible values at all - month 0-11 (assuming 0-based), day 0-30, year non-negative (assuming that's a constraint).
Once you're past that, I'd probably check for the 1752 special cases.
If that's not relevant, the 'regular' months can be handled pretty simply.
This leaves us with the leap year cases, which can be broken down into two expressions - whether something is a leap year (which will be broken down additionally based on gregorian/julian), and whether the date is valid at that point.
So, at the highest level, your expression looks something like this:
areWithinRange(d,m,y) && passes1752SpecialCases(d,m,y) && passes30DayMonths(d,m,y) && passes31DayMonths(d,m,y) && passesFebruaryChecks(d,m,y)
If we assume that we only return false from our sub-expressions if we actively detect a rule break (31 days in June for the 30DayMonth rule returns false, but 30 days in February is irrelevant and so passes true), then we can pretty much say that the logic at that level is correct.
At this point, I'd write separate functions for the individual pieces (as pure expressions, a single return ... statement). Once you've gotten those in place, you can replace the method call in your top-level expression with the expanded version. Just make sure you parenthesize (is that a word?) everything sufficiently.
I'd also make a test harness program that uses the expression and has a number of valid and invalid inputs, and verifies that you're doing the right thing. You can write that in a function for ease of cut and paste for the final turn-in by doing something like:
bool isValidDate(int d, int m, int y)
{
return
// your expression here
}
Since the expression will be on a line by itself, it'll be easy to cut and paste.
You may find other ways to simplify your logic - excepting the 1752 special cases, days between 1 and 28 are always valid, for instance.
Considering it is homework, I think the best advice would be an approach to deriving your own solution.
If I were tackling this assignment, I would start by breaking the rules.
I would write a c++ function that given the variables d, m, and y, returns a boolean result on the validity of the date. I would use as many non-recursive helper functions as needed, and feel free to use if, else if, and else, but no looping aloud.
I would then Inline all helper functions
I would reduce all if, else if, and else statement to ? : notation
If I was successful at limiting my use of variables, I might be able to reduce all this to a single function with no variables - the body of which will contain the single expression I seek.
Good Luck.
You clearly have to have the date passed in somehow. Beyond that, all you're really going to be doing is chaining && and || (assuming we get the date as a tm struct):
#include <ctime>
bool validate(tm date)
{
return (
// sanity check that all values are positive
date.tm_mday >= 1 && date.tm_mon >= 0 && date.tm_year >= 0
// check for valid days
&& ((date.tm_mon == 0 && date.tm_mday <= 31)
|| (date.tm_mon == 1 && date.tm_mday <= (date.tm_year % 4 ? 28 : 29))
|| (date.tm_mon == 2 && date.tm_mday <= 31)
// yadda yadda for the other months
|| (date.tm_mon == 11 && date.tm_mday <= 31))
);
}
The parenthesis around date.tm_year % 4 ? 28 : 29 actually aren't needed, but I'm including them for readability.
UPDATE
Looking at a comment, you'll also need similar rules to validate dates that don't exist in the Gregorian calendar.
UPDATE II
Since you're dealing with dates in the past you will need to implement a more correct leap year test. However, I generally deal with dates in the future, and this incorrect leap year test will give correct results in 2012, 2016, 2020, 2024, 2028, 2032, 2036, 2040, 2044, 2048, 2052, 2056, 2060, 2064, 2068, 2072, 2076, 2080, 2084, 2088, 2092, and 2096. I will make a prediction that before this test fails in 2100 silicon-based computers will be forgotten relics. I seriously doubt we will use C++ on the quantum computers in use then. Besides, I won't be the programmer assigned to fix the bug.