Determine time period since event (up to n leads/lags) - stata

I want to measure the number of periods (here years) since an event occurred (here represented by indicator variable pos) up to a given number of leads and lags (here three).
The following code works, but seems hackish and like I'm missing something fundamental. Is there a more robust solution that takes advantage of built in functions or a better logic? I'm on 11.2. Thanks!
version 11.2
clear
* generate annual data
set obs 40
generate country = cond(_n <= 20, "USA", "UK")
bysort country: generate year = 1766 + _n
generate pos = 1 if (year == 1776)
* generate years since event (up to three)
encode country, generate(countryn)
xtset countryn year
generate time_to_pos = 0 if (pos == 1)
forvalues i = 1/3 {
replace time_to_pos = `i' if (l`i'.pos == 1)
replace time_to_pos = -1 * `i' if (f`i'.pos == 1)
}

Clear question.
This can be shortened. Here is one way. Starting with your code to set up a sandpit
version 11.2
clear
* generate annual data
set obs 40
generate country = cond(_n <= 20, "USA", "UK")
bysort country: generate year = 1766 + _n
Now it is
gen time_to_pos = year - 1776 if abs(1776 - year) <= 3
That is all that seems needed for your example. If you want to generalise to multiple events within each panel, I'd like to know the rules for such events.
I was going to show a trick from http://www.stata-journal.com/article.html?article=dm0055 but it doesn't appear needed.

Related

Transforming a logic constraint into python pulp code

I started working on a problem in the past several days...
A company plans its business in a three month period. It can produce
110 units at a cost of 600 each. The minimum amount it must produce
per month is 15 units if active (but of course, it can choose to be closed
during the month, and produce 0 units). Each month it can subcotract the
prodution of 60 units, at a cost of 660 each. Storing a unit for one month
costs 20$ per unit per month. The marketing department has forcasted
sales of 100, 130 and 150 units for the next three months, respectively.
The goal is to meet the demand each month while minimizing the total
cost.
I deduced that we need to have an objective function of form min[Sum(i=0..3) 600*x1+660*x2+20*x3].
We need to add some constrains on x1>=15, and on x2 0<=x2<=60
Also we will also need another constraint for each month...
For the first one i=1 => x1+x2 = 100 - x3last (x3last is an extra variable that should hold the amount existing in deposit from the previous month), and for i=2 and i=3 same constraints.
I don't have any idea how to write this in pulp, and i would appreciate some help. Thx ^_^
I'd tend to agree with #Erwin that you should focus on formulating the problem as a Linear Program. It is then easy to translate this into code in PULP or one of many other PULP libraries/tools/languages.
As an example of this - lets work through this process for the example problem you have written out in your question.
Decision Variables
The first thing to decide is what you can/should decide. This set of information is called the decision variables. Picking the best/easiest decision variables for your problem comes with practice - the important thing is that once you know the values of the variables you have a unique solution to the problem.
Here I would suggest the following. These assume that the forecasts for demand are perfect. For each month i:
Whether the production line should be open - o[i]
How much to produce in that month - p[i]
How much to hold in storage for next month - s[i]
How much to get made externally - e[i]
Objective Function
The objective in your case is obvious - minimise the total cost. So we can just write this down: sum(i=0...2)[p[i]*600 + s[i]*20 + e[i]*660]
Constraints
Let's lift these directly our of your problem description:
"It can produce 110 units at a cost of 600 each. The minimum amount it must produce per month is 15 units if active (but of course, it can choose to be closed during the month, and produce 0 units)."
p[i] >= o[i]*15
p[i] <= o[i]*110
The first constraint forces the minimum production about to be 15 if the production is open that month (o[i] == 1), if the production is not open this constraint has not effect. The second constraint sets a maximum value on p[i] of 110 if the production is open and a maximum production of 0 if the production is closed that month (o[i] == 0).
"Each month it can subcotract the prodution of 60 units, at a cost of 660 each"
e[i] <= 60
"The marketing department has forcasted sales of 100, 130 and 150 units for the next three months, respectively. The goal is to meet the demand each month while minimizing the total cost." If we declare the sales in each mongth to be sales[i], we can define our "flow constraint" as:
p[i] + e[i] + s[i-1] == s[i] + sales[i]
The way to think of this constraint is inputs on the left, and outputs on the right. Inputs of units are production, external production, and stuff taken out of storage from last month. Outputs are units left/put in storage for next month and sales.
Finally in code:
from pulp import *
all_i = [1,2,3]
all_i_with_0 = [0,1,2,3]
sales = {1:100, 2:130, 3:150}
o = LpVariable.dicts('open', all_i, cat='Binary')
p =LpVariable.dicts('production', all_i, cat='Linear')
s =LpVariable.dicts('stored', all_i_with_0, lowBound=0, cat='Linear')
e =LpVariable.dicts('external', all_i, lowBound=0, cat='Linear')
prob = LpProblem("MinCost", LpMinimize)
prob += lpSum([p[i]*600 + s[i]*20 + e[i]*660 for i in all_i]) # Objective
for i in all_i:
prob += p[i] >= o[i]*15
prob += p[i] <= o[i]*110
prob += e[i] <= 60
prob += p[i] + e[i] + s[i-1] == sales[i] + s[i]
prob += s[0] == 0 # No stock inherited from previous monts
prob.solve()
# The status of the solution
print ("Status:", LpStatus [prob.status])
# Dislay the optimums of each var
for v in prob.variables ():
print (v.name, "=", v.varValue)
# Objective fcn
print ("Obj. Fcn: ", value(prob.objective))
Which returns:
Status: Optimal
external_1 = 0.0
external_2 = 10.0
external_3 = 40.0
open_1 = 1.0
open_2 = 1.0
open_3 = 1.0
production_1 = 110.0
production_2 = 110.0
production_3 = 110.0
stored_0 = 0.0
stored_1 = 10.0
stored_2 = 0.0
stored_3 = 0.0
Obj. Fcn: 231200.0

How to generate item variables from total score variable

I want to simulate the item score from total score.
For example, I have generated the total score, which has scores between 5 and 25. I would like to distribute this total score to five items with each having a 5-Likert score.
Then I used a while loop to check the condition in Stata 15. The code takes took too long to finish looping and I do not know whether I have made a mistake.
Perhaps someone would like to suggest another way to simulate the item score from the total score?
My code:
set obs 200
generate id=_n
generate u_i= rnormal(0, 0.5)
generate gr = runiform()>0.5
generate sex = runiform()>0.4
generate age = round(rnormal(65, 10))
expand 5
bysort id: generate time=_n
generate e_ij = rnormal(0, 1.0)
generate run=_n
*Generate Sum score 5-25
generate y = 3.0 + 2.0*gr + 0.2*age -1.2*sex + 0.5*time + u_i + e_ij
summarize y
replace y = round(y)
*Generate each item
forvalues k = 1(1)5 {
generate item`k' = runiform(1, 5)
replace item`k' = round(item`k')
}
egen sum_item=rowtotal(item1 item2 item3 item4 item5)
generate diff = y - sum_item
*Looping check if y=sum_item
forvalues a = 1(1)`=_N' {
quietly gsort -diff
while sum_item!=y[`a'] {
replace sum_item=. if sum_item!=y[_n]
forvalues k = 1(1)5 {
replace item`k' =. if sum_item==.
replace item`k' = runiform(1, 5) if item`k'==.
replace item`k' = round(item`k')
}
replace sum_item= item1 + item2+item3+item4+item5 if sum_item==.
replace diff = y - sum_item
if (sum_item==y[`a']) continue, break
}
}
The expected data that I would like to have:
As you can see, after running the loop I will always get 2-4 cases that the program keep running by generating item score (item1-item5) until the diff variable equals zero.
If I'm understanding correctly, you could loop something like the following (after setting all the items to initial values of 1, since possible values are 1 to 5):
capture generate rand_int = 0
replace rand_int = floor( 5 * runiform() + 1 ) // random int, 1 to 5
capture generate cnd = 0
forvalues k = 1(1)5 {
replace cnd = rand_int == `k' & sum_item < y & item`k' < 6
replace item`k' = item`k' + 1 if cnd
}
replace sum_item = item1+item2+item3+item4+item5
In words, that says is that if sum_item < y, then randomly add 1 to one of the items (as long as that item is not already equal to 5), and then you would keep doing it until sum_item == y for all rows.
So that's going to converge in roughly 20 iterations if the max value of y is 25 and items are from 1 to 5. I say "roughly" because there is a little waste in here when you add 1 to an item that is already equal to 5. You could ad some extra code for that, but I wouldn't bother if this is fast enough. E.g. for high values of item_sum it would be more efficient to start with initial values of 5 and randomly subtract 1 until it converges.
I'm not enough of a statistician to say that's the best or even an adequate way to do it, but intuitively to me it seems OK if you want a fairly uniform distribution of values. If you wanted the modal value to be 4, for example, that's a lot harder and not really a programming question any longer.

Combining indicator variables

I am using the CCES 2016 dataset.
I am only interested in whites who have a high school diploma or less (that is, no college), and who identify as democrats.
The three variables are race, educ, and pid3:
race = 1 if white
educ = 2 if high school diploma and educ = 1 if not
pid3 = 1 if democrat
I would like to create a new variable made up of people who selected 1 for race; 1 or 2 for educ; and 1 for pid3.
What commands should I type in Stata 13 to achieve this?
For a (0, 1) indicator, consider
generate wanted = (race == 1) & inlist(educ, 1, 2) & (pid3 == 1)
See
the help for operators and inlist()
https://www.stata.com/support/faqs/data-management/true-and-false/

Having an issue with stopping a loop in Stata

So, I have a bunch of variables in my data set which are binary and contain information on whether an individual was married or not. So, for example, marr79, is whether a person was married in 1979 or not.
I'm trying to find how many years a person was married (the first time) from the child's birth. So, if the child was born in 1980, and the person was married in 1980, it would add to child_marr, and it would do the same for the following 18 years of their life. I want it to stop, though, if it encounters a 0. So if there are 1's for 1980, 1981, and 1982, and a 0 for 1983, I want it to stop at 1983, even if there is a 1 in 1984.
My code below (and it is one of many iterations I've tried) either has it run through all the years without stopping, or never run at all, leaving values of all 0.
Any help is appreciated.
gen child_marr=0;
forvalues y=79(1)99 {;
gen temp_yr=1900+`y';
if (ch_yob<=temp_yr & marr`y'==1 & temp_yr<(ch_yob+18))==1 {;
replace child_marr = child_marr + 1;
};
else if (marr`y'==0 & ch_yob<=temp_yr) {;
continue, break;
};
drop temp_yr;
};
A few comments:
Your condition if (test1 & test2 & test3) == 1 does not need the == 1 portion -- Stata infers that if (condition) means if condition == 1 (caveat: for cases where the logical test is {0,1}).
There is no need to generate a temporary variable, since you can compare the value of a variable to a local macro directly.
To the issue at hand, your loop is comparing observation-level criteria (e.g., the value of the variable temp_yr to the value of the variable ch_yob). This can seem correct, but is often problematic -- see Stata FAQ: if command versus if qualifier.
A first pass at a solution would be to recode your forvalues loop to use the if qualifier rather than the if command:
gen child_marr = 0
forvalues y = 79/99 {
local yr = 1900 + `y'
replace child_marr = child_marr + 1 if (ch_yob <= `yr') & (marr`y' == 1) & (`yr' < (ch_yob + 18))
}
But as mentioned, a concrete solution would be easier with a reproducible example.

Nearest Neighbor Matching in Stata

I need to program a nearest neighbor algorithm in stata from scratch because my dataset does not allow me to use any of the available solutions (as far as I am concerned).
To be pecise. I have a dataset that is of similar structure to that of the following (original has around 14k observations)
input id value treatment match
1 0.14 0 .
2 0.32 0 .
3 0.465 1 2
4 0.878 1 2
5 0.912 1 2
6 0.001 1 1
end
I want to generate a variable called match (already included in the example above). For each observation with treatment == 1 the variable match should store the id of another observation from within treatment == 0 whose value is closest to value of the considered observation (treatment == 1).
I am new to stata programming, so I am not yet familiar with the syntax. My first shot is the following however it does not produce any changes to the match variable. I am sure this is a novice question but I am hoping for some advice on how to make the code running.
EDIT: I have changed the code slightly and now it seems to work. Do you see any problems that may arise if I run it on a bigger dataset?
set more off
clear all
input id pscore treatment
1 0.14 0
2 0.32 0
3 0.465 1
4 0.878 1
5 0.912 1
6 0.001 1
end
gen match = .
forval i = 1/`= _N' {
if treatment[`i'] == 1 {
local dist 1
forvalues j = 1/`= _N' {
if (treatment[`j'] == 0) {
local current_dist (pscore[`i'] - pscore[`j'])^2
if `dist' > `current_dist' {
local dist `current_dist' // update smallest distance
replace match = id[`j'] in `i' // write match
}
}
}
}
}
Consider some simulated data: 1,000 observations, 200 of them untreated (treat == 0) and the rest treated (treat == 1). Then the code included below will be much more efficient than the originally posted. (Ties, like in your code, are not explicitly handled.)
clear
set more off
*----- example data -----
set obs 1000
set seed 32956
gen id = _n
gen pscore = runiform()
gen treat = cond(_n <= 200, 0, 1)
*----- new method -----
timer clear
timer on 1
// get id of last non-treated and first treated
// (data is sorted by treat and ids are consecutive)
bysort treat (id): gen firsttreat = id[1]
local firstt = first[_N]
local lastnt = `firstt' - 1
// start loop
gen match = .
gen dif = .
quietly forvalues i = `firstt'/`=_N' {
// compute distances
replace dif = (pscore[`i'] - pscore)^2
summarize dif in 1/`lastnt', meanonly
// identify id of minimum-distance observation
replace match = . in 1/`lastnt'
replace match = id in 1/`lastnt' if dif == r(min)
summarize match in 1/`lastnt', meanonly
// save the minimum-distance id
replace match = r(max) in `i'
}
// clean variable and drop
replace match = . in 1/`lastnt'
drop dif firsttreat
timer off 1
tempfile first
save `first'
*----- your method -----
drop match
timer on 2
gen match = .
quietly forval i = 1/`= _N' {
if treat[`i'] == 1 {
local dist 1
forvalues j = 1/`= _N' {
if (treat[`j'] == 0) {
local current_dist (pscore[`i'] - pscore[`j'])^2
if `dist' > `current_dist' {
local dist `current_dist' // update smallest distance
replace match = id[`j'] in `i' // write match
}
}
}
}
}
timer off 2
tempfile second
save `second'
// check for equality of results
cf _all using `first'
// check times
timer list
The results in seconds to finish execution:
. timer list
1: 0.19 / 1 = 0.1930
2: 10.79 / 1 = 10.7900
The difference is huge, specially considering this data set has only 1,000 observations.
An interesting thing to notice is that as the number of non-treated cases increases relative to the number of treated, then the original method improves, but never reaches the levels of efficiency of the new method. As an example, invert the number of cases, so there is now 800 untreated and 200 treated (change data setup to gen treat = cond(_n <= 800, 0, 1)). The result is
. timer list
1: 0.07 / 1 = 0.0720
2: 4.45 / 1 = 4.4470
You can see that the new method also improves and is still much faster. In fact, the relative difference is still the same.
Another way to do this is using joinby or cross. The problem is they temporarily expand (a lot) the size of your data base. In many cases, they are not feasible due to the hard limit Stata has on the number of possible observations (see help limits). You can find an example of joinby here: https://stackoverflow.com/a/19784222/2077064.
Edit
If there's a large number of treated relative to untreated, your code suffers
because you go through the whole first loop many more times (due to the first if).
Furthermore, going through
that whole loop once, implies going through another loop that
has itself two if conditions, _N more times.
The opposite case in which there are few treated observations means that you go through the whole
first loop only in a small number of occasions, speeding up your code substantially.
The reason my code can maintain its efficiency is due to the use of in. This always
offers speed gains over if. Stata will go directly to those observations with no
logical checking needed. Your problem provides an opportunity for that replacement
and it's wise to seize it.
If my code used if where in is in place, the results would be different.
Your code would be faster for the
case in which there's a large number of untreated relative to treated, and again, that
is because in your code there would not be the need to go through the complete loop,
requiring very little work;
the first loop is short-circuited with the first if. For the opposite case,
my code would still dominate.
The key is to "separate" treated from untreated and work on each group using in.