Dummy between another dummy - stata

I'm working in Stata wondering how to create a dummy that is between another dummy. I have time and a dummy for election years, and I want to create a dummy that indicates years in the middle between each election.
For example

If it is always 4 years in-between each 1 in elec then you can use this code. If this is not the case, then you will have to provide us with more information.
* Example generated by -dataex-. For more info, type help dataex
clear
input int time byte elec
2000 0
2001 1
2002 0
2003 0
2004 0
2005 1
2006 0
end
*Initiate all values to 0
gen elec1 = 0
*Replace elec1 to 1 if the values in elec two rows above and two rows below is 1
replace elec1 = 1 if elec[_n-2] == 1 & elec[_n+2] == 1

There is a fair bit unexplained in this question. I read it this way and I think #TheIceBear is doing the same. (Thanks to them for the data example: OP, please note how to do it!)
Elections are held every four years in an area. As examples, elections were held in 2001 and 2005.
What is wanted is an indicator (a.k.a. dummy) for years halfway between elections. For example, 2003 is one such.
This works for the example. We notice that election years have remainder 1 on division by 4. So, the wanted years will have remainder 3.
* Example generated by -dataex-. For more info, type help dataex
clear
input int time byte elec
2000 0
2001 1
2002 0
2003 0
2004 0
2005 1
2006 0
end
gen test = mod(time, 4) == 1
assert elec == test
gen wanted = mod(time, 4) == 3
list, sep(0)
+-----------------------------+
| time elec test wanted |
|-----------------------------|
1. | 2000 0 0 0 |
2. | 2001 1 1 0 |
3. | 2002 0 0 0 |
4. | 2003 0 0 1 |
5. | 2004 0 0 0 |
6. | 2005 1 1 0 |
7. | 2006 0 0 0 |
+-----------------------------+
As a test, you can check out 2007:
display mod(2007, 4)
See this paper for just a few uses of the modulus, strictly the remainder.
The direct method of generating indicators as true (1) or false (0) results of a true-or-false equality or inequality is discussed at many places, such as this FAQ and this paper.

Related

How to apply maximum value to a whole group using Stata [duplicate]

This question already has answers here:
variable showing the highest value attained of another variable, recorded so far, over time
(2 answers)
Closed 1 year ago.
I want to generate a variable max_count wherein, for a given group ID, if the value of count for the current year is higher than for the previous year then max_count takes the value for the current year. The value for the current year will be applied to the succeeding years until a higher value than that in the current year occurs. For instance, in the example below for ID 2, the value of count in 2001 is 10 but the succeeding years (2002 and 2003) have values less than 10 (i.e. 2 and 4) so 2002 and 2003 then take the value of 10 (the highest value after 2001).
I used this Stata code but it doesn't work:
bysort id (Year): gen max_count=max(count, count[_n-1])
The highest value is only applied to the immediately succeeding year and not to all succeeding years.
ID Year count max_count
1 2000 5 5
1 2001 0 5
1 2002 3 5
1 2003 7 7
2 2000 5 5
2 2001 10 10
2 2002 2 10
2 2003 4 10
3 2000 2 2
3 2001 5 5
3 2002 9 9
3 2003 6 9
clear
input ID Year count max_count
1 2000 5 5
1 2001 0 5
1 2002 3 5
1 2003 7 7
2 2000 5 5
2 2001 10 10
2 2002 2 10
2 2003 4 10
3 2000 2 2
3 2001 5 5
3 2002 9 9
3 2003 6 9
end
bysort ID (Year) : gen wanted = count[1]
by ID : replace wanted = max(wanted[_n-1], count) if _n > 1
list, sepby(ID)
+---------------------------------------+
| ID Year count max_co~t wanted |
|---------------------------------------|
1. | 1 2000 5 5 5 |
2. | 1 2001 0 5 5 |
3. | 1 2002 3 5 5 |
4. | 1 2003 7 7 7 |
|---------------------------------------|
5. | 2 2000 5 5 5 |
6. | 2 2001 10 10 10 |
7. | 2 2002 2 10 10 |
8. | 2 2003 4 10 10 |
|---------------------------------------|
9. | 3 2000 2 2 2 |
10. | 3 2001 5 5 5 |
11. | 3 2002 9 9 9 |
12. | 3 2003 6 9 9 |
+---------------------------------------+
There is a detailed discussion of how to get such records (the maximum or minimum so far is the "record", as in sport) in this Stata FAQ.
For a one-line solution, install rangestat from SSC and then
rangestat (max) WANTED = count, int(Year . 0) by(ID)
The problem of when the record occurred is naturally related:
by ID : gen when = Year[1]
by ID : replace when = cond(wanted > wanted[_n-1], Year, when[_n-1]) if _n > 1

Conditional summation in time-to-event data

I have the following data that has been prepared with stset. The resulting variables signify cohort entry and exit times along with event status. In addition, a numerical variable - prob has been calculated based on the riskset size.
For those subjects that are not cases (where _d == 0), I need to sum all values of the prob variable where _t falls within that subject's follow-up time.
For example, subject 8 enters the cohort at _t0 == 0 and exits at _t == 8. Between these times, there are three prob values 0.9, 0.875 and 0.875 - giving the desired answer for subject 8 as 2.65.
* Example generated by -dataex-. To install: ssc install dataex
clear
input long id byte(_t0 _t _d) float prob
1 0 1 0 .
2 0 2 0 .
3 1 3 1 .9
4 0 4 0 .
5 0 5 1 .875
6 0 6 1 .875
7 5 7 0 .
8 0 8 0 .
9 0 9 1 .8333333
10 0 10 1 .8
11 0 11 0 .
12 8 12 1 .6666667
13 0 13 0 .
14 0 14 0 .
15 0 15 0 .
end
The desired output would return all of the data with an additional variable signifying the summed values of prob.
Thanks so much in advance.

Subsampling years before and after an event of an unbalanced panel dataset

I'm trying to figure out a concise way to keep only the two years before and after the year in which an event takes place using daily panel data in Stata. The panel is unbalanced. Ultimately, I'm trying to conduct an event study but I experienced issues because the unique groups report inconsistent years.
The data looks something like this:
ID year month day event
1 1999 1 1 0
1 1999 1 2 0
1 1999 1 3 0
1 1999 1 4 0
1 1999 1 5 0
1 1999 1 6 0
1 1999 1 7 0
1 1999 1 8 0
1 1999 1 9 0
1 1999 1 10 0
1 1999 1 11 0
1 1999 1 12 0
1 1999 1 13 0
1 1999 1 14 0
1 1999 1 15 0
1 1999 1 16 0
1 1999 1 17 0
1 1999 1 18 0
1 1999 1 19 0
1 1999 1 20 0
1 1999 1 21 0
1 1999 1 22 0
1 1999 1 23 0
1 1999 1 24 0
1 1999 1 25 0
1 1999 1 26 0
1 1999 1 27 0
1 1999 1 28 0
1 1999 1 29 0
1 1999 1 30 0
1 1999 1 31 0
1 1999 2 1 1
1 1999 2 2 1
In this case, the event takes place in February 1999. The event is monthly, but I need the daily data for a later part of the analysis. I want to somehow tag the 24 months before February 1999 and the 24 months after February 1999. However, I need to do this in a way that won't codify any months in 2002 if group 1 reported no data in 2000.
I got the following to work on a similar set of monthly data but I can't figure out a way to do it with daily data. Furthermore, if anyone has suggestions for a less clunky solution, I would be very appreciative.
bys ID year (month) : egen year_change = max(event)
bys ID (year month) : replace year_change = 2 if ///
(year_change[_n+24] == 1 & year[_n] == year[_n+24] - 2) | ///
(year_change[_n+12] == 1 & year[_n] == year[_n+12] - 1) | ///
(year_change[_n-12] == 1 & year[_n] == year[_n-12] + 1) | ///
(year_change[_n-24] == 1 & year[_n] == year[_n-24] + 2)
keep if year_change >= 1
It seems that your event date is the first date with event 1. So,
gen dailydate = mdy(month, day, year)
bysort id : egen key = min(cond(event == 1, dailydate, .))
gen wanted = inrange(dailydate, key - 730, key + 730)
Check that wanted gives the dates you want and then modify the rule or keep accordingly.
This code doesn't assume that the event date is the same for each panel, but that would not be a problem.
See this paper for a review of related technique.
For your task, I suggest you to work use actual Stata dates, instead of relying on year + month + day variables - this way, it would be easier to add/subtract 24 months without relying on data sorting (the "_n+24" part in your code) and the codification would not suffer from the issue with missing data that you outline in the question.
I see a straightforward solution, which relies on an assumption I made on your setting (that you did not specify, but is the general form of event studies): the event date is unique for all IDs, hence there is no group-specific "treatment" date.
g stata_date = mdy(month, day, year) // generate variable with Stata date
/* Unique event on Feb 1, 1999 */
bys ID: egen treat_group = max(event) // indicator for an ID to ever be "treated"
g event_window = (stata_date >= td(01Feb1997) & stata_date < td(01Feb2001)) // indicator for event window - 2 years before and after Feb 1, 1999
g event_treatment = treat_group * event_window // indicator for a treated ID during the event window

Stata: Reducing observations based on yearly data

I want to create a variable that is one or zero if a company (companyid below) is "multicolor" in each year. Below is my data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 companyid int year float(red blue green)
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2017 1 0 0
"001045" 2017 1 0 0
"001049" 2019 0 1 0
"001049" 2019 0 0 1
"001055" 2018 1 0 0
"001055" 2018 0 1 0
"001055" 2018 0 0 1
So for example, company #001055 is red, blue, and green for 2018 so this 'multicolor' variable should equal to one.
Additionally, I also want to create variables for the different combinations. I.e. a red-blue var = 1 if a company is red and blue = 1 in each year.
I was trying to do something with bysort companyid year: gen multicolor = 1 if red == 1 & blue == 1 & green == 1 but I realize that has a lot missing in what I want to accomplish.
The overall goal is to reduce multiple year observations so I have one observation per year per company.
This single year/company record would have the info if that company was red, green, blue, or the exact mix of these colors if it is mixed. Below would be the example of data that I want to create from the data above.
input str6 companyid int year float(red blue green r-b-g red-blue blue-green ...more...)
"001045" 2015 0 1 0 0 0 0 ...
"001045" 2017 1 0 0 0 0 0 ...
"001049" 2019 0 0 0 0 0 1 ...
"001055" 2018 0 0 0 1 0 0 ...
I think this is a lot easier than you are fearing. First, collapse to maximum values by company and year. Then you have the individual values of red blue green. Second, concatenate the values, so that "110" is red and blue but not green, and so on.
tabulate would generate all the indicators corresponding to combinations found in the data.
In effect, the 3 colors and 2 possibilities permit binary encoding, and the string is a binary number too.
The correspondence for true 1 and false 0 that maximum over 0s and 1s means "any" and that minimum over 0s and 1s means "all" is obvious once understood, but worth explaining otherwise. For a Stata context, see this FAQ
clear
input str6 companyid int year float(red blue green)
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2015 0 1 0
"001045" 2017 1 0 0
"001045" 2017 1 0 0
"001049" 2019 0 1 0
"001049" 2019 0 0 1
"001055" 2018 1 0 0
"001055" 2018 0 1 0
"001055" 2018 0 0 1
end
collapse (max) red blue green, by(companyid year)
egen colors = concat(red blue green)
list
+-----------------------------------------------+
| compan~d year red blue green colors |
|-----------------------------------------------|
1. | 001045 2015 0 1 0 010 |
2. | 001045 2017 1 0 0 100 |
3. | 001049 2019 0 1 1 011 |
4. | 001055 2018 1 1 1 111 |
+-----------------------------------------------+

Merging two stat(sum) codes

I have obtained a list of projects that in total generate zero revenue (total revenue over a period of time)
tabstat revenue, by(project) stat(sum)
I have identified 261 projects (out of 1000s) that generate zero revenue for the whole period of time.
Now, want to look at the total value of a specific variable that can be tracked over multiple periods for each project in these zero-revenue-generating projects. I know that I can go after each campaign by typing
tabstat variable_of_interest if project==127, stat(sum)
Again, here project 127 generated zero revenue.
Is there a way to merge these two codes so that I can generate a table with the following logic
generate total sum of the variable_of_interest if project's stat(sum) was equal to zero?
here is a data sample
project revenue var_of_intr
1 0 5
1 0 8
1 2 10
1 0 5
2 0 5
2 0 90
2 0 2
2 0 0
3 0 76
3 0 5
3 0 23
3 0 4
4 0 75
4 8 2
4 0 9
4 0 6
5 0 88
5 0 20
5 0 9
5 0 14
Since projects 1 and 4 generated revenue>0, the code should ignore then when summing up the variable of interest by campaign, thus, the table I am interested in should look like this
project var_of_intr
2 97
3 108
5 131
You can use collapse:
clear
set more off
*----- example data -----
input ///
project revenue somevar
1 0 5
1 0 8
1 2 10
1 0 5
2 0 5
2 0 90
2 0 2
2 0 0
3 0 76
3 0 5
3 0 23
3 0 4
4 0 75
4 8 2
4 0 9
4 0 6
5 0 88
5 0 20
5 0 9
5 0 14
end
list
*----- what you want -----
collapse (sum) revenue somevar, by(project)
keep if revenue == 0
That will destroy the database, of course, but it might be useful anyway. You don't really specify if this approach is acceptable or not.
For a table, you can flag projects with revenue equal to zero and condition on that:
bysort project (revenue): gen revzero = revenue[_N] == 0
tabstat somevar if revzero, by(project) stat(sum)
If you have missing or negative revenues, modifications are required.