#Stata remove rows before event - stata

I have a unbelanced panel data and have hard time to analyze it
I would like to remove all rows that I have for each person before an event happens because I'm interested in the effect of the event on other variables
One problem are the many missing values, mostly every other year, the second problem are persons where the "event" occurs more than once, in this care I'm only interessted in the last event as you can see for person 3 in my picture

If all you are interested in is identifying the last non-missing event, this should do the trick:
clear
* Input data
input float id str7 event
1 "1"
1 "missing"
1 "1"
1 "missing"
1 "0"
1 "missing"
2 "missing"
2 "0"
2 "missing"
2 "missing"
2 "1"
2 "missing"
2 "0"
3 "missing"
3 "0"
3 "1"
3 "0"
3 "missing"
3 "1"
3 "0"
end
* Re-structure event variable to byte and preserve sort order
replace event = "." if event == "missing"
destring event, replace
gen sortorder = _n
* Identify last event iff event == 1
bysort id event (sortorder): gen lastevent = _n == _N if event == 1
I assume you have a time variable to preserve sort order better, but this should serve in the example. Once you have them identified, you can do with them what you wish. Use keep if lastevent == 1 to drop all necessary observations, or keep them around as missing (for event == missing | 0).

Related

Stata: Keep the first observation by group

I have a data set that looks like this:
id firm earnings A
1 A 100 0
1 A 200 0
2 B 50 1
2 B 70 1
3 C 900 0
bys id firm, I want to keep only the first observation if A==0 and want to keep all the observations if A==1.
I've tried the following code:
if A==0{
bys id firm: keep if _n==1
}
However, this code drops all the _n>1 observations no matter what the A value is.
The if (conditional) {do something} syntax is used in control flow rather than in defining variables. As you have it now Stata is only testing if A==1 in the first row. Try adding additional conditions using and (&) or or (|) statements. Try this:
bys id firm: keep if (_n==1 & A==0) | A==1

what is this program doing exactly? (SAS)

I was confused by the following SAS code. So, here, the SAS data set named WORK.SALARY contains 10 observations for each department,and is currently ordered by Department. The following SAS program is submitted:
data WORK.TOTAL;
set WORK.SALARY(keep=Department MonthlyWageRate);
by Department;
if First.Department=1 then Payroll=0;
Payroll+(MonthlyWageRate*12);
if Last.Department=1;
run;
So, what exactly is First.Department and Last.Department? Many thanks for your time and attention.
Your data step calculates the total PAYROLL for each DEPARTMENT.
The FIRST. and LAST. variables are generated automatically when you use a BY statement. They are true when the current observation is the first (or last) observation in the BY group. How the DATA Step Identifies BY Groups
The sum statement (Syntax: var+expression;) for PAYROLL means that the value of PAYROLL is retained (or carried over) to the next observation.
The IF/THEN statement will initializes the value to zero when a new group starts.
The subsetting IF statement will make sure that only the final observation for each department is output.
As explained, it is calculating payroll for each department.
First.department assigns value =1 when a particular department id is encountered. last.department assigns a value =1 when the last record for the department is read.
So if you have :
Department Wage
1 100
1 200
1 300
2 1000
2 2000
2 3000
With the first. and last. assigned, it will look like this:
Department Wage first.deaprtment last.department
1 100 1 0
1 200 0 0
1 300 0 1
2 1000 1 0
2 2000 0 0
2 3000 0 1
Now you can follow your logic as to what happens when first.department = 1.
By the way, in your code, I dont see they are doing anything if Last.Department=1;

select last row by group with "collapse (last) ..." syntax

I want to select the last row in each subset of the data determined by one or more categorical variables.
Background. For each ticket in my data set, I have a ticketid and multiple transactions (sale, refund, sale, refund, sale...). I am only interested in keeping series that end in "sale".
My first step was to drop ticketids with evenly matched sales and refunds:
duplicates tag ticketid, gen(mult)
by ticketid: egen count_sale = total(transtatus == "Sale")
by ticketid: egen count_ref = total(transtatus == "Refund")
drop if mult & count_sale == count_ref
Now, I want to keep just the final sale when count_sale = count_ref + 1
sort ticketid time
preserve
** some collapse command
save "temp_terminal_sales.dta"
restore
append using "temp_terminal_sales.dta"
I can't figure out how (if at all) to use collapse here. I think I may just have to keep if mult, tag the last row with by ticketid: gen last = _n == _N and keep if last...? It seems like collapse should work. Here is the (wrong) syntax that seemed intuitive to me:
collapse (last), by(ticketid)
collapse (last) *, by(ticketid)
These don't work because (i) a varlist is required, and (ii) the by variables cannot be in the varlist.
Example data:
ticketid time myvar transtatus
1 1 2 "Sale"
1 2 2 "Refund"
2 1 2 "Sale"
3 1 2 "Sale"
3 2 2 "Refund"
3 3 2 "Sale"
3 4 2 "Refund"
4 1 2 "Sale"
4 2 2 "Refund"
4 3 2 "Sale"
Desired result:
ticketid time myvar transtatus
2 1 2 "Sale"
4 3 2 "Sale"
The easiest generic way to keep the last of a group is as follows. For a concrete example I assume panel data with identifier id and time variable time:
bysort id (time): keep if _n == _N
The generalisation is
bysort <variables defining groups> (<variable defining order first ... last>): keep if _n == _N
Many Stata commands support the in qualifier, but here we need if and the syntax hinges crucially on the fact that under the aegis of by: observation number _n and number of observations _N are determined within the groups defined by by:. Thus _n == 1 identifies the first and _n == _N identifies the last observation in each group.
drop if _n < _N is a dual command here.
You touched on this approach in your question, but the intermediate step of creating an indicator variable is unnecessary.
For collapse a work-around is presumably just to use some other variable, or even to create one for the purpose as in gen anything = 1. But I would always use by: for your purpose.
There is a discursive tutorial on by: at http://www.stata-journal.com/article.html?article=pr0004 Searching the Stata Journal archives using by as a keyword will reveal more applications.
#NickCox has already provided the general answer. Now that you have given example data, I post a reproducible example with several syntaxes:
clear all
set more off
input ///
ticketid time myvar str10 transtatus
1 1 2 "Sale"
1 2 2 "Refund"
2 1 2 "Sale"
3 1 2 "Sale"
3 2 2 "Refund"
3 3 2 "Sale"
3 4 2 "Refund"
4 1 2 "Sale"
4 2 2 "Refund"
4 3 2 "Sale"
end
list, sepby(ticketid)
*-----
* Method 1
bysort ticketid (time): keep if transtatus[_N] == "Sale" // keep subsets
by ticketid: keep if _n == _N // keep last observation of subsets
list
*-----
* Method 2
// list of all variables except ticketid
unab allvars: _all
local exclvar ticketid
local mycvars: list allvars - exclvar
bysort ticketid (time): keep if transtatus[_N] == "Sale" // keep subsets
collapse (last) `mycvars', by(ticketid) // keep last observation of subsets
list
*-----
*Method 3
bysort ticketid (time): keep if transtatus[_N] == "Sale" & _n == _N
list
(Remember to reload the data for each method.)
Consider also tagging and then running the following estimation commands with if. For example, regress ... if ...

Stata: tag all values in a group based on a characteristic of any values in the group

I think egen might help me here, but for whatever reason I can't quite figure out the right syntax. I'd like to create a new variable that takes a value of 1 for all observations in a group if, for any of the observations in the group, X is true. So, for example, my data has the obs, group, and flag variables, and I want to generate the variable grpflag.
obs group flag grpflag
1 1 0 1
2 1 1 1
3 1 0 1
4 2 0 0
5 2 0 0
6 2 0 0
7 3 1 1
8 3 0 1
So, in the example data, since flag==1 for one (i.e., any) of the observations in group 1, I want grpflag to take the value 1 for all observations in group 1. The same is true for group 3, and the opposite is true for group 2.
You were right: the egen command can do this.
egen grpflag = max(flag), by(group)
See the Stata FAQ http://www.stata.com/support/faqs/data-management/create-variable-recording/ for more detail on the correspondences any:maximum and all:minimum exploited in Stata.
Note that while your example is easy (flag is already 0 or 1, so max() can be applied directly to flag) the argument of max() can be an expression, so the syntax extends easily to more general cases, e.g. max(foo == 42).
Even if egen were not available, or did not work like this, this kind of one-liner is possible in Stata, and will be more efficient than calling egen:
bysort group (flag) : gen grpflag = flag[_N]
However, that would be thrown by missings on flag, so you would need to work around that. In turn that could just be
gen isflag = flag == 1
bysort group (isflag) : gen grpflag = isflag[_N]
The general principle is that so long as what you are sorting is just 0 and 1, any values of 1 will be sorted to the end within each block of observations.

How to number households?

I have a set of household data with more than 20,000 records of 4200 households. In my data set there is no any column for household ID & all the records are per household member. There is a column for person's serial no & with each & every "1", the household should be changed.( i.e: if we start to number households, with the very 1st person's serial no when it's equal to 1, the corresponding HH_ID should be "1". Once the next record when person's serial no=1 meets, then the HH_ID should be 2.) So I want to add a column named HH_ID & number it from 1-4200. How could I write a program using STATA?
What you want is (assuming a variable personid for person identifier)
. gen hhid = sum(personid == 1)
That's it. The explanation is longer than the code. The expression personid == 1 evaluates as 1 when true and 0 when false. For the first household, first person, this will be 1, and for the other persons in the same household 0. For the second household, first person, this will be 1, and so on. The function sum() gives the cumulative or running sum, so that you should end with something that goes 1,1,1,2,2,2,2,3,3,3,... Clearly the actual numbers of 1s, 2s, 3s etc. will depend on the numbers of persons in the households.
On true and false in Stata see
http://www.stata.com/support/faqs/data-management/true-and-false/index.html