I want to select the last row in each subset of the data determined by one or more categorical variables.
Background. For each ticket in my data set, I have a ticketid and multiple transactions (sale, refund, sale, refund, sale...). I am only interested in keeping series that end in "sale".
My first step was to drop ticketids with evenly matched sales and refunds:
duplicates tag ticketid, gen(mult)
by ticketid: egen count_sale = total(transtatus == "Sale")
by ticketid: egen count_ref = total(transtatus == "Refund")
drop if mult & count_sale == count_ref
Now, I want to keep just the final sale when count_sale = count_ref + 1
sort ticketid time
preserve
** some collapse command
save "temp_terminal_sales.dta"
restore
append using "temp_terminal_sales.dta"
I can't figure out how (if at all) to use collapse here. I think I may just have to keep if mult, tag the last row with by ticketid: gen last = _n == _N and keep if last...? It seems like collapse should work. Here is the (wrong) syntax that seemed intuitive to me:
collapse (last), by(ticketid)
collapse (last) *, by(ticketid)
These don't work because (i) a varlist is required, and (ii) the by variables cannot be in the varlist.
Example data:
ticketid time myvar transtatus
1 1 2 "Sale"
1 2 2 "Refund"
2 1 2 "Sale"
3 1 2 "Sale"
3 2 2 "Refund"
3 3 2 "Sale"
3 4 2 "Refund"
4 1 2 "Sale"
4 2 2 "Refund"
4 3 2 "Sale"
Desired result:
ticketid time myvar transtatus
2 1 2 "Sale"
4 3 2 "Sale"
The easiest generic way to keep the last of a group is as follows. For a concrete example I assume panel data with identifier id and time variable time:
bysort id (time): keep if _n == _N
The generalisation is
bysort <variables defining groups> (<variable defining order first ... last>): keep if _n == _N
Many Stata commands support the in qualifier, but here we need if and the syntax hinges crucially on the fact that under the aegis of by: observation number _n and number of observations _N are determined within the groups defined by by:. Thus _n == 1 identifies the first and _n == _N identifies the last observation in each group.
drop if _n < _N is a dual command here.
You touched on this approach in your question, but the intermediate step of creating an indicator variable is unnecessary.
For collapse a work-around is presumably just to use some other variable, or even to create one for the purpose as in gen anything = 1. But I would always use by: for your purpose.
There is a discursive tutorial on by: at http://www.stata-journal.com/article.html?article=pr0004 Searching the Stata Journal archives using by as a keyword will reveal more applications.
#NickCox has already provided the general answer. Now that you have given example data, I post a reproducible example with several syntaxes:
clear all
set more off
input ///
ticketid time myvar str10 transtatus
1 1 2 "Sale"
1 2 2 "Refund"
2 1 2 "Sale"
3 1 2 "Sale"
3 2 2 "Refund"
3 3 2 "Sale"
3 4 2 "Refund"
4 1 2 "Sale"
4 2 2 "Refund"
4 3 2 "Sale"
end
list, sepby(ticketid)
*-----
* Method 1
bysort ticketid (time): keep if transtatus[_N] == "Sale" // keep subsets
by ticketid: keep if _n == _N // keep last observation of subsets
list
*-----
* Method 2
// list of all variables except ticketid
unab allvars: _all
local exclvar ticketid
local mycvars: list allvars - exclvar
bysort ticketid (time): keep if transtatus[_N] == "Sale" // keep subsets
collapse (last) `mycvars', by(ticketid) // keep last observation of subsets
list
*-----
*Method 3
bysort ticketid (time): keep if transtatus[_N] == "Sale" & _n == _N
list
(Remember to reload the data for each method.)
Consider also tagging and then running the following estimation commands with if. For example, regress ... if ...
Related
I have two datasets, one with one observation and two variables. Other dataset with 10 observations, four variables.
Dataset 1
Final Result
X Fail
Dataset 2
A B C D Output
1 1 2 Pass
2 1 2 Pass
3 1 2 Pass
4 1 2 Fail
5 1 2 Pass
6 1 2 Fail
7 1 2 Pass
8 1 2 Fail
9 1 2 Pass
10 1 2 Pass
I would like to generate a fifth variable (output) in the second dataset depending on the value of the second variable in the first dataset.
If Result in first dataset equal to fail, generate a new variable output in the second dataset as fail. If Result in first dataset equal to pass, then generate a new variable output in the second dataset which will be equal to the value in column D of the second dataset.
Just use some simple IF/THEN logic. Since you know DATASET1 only has one observation then only read one observation from it.
data want;
if _n_=1 then set dataset1 ;
set dataset2 ;
length OUTPUT $4 ;
if RESULT='FAIL' then OUTPUT=RESULT;
else OUTPUT=D ;
run;
I am trying to collapse all variables in my dataset, which is as follows.
date number_of_patients health_center vaccinations
6/25/21 1 healthcentername 1
6/18/21 2 healthcentername 2
10/9/20 2 healthcentername 1
10/2/20 2 healthcentername 1
10/16/20 1 healthcentername 1
I am trying to collapse by date through count into:
number_of_patients healthcentername vaccinations
8 healthcentername 6
I am trying to do this across all health centers but I can't seem to do it without identifying the specific variables I want to collapse. Unfortunately, this isn't entirely feasible because I have 3500 variables in the dataframe.
Somehow you need to tell Stata which variables you want to sum by health center, but that doesn't mean that you need to type them all. You can use ds to create a list of variable names. If you use the option not then ds will list all but the variable names you are mentioning. Like this:
* Example generated by -dataex-. For more info, type help dataex
clear
input str8 date byte number_of_patients str16 health_center byte vaccinations
"6/25/21" 1 "healthcentername" 1
"6/18/21" 2 "healthcentername" 2
"10/9/20" 2 "healthcentername" 1
"10/2/20" 2 "healthcentername" 1
"10/16/20" 1 "healthcentername" 1
end
*List all variables but the one mentioned and store list in r(varlist)
ds date health_center, not
*Sum by health center all but the variables explicitly excluded above
collapse (sum) `r(varlist)' , by(health_center)
I have a dataset in Stata and want to count by group (loc_ID) and year. I used the following two lines of code:
egen count_obsv = tag(loc_ID year)
This adds a counter to my dataset (count_obsv) which is 1 (and 0 for every element that has the same combination of loc_ID and year) for every new combination.
Then I use:
collapse (sum) count_obsv, by(loc_ID year)
according to various Stata forum posts this should result in eg.:
loc_ID year count_obsv
1 2000 342
1 2001 23
2 2008 23
...
But my output is:
loc_ID year count_obsv
1 2000 1
1 2001 1
2 2008 1
...
What am I summarizing wrong?
When you call up the tag() function of the egen command, you assign the value 1 to just one of any number of observations with the same distinct values for the specified variables, and 0 to all the others. Then when you ask for the sum of those values in the same groups of observations, you get the group sums of one 1 and any number of 0s, and each sum is thus necessarily 1.
Your question is probably abstracted from some other calculations that worked as you expected, but if all you wanted was a dataset with frequencies, then
contract loc_ID year
would do that for you. If you wanted a dataset with summaries of other variables too, you would need something more like
collapse (count) count=foo (mean) mean=foo (sd) sd=foo, by(loc_ID year)
I doubt that any Statalist posts state otherwise. (I wrote tag() in 1999, and I am not aware of this as a misunderstanding.) There is a related but so to speak distinct problem where tag() comes in useful, which is counting distinct values (often called unique values).
sysuse auto, clear
egen tag = tag(foreign rep78)
egen distinct = total(tag), by(foreign)
tabdisp foreign, c(distinct)
would be a way to get at the number of distinct values of rep78 within categories of foreign.
I am exploring an effect that I think will vary by GDP levels, from a data set that has, vertically, country and year (1960 to 2015), so each country label is on 55 rows. I ran
sort year
by year: egen yrank = xtile(rgdp), nquantiles(4)
which tags every year row with what quartile of GDP they were in that year. I want to run this:
xtreg fiveyearg taxratio if yrank == 1 & year==1960
which would regress my variable (tax ratio) against some averaged gdp data from countries that were in the bottom quartile of GDPs in 1960 alone. So even if later on they grew enough to change ranks, the later data would still be in the regression pool. Sadly, I cannot get this code, or any variation, to run.
My current approach is to try to generate some new variable that would give every row with country label X a value of 1 if they were in the bottom quartile in 1960, but I can't get that to work either. i have run out of ideas, so I thought I would ask!
Based on your latest comment, which describes the (un)expected behavior:
clear
set more off
*----- example data -----
input ///
country year rank
1 1960 2
1 1961 1
1 1962 2
2 1960 1
2 1961 1
2 1962 1
3 1960 3
3 1961 3
3 1962 3
end
list, sepby(country)
*----- what you want -----
// tag countries whose first observation for -rank- is 1
// (I assume the first observation for -year- is always 1960)
bysort country : gen toreg = rank[1] == 1
list, sepby(country)
// run regression conditional on -toreg-
xtreg ... if toreg
Check help subscripting if in doubt.
I think egen might help me here, but for whatever reason I can't quite figure out the right syntax. I'd like to create a new variable that takes a value of 1 for all observations in a group if, for any of the observations in the group, X is true. So, for example, my data has the obs, group, and flag variables, and I want to generate the variable grpflag.
obs group flag grpflag
1 1 0 1
2 1 1 1
3 1 0 1
4 2 0 0
5 2 0 0
6 2 0 0
7 3 1 1
8 3 0 1
So, in the example data, since flag==1 for one (i.e., any) of the observations in group 1, I want grpflag to take the value 1 for all observations in group 1. The same is true for group 3, and the opposite is true for group 2.
You were right: the egen command can do this.
egen grpflag = max(flag), by(group)
See the Stata FAQ http://www.stata.com/support/faqs/data-management/create-variable-recording/ for more detail on the correspondences any:maximum and all:minimum exploited in Stata.
Note that while your example is easy (flag is already 0 or 1, so max() can be applied directly to flag) the argument of max() can be an expression, so the syntax extends easily to more general cases, e.g. max(foo == 42).
Even if egen were not available, or did not work like this, this kind of one-liner is possible in Stata, and will be more efficient than calling egen:
bysort group (flag) : gen grpflag = flag[_N]
However, that would be thrown by missings on flag, so you would need to work around that. In turn that could just be
gen isflag = flag == 1
bysort group (isflag) : gen grpflag = isflag[_N]
The general principle is that so long as what you are sorting is just 0 and 1, any values of 1 will be sorted to the end within each block of observations.