Collapsing across all variables in stata - stata

I am trying to collapse all variables in my dataset, which is as follows.
date number_of_patients health_center vaccinations
6/25/21 1 healthcentername 1
6/18/21 2 healthcentername 2
10/9/20 2 healthcentername 1
10/2/20 2 healthcentername 1
10/16/20 1 healthcentername 1
I am trying to collapse by date through count into:
number_of_patients healthcentername vaccinations
8 healthcentername 6
I am trying to do this across all health centers but I can't seem to do it without identifying the specific variables I want to collapse. Unfortunately, this isn't entirely feasible because I have 3500 variables in the dataframe.

Somehow you need to tell Stata which variables you want to sum by health center, but that doesn't mean that you need to type them all. You can use ds to create a list of variable names. If you use the option not then ds will list all but the variable names you are mentioning. Like this:
* Example generated by -dataex-. For more info, type help dataex
clear
input str8 date byte number_of_patients str16 health_center byte vaccinations
"6/25/21" 1 "healthcentername" 1
"6/18/21" 2 "healthcentername" 2
"10/9/20" 2 "healthcentername" 1
"10/2/20" 2 "healthcentername" 1
"10/16/20" 1 "healthcentername" 1
end
*List all variables but the one mentioned and store list in r(varlist)
ds date health_center, not
*Sum by health center all but the variables explicitly excluded above
collapse (sum) `r(varlist)' , by(health_center)

Related

Count the missing values in panel

I have a data set that looks like this:
id A
1 5
1 5
1 .
1 5
5 .
5 .
5 8
13 .
13 .
13 .
13 .
I want to calculate the number of A values when at least one A value is not missing in that panel in Stata.
For example, in the example above there are 3 missing values that are not the only missing value in that panel.
There is one missing A value when id is 1 and as there also are non-missing A values when id=1, I want to count that one.
Similarly, there are two missing A values when id is 5 and as there are also non-missing values when id=5, I want to count those two too.
There are 4 missing A values when id=13 but as there are no non-missing values when id=13, I don't want to count these.
I can't follow this, but the number of observations in each panel is
bysort id : gen count = _N
and the number of non-missing values of A is
by id : egen A_nm = count(A)
from which the missing values can be counted by subtraction. Alternatively, missing values can be counted directly by
by id: egen A_m = total(missing(A))
If that doesn't help, you may have to expand your question by showing what the new variable(s) you want looks like.
EDIT What you want may be just an application of this: you want to look at A_m values conditional on A_nm being positive.

Run a regression of countries by quartiles for a specific year

I am exploring an effect that I think will vary by GDP levels, from a data set that has, vertically, country and year (1960 to 2015), so each country label is on 55 rows. I ran
sort year
by year: egen yrank = xtile(rgdp), nquantiles(4)
which tags every year row with what quartile of GDP they were in that year. I want to run this:
xtreg fiveyearg taxratio if yrank == 1 & year==1960
which would regress my variable (tax ratio) against some averaged gdp data from countries that were in the bottom quartile of GDPs in 1960 alone. So even if later on they grew enough to change ranks, the later data would still be in the regression pool. Sadly, I cannot get this code, or any variation, to run.
My current approach is to try to generate some new variable that would give every row with country label X a value of 1 if they were in the bottom quartile in 1960, but I can't get that to work either. i have run out of ideas, so I thought I would ask!
Based on your latest comment, which describes the (un)expected behavior:
clear
set more off
*----- example data -----
input ///
country year rank
1 1960 2
1 1961 1
1 1962 2
2 1960 1
2 1961 1
2 1962 1
3 1960 3
3 1961 3
3 1962 3
end
list, sepby(country)
*----- what you want -----
// tag countries whose first observation for -rank- is 1
// (I assume the first observation for -year- is always 1960)
bysort country : gen toreg = rank[1] == 1
list, sepby(country)
// run regression conditional on -toreg-
xtreg ... if toreg
Check help subscripting if in doubt.

select last row by group with "collapse (last) ..." syntax

I want to select the last row in each subset of the data determined by one or more categorical variables.
Background. For each ticket in my data set, I have a ticketid and multiple transactions (sale, refund, sale, refund, sale...). I am only interested in keeping series that end in "sale".
My first step was to drop ticketids with evenly matched sales and refunds:
duplicates tag ticketid, gen(mult)
by ticketid: egen count_sale = total(transtatus == "Sale")
by ticketid: egen count_ref = total(transtatus == "Refund")
drop if mult & count_sale == count_ref
Now, I want to keep just the final sale when count_sale = count_ref + 1
sort ticketid time
preserve
** some collapse command
save "temp_terminal_sales.dta"
restore
append using "temp_terminal_sales.dta"
I can't figure out how (if at all) to use collapse here. I think I may just have to keep if mult, tag the last row with by ticketid: gen last = _n == _N and keep if last...? It seems like collapse should work. Here is the (wrong) syntax that seemed intuitive to me:
collapse (last), by(ticketid)
collapse (last) *, by(ticketid)
These don't work because (i) a varlist is required, and (ii) the by variables cannot be in the varlist.
Example data:
ticketid time myvar transtatus
1 1 2 "Sale"
1 2 2 "Refund"
2 1 2 "Sale"
3 1 2 "Sale"
3 2 2 "Refund"
3 3 2 "Sale"
3 4 2 "Refund"
4 1 2 "Sale"
4 2 2 "Refund"
4 3 2 "Sale"
Desired result:
ticketid time myvar transtatus
2 1 2 "Sale"
4 3 2 "Sale"
The easiest generic way to keep the last of a group is as follows. For a concrete example I assume panel data with identifier id and time variable time:
bysort id (time): keep if _n == _N
The generalisation is
bysort <variables defining groups> (<variable defining order first ... last>): keep if _n == _N
Many Stata commands support the in qualifier, but here we need if and the syntax hinges crucially on the fact that under the aegis of by: observation number _n and number of observations _N are determined within the groups defined by by:. Thus _n == 1 identifies the first and _n == _N identifies the last observation in each group.
drop if _n < _N is a dual command here.
You touched on this approach in your question, but the intermediate step of creating an indicator variable is unnecessary.
For collapse a work-around is presumably just to use some other variable, or even to create one for the purpose as in gen anything = 1. But I would always use by: for your purpose.
There is a discursive tutorial on by: at http://www.stata-journal.com/article.html?article=pr0004 Searching the Stata Journal archives using by as a keyword will reveal more applications.
#NickCox has already provided the general answer. Now that you have given example data, I post a reproducible example with several syntaxes:
clear all
set more off
input ///
ticketid time myvar str10 transtatus
1 1 2 "Sale"
1 2 2 "Refund"
2 1 2 "Sale"
3 1 2 "Sale"
3 2 2 "Refund"
3 3 2 "Sale"
3 4 2 "Refund"
4 1 2 "Sale"
4 2 2 "Refund"
4 3 2 "Sale"
end
list, sepby(ticketid)
*-----
* Method 1
bysort ticketid (time): keep if transtatus[_N] == "Sale" // keep subsets
by ticketid: keep if _n == _N // keep last observation of subsets
list
*-----
* Method 2
// list of all variables except ticketid
unab allvars: _all
local exclvar ticketid
local mycvars: list allvars - exclvar
bysort ticketid (time): keep if transtatus[_N] == "Sale" // keep subsets
collapse (last) `mycvars', by(ticketid) // keep last observation of subsets
list
*-----
*Method 3
bysort ticketid (time): keep if transtatus[_N] == "Sale" & _n == _N
list
(Remember to reload the data for each method.)
Consider also tagging and then running the following estimation commands with if. For example, regress ... if ...

Modifying data in SAS: copying part of the value of a cell, adding missing data and labeling it

I have three different questions about modifying a dataset in SAS. My data contains: the day and the specific number belonging to the tag which was registred by an antenna on a specific day.
I have three separate questions:
1) The tag numbers are continuous and range from 1 to 560. Can I easily add numbers within this range which have not been registred on a specific day. So, if 160-280 is not registered for 23-May and 40-190 for 24-May to add these non-registered numbers only for that specific day? (The non registered numbers are much more scattered and for a dataset encompassing a few weeks to much to do by hand).
2) Furthermore, I want to make a new variable saying a tag has been registered (1) or not (0). Would it work to make this variable and set it to 1, then add the missing variables and (assuming the new variable is not set for the new number) set the missing values to 0.
3) the last question would be in regard to the format of the registered numbers which is along the line of 528 000000000400 and 000 000000000054. I am only interested in the last three digits of the number and want to remove the others. If I could add the missing numbers I could make a new variable after the data has been sorted by date and the original transponder code but otherwise what would you suggest?
I would love some suggestions and thank you in advance.
I am inventing some data here, I hope I got your questions right.
data chickens;
do tag=1 to 560;
output;
end;
run;
data registered;
input date mmddyy8. antenna tag;
format date date7.;
datalines;
01012014 1 1
01012014 1 2
01012014 1 6
01012014 1 8
01022014 1 1
01022014 1 2
01022014 1 7
01022014 1 9
01012014 2 2
01012014 2 3
01012014 2 4
01012014 2 7
01022014 2 4
01022014 2 5
01022014 2 8
01022014 2 9
;
run;
proc sql;
create table dates as
select distinct date, antenna
from registered;
create table DatesChickens as
select date, antenna, tag
from dates, chickens
order by date, antenna, tag;
quit;
proc sort data=registered;
by date antenna tag;
run;
data registered;
merge registered(in=INR) DatesChickens;
by date antenna tag;
Registered=INR;
run;
data registeredNumbers;
input Numbers $16.;
datalines;
528 000000000400
000 000000000054
;
run;
data registeredNumbers;
set registeredNumbers;
NewNumbers=substr(Numbers,14);
run;
I do not know SAS, but here is how I would do it in SQL - may give you an idea of how to start.
1 - Birds that have not registered through pophole that day
SELECT b.BirdId
FROM Birds b
WHERE NOT EXISTS
(SELECT 1 FROM Pophole_Visits p WHERE b.BirdId = p.BirdId AND p.date = ????)
2 - Birds registered through pophole
If you have a dataset with pophole data you can query that to find if a bird has been through. What would you flag be doing - finding a bird that has never been through any popholes? Looking for dodgy sensor tags or dead birds?
3 - Data code
You might have more joy with the SUBSTRING function
Good luck

Use of Lag /Lead function

Kindly refer the sample data. I have Month, Region and Values in my data set. I need an Ouput Column as mentioned below. Basically I need on the basis of Month by Values moved ahead. Kindly help.
Month Region Values Output
1 R1 2 3
1 R2 4 5
2 R1 3 4
2 R2 5 7
3 R1 4 6
3 R2 7 5
4 R1 6
4 R2 5
Thanks,
Gauraw
If I got it right, you want to assign as OUTCOME the value from the next month within each region. If so, then you can use two SET-statements, the second of which will add the same dataset, but shifted by one record (FIRSTOBS=2).
proc sort data=yourdata; by region month; run;
data result;
set yourdata;
by region;
do until(eof);
set yourdata(firstobs=2 keep=values rename=(values=outcome)) end=eof;
end;
if LAST.region then call missing(outcome);
run;
And we need to wrap SET into DO UNTIL loop, because otherwise we'll loose the last record of the dataset - the end of the second instance of the same dataset will be reached one record earlier and DATA step will stop.