Displaying variable sets that define each row - stata

To say that a dataset is (person, year) level means that each row of that dataset has different (person, year) like this:
person year wage
Mike 2000 10
Mike 2010 30
Jack 1990 20
How can I make Stata display exactly those (person, year) variable sets that uniquely define each row?
I want to make a log file to record
person year
only, but not display any individual information (displaying individuals' information in a log file is against the rules set by the data provider).
How could I do this?
What I thought about is using bysort in some way
bysort person year: gen num=_n
and if every num is 1, then it means (person, year) defines each row.
But if a dataset is extremely large, then checking whether every num is 1 is too tedious. Is there any smarter way?

The command isid checks whether the variables you supply do jointly specify observations uniquely. Here is an example you can try:
. webuse grunfeld, clear
. isid company
variable company does not uniquely identify the observations
r(459);
. isid company year
Note the principle: no news is good news.
Another way to check for problems is through duplicates. For example, try duplicates list person year. In your case, you don't want that in the log. But what you can do first is anonymise your persons through
egen id = group(person)
and then check for duplicates on id year.
See also this FAQ.

Related

Combining surveys with distinct analytical weights in Stata

I have a dataset which combine 14 household surveys in 14 countries. Each survey was conducted in different years and each survey has a household weight variable that only specifies to this country's context (data structure is the same across 14 countries).
Now I merged them and tried to cross tabulate the country and gender_area (four types of value: male_rural, female_rural, male_urban, female_urban) variable with weights (tab country gender [aw=hhweight], m). But I found that such a cross-tabulation would create weird values for some of the countries.
For example, if I add one if condition by the end of the tab (tab country gender [aw=hhweight] if abc==1, m), some country (KHM, NPL) 's row total would be greater than their original row total without the condition. But in this dataset, a condition would give a smaller subsample. If I don't add the weight (tab country gender, m), there is no such a problem. If I just tab one country with weight, there is no such a problem either.
So I wonder if there is any way for me to compare all countries with weight. I am not that familiar with survey data reference in Stata (svyset, strata, etc).
I tried to refer to the book Applied Survey Data Analysis, but it seems that it doesn't contain methodology to deal with such a combination.

identify groups with few observations in paneldata models (stata)

How can I identify groups with few observations in panel-data models?
I estimated using xtlogit several random effects models. On average I have 26 obs per group but some groups only record 1 observation. I want to identify them and exclude them from the models... any suggestion how?
My panel data is set using: xtset countrycode year
Let's suppose your magic number for a big enough panel is 7 and that you fit a first model.
bysort countrycode : egen n_used = total(e(sample))
then gives you a count of how many observations were available and can be used, after which your criterion for a later model is if n_used >= 7
You could just go
bysort countrycode : gen n_available = _N
regardless of a model fit.
The differences are two-fold:
That last statement would disregard any missing values in the variables used in a model fit.
If you also used if and/or in to restrict model fit to particular subsets of observations, then e(sample) knows about that, but the last statement does not.

Calculate the number of firms at a given month

I'm working on a dataset in Stata
The first column is the name of the firm. the second column is the start date of this firm and the third column is the expiration date of this firm. If the expdate is missing, this firm is still in business. I want to create a variable that will record the number of firms at a given time. (preferably to be a monthly variable)
I'm really lost here. Please help!
Next time, try using dataex (ssc install dataex) rather than a screen shot, this is recommended in the Stata tag wiki, and will help others help you!
Here is an example for how to count the number of firms that are alive in each period (I'll use years, but point out where you can switch to month). This example borrows from Nick Cox's Stata journal article on this topic.
First, load the data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long(firmID dt_start dt_end)
3923155 20080123 99991231
2913168 20070630 99991231
3079566 20000601 20030212
3103920 20020805 20070422
3357723 20041201 20170407
4536020 20120201 20170407
2365954 20070630 20190630
4334271 20110721 20191130
4334338 20110721 20170829
4334431 20110721 20190429
end
Note that my in my example data my dates are not in Stata format, so I'll convert them here:
tostring dt_start, replace
generate startdate=date(dt_start, "YMD")
tostring dt_end, replace
generate enddate=date(dt_end, "YMD")
format startdate enddate
Next make a variable with the time interval you'd like to count within:
generate startyear = year(startdate)
generate endyear = year(enddate)
In my dataset I have missing end dates that begin with '9999' while you have them as '.' I'll set these to the current year, the assumption being that the dataset is current. You'll have to decide whether this is appropriate in your data.
replace endyear = year(date("$S_DATE","DMY")) if endyear == 9999
Next create an observation for the first and last years (or months) that the firm is alive:
expand 2
by firmID, sort: generate year = cond(_n == 1, startyear, endyear)
keep firmID year
duplicates drop // keeps one observation for firms that die in the period they were born
Now expand the dataset to have an observation for every period between the start and end date. For this I use tsfill.
xtset firmID year
tsfill
Now I have one observation per existing firm in each period. All that remains is to count the observations by year:
egen entities = count(firmID), by(year)
drop firmID
duplicates drop

Count unique patients and overall observation using PROC SQL

Working in SAS but using some SQL code to count the number of unique patients but also the total number of observations for a set of indicators. Each record has a patient identifier, the facility where the patient is, and a group of binary indicators (0,1) for each bed section (the particular place in the hospital where the patient is). For each patient record, only 1 bed section can have a value of '1'. Overall, patients can have multiple observations in a bed section or in other bed sections, i.e. patients can be hospitalized > 1. The idea is to roll this data set up by facility and count the total # of admissions for each bed section but also the total people for each bed section. The people count will always be <= to the observation count. Counting people was just added to my to-do list and to this point I was only summing up observations for each bed section using the code below:
proc sql;
create table fac_bedsect as
select facility,
sum(bedsect_alc) as bedsect_alc,
sum(bedsect_blind) as bedsect_blind,
sum(bedsect_gen) as bedsect_gen
from bedsect_type
group by facility;
quit;
Is there a way I can incorporate into this code the # of unique people for each bed section? Thanks.
With no knowledge of the source table(s) it is impossible to answer precisely, but the syntax for counting distinct values is as seen below. You will need to use the correct column name where I have used "patient_id":
SELECT
facility
, COUNT(DISTINCT patient_id) AS patient_count
, SUM(bedsect_alc) AS bedsect_alc
, SUM(bedsect_blind) AS bedsect_blind
, SUM(bedsect_gen) AS bedsect_gen
FROM bedsect_type
GROUP BY
facility
;

Stata: sales growth rate of multiple groups

Cross-posting:
german: http://www.stata-forum.de/post1716.html#p1716
english: http://www.talkstats.com/showthread.php/47299-sales-growth-rate-with-multiple-groups-conditions
I want to calculate the annual sales growth rate of different firm-groups in Stata. The firms are grouped by variables country and industry.
I summed sales for each group (called it sales_total: sales of all firms in a group with equal country, industry and year):
bysort country year industry: egen sales_total = sum(sales)
I have a much bigger sample, but I tried to calculate the growth-rate with a smaller sample.
I tried multiple combinations such as:
egen group = group(year country industry)
xtset group year, yearly
bys group: g salesgrowth = log(D.sales_total)
or
bysort group: gen salesgrowth=(sales[_n]-sales[_n-1])/sales[_n-1]*
also with tsset.
and tried everything from this answer:
Generate percent change between annual observations in Stata?
but I always get error messages such as
repeated time values within panel
or
repeated time values within sample
due to the repetition of the number in a variable such as group.
Can you help me to find the yearly growth rate from each group (firms from same country & industry)?
update
here again an example of my observations (which normally have 10,000 firms over 10 years). There are also missing values (for sales, industry, year, country)
firms -- country -- year -- industry -- sales
-a --------usa-------1----------1----------300
-a---------usa-------2----------1--------4000
-b---------ger-------1----------1--------200
-b---------ger-------2----------1--------400
-c---------usa------1----------1----------100
-c---------usa------2----------1----------300
-d---------usa------1----------1----------400
-d---------usa------2----------1----------200
-e---------usa------1----------1----------7000
-e---------usa------2----------1----------900
-f----------ger------1----------2----------100
-f---------ger------2----------2----------700
-h---------ger------1----------2----------700
-h---------ger------2----------2----------600
-.................etc.....................................
I tried the programing you mentioned, but I got a couple of variables that need to be used in the same row and not in the same column (which I would probably need). Is there a possibility to keep the data without reshaping, keeping them in a row, for example grouping the observations:
egen group=group(industry year country)
and then try
xtset group year
bysort group: sales_growth = log(D.sales)
or
bysort group: gen sales_growth = (sales[_n]-sales[_n-1])/sales[_n-1]
Thank you!
The strategy here is trying to work at the wrong level of resolution. You should
collapse (sum) sales, by(country year industry)
and then work with that reduced dataset. Depending on what you want precisely, you will probably need to restructure that data with reshape so that different industries give different variables. Then
xtset country year
and growth rates will then be easier to calculate.