Create unique personal id - grouping

I have a Stata data set like this:
HouseholdId PersonId OtherVariables
1 1
1 2
2 1
2 2
3 1
3 2
Here HouseholdId is a unique identifier for each household and PersonId is a unique identifier for each person in a household. If I want to create a unique personal id for each person within the sample, period. How would I do this?
I have tried egen per_id = group(PersonID HouseholdID)
but that doesn't seem to work.

I take it that you want a unique identifier for each person within the entire dataset. That could be just
sort HouseholdId PersonId
gen long obs Id = _n
as follows from an accessible discussion in this Stata FAQ. That would have been found by typing in Stata
search identifier
or even
search id
(Meta-answer: You can and should look within Stata for information on basic notions like this.)
I add a strong recommendation that the word unique still carries its original meaning of appearing once only. The word distinct is, I suggest, a much better word when that is what you mean. More on that on p.558 of this paper.

Related

identify groups with few observations in paneldata models (stata)

How can I identify groups with few observations in panel-data models?
I estimated using xtlogit several random effects models. On average I have 26 obs per group but some groups only record 1 observation. I want to identify them and exclude them from the models... any suggestion how?
My panel data is set using: xtset countrycode year
Let's suppose your magic number for a big enough panel is 7 and that you fit a first model.
bysort countrycode : egen n_used = total(e(sample))
then gives you a count of how many observations were available and can be used, after which your criterion for a later model is if n_used >= 7
You could just go
bysort countrycode : gen n_available = _N
regardless of a model fit.
The differences are two-fold:
That last statement would disregard any missing values in the variables used in a model fit.
If you also used if and/or in to restrict model fit to particular subsets of observations, then e(sample) knows about that, but the last statement does not.

How to rank without skipping values? [duplicate]

I have some data in Stata which look like the first two columns of:
group_id var_to_rank desired_rank
____________________________________
1 10 1
1 20 2
1 30 3
1 40 4
2 10 1
2 20 2
2 20 2
2 30 3
I'd like to create a rank of each observation within group (group_id) according to one variable (var_to_rank). Usually, for this purpose I used:
gen id = _n
However some of my observations (group_id = 2 in my small example) have the same values of ranking variable and this approach doesn't work.
I have also tried using:
egen rank
command with different options, but cannot make my rank variables make to look like desired_rank.
Could you point me to a solution to this problem?
The following works for me:
bysort group_id: egen desired_rank=rank(var_to_rank)
I'd say this question is posed the wrong way round for best understanding. The aim is to group observations, those with the lowest value all being assigned a grade 1, the next lowest being all assigned 2 and so forth. This isn't ranking in most senses that I have seen discussed, but Stata's egen, rank() does get you part of the way.
But the direct way, which was mentioned in the Statalist thread cited elewhere in this thread (start here) is simpler in spirit than any solution quoted:
bysort group_id (var_to_rank): gen desired_rank = sum(var_to_rank != var_to_rank[_n-1])
Once data are sorted on var_to_rank then when values differ from previous values at the start of each block of distinct values a value of 1 is the result of var_to_rank != var_to_rank[_n-1]; otherwise 0 is the result. Summing those 1s and 0s cumulatively gives the desired variable. The prefix command bysort does the sorting required and ensures that this is all done separately within the groups defined by group_id. No need for egen at all (a command that many people who only use Stata occasionally often find bizarre).
Declaration of interest: The Statalist thread cited shows that when asked a similar question I too did not think of this solution in one.
Stumbled upon such solution on the Statalist:
bysort group_id (var_to_rank) : gen rank = var_to_rank != var_to_rank[_n-1]
by group_id : replace rank = sum(rank)
Seems to sort out this issue.
#radek: you surely got it sorted out in the meantime ... but this would have been an easy (though not very elegant) solution:
bysort group_id: egen desired_rank_HELP =rank(var_to_rank), field
egen desired_rank =group(grup_id desired_rank_HELP)
drop desired_rank_HELP
Way too much work. Easy and elegant. Try this one.
gen desired_rank=int(var_to_rank/10)
try this command, it works for me so well: egen newid=group(oldid)

Displaying variable sets that define each row

To say that a dataset is (person, year) level means that each row of that dataset has different (person, year) like this:
person year wage
Mike 2000 10
Mike 2010 30
Jack 1990 20
How can I make Stata display exactly those (person, year) variable sets that uniquely define each row?
I want to make a log file to record
person year
only, but not display any individual information (displaying individuals' information in a log file is against the rules set by the data provider).
How could I do this?
What I thought about is using bysort in some way
bysort person year: gen num=_n
and if every num is 1, then it means (person, year) defines each row.
But if a dataset is extremely large, then checking whether every num is 1 is too tedious. Is there any smarter way?
The command isid checks whether the variables you supply do jointly specify observations uniquely. Here is an example you can try:
. webuse grunfeld, clear
. isid company
variable company does not uniquely identify the observations
r(459);
. isid company year
Note the principle: no news is good news.
Another way to check for problems is through duplicates. For example, try duplicates list person year. In your case, you don't want that in the log. But what you can do first is anonymise your persons through
egen id = group(person)
and then check for duplicates on id year.
See also this FAQ.

Count unique patients and overall observation using PROC SQL

Working in SAS but using some SQL code to count the number of unique patients but also the total number of observations for a set of indicators. Each record has a patient identifier, the facility where the patient is, and a group of binary indicators (0,1) for each bed section (the particular place in the hospital where the patient is). For each patient record, only 1 bed section can have a value of '1'. Overall, patients can have multiple observations in a bed section or in other bed sections, i.e. patients can be hospitalized > 1. The idea is to roll this data set up by facility and count the total # of admissions for each bed section but also the total people for each bed section. The people count will always be <= to the observation count. Counting people was just added to my to-do list and to this point I was only summing up observations for each bed section using the code below:
proc sql;
create table fac_bedsect as
select facility,
sum(bedsect_alc) as bedsect_alc,
sum(bedsect_blind) as bedsect_blind,
sum(bedsect_gen) as bedsect_gen
from bedsect_type
group by facility;
quit;
Is there a way I can incorporate into this code the # of unique people for each bed section? Thanks.
With no knowledge of the source table(s) it is impossible to answer precisely, but the syntax for counting distinct values is as seen below. You will need to use the correct column name where I have used "patient_id":
SELECT
facility
, COUNT(DISTINCT patient_id) AS patient_count
, SUM(bedsect_alc) AS bedsect_alc
, SUM(bedsect_blind) AS bedsect_blind
, SUM(bedsect_gen) AS bedsect_gen
FROM bedsect_type
GROUP BY
facility
;

EU-SILC database about education and experience

I am using EU-SILC database for 2008 for Greece. Firstly, I would like to use PE040 so as to create three dummies: primeduc for education on pre-primary AND primary school seceduc on lower secondary education +(upper) secondary + post-secondary non tertiary education and tereduc on 1st + 2nd tertiary stage.
Secondly, I would like to make a variable about working experience based on the idea exper=age-educ-6 where educ I would like sth about the years (generally) spent in education.
Any ideas of which commands I should use on stata???
What I've tried so far
About stata syntax:
tabulate PE040, gen(educ)
gen primeduc=educ1+educ2
gen seceduc=educ3+educ4+educ5
gen tereduc=educ6
Having defined lnwage as =log(PY010N/(PL060+PL070)) and age as =2008-PB140, I've tried to regress and it takes only into account 191 obs.
For your first question, I think you want a 0-1 indicator, equal to 1 if either of the indicated educational categories was recorded.
gen primeduc=educ1 | educ2
gen seceduc =educ3 |educ4 |educ5
The "|" stands for logical "or". For example, primeduc will be 1 if educ1 is 1 or educ2 is 1.