Generating a variable which records the number of repetitions - stata

I have a longitudinal dataset in Stata in which IDs are repeated, I want to generate a new variable which repeats the number of IDs (like the column "visit" in the image). How can I write the code?
enter image description here

You can use: bysort ID : gen visit = _n

Related

identify groups with few observations in paneldata models (stata)

How can I identify groups with few observations in panel-data models?
I estimated using xtlogit several random effects models. On average I have 26 obs per group but some groups only record 1 observation. I want to identify them and exclude them from the models... any suggestion how?
My panel data is set using: xtset countrycode year
Let's suppose your magic number for a big enough panel is 7 and that you fit a first model.
bysort countrycode : egen n_used = total(e(sample))
then gives you a count of how many observations were available and can be used, after which your criterion for a later model is if n_used >= 7
You could just go
bysort countrycode : gen n_available = _N
regardless of a model fit.
The differences are two-fold:
That last statement would disregard any missing values in the variables used in a model fit.
If you also used if and/or in to restrict model fit to particular subsets of observations, then e(sample) knows about that, but the last statement does not.

Finding string value associated with max value of record subset in long format

For non-longitudinal analysis using long-formatted data, when subjects have multiple visits or records, I will typically hunt down a record within each subject using bysort ID, and set a temporary variable to hold the integer or real value that I found, and then egen max() to find the max value for all records found, then set a final value in record _n==1 for that subject. This is so I can have the values I want from different visits percolate to a single record for each subject. Each single record per subject will then be used during analysis (but not longitudinal, maybe cross-sectional or regression, ANOVA, etc.)
Let's say I want the highest cholesterol (ldl) value for the 3rd year of a trial, where ldl is measured quarterly (every 3 months) for all subjects, which can be accomplished using the code below:
cap drop ldl3tmp
cap drop ldl3max
cap drop ldl3
bysort id (visitdate): gen ldl3tmp = ldl if trialyear==3
bysort id (visitdate): egen ldl3max = max(ldl3tmp)
bysort id (visitdate): gen ldl3 = ldl3max if _n==1
Suppose there are initials for the lab technician or phlebotomist that did the blood draw. How can I percolate a string value to record _n==1 that's associated with the greatest ldl value among the subset of records for the 3rd year of the trial? String values can't be sorted, so I am guessing the answer might be to eliminate records for which ldl is not the greatest value in year 3, then the string will be in that record?
In this case, how can I find out what _n is for the maximum value? If I know that, I could use
bysort id (visitdate): drop if _n!=6 //if _n==6 has the max value of ldl
Here is how to find the record number associated with the greatest ldl value within 4 quarterly ldl values in year 3 of a trial. The result is a variable called recmax, which will only be filled in for the specific record where the greatest value was found (among all records for each subject).
cap drop tmpldl3
cap drop maxldl3
cap drop recmax
cap drop visitdate
gen long visitdate = date(dateofvisit, "MDY") //You have to convert date ("MM/DD/YYYY") to a long integer format - based on #days since Jan 1, 1960
bysort id (visitdate): gen tmpldl3 = ldl if trialyear ==3
bysort id (visitdate): egen maxldl3 = max(tmpldl3)
bysort id (visitdate): gen recmax = _n if tmpldl3==maxldl3 & tmpldl3!=. & maxldl3!=.
You can then analyze all the other data (such as string data) in that record cross-sectionally (ANOVA, correlation, regression) by specifying if recmax!=. in the trailing if statement for any analysis command. If you are careful, you could also drop all other records with extraneous ldl values not of interest by using the command drop if recmax!=. providing you realize you dropped data and if you save, save to a filename with "_reduced" or "_dropped" in it.

Calculate the number of firms at a given month

I'm working on a dataset in Stata
The first column is the name of the firm. the second column is the start date of this firm and the third column is the expiration date of this firm. If the expdate is missing, this firm is still in business. I want to create a variable that will record the number of firms at a given time. (preferably to be a monthly variable)
I'm really lost here. Please help!
Next time, try using dataex (ssc install dataex) rather than a screen shot, this is recommended in the Stata tag wiki, and will help others help you!
Here is an example for how to count the number of firms that are alive in each period (I'll use years, but point out where you can switch to month). This example borrows from Nick Cox's Stata journal article on this topic.
First, load the data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long(firmID dt_start dt_end)
3923155 20080123 99991231
2913168 20070630 99991231
3079566 20000601 20030212
3103920 20020805 20070422
3357723 20041201 20170407
4536020 20120201 20170407
2365954 20070630 20190630
4334271 20110721 20191130
4334338 20110721 20170829
4334431 20110721 20190429
end
Note that my in my example data my dates are not in Stata format, so I'll convert them here:
tostring dt_start, replace
generate startdate=date(dt_start, "YMD")
tostring dt_end, replace
generate enddate=date(dt_end, "YMD")
format startdate enddate
Next make a variable with the time interval you'd like to count within:
generate startyear = year(startdate)
generate endyear = year(enddate)
In my dataset I have missing end dates that begin with '9999' while you have them as '.' I'll set these to the current year, the assumption being that the dataset is current. You'll have to decide whether this is appropriate in your data.
replace endyear = year(date("$S_DATE","DMY")) if endyear == 9999
Next create an observation for the first and last years (or months) that the firm is alive:
expand 2
by firmID, sort: generate year = cond(_n == 1, startyear, endyear)
keep firmID year
duplicates drop // keeps one observation for firms that die in the period they were born
Now expand the dataset to have an observation for every period between the start and end date. For this I use tsfill.
xtset firmID year
tsfill
Now I have one observation per existing firm in each period. All that remains is to count the observations by year:
egen entities = count(firmID), by(year)
drop firmID
duplicates drop

Comparing values within columns in SAS

I have not been able to find an answer to this question, although there are some similar ones.
I have a large dataset (9 million rows) that contains
an id column ("id"),
an identifier if "id" is new for that period ("new_id"),
a latitude column ("lat")
and a longitude column ("long").
In SAS, what I want to do is, using the geodist function, compare the distance between each row and create an indicator, "nearby", that equals 1 if the distance between that id and any other id is less than 50 miles and "new_id" = 1 for the other row.
Below is some pseudo code of what I'm trying to do. Any help would be greatly appreciated. Thanks!
Pseudo code:
For all_rows1 in data
For all_rows2 in data
if (geo_dist(all_rows1(lat), all_rows1(long), all_rows2(lat), all_rows2(long) < 50)
and all_rows2(new_id) = 1
then all_rows1(nearby) = 1

Counting different people associated with one phone number in SAS

I have a few million records with a list of names and phone numbers. I need to count how many people are associated with each unique phone number. The phone numbers are associated with duplicate names and unique names. So for each phone number I need to count the number of distinct users. Then this needs to be mapped to a list of stores. I tried selecting distinct phones/distinct phones but that only gives me a ratio of a distribution. So for example, if there is 10 people using three phones, then my ratio tells me that 3 phones are distributed among 10 people, but it doesnt tell me the actual number of people withn that distribution associated with the phone. Can anyone please help me with the SAS code to get the correct count where I know exactly how many phones are associated with the same phone number. Thanks in advance.
-r
If you want just the number of distinct rows that have the same phone number, you use:
proc sql;
create table phone_number_counts as
select phonenumber, count(1) as count_users
from dset
group by phonenumber;
quit;
If you want to find out distinct names within phone number, ie, if
555-123-4567 John H
555-123-4567 John H
555-123-4567 Mary Y
should result in 2, not in 3 (the first code would yield 3), then use count(distinct name) instead of count(1).
If you want something else, some example data might be helpful - ie, an example of the initial data and an example of a correct final dataset would be helpful.
I believe you're looking for count(distinct name):
proc sql;
create table phone_number_counts as
select phonenumber,
count(*) as count_rows,
count(distinct name) as unique_names
from dset
group by phonenumber;
quit;