which function would be equivalent in R to bysort variable1: generate variable2=_n (stata) - stata

I have a database with id and hospital visits. The same id can have several visits. I would like to know among those who have more than one visit, what number of visit is equivalent to each hospital admission. In stata I would do:
bysort id: generate number_visit=_n

Related

Combining surveys with distinct analytical weights in Stata

I have a dataset which combine 14 household surveys in 14 countries. Each survey was conducted in different years and each survey has a household weight variable that only specifies to this country's context (data structure is the same across 14 countries).
Now I merged them and tried to cross tabulate the country and gender_area (four types of value: male_rural, female_rural, male_urban, female_urban) variable with weights (tab country gender [aw=hhweight], m). But I found that such a cross-tabulation would create weird values for some of the countries.
For example, if I add one if condition by the end of the tab (tab country gender [aw=hhweight] if abc==1, m), some country (KHM, NPL) 's row total would be greater than their original row total without the condition. But in this dataset, a condition would give a smaller subsample. If I don't add the weight (tab country gender, m), there is no such a problem. If I just tab one country with weight, there is no such a problem either.
So I wonder if there is any way for me to compare all countries with weight. I am not that familiar with survey data reference in Stata (svyset, strata, etc).
I tried to refer to the book Applied Survey Data Analysis, but it seems that it doesn't contain methodology to deal with such a combination.

Generating a variable which records the number of repetitions

I have a longitudinal dataset in Stata in which IDs are repeated, I want to generate a new variable which repeats the number of IDs (like the column "visit" in the image). How can I write the code?
enter image description here
You can use: bysort ID : gen visit = _n

Finding string value associated with max value of record subset in long format

For non-longitudinal analysis using long-formatted data, when subjects have multiple visits or records, I will typically hunt down a record within each subject using bysort ID, and set a temporary variable to hold the integer or real value that I found, and then egen max() to find the max value for all records found, then set a final value in record _n==1 for that subject. This is so I can have the values I want from different visits percolate to a single record for each subject. Each single record per subject will then be used during analysis (but not longitudinal, maybe cross-sectional or regression, ANOVA, etc.)
Let's say I want the highest cholesterol (ldl) value for the 3rd year of a trial, where ldl is measured quarterly (every 3 months) for all subjects, which can be accomplished using the code below:
cap drop ldl3tmp
cap drop ldl3max
cap drop ldl3
bysort id (visitdate): gen ldl3tmp = ldl if trialyear==3
bysort id (visitdate): egen ldl3max = max(ldl3tmp)
bysort id (visitdate): gen ldl3 = ldl3max if _n==1
Suppose there are initials for the lab technician or phlebotomist that did the blood draw. How can I percolate a string value to record _n==1 that's associated with the greatest ldl value among the subset of records for the 3rd year of the trial? String values can't be sorted, so I am guessing the answer might be to eliminate records for which ldl is not the greatest value in year 3, then the string will be in that record?
In this case, how can I find out what _n is for the maximum value? If I know that, I could use
bysort id (visitdate): drop if _n!=6 //if _n==6 has the max value of ldl
Here is how to find the record number associated with the greatest ldl value within 4 quarterly ldl values in year 3 of a trial. The result is a variable called recmax, which will only be filled in for the specific record where the greatest value was found (among all records for each subject).
cap drop tmpldl3
cap drop maxldl3
cap drop recmax
cap drop visitdate
gen long visitdate = date(dateofvisit, "MDY") //You have to convert date ("MM/DD/YYYY") to a long integer format - based on #days since Jan 1, 1960
bysort id (visitdate): gen tmpldl3 = ldl if trialyear ==3
bysort id (visitdate): egen maxldl3 = max(tmpldl3)
bysort id (visitdate): gen recmax = _n if tmpldl3==maxldl3 & tmpldl3!=. & maxldl3!=.
You can then analyze all the other data (such as string data) in that record cross-sectionally (ANOVA, correlation, regression) by specifying if recmax!=. in the trailing if statement for any analysis command. If you are careful, you could also drop all other records with extraneous ldl values not of interest by using the command drop if recmax!=. providing you realize you dropped data and if you save, save to a filename with "_reduced" or "_dropped" in it.

Displaying variable sets that define each row

To say that a dataset is (person, year) level means that each row of that dataset has different (person, year) like this:
person year wage
Mike 2000 10
Mike 2010 30
Jack 1990 20
How can I make Stata display exactly those (person, year) variable sets that uniquely define each row?
I want to make a log file to record
person year
only, but not display any individual information (displaying individuals' information in a log file is against the rules set by the data provider).
How could I do this?
What I thought about is using bysort in some way
bysort person year: gen num=_n
and if every num is 1, then it means (person, year) defines each row.
But if a dataset is extremely large, then checking whether every num is 1 is too tedious. Is there any smarter way?
The command isid checks whether the variables you supply do jointly specify observations uniquely. Here is an example you can try:
. webuse grunfeld, clear
. isid company
variable company does not uniquely identify the observations
r(459);
. isid company year
Note the principle: no news is good news.
Another way to check for problems is through duplicates. For example, try duplicates list person year. In your case, you don't want that in the log. But what you can do first is anonymise your persons through
egen id = group(person)
and then check for duplicates on id year.
See also this FAQ.

Stata: sales growth rate of multiple groups

Cross-posting:
german: http://www.stata-forum.de/post1716.html#p1716
english: http://www.talkstats.com/showthread.php/47299-sales-growth-rate-with-multiple-groups-conditions
I want to calculate the annual sales growth rate of different firm-groups in Stata. The firms are grouped by variables country and industry.
I summed sales for each group (called it sales_total: sales of all firms in a group with equal country, industry and year):
bysort country year industry: egen sales_total = sum(sales)
I have a much bigger sample, but I tried to calculate the growth-rate with a smaller sample.
I tried multiple combinations such as:
egen group = group(year country industry)
xtset group year, yearly
bys group: g salesgrowth = log(D.sales_total)
or
bysort group: gen salesgrowth=(sales[_n]-sales[_n-1])/sales[_n-1]*
also with tsset.
and tried everything from this answer:
Generate percent change between annual observations in Stata?
but I always get error messages such as
repeated time values within panel
or
repeated time values within sample
due to the repetition of the number in a variable such as group.
Can you help me to find the yearly growth rate from each group (firms from same country & industry)?
update
here again an example of my observations (which normally have 10,000 firms over 10 years). There are also missing values (for sales, industry, year, country)
firms -- country -- year -- industry -- sales
-a --------usa-------1----------1----------300
-a---------usa-------2----------1--------4000
-b---------ger-------1----------1--------200
-b---------ger-------2----------1--------400
-c---------usa------1----------1----------100
-c---------usa------2----------1----------300
-d---------usa------1----------1----------400
-d---------usa------2----------1----------200
-e---------usa------1----------1----------7000
-e---------usa------2----------1----------900
-f----------ger------1----------2----------100
-f---------ger------2----------2----------700
-h---------ger------1----------2----------700
-h---------ger------2----------2----------600
-.................etc.....................................
I tried the programing you mentioned, but I got a couple of variables that need to be used in the same row and not in the same column (which I would probably need). Is there a possibility to keep the data without reshaping, keeping them in a row, for example grouping the observations:
egen group=group(industry year country)
and then try
xtset group year
bysort group: sales_growth = log(D.sales)
or
bysort group: gen sales_growth = (sales[_n]-sales[_n-1])/sales[_n-1]
Thank you!
The strategy here is trying to work at the wrong level of resolution. You should
collapse (sum) sales, by(country year industry)
and then work with that reduced dataset. Depending on what you want precisely, you will probably need to restructure that data with reshape so that different industries give different variables. Then
xtset country year
and growth rates will then be easier to calculate.